Can you "download the internet"? Surely, that's impossible, you simply won't have enough storage. However, developers of free and open-source software always have a keen sense of humor. A simple example is the wget utility. Its name is the abbreviation of "www get," where WWW stands for World Wide Web. Thus, the term can be understood as "download the Internet."
In this material, however, we will focus not on the utility itself but on the ways to make it work through proxy. Usually, this is required for organizing multi-threaded connections and parsing operations.
Earlier, we have already talked about a similar utility - cURL (it is quite compatible with proxy granted that you have enough skills). Therefore, we will additionally compare both utilities and talk about their differences below.
What Is Wget and How to Use It
Wget - is a built-in command-line utility that is provided with practically all popular Linux distributions; it is developed for fast downloading of files and other content via various internet protocols.
If needed, the utility can be installed and used on other platforms, as the program has open-source code that can be compiled for different execution environments.
Wget boasts a very simple syntax and is therefore ideal for everyday use, including for beginners. The fact that wget is included in the basic environment of Linux distributions allows downloading other progarms and packages quite quickly and easily. Tasks can be included in the cron scheduler as well (scripts and commands are executed on a schedule). Plus, wget can be incorporated into any other scripts and console commands.
For example, wget can be used to fully download a target website, if the options for bypassing URL addresses (with recursion) are set correctly.
Wget supports working with HTTP, HTTPS, FTP and FTPS protocols (+ some other, less popular ones).
A more correct name is GNU Wget (official website and documentation).
Note that there is a parallel implementation of wget - wget2. It has a number of small innovations and features.
An example of using wget to download an archive:
- wget https://your.site/directory/archive.zip
Bulk files can be downloaded here by simply specifying all their names (links) separated by spaces:
- wget https://your.site/directory/archive1.zip https://your.site/directory/archive2.zip https://your.site/directory/archive3.zip
The utility will download files sequentially with progress displayed directly in the console.
The names of target files (list of URLs) can be saved in a separate document and "fed" to wget like this:
- wget --input-file=~/urls.txt
The same is about shortened options:
- wget -i ~/urls.txt
If access is protected by a login and password, wget can handle it as well (you need to replace user and password with actual ones):
- wget ftp://user:password@host/path
This is how you can create a local version of a specific website (it will be downloaded as HTML pages with all related content):
- wget --mirror -p --convert-links -P /home/user/site111 source-site.com
You can download only files of a certain type from a website:
- wget -r -A "*.png" domain.zone
Note! Wget cannot handle JavaScript, meaning it will only load and save custom HTML code. All dynamically loaded elements will be ignored.
There are plenty of possible wget applications.
A complete list of all options and keys for the utility can be found in the program documentation as well as on the official website. In particular, you can:
- Limit download speed and set other quotas;
- Change the user-agent to your own value (for example, you can pretend to be a Chrome browser to the website);
- Resume download;
- Set offset when reading a file;
- Analyze creation/modification time, MIME type;
- Use constant and random delays between requests;
- Recursively traverse specified directories and subdirectories;
- Use compression at the wget proxy server level;
- Switch to the background mode;
- Employ proxies.
Naturally, we are mostly interested in the first point.
When parsing, wget can help with saving HTML content, which can later be dissected and analyzed by other tools and scripts. For more details, see materials on Python web scraping libraries and Golang Scraper.
Why Use a Proxy with Wget
A proxy is an intermediary server. Its main task is to organize an alternative route for exchanging requests between a client and a server.
Proxies can use different connection schemes and technologies. For example, proxies can be anonymous or not, work based on different types of devices (server-based, mobile, residential), paid or free, with feedback mechanisms (backconnect proxies), static or dynamic addresses etc.
No matter what they are, their tasks remain roughly the same: redirection, location change, content modification (compression, cleaning etc.).
When parsing, wget use proxy is also needed in order to hide the real owner's address and organize multiple parallel connections, for example, to speed up the data collection procedure (scraping, not to be confused with web crawling).
How to Install Wget
In many Linux distributions, wget is a pre-installed utility. If the wget command returns an error, wget can be easily installed using the native package manager.
Debian-based distributions, including Ubuntu:
- sudo apt-get install wget
Fedora, CentOS and RHEL:
- yum install wget
ArchLinux and equivalents:
- pacman -Sy wget
In MacOS, wget can be installed either from the source (with “make” and “make install” commands) or using the Homebrew package manager. For beginners, the latter option will be the most convenient (note that cURL utility is used, which is pre-installed in MacOS by default):
- /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- brew install wget
In the latest versions of Windows (10 and 11), wget can be installed in the Linux subsystem (WSL), directly from compiled sources (for example, they can be found here) or using third-party package managers like Chocolatey. Installation command for Chocolatey:
- choco install wget
If you install wget in Windows at the binary file level, you will need to specify the program link in the PATH variable for the correct applet invocation in the command line. Otherwise, you will have to refer to the file directly each time as ".\directory\wget.exe", followed by the list of options and parameters.
Running Wget
Once the utility is installed, it can be launched either from the command line or accessed within shell scripts.
Typical launch:
- wget https://site.zone/directory/file.zip
Immediately after pressing “enter”, the utility will start downloading the file to the user's home directory (or to another directory according to environment settings).
In the console, wget displays the current speed and overall download progress.
You can change the filename during download:
- wget -O new-name.zip https://site.zone/directory/source-file.zip
If you need to call up help for the set of options, type:
- wget -h
Setting Up Wget to Work Through Proxy
The simplest way to specify a wget proxy is through special options in the command line:
- If the proxy does not require authentication:
wget -e use_proxy=on -e http_proxy=proxy.address.or.IP.address:port https://target.site/directory/file.zip
- If authentication with a username and password is required:
wget -e use_proxy=on -e http_proxy=132.217.171.127:1234 --proxy-user=USERNAME --proxy-password=PASSWORD https://target.site/directory/file.zip
In some cases, instead of the option "use_proxy=on", the combination "use_proxy=yes" may be used.
If it is inconvenient for you to specify options in the console every time, you can add the proxy wget at the configuration file level. This can be done either in the general configuration directory (/etc/wgetrc) or in the local user config (~.wgetrc, if there is no such file, it can be created manually). Just replace the options with the following (if the user config is created from scratch, just add the options to an empty file):
use_proxy=on
http_proxy=155.217.170.121:12345
https_proxy=155.217.170.121:12345
Naturally, instead of 155.217.170.121:12345, you should specify the actual IP address and port number.
If authentication with a username and password is required, you can use the following construction:
use_proxy = on
http_proxy = http://USERNAME:PASSWORD@155.217.170.121:12345
Now you can run wget without additional keywords; the utility will keep working through proxy.
Rotating Proxy for Wget
Wget does not have built-in tools for proxy rotation. Therefore, if you want to run each new wget with proxy, you need to write a bash script or use the "-e" option.
Example:
wget -e use_proxy=on -e http_proxy=104.254.41.36:1234 --proxy-user=USERNAME-one --proxy-password=PASSWORD-one https://site-one.zone/directory/file-one.zip
wget -e use_proxy=on -e http_proxy=26.104.52.225:2234 --proxy-user=USERNAME-two --proxy-password=PASSWORD-two https://site-two.zone/directory/file-two.zip
wget -e use_proxy=on -e http_proxy=70.174.89.3:44444 --proxy-user=USERNAME-three --proxy-password=PASSWORD-three https://site-three.zone/directory/file-three.zip
And here's how a bash script variant of forced proxy rotation might look, randomly selected from a list stored in the file proxies.txt (let's assume there are 10 lines):
for i in {1..10}
do
proxy=$(shuf -n 1 proxies.txt)
wget -e use_proxy=on -e http_proxy=$proxy --proxy-user=USERNAME --proxy-password=PASSWORD https://target-site.zone/subdirectory/some-file
done
If you're not familiar with scripting, there's another elegant solution – using backconnect proxies. Let’s take Froxy proxy as an example:
- A port is configured in the personal account (location and conditions for rotating outgoing IP addresses are defined with each new request, for example);
- Proxy port data is copied (this will be a regular proxy for wget).
- The requests are then executed similar to using regular individual proxies (wget -e use_proxy=on -e http_proxy=255.89.155.178:1234 --proxy-user=USERNAME --proxy-password=PASSWORD https://target.site/directory/file.zip).
- IP address rotation is carried out on the proxy provider's side. The input port remains the same (there is no need to add or update anything in wget).
cURL vs Wget
Both cURL and wget are open-source utilities used for downloading files and other content via HTTP and FTP protocols. They both handle HTTP POST and GET requests, cookies, can work with secure versions of websites (HTTPS) and can be incorporated into bash scripts.
However, they also have distinctions.
Let's start with cURL.
- This is not just a utility but also a software library that can be used at the code level;
- Unlike wget, cURL supports a vast number of additional protocols (here is a detailed comparison table).
- cURL can work through SOCKS proxies (wget supports HTTP only);
- It offers more capabilities for site authentication and SSL connection support;
- In addition to POST and GET, it also supports some other methods (e.g., PUT).
On the other hand, wget also has something to offer:
- Recursive downloading of directory contents is possible;
- Creating copies of websites is available;
- Interrupted downloads can be resumed (no need to re-download large files);
- It has a smaller set of options, making wget easier to manage and configure.
Take your time to find out how to integrate cURL with proxies.
Conclusion and Recommendations
Wget is a simple yet powerful utility for downloading files and HTML pages. It can be adapted for parsing tasks and can be accessed in the console or through bash scripts. Its downsides include the inability to use it as a library and the lack of built-in proxy rotation.
You can find quality residential and mobile proxies with automatic rotation in our service. Froxy offers over 8 million IP addresses, a convenient interface and targeting up to the city level (with solid coverage in all countries worldwide). Price depends on traffic only. There's a special trial package available for testing the utility features.