Can you "download the internet"? Surely, that's impossible, you simply won't have enough storage. However, developers of free and open-source software always have a keen sense of humor. A simple example is the wget utility. Its name is the abbreviation of "www get," where WWW stands for World Wide Web. Thus, the term can be understood as "download the Internet."
In this material, however, we will focus not on the utility itself but on the ways to make it work through proxy. Usually, this is required for organizing multi-threaded connections and parsing operations.
Earlier, we have already talked about a similar utility - cURL (it is quite compatible with proxy granted that you have enough skills). Therefore, we will additionally compare both utilities and talk about their differences below.
Wget - is a built-in command-line utility that is provided with practically all popular Linux distributions; it is developed for fast downloading of files and other content via various internet protocols.
If needed, the utility can be installed and used on other platforms, as the program has open-source code that can be compiled for different execution environments.
Wget boasts a very simple syntax and is therefore ideal for everyday use, including for beginners. The fact that wget is included in the basic environment of Linux distributions allows downloading other progarms and packages quite quickly and easily. Tasks can be included in the cron scheduler as well (scripts and commands are executed on a schedule). Plus, wget can be incorporated into any other scripts and console commands.
For example, wget can be used to fully download a target website, if the options for bypassing URL addresses (with recursion) are set correctly.
Wget supports working with HTTP, HTTPS, FTP and FTPS protocols (+ some other, less popular ones).
A more correct name is GNU Wget (official website and documentation).
Note that there is a parallel implementation of wget - wget2. It has a number of small innovations and features.
An example of using wget to download an archive:
Bulk files can be downloaded here by simply specifying all their names (links) separated by spaces:
The utility will download files sequentially with progress displayed directly in the console.
The names of target files (list of URLs) can be saved in a separate document and "fed" to wget like this:
The same is about shortened options:
If access is protected by a login and password, wget can handle it as well (you need to replace user and password with actual ones):
This is how you can create a local version of a specific website (it will be downloaded as HTML pages with all related content):
You can download only files of a certain type from a website:
Note! Wget cannot handle JavaScript, meaning it will only load and save custom HTML code. All dynamically loaded elements will be ignored.
There are plenty of possible wget applications.
A complete list of all options and keys for the utility can be found in the program documentation as well as on the official website. In particular, you can:
Naturally, we are mostly interested in the first point.
When parsing, wget can help with saving HTML content, which can later be dissected and analyzed by other tools and scripts. For more details, see materials on Python web scraping libraries and Golang Scraper.
A proxy is an intermediary server. Its main task is to organize an alternative route for exchanging requests between a client and a server.
Proxies can use different connection schemes and technologies. For example, proxies can be anonymous or not, work based on different types of devices (server-based, mobile, residential), paid or free, with feedback mechanisms (backconnect proxies), static or dynamic addresses etc.
No matter what they are, their tasks remain roughly the same: redirection, location change, content modification (compression, cleaning etc.).
When parsing, wget use proxy is also needed in order to hide the real owner's address and organize multiple parallel connections, for example, to speed up the data collection procedure (scraping, not to be confused with web crawling).
In many Linux distributions, wget is a pre-installed utility. If the wget command returns an error, wget can be easily installed using the native package manager.
Debian-based distributions, including Ubuntu:
Fedora, CentOS and RHEL:
ArchLinux and equivalents:
In MacOS, wget can be installed either from the source (with “make” and “make install” commands) or using the Homebrew package manager. For beginners, the latter option will be the most convenient (note that cURL utility is used, which is pre-installed in MacOS by default):
In the latest versions of Windows (10 and 11), wget can be installed in the Linux subsystem (WSL), directly from compiled sources (for example, they can be found here) or using third-party package managers like Chocolatey. Installation command for Chocolatey:
If you install wget in Windows at the binary file level, you will need to specify the program link in the PATH variable for the correct applet invocation in the command line. Otherwise, you will have to refer to the file directly each time as ".\directory\wget.exe", followed by the list of options and parameters.
Once the utility is installed, it can be launched either from the command line or accessed within shell scripts.
Typical launch:
Immediately after pressing “enter”, the utility will start downloading the file to the user's home directory (or to another directory according to environment settings).
In the console, wget displays the current speed and overall download progress.
You can change the filename during download:
If you need to call up help for the set of options, type:
The simplest way to specify a wget proxy is through special options in the command line:
wget -e use_proxy=on -e http_proxy=proxy.address.or.IP.address:port https://target.site/directory/file.zip
wget -e use_proxy=on -e http_proxy=132.217.171.127:1234 --proxy-user=USERNAME --proxy-password=PASSWORD https://target.site/directory/file.zip
In some cases, instead of the option "use_proxy=on", the combination "use_proxy=yes" may be used.
If it is inconvenient for you to specify options in the console every time, you can add the proxy wget at the configuration file level. This can be done either in the general configuration directory (/etc/wgetrc) or in the local user config (~.wgetrc, if there is no such file, it can be created manually). Just replace the options with the following (if the user config is created from scratch, just add the options to an empty file):
use_proxy=on
http_proxy=155.217.170.121:12345
https_proxy=155.217.170.121:12345
Naturally, instead of 155.217.170.121:12345, you should specify the actual IP address and port number.
If authentication with a username and password is required, you can use the following construction:
use_proxy = on
http_proxy = http://USERNAME:PASSWORD@155.217.170.121:12345
Now you can run wget without additional keywords; the utility will keep working through proxy.
Wget does not have built-in tools for proxy rotation. Therefore, if you want to run each new wget with proxy, you need to write a bash script or use the "-e" option.
Example:
wget -e use_proxy=on -e http_proxy=104.254.41.36:1234 --proxy-user=USERNAME-one --proxy-password=PASSWORD-one https://site-one.zone/directory/file-one.zip
wget -e use_proxy=on -e http_proxy=26.104.52.225:2234 --proxy-user=USERNAME-two --proxy-password=PASSWORD-two https://site-two.zone/directory/file-two.zip
wget -e use_proxy=on -e http_proxy=70.174.89.3:44444 --proxy-user=USERNAME-three --proxy-password=PASSWORD-three https://site-three.zone/directory/file-three.zip
And here's how a bash script variant of forced proxy rotation might look, randomly selected from a list stored in the file proxies.txt (let's assume there are 10 lines):
for i in {1..10}
do
proxy=$(shuf -n 1 proxies.txt)
wget -e use_proxy=on -e http_proxy=$proxy --proxy-user=USERNAME --proxy-password=PASSWORD https://target-site.zone/subdirectory/some-file
done
If you're not familiar with scripting, there's another elegant solution – using backconnect proxies. Let’s take Froxy proxy as an example:
Both cURL and wget are open-source utilities used for downloading files and other content via HTTP and FTP protocols. They both handle HTTP POST and GET requests, cookies, can work with secure versions of websites (HTTPS) and can be incorporated into bash scripts.
However, they also have distinctions.
Let's start with cURL.
On the other hand, wget also has something to offer:
Take your time to find out how to integrate cURL with proxies.
Wget is a simple yet powerful utility for downloading files and HTML pages. It can be adapted for parsing tasks and can be accessed in the console or through bash scripts. Its downsides include the inability to use it as a library and the lack of built-in proxy rotation.
You can find quality residential and mobile proxies with automatic rotation in our service. Froxy offers over 8 million IP addresses, a convenient interface and targeting up to the city level (with solid coverage in all countries worldwide). Price depends on traffic only. There's a special trial package available for testing the utility features.