When it comes to parsing or automating routine tasks on the internet, bearded sysadmins and web programmers (both beginners and pros) always know the magical recipe that can solve almost any issue - this is where the great and terrible cURL comes into play.
For those of a nervous disposition, please step away from the screens and switch to something more positive. Below, we discuss hardcore solutions for professionals as well as for those thinking about creating their own parser.
Previously, we talked about various Python web scraping libraries and frameworks as well as Golang parsing libraries. Now, let's focus on the command-like interface and on how to use a curl proxy.
cURL is a conjunction of a cross-platform library (libcurl) and a command-line interface-driven program designed for exchanging information and files through various internet protocols. The list of supported protocols includes over a dozen different types, all unified by their work with addresses based on the URL markup.
Specifically, cURL is an anagram derived from the term "Client URL", where the abbreviation URL stands for Uniform Resource Locator.
Many people commonly refer to URL addresses as links. The typical URL structure looks as follows:
<protocol_or_scheme>://<username>:<password>@<host>:<port>/<path>?<additional_parameters>#<anchor>
If certain elements are not used, they are omitted (removed from the URL).
Here's how URL addresses look for the well-known HTTP protocol:
http://www.domain.zone/path/to/page/
The same address can look significantly more complex when additional parameters and anchor links are used:
http://www.domain.zone/path/to/page/?parameter1=value1¶meter2=value2#anchor_for_navigation_to_specific_element_identifier
If the SOCKS5 protocol and authentication parameters are used (like with our residential or mobile proxies), the URL will resemble something like this:
socks5://login:password@123.123.123.123:port
Note that a specific IP address is used here instead of a domain.
cURL supports the following protocols:
The most pleasant aspect, however, is that curl HTTP proxy and curl SOCKS proxy options are also available.
The range of tasks applicable for cURL is quite extensive. It's almost like a text-based browser, but its operations are controlled from the console. That's exactly why it's crucial to know the key cURL arguments and the order of their specification.
cURL is typically used:
Let's dive deep into the details.
For Windows Systems
All recent versions of Windows operating systems starting from the one created in 1803 (Windows 10) and above come with the built-in cURL client. However, there are nuances. The system solution works similarly to the Invoke-WebRequest cmdlet. Consequently, it assumes the use of modified syntax, which makes many command-line parameters unavailable.
Therefore, if you want to use the standard cURL interface, you'll need to choose one of several solutions:
For Linux Distributions
cURL installation is generally unnecessary in most cases. In all popular distributions, the conjunction of the library and utility is available by default.
If the system displays an error about the missing executable file when invoking the curl command in the terminal, you can install the program using the standard package manager, for example:
For MacOS
cURL is installed in all Mac OS versions by default. To start using the utility, simply open the "Terminal" application and start entering commands.
To call the program in the console (terminal or command prompt), it's sufficient to simply enter its name: curl.
Unfortunately, running the command without any arguments won't yield any results. At most, it will display a prompt on how to request help (the help section).
To see a basic list of available arguments, you need to enter the command:
curl --help
or for Windows:
curl.exe --help
If you need a comprehensive overview of all options, input the command:
curl --help all
Please note, arguments can be specified in two ways:
Using the utility without arguments but with a specified URL is analogous to a GET request. For instance:
curl https://blog.froxy.com/en/what-is-data-parsing
This is the same as:
curl --request GET https://blog.froxy.com/en/what-is-data-parsing
Just in case, cURL works great with SSL certificates (i.e., with the HTTPS protocol) and other types of requests: PUT, POST, DELETE, HEAD.
In response to a GET request, cURL retrieves the entire content of the HTML page at the specified address.
Here's how the command for cURL with HTTP headers looks:
curl -I https://blog.froxy.com/en/what-is-data-parsing
And you can save the received data into a file:
curl -o what-is-data-parsing.html “https://blog.froxy.com/en/what-is-data-parsing”
The list of general cURL arguments includes:
Interestingly, cURL supports substitution and permutation functions.
For instance, you can use the following syntax:
curl “http://domain.{one,two,three}.com”
The data inside curly braces separated by commas will be sequentially iterated by the script.
If you need to indicate a range of numeric values, you can utilize square brackets, for example:
"http://domain.zone/archive[2000-2022]/vol[1-4]/part{a,b,c}.html"
The script will automatically specify all combinations from 2000 to 2022 (for archive) and from 1 to 4 (for vol). By using a colon, you can define the automatic iteration step, for instance, like this [2000-2022:3], resulting in the script using the sequence 2000, 2003, 2006 etc.
Starting from cURL version 8.3, the utility introduced support for variables. Creating variables is implemented using the --variable attribute.
For example:
--variable '%DATA@default'
--expand-url = "https://example.com/api//method"
It's crucial to use double quotes, as single quotes won't support variable functionality (they will be interpreted as regular text).
In the example above, we first declared the variable DATA and immediately assigned it the default value 'default' (the default value will be used by the utility if the variable is empty when accessed).
Accessing the variable is completed using the construction .
Additionally, when working with cURL, you can utilize variables declared at the shell script level. Again, it's important not to forget about double quotes here:
your-variable = XXX
curl -X GET "http://domain.com:8080/details/${your-variable}"
To use a curl with proxy, the attribute "-x" or "--proxy" is applied.
For instance:
curl --proxy "http://login:password@123.123.123.123:port" "http://domain.com/target-page"
Or:
curl -x "http://login:password@123.123.123.123:port" "http://domain.com/target-page"
However, it's inconvenient to enter the same data every time. What if the proxy details change frequently? There are several typical solutions for such cases. Let’s discuss them below.
If the task doesn't involve iterating through a list of proxy servers, a simple approach will be enough.
Example:
For Linux Systems
export yourproxy=http://LOGIN:PASSWORD@proxy-url.com:8080
For Windows Systems
set yourproxy=http://LOGIN:PASSWORD@proxy-url.com:8080
That is how we've created a variable and assigned a value to it.
Now, you can call the variable where needed. In the case of proxies, it could look like this:
curl -x $yourproxy https://blog.froxy.com/en/what-is-data-parsing
The option involving iterating through a proxy list would be significantly more complicated – see the section above about using environment variables.
If you need the SOCKS protocol, you can explicitly specify it within the proxy address, for example:
--proxy "socks5://login:password@123.123.123.123:port".
Alternatively, instead of the "--proxy" or "-x" attribute, you can use specific attributes:
Their syntax is similar.
Imagine now that you have to work with the command line but you wish to avoid entering cURL connection parameters to the same proxy server every time.
The most interesting solution to avoid this routine is creating a shortcut (alias or pseudonym). This is done using the 'alias' command in Linux systems.
Here's an example of a shortcut with pre-configured connection parameters for proper curl using proxy options:
alias xcurl="curl --proxy http://LOGIN:PASSWORD@proxy-url.com:8080"
Now, to make a request to a specific website or use other utility parameters, instead of the standard curl use proxy command, you simply call our alias:
xcurl https://blog.froxy.com/en/what-is-data-parsing
Instead of 'xcurl', you can use any other name, as long as it doesn't conflict with the names of other system utilities and is easily remembered/typed in the terminal.
As for Windows, there is also a command 'alias,' but it functions quite differently. To create a shortcut for a new command, you'll have to work with the registry and .bat/.cmd files.
All curl proxy settings associated with a specific user are stored in their home directory:
For Linux Systems
/home/user/.curlrc
Please note, the dot at the beginning of the file name indicates that it is hidden. To display it in a file manager, you need to activate the option to show hidden files. Alternatively, you can directly access editing the configuration through the command line:
bash
Copy code
nano /home/user/.curlrc
or
bash
Copy code
nano ~/.curlrc
Nano is a text editor for the console, but you can use software with a graphical interface like Gedit instead.
If the file doesn't exist, simply create it (the command mentioned above for the console won't change).
Insert the proxy setting into the file and save it:
bash
Copy code
proxy="http://USERNAME:PASSWORD@123.123.123.123:1234"
Now, cURL can be used as usual, and the proxy parameters will be displayed by default:
curl https://blog.froxy.com/en/what-is-data-parsing
For Windows Systems, the settings file should be located in the folder C:\Users<YOUR-USER-NAME>\AppData\Roaming and should be named _curlrc.
cURL is undeniably a powerful and functional tool in skilled hands. With this utility alone, you can gather and retrieve headers from target website pages, download their entire content, including HTML pages. All you need is a console and a few standard commands.
cURL's library seamlessly integrates with many web programming languages. Here is the sample of PHP documentation. Therefore, automation scripts for parsing can be enhanced with just a few additional lines of code.
Alongside, cURL might not work well for novices due to its complex syntax and challenging performance principles.
Regardless of how straightforward your parsing script is, there's a high probability that it could get blocked due to repeated requests from the same IP address. To avoid such blocks, proxies are a must. cURL can work with them out of the box.
The only challenge is to find a reliable service. We offer top-notch mobile and residential proxies with automatic rotation. Our services include an API and straightforward list exports, targeting up to the city and mobile operator levels. We provide access to a pool of over 8 million IP addresses. Additionally, there's a special trial package available for testing purposes.