What are User-Agents For Web Scraping and How to Use It

Written by Team Froxy | Jun 6, 2024 9:00:00 AM

So, what are user agents? When your parser crawls pages, it has to identify itself to the website. The parameter responsible for the client app name is called the user-agent. Various design options and more can depend on it. However, many large websites primarily use the values of the user-agent line to organize protection against malicious actions, including parsing blocks.

Below, we'll explain what popular user agents are and how to prevent a parser from being blocked by the anti-fraud systems.

Understanding User Agents

A common User Agent is a text identifier that software sends within HTTP requests when establishing connections to websites or web services.

Put simply, it's a conditional name of a browser or other programs like search bots, spiders etc.

User agent example options:

The current version of Google Chrome (v 124), running on a PC with Windows OS, is represented as: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
Firefox browser: "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0"
Chrome on mobile devices: "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Mobile Safari/537.36"
Google's search bot: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Etc.

You may notice that multiple parameters are transmitted within the user-agent line:

Current platform (OS type, its bitness, processor architecture, e.g., ARM, Win64, Linux, etc.);
Browser developer (Chrome, Safari, Mozilla);
Browser version (for mobile or desktop, build number).

Attentive readers will notice that user-agent strings from different browsers are very similar but they still have distinctions. Sites determine the browser type based on these differences.

For more details, you can find reasons for such "similarity" on Wikipedia. It's because they tried to impersonate competitors' browsers during the active market competition. As a result, modern strings of useragents can contain up to 5 names at a time.

A Bit of Statistics: What User-agent to Choose for Parsing

By logic, the most common user agent would be the one used on the largest number of devices. Currently, this is the stable Google Chrome version running on Android OS. According to Statcounter statistics, Android is used by over 43% of all network users.

When it comes to desktops, the undisputed leader is Windows along with Chrome. They account for over 27% of all devices connected to the internet.

The long-standing leader among all browsers, and, thus, the most popular user-agent, is Google Chrome. It accounts for over 65% of all internet requests.

So, the most reliable options to specify in HTTP requests for fine-tuning parsing are:

For desktops: Win64 + stable version of Chrome. At the time of writing this article, it's Chrome 124. The user-agent string should look like this: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
For mobile devices: Android + stable Chrome version. Desktop and mobile browser versions usually coincide. The user-agent string would be: "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Mobile Safari/537.36"

How and When Is User-Agent Applied?

Does it make sense to mention that a user-agent is one of the most important parameters of a digital fingerprint?

Mind that spoofing the User-Agent may be insufficient for stable parsing without being blocked. Websites can gather much more information about client browsers/applications. Details on best parsing practices here. For example, headless browsers and quality proxies may come in handy.

Below are some of the most common tasks that analyzing user-agent strings allows us to accomplish:

Visitor statistics and data collection: browsers, operating systems, and devices used. The better you understand your target audience, the more quality content you will be able to offer.
Advertising targeting: Special small ad banners can be displayed only to Android device owners or iOS users. Large banners should be displayed better on laptops, etc.
Design optimization: Every device type can receive the website that suits it most. Separate CSS parameters and content personalization can be applied to an individual browser version.
Detecting malicious traffic and blocking bots: Many large websites openly prohibit bulk requests and try to filter out bots and parsers that create parasitic loads. The user agent plays a crucial role in identifying client software. Depending on the parameters received from the user agent, different security scripts may be initiated.

Is It a Must to Apply User-Agent When Parsing?

Absolutely. If you send requests to a site without identifying yourself, this will be the first signal for blocking parasitic traffic.

The question here is more about the identification issue:

Many websites consciously prohibit the crawling of pages by bots and search spiders (more details on the difference between parsing and web crawling). Sites can block visits specifically to specific bots, or they can maintain a whitelist (meaning they prohibit crawling for everyone except for a list of approved agents);
If you specify someone else's bot name, ownership can easily be verified through reverse queries when analyzing IP addresses for matching domain names.
If you identify as a popular user browser, additional means of fingerprinting verification may be initiated, such as checking for JavaScript support, user-agent parameters in JavaScript environment (which are separate from the string in the HTTP request and requested through the window.navigator.userAgent function), font sets, cookies, etc.
If you identify as an outdated browser, this could be one of the most likely parser attributes.
If the browser version is made up (non-existent), you might get banned just like that.
Etc.

Best Practices and Considerations - How to Avoid Blocks by User-Agent

Tip 1. Use a user-agent that matches your real device.

The point is that the real operating system and hardware platform can be checked in various ways. Advanced anti-fraud systems can compare lists of pre-installed fonts. Naturally, Linux systems and macOS have different default fonts. With very high probability, a free Ubuntu distribution will not contain proprietary fonts protected by Microsoft's copyright.

Another interesting technique is HTML5 Canvas fingerprinting. Certain elements are drawn on the page using the browser's built-in features. Rendering occurs differently in various operating systems. Therefore, by collecting colors from specific web page areas, you can check whether the user lied in their user agent.

Similar scripts can be used to check the platform based on other technologies: WebGL, WebRTC, etc.

Tip 2. Use up-to-date, stable browser versions.

This is also an interesting point: it's best to specify a browser version inside the user-agent that lags no more than 2 versions behind the current stable branch.

The thing is, outdated browser versions are excluded from the list of supported ones by many developers. By continuing to use an old browser, you run a risk of encountering the lack of compatibility and support for current web standards.

Programmers from the largest websites and services can use such a sign to identify suspicious traffic.

Plus, many anti-detection browsers run on old Chromium versions. Isn't that a reason to take a closer look at the connected clients?

Tip 3. Rotate IP addresses and fingerprints.

To more accurately identify problematic requests, anti-fraud systems need time. For example, they can measure delays between requests from one IP address, compare identifiers stored in cookies and analyze user-agent strings.

By changing digital fingerprints along with IP addresses, you connect to the target site each time as if from a new device. At least, that's how it looks from the perspective of anti-fraud systems.

Protective mechanisms simply don't have time to react.

Conclusion

Analyzing the user-agent is the basis of many protective systems and mechanisms on large websites. Yes, just one user-agent may not be enough, so the parameter is studied in conjunction with other data: cookies, HTML5-canvas, WebGL, etc.

Specifying a parsing user agent is essential, but it must be done correctly to avoid detection by anti-fraud systems.

Rotating proxies can help bypass blocks and send a large number of requests to the same website.

You can buy quality mobile and residential proxies from us. Froxy offers over 8 million IPs with city-level targeting. You only pay for traffic, while the number of parallel connections may vary (up to 1000 ports per user).

View full post