Browsers almost inevitably store a vast amount of information about their users. For example, this may be saved passwords, cached website data (scripts, images, videos etc.), browsing history, bookmarks, cookies and more.
This is all necessary for certain technical tasks, but primarily for user convenience: to authenticate less frequently on favorite websites and services, for pages to load faster, not to memorize passwords as well as to always have a readily available list of favorite or frequently visited sites at hand.
However, this data can also be used for other purposes like user identification, to later offer personalized advertising and to track their online activities (websites visited, things bought or ordered, the customer's interests etc.).
A browser profile is a complete digital fingerprint. Many major websites use digital fingerprints to filter out parasitic traffic.
Let's delve into all of this more thoroughly, and most importantly, let’s understand how digital fingerprints and the web scraping process are interconnected, and whether it's possible to bypass the browser profile verification process.
Why Websites Use Fingerprints
A simple example for illustration purposes is screen resolution, locale and browser version. Based on the screen resolution, the web server can provide a client with a separate website version (for desktop or mobile). The interface translation is activated based on the locale, while the browser version is used for better rendering of styling (CSS styles).
Similarly, a client's IP address is used; based on it, the nearest caching server can be selected (if a CDN is used), so even a very large and complex web service will respond and load as quickly as possible.
A site can access the operating system version, font set and many other device parameters. For example, on mobile gadgets, hardware sensors and touchscreens can be enabled.
Cool, right? Yes, but only if all of this is used for its direct purpose – to enhance user comfort. However, this isn't always the case. Websites and specialized monitoring systems can apply user data for other purposes:
- Hidden client identification (even if a user hasn't entered anything anywhere and visited the site only, a server can already personalize content or ads);
- Advance tracking and fingerprint protection of user actions – which passwords they enter and where, their card details, location, gender, age, interests. All of this can be used to their advantage (don't fool yourself into thinking that no one needs you or your data);
- Tracking competitor sites (based on cookie analysis);
- Detecting bots and anomalous visitor behavior. Based on these data, decisions are made about blocking or other sanctions (which significantly complicates the parsing process);
- Organizing affiliate programs (where a site can track and count everyone redirected to it by affiliate sites).
Some sites are created by malicious users and can intercept (steal) elements of digital fingerprints to further resell them (there are even special markets or services of digital fingerprints). Plus, fingerprints are used for certain types of attacks like cookie substitution.
Impact of Fingerprinting Website Techniques on Scraping
A digital fingerprint is a set of user parameters by which they can be identified or tracked online. More often than not, a browser profile is implied by the term digital fingerprint.
A browser profile is a set of parameters a browser can transmit to a remote server during an HTTP/HTTPS connection.
For example, a browser profile may include:
- Operating system and browser version (the so-called user-agent);
- Processor type and architecture, its model (set of available instructions);
- Graphics accelerator model;
- RAM amount;
- Screen resolution;
- A set of installed browser plugins;
- Cookies files;
- Bookmarks;
- Locale (browser language);
- Time zone;
- List of available fonts;
- Available media devices (camera, microphone, speakers/speakers);
- Supported technologies (JavaScript execution, WebGL rendering support, HTML5 Canvas etc.);
- Geolocation (generally determined based on IP address).
Some sites may use a fingerprinting browser that includes the analysis of natural noises picked up by the microphone and may also check for image mobility on the built-in camera.
Earlier, we've discussed practices that reduce the risk of blocking during data parsing.
For example, in particularly challenging situations, one should use headless browsers or even screen scraping (with screen recognition).
However, understanding what digital fingerprint services or browser profiles are, and using this data in real situations, are two different things.
The simplest examples of scanning digital fingerprints on sites include:
- The server checks for JavaScript availability and analyzes cursor movement. If a client's browser is silent, then it's highly likely that a bot is using it;
- The site checks the operating system version and available font set. If the operating system is indicated as Windows, it must necessarily include a set of standard proprietary fonts. But if the list of available individual fonts is empty, then the browser is likely trying to deceive someone. Accordingly, users and their connections are blocked;
- The structure of browser HTTP requests and responses is analyzed. Certain constructions can easily reveal the use of anti-detection browsers (of course, high-quality anti-detections can mask themselves correctly);
- A site may have its own database of known browser fingerprints. Each new client undergoes a certain validity check. If a browser attempts to log into one account, but its real digital fingerprint (browser profile) indicates association with another client, then the security system may ask a user to take additional steps of identification (activate the second authentication factor, answer a secret question etc.).
The conclusion is as follows: if you want to scrape competitor sites or gather data from large platforms like Amazon, eBay etc., you need to take care of the digital fingerprints of your browser (parser).
How to Bypass Browser Fingerprinting while Scraping?
Web parsing is not always malicious. More often than not, automated requests are driven by simple and peaceful tasks: data retrieval, price monitoring, competitor analysis, niche selection, counterparts verification etc.
Protecting browser fingerprint and conducting preliminary scanning is akin to the battle of good against evil. Some want to protect their personal data (browser profiles, digital fingerprints) or bypass other site/web service limitations, while others want to know everything about clients to sell better or, conversely, to block parasitic (in their opinion) loads.
There is no one correct position on this issue, and there cannot be. Everyone can end up on different sides of the barricade.
So, the scraping challengers are clear: due to the verification of digital fingerprints (browser profiles), automated data collection becomes more challenging. Sites can check a large number of client parameters and block them at the slightest suspicion.
Available bypassing methods include:
- A unique browser profile (with emulation of all possible parameters - from the operating system version and font set to cookies and the list of installed plugins) is generated for each new account created on the target site;
- To maintain a large number of isolated browsers or browser profiles, virtualization environments (virtual machines and containers) can be used along with special browsers - anti-detection browsers. The latter can fake almost any parameters requested by target sites;
- Neither virtual machines nor anti-detection browsers can change a client's location. For this task, proxies should be used. Proxy selection should be based on the potential client's parameters, i.e., the address should correlate with the digital fingerprint. For example, if a smartphone screen resolution is used, the IP address should belong to a mobile network operator (more details on mobile proxies). If a browser has a desktop resolution, it makes sense to use IP addresses of home users (resident proxies).
- Since many major sites simultaneously check for support of certain web technologies, including JavaScript, special browsers capable of emulating user behavior (pointer movement, manual input in forms etc.) should be used during task automation. This all can be done by headless browsers.
Conclusion and Recommendations
Websites, especially those backed by large IT teams, have learned to distinguish bots and automatically generated traffic in order to block it and reduce hosting expenses.
Digital fingerprints (browser profile parameters) are mostly used to distinguish real clients from fake ones.
However, for every action, there is always a counteraction. Parser programs can be taught to emulate user behavior and spoof most of the parameters of those digital fingerprints. For this purpose, headless or anti-detection browsers are typically used in conjunction with proxies.
Proxies are an extremely important element that is responsible for changing location and protecting real IP addresses (in case of blocks).
We, the Froxy team, offer high-quality mobile and residential proxies with payment based on traffic packages. IP rotation can be done on demand or on a timer. New addresses can be selected in the same location (up to the city level) and even from the same telecommunications operator, significantly reducing the risk of blocking. The pool of addresses includes over 8 million IPs in 200+ countries.