Parsing is always better without blocks. In practice, however, such situations are quite rare. Below, we'll discuss how to bring reality closer to the dream and minimize all risks of blocks during the automated process of site parsing
Certain websites and some major internet platforms know that their clients are parsing them. Thus, they provide special API interfaces, clearly defining limits and data retrieval technology. This is a rare occurrence, however. Parsing is usually perceived as a negative practice because a large number of parallel requests create a significant load on the server. This, in turn, could lead to increased expenses for servicing the IT infrastructure.
For this simple reason, many websites and online services try to detect automatic requests and then reset or block connections to reduce the parasitic load on the server.
However, for every cunning defense system, there will undoubtedly be an equally cunning means to bypass it. Earlier, we've written about what parsing is, the differences between web scraping and web crawling, available tools for Amazon parsing, existing Python web scraping libraries as well as Go lang libraries.
In this material, we'll only discuss ways to conduct scraping data from websites without being blocked. To be precise, we will focus on the methods and approaches that reduce the likelihood of encountering blocks.
Setting Random Intervals between Requests
The most prominent indicator that a site is interacting with a bot rather than a real person is the requests occurring at regular intervals. Humans are unlikely to switch between pages within milliseconds.
Many parsing programs operate based on predetermined delays between requests. Yes, these delays can often be replaced with custom values, but these settings are typically global and are centrally defined for a session or a specific project. The delay itself is usually set very precisely (in seconds or milliseconds), making it easily traceable at the request analysis level. It takes only 3-4 requests to realize that a bot has initiated the session.
However, if you're building your own parser or you have access to communicate with the developers of the original software, randomizing the time between requests can be relatively simple.
For instance, in Python, just a couple of lines of code help achieve this:
import random # importing the library for working with number generation
time_sleep = random.uniform(1, 9) # defining the range in which random numbers will be generated
time.sleep(time_sleep) # a pause is introduced between requests, with the duration currently stored in the variable, changing with each iteration
Naturally, it's advisable to use this construction within a loop traversing a bulk of URL addresses.
Using Rotating Proxies and Cleaning Services
The primary factor that allows identifying a violator on the network is their IP address. All devices connected to the Internet have IP addresses; you simply cannot send or receive requests without them.
IP addresses can be static (assigned to a specific client) or dynamic (rotated based on the provider's internal algorithms). If interested, you can delve deeper into what a dynamic IP address is.
When a website's defense algorithm (anti-fraud systems) detects a violator, they are placed on a special blacklist - either permanently or for a specified period (depending on the type of IP address and the nature of the violation).
Upon a subsequent connection attempt, if the request originates from the blacklisted IP address, the connection is reset or a client is presented with a system placeholder. Parsing becomes impossible.
It is quite easy to bypass such protective measures - this involves IP address substitution. The problem is, however, where to find a large pool of IP addresses for rotation. Free proxies quickly become inoperative. It is close to impossible to get them from a specific location (this can even be critical for some sites). Their connection speed is low - just like anonymity (their query data might be stored and processed by third parties).
The best option is high-quality specialized providers offering proxy rental services, such as Froxy.
The proxy type and network coverage are crucial. The best types of proxies for parsing are mobile (how mobile proxies work and how to use them) or residential (how residential proxies work and how to use them).
If you have no desire to bother about managing IP address lists, it's better to use proxies with feedback, also known as BackConnect proxies.
If your parser does not support proxy operation, then it is likely not a parser.
Access the program settings and load the proxy list into a specific field. Ideally, parsers support different types of proxies (HTTP or SOCKS, protocol comparison) and can perform preliminary IP availability checks.
Proxy rotation can be performed based on various rules, depending on the target site's security policies and your preferences. For instance, IP replacement can occur with each new request. Attempts can be made to maintain the connection as long as possible, or IP changes can be time-based.
Setting Correct Headers and User-Agent Name
What prevents anti-fraud systems from concurrently analyzing the same information and identifying suspicious requests? Well, nothing really. That's why many major websites do the following:
- Check account compliance and identifiers in cookies;
- Read their own and third-party cookies to acquire digital fingerprints;
- Analyze the set of fonts available in the operating system;
- Verify encoding, locale etc.
Proper parsers can mimic current browser versions and generate essential technical headers.
If you're building your parser from scratch, don't forget to consider parameters for the following headers:
Certain websites may also scrutinize other request headers, such as Cookie and Authorization. More details about referer headers will be discussed in the section below.
You can find a sample of how requests with specified headers may look in your real browser. Do that here, for example: https://httpbin.org/headers (just visit the page in your preferred browser).
To see an example of how a complete HTTP request might look like, check out this link: https://httpbin.org/anything .
Source Specification (Referer Header)
When a user seeks information, they do that within a search engine or directly within a website. Consequently, direct transition to a complex URL with numerous filters and parameters may raise security system concerns.
To alleviate suspicions, it's logical to indicate to the website that the transition isn't direct but originates from an organic source, such as a search engine's site.
Similarly, security systems can track the way a user browses website pages. If a user navigates from one URL to another, the previous URL acts as the referrer.
The HTTP Referer header is responsible for conveying this information.
This example not only specifies the search engine site but also includes the search query and device type on which the browser operates within the parameters.
Even a mere specification of the search engine's site can be an advantage:
This makes the HTTP request more convincing and does not cause any suspicions.
Using Headless Browsers
Moreover, some content may load dynamically or in response to specific user actions. All these factors significantly complicate the parsing process.
The main problem is that the final HTML document is generated not on the server but directly within the browser after executing a series of JS scripts.
To start working with such sites, you need either a real browser (along with a real person) or a specialized library (driver) that translates programmatic requests into browser actions and vice versa.
Such a software interface (also known as an API) can be obtained using:
- Selenium WebDriver
- Chrome DevTools Protocol
Recently, Chrome DevTools introduced command-line support.
Leading solutions allow combining Headless browsers with proxy servers.
Note that each new instance of a Headless browser consumes computer or server resources almost as much as a regular browser sample with a graphical interface.
If you're using the Selenium library or its analogs, consider implementing a stealth mode. By the way, you may require other additional libraries, such as undetected_chromedriver.
Another significant issue is managing cookies and creating unique digital fingerprints for multi-account scenarios: each browser sample should operate through its proxy (a unique IP address) and have its set of digital fingerprint attributes (cookies, fonts, screen resolution, etc.).
Many large-scale websites like Amazon, Instagram, eBay, Craigslist, Facebook etc. track attempts to access various accounts from one IP and may block all related accounts due to policy violations. Therefore, it's essential to work through proxies.
Bypassing Honeypots and Error Analysis
Honeypots are straightforward yet effective means to safeguard websites against bots and parsers. Typically, these are certain page elements (like input forms or links) visible to automated HTML code analysis tools but hidden from regular users. Any attempt to access such an element helps unequivocally identify the violator.
For instance, a site's code may contain a link like this:
<a href="https://target.domain/honeypot" style="display: none">Some text</a>
The attribute style="display: none" makes it invisible to regular users. As a result, they cannot click or navigate to the specified address.
However, a parser "sees" all links in the document. When it attempts to access this trap, it gets caught, instantly revealing itself.
To bypass such traps at the HTML code level, one can:
- Monitor the use of CSS properties like "display: none" or "visibility: hidden."
- Attempt to detect text color matching the background color for links.
- Track elements being positioned beyond the visible screen area.
- Adhere to the rules stated in the robots.txt file (avoid visiting sections and pages disallowed for indexing).
Another interesting aspect involves tracking 4** errors (missing/moved pages) and 5** errors (server failures). If a parser excessively accesses the server or non-existent pages, it might easily be blacklisted.
To avoid negative statistics, it makes sense to exponentially increase the time between accessing problematic pages and the site in general: each repeated error will lead to an increase in time between requests.
A rough formula for this could be: ((2n + random_number_milliseconds), maximum_backoff)
The maximum time could be set approximately in the range of 33-65 seconds.
Using Services to Solve Captcha
When a website's security system suspects automated requests, it may prompt for human verification. This is commonly done through CAPTCHA or similar tools, where users solve simple puzzles, find characters in an image etc.
While there are software implementations capable of recognizing images and automatically solving CAPTCHA, security methods periodically change, making human intervention the most reliable CAPTCHA solver.
If you seek to automate the resolution of any CAPTCHA, specialized services like 2Captcha, AntiCAPTCHA etc. should be considered. These platforms function as exchanges that engage real people and pay them for their work.
Interaction with these services occurs via APIs. To initiate the process, you need to fund your account and set the rates for solving operations to attract individuals solving CAPTCHAs.
Tracking Speed Limits and Access Denials
We've discussed why websites block automated traffic - it consumes server resources and disrupts regular users, creating parasitic load.
Consequently, many websites intentionally restrict connection speeds and the number of requests to their servers (made concurrently or within a specific time frame) for each individual user (IP address). This helps protect their computational resources and guarantees availability for other users.
Limits imposed on a specific client can be indicated in the X-RateLimit-Limit headers (corresponding to the maximum number of requests per second). It's logical to assume that violating these limits while parsing will result in the server blocking your connections.
Moreover, instead of a 200 status code, the server might return a 429 error (signifying too many requests to the site) or 503 error (indicating complete server unavailability).
When receiving such errors, parsing should be paused for some time and an attempt to resume should be made later. If the error persists, you should try changing proxies or the user-agent parameter.
Scraping Google Search Results from Cache
If the target website is reliably protected and quickly detects parsers, an alternative method can be used: Google search scraping from cache instead of direct parsing.
This method, however, is not always suitable:
- Some websites may request Google not to cache their content (meaning there won’t be any cached versions available);
- Information in the cache is often outdated and therefore not up-to-date.
Accessing the cached web page version is done as follows:
where you need to specify the target URL instead of "https://froxy.com/".
Mind that scraping Google results via the use of caching service has its own limitations on the number of requests and its own bot protection system.
Conclusions and Recommendations
Web scraping without getting blocked is feasible, but it requires proper niche software configuration and considering several factors. While you can create your own parser from scratch, it's essential to account for the above mentioned aspects: traps, delays, headers, cookies, digital fingerprints, bypassing CAPTCHA and proxy rotation.
Regardless of your parser's structure, proxies are indispensable when dealing with large volumes of data and simultaneously gathering information from numerous pages. Only they can help circumvent IP blocking and connection limitations to servers, significantly reducing CAPTCHA frequency.
The best rotating proxies with broad global coverage and precise geotargeting (up to the city level) can be found at Froxy. Our network encompasses over 8.5 million IPs from over 200 countries, offering customizable proxy list export formats, an API and up to 1000 parallel ports. Proxy types include mobile and residential (ideal for content scraping and other business tasks), operating on HTTP and SOCKS protocols.