Sign In Sign Up

Cases

Using Rotating Proxy in Scrapy: Comprehensive Guide

Learn how to use rotating proxies in Scrapy to enhance your Python web scraping. This detailed guide eliminates bans and improves data accuracy while scraping.

Team Froxy 19 Dec 2024 8 min read
Using Rotating Proxy in Scrapy: Comprehensive Guide

Proxy servers are literally indispensable in tasks related to web scraping. Proxies can solve several tasks at a time: from ensuring anonymity and protecting the user (and their devices) to scaling and parallelizing data collection processes.

This article will explain how to integrate custom software based on the Scrapy framework with rotating proxies. Let's start with the theory.

What Is Scrapy?

Scrapy - is a powerful open-source framework that comprehensively solves the problem of writing parsers for personal use. We mentioned Scrapy in the material about Python libraries. This is not accidental since the framework is written in Python.

The process of creating scripts is literally solved in just a few lines of code here. Moreover, Scrapy supports parallel thread management, can integrate with ready-made scraping services via API (like Zyte), easily deploys in the cloud and has a huge number of extensions.

This is actually a ready-made scraper that can be quickly adapted to your needs and tasks. It already knows how to export data in common formats (JSON, CSV and XML), manage Cookies, create scraping queues etc.

The most interesting feature of Python Scrapy proxy is the ability to integrate with SOCKS and HTTP(S) out of the box. Both static IPs (v4/v6) and rotating IP proxies are supported.

Additionally, Scrapy has a built-in API interface and can interact with headless browsers like Playwright or anti-detect services.

Scrapy Installation

Scrapy is installed using the standard package manager pip command:

pip install Scrapy

If you use the Anaconda or Miniconda distribution, you can use the conda package manager instead:

conda install -c conda-forge scrapy

Note: When installing in a Windows environment, you need to pre-install the Microsoft Visual C++ environment, as it may be required for certain functions.

Since Scrapy depends on several other libraries, they can be installed in advance: lxml, parsel, w3lib, twisted, cryptography and pyOpenSSL. However, these are typically installed automatically when using a package manager.

The Twisted library should be installed with the additional parameter “TLS” to avoid some common errors:

pip install twisted[tls]

Creating and Launching Your First Scrapy Parser

Creating and Launching Your First Scrapy Parser

The parser is essentially ready to use. You just need to use the built-in command shell (official documentation):

scrapy shell

After you enter the command, the Scrapy terminal will open. Here, you can start scraping pages. For example:

fetch(“https://scrapingclub.com/exercise/list_infinite_scroll/”)

To display the fetched HTML code, use the following command:

print(response.text)

In your queries, you can extract text or specific tag content (including those using CSS properties). However, if you need to find an element on a page using a complex structure of tags, classes and styles, you should use XPath syntax. For example:

response.xpath("//div[@class='p-4']/h4/a/text()").extract()

To create a full-fledged parser, use the startproject command. For example:

scrapy startproject my-first-parser

Scrapy will eventually create a separate project directory containing configuration files, spiders etc.

At a minimum, you need to create your first spider in the project, specifying the URLs for scraping and crawling tasks (along with markup schemes).

Here’s an example of such a spider (create a file my_first_spider.py in the my-first-parser/spiders folder):

# python 3import scrapyfrom urllib.parse import urljoinclass MySpider(scrapy.Spider):name = "firstparser"start_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/',]def parse(self, response):for post_link in response.xpath('//div[@class="p-4"]/h4/a/@href').extract():url = urljoin(response.url, post_link)print(url)

You’ll only need to launch the spider:

scrapy crawl firstparser

In this case, the console will display a list of URLs found in the product titles (but only on one page).

What Are Rotating Proxies and Why Do We Need Them?

What Are Rotating Proxies and Why Do We Need Them?

Rotating proxies are proxies that can replace the IP addresses of access points during operation. Rotation can occur under various conditions and reasons (triggers).

Typically, rotating proxies refer to BackConnect proxies (also known as reverse proxies). These proxies operate as follows:

  • There is an access point to the network (an entry point), which is also a proxy.
  • When a request is sent to the entry point, it is routed through the internal network according to specific rules (the end user can partially influence these rules if provided with the appropriate interfaces - APIs or a web dashboard).
  • The request exits the network at a point determined by the routing rules.
  • Since the proxy network is constantly evolving (some addresses go offline, others are added, etc.), the super-proxy must know where it can forward the request and where it can’t. Feedback channels are configured specifically for this purpose. All proxies in the network communicate with the super-proxy, reporting their status and IP.
  • Technically, a user connects to just one proxy (entering only one address). However, the exit point can change each time because all available proxies are intentionally rotated within the system. This is done to distribute and balance the load as well as to reduce the risk of being blocked.

Nevertheless, IP rotation can follow other schemes. For example, static IPs (server proxies) can be rotated manually using scripts or specialized proxy managers.

Some programs can independently test proxies and rotate them based on a predefined list.

The following types of proxies fall under the definition of rotating proxies: server proxies, residential proxies, and mobile proxies - provided the address pool is large and they are connected via a BackConnect scheme.

3 Most Popular Proxy Integration Methods

3 Most Popular Proxy Integration Methods

So, we have agreed on the proxy format. Now we have to understand the connection methods. Each of them has its advantages and disadvantages. Let’s review them below.

Proxy Lists

In this case, you work proactively by copying an entire list of proxy servers and connecting them to your script.

The script itself can test proxies, incorporate them into operations and distribute among crawlers (working processes). Rotation can be implemented based on custom conditions and triggers.

Essentially, you can take a large list of static proxies and make them rotatable.

Disadvantages include:

  • The amount of code and the script's complexity grow significantly;
  • Since the list may contain a vast number of proxies, each address must be considered individually and categorized based on various characteristics like functionality, ping, location etc.;
  • No matter how large the list is, it can eventually run out. This necessitates planning for load balancing and reactivation of used proxies in advance;
  • If the proxies are low-quality like free ones, the percentage of functional proxies may be critically low. Even a large list can quickly "turn into a pumpkin." In this case, you'll need to find new proxies and reintegrate them into the list and workflow.

Rotating/BackConnect Proxies

BackConnect proxies solve most of the mentioned issues. The remote server handles tasks like rotation, quality analysis, proxy location sorting etc.

You are provided with a single connection address (the entry point to the proxy network). You configure the parameters for address selection and rotation. That's it - you're ready to start working.

However, there are certain issues here as well:

  • Rotation services are never free. BackConnect proxies are full-fledged services with their own subscriptions and pricing plans (e.g., Froxy is a common example).
  • Internal rotation algorithms may conflict with your proxy requirements. For instance, there might not be any free IPs in your chosen location or within the same ASN-numbered subnet. In such cases, authentication could reset, requiring you to log in more frequently, which increases the risk of account bans.

Despite these challenges, the undeniable advantage is the simplicity of integration — just one proxy address needs to be connected to your parser, without any additional rotation logic in the code.

API

Let's assume that the final (output) IP is stable but is blocked by a specific website or web service. Parsing is impossible in this case.

As a result, you need to manually access the proxy control panel and force an IP update. However, this is not the most convenient solution for an automated parsing script.

It makes sense to write code that would refresh the "problematic" IP without user involvement. However, not all proxy services provide a dedicated interface for programmatic interaction. This is the so-called API (short for Application Programming Interface).

IP rotation via API is completed using specific commands defined by the proxy service. Each service has its own syntax. Additionally, to avoid confusing clients, a special key is passed in API requests, serving as a user identifier.

Instead of complex server communication constructions via API, many proxy services offer the ability to update the outgoing IP address via a special link. By simply accessing this link or sending a request to it from the parser, a trigger is activated that handles proxy rotation.

Advantages include the connection address for the proxy that remains static (if reverse proxies are used and there is a link/API for updates).

If the API is well-designed, almost any actions can be automated without user involvement: recharging the balance, creating a new proxy port (filter) or deleting it, rotation etc.

How to Integrate and Rotate Proxy Lists with Scrapy

You can manage rotating proxies Scrapy using the free scrapy-rotating-proxies middleware (the official GitHub page + documentation).

To integrate the rotator into your parser, you initially have to install the package in the pip manager:

pip install scrapy-rotating-proxies

Right after the installation, you have to add the proxy list to parser settings. This is the settings.py file for Scrapy.

Code sample:

ROTATING_PROXY_LIST = ['Address.proxy1.com:8001','Address.proxy2.com:8002','Address.proxy3.com:8003','Address.proxy4.com:8004',]

The list can be of any size. Ensure it is formatted as a Python array, and there’s no trailing comma after the last entry.

If you prefer to manage proxies in a separate text file, you can connect the list in the following way:

ROTATING_PROXY_LIST_PATH = '/folder/with/proxy/proxies.txt'

Mind the correctness of path identification - in Python the countdown starts from the catalog, in which the script is completed. This means, you can use the related paths, but the list should be included in the folder inside the Scrapy project.

If you are using both the ROTATING_PROXY_LIST_PATH and ROTATING_PROXY_LIST, then the file call will have the highest priority.

Enable the proxy rotator in Scrapy’s downloader middleware (for detailed documentation, see the DOWNLOADER_MIDDLEWARES. This is done with the following code:

DOWNLOADER_MIDDLEWARES = {# the available code (if any)'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,'rotating_proxies.middlewares.BanDetectionMiddleware': 620,# another code, for other modules}

That’s it. Right after that, the Scrapy spiders’ requests will be sent via proxies from the list.

Mind that if the metadata with the “proxy” parameter are used in your code, they won’t be processed by the rotator.

That’s why, if you wish to disable proxy or use a particular request, you have to use the following constructions:

request.meta['proxy'] = Nonerequest.meta['proxy'] = "<my-own-proxy>"

The plugin comes with extra settings (they are defined via the corresponding file):

  • ROTATING_PROXY_CLOSE_SPIDER – cat be true or false. In the True mode, it stops the spider if all the proxies in the list are dead. In the False mode (the default parameter), the rotator periodically re-checks all dead proxies;
  • ROTATING_PROXY_PAGE_RETRY_TIMES – sets the maximum number of proxies to try for a single page before considering the request. If the rotator checks the required number of proxy but will eventually get an error, this will be considered the parsing script error (as too many proxies were spent on one page);
  • ROTATING_PROXY_BACKOFF_BASE – sets the base backoff time for retries in seconds. Default is 300 (5 minutes).
    ROTATING_PROXY_BACKOFF_CAP – sets the maximum backoff time for retries in seconds. Default is 3600 (1 hour).
    ROTATING_PROXY_BAN_POLICY – defines a custom ban detection policy if needed. By default, Scrapy uses the 'rotating_proxies.policy.BanDetectionPolicy'.

How to Use Rotating/BackConnect Proxies

The simplest way to use rotating proxies is to write the connection parameters directly in the meta field request. For example:

## your spider first_spider.pydef start_requests(self):for url in self.start_urls:return Request(url=url, callback=self.parse,meta={"proxy": "http://user:password@your-proxy:8081"})# and the spider code then

The second option is to set up the proxy configuration at the downloader middleware level for your parser. This is achieved by extending the HttpProxyMiddleware (refer to the documentation). Here's an example of how to create a middleware file (save one in the root of your project):

import base64class MyFirstProxyMiddleware(object):@classmethoddef from_crawler(cls, crawler):return cls(crawler.settings)def __init__(self, settings):self.user = settings.get('USER')self.password = settings.get('PASSWORD')self.endpoint = settings.get('ENDPOINT')self.port = settings.get('PORT')def process_request(self, request, spider):user_credentials = '{user}:{passw}'.format(user=self.user, passw=self.password)basic_authentication = 'Basic ' + base64.b64encode(user_credentials.encode()).decode()host = 'http://{endpoint}:{port}'.format(endpoint=self.endpoint, port=self.port)request.meta['proxy'] = hostrequest.headers['Proxy-Authorization'] = basic_authentication

After creating the middleware, define all necessary variables and override the DOWNLOADER_MIDDLEWARES methods in your project's settings.py file to include the following lines:

# don’t forget to replace the variables with those of your ownUSER = 'proxy-user'PASSWORD = 'proxy-password'ENDPOINT = 'your.proxy.provider.com'PROXY_PORT = '8083'DOWNLOADER_MIDDLEWARES = {'my-first-parser.middlewares.MyFirstProxyMiddleware': 350,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,}

Worldwide Coverage

200+ Geos, No limitations

Access any data through our global network of proxy servers.

See Pricing

How to Use Proxy APIs

To begin, your proxy provider must offer an API interface—without it, remote management is not possible. Additionally, remember to use your unique access key in requests (every key is different)!

The implementation will largely depend on the logic of your program and the set of methods available in your chosen proxy service's API. For example, a script might manage subscriptions, configure new filters for different locations, distribute proxies among multiple crawlers or parsers.

Describing all possible use cases would be impossible to cover comprehensively in this material.

For example, Froxy makes it possible to parse websites without additional frameworks (like Scrapy). We examined an online parser's functionality by collecting data from Google search results.

The online parser enables you to obtain structured data from the most popular platforms like social networks, search engines and marketplaces. For all other websites, there is a universal tool for retrieving HTML code from pages.

Here is the API documentation (including scrapers).

After completing a task, scrapers can notify an external service using webhooks.

Here is the sample of the API request for parsing Google results:

curl -X POST https://froxy.com/api/subscription/API-KEY/task \-H "X-Authorization: Authorization Token" \-d "location[country]"="US" \-d "filters[upload_date]"="past_year" \-d domain="us" \-d page=1 \-d per_page=10 \-d query="search phrase" \-d type="google" \

This is how the JSON-markup request will look like:

{"type": "google","query": "Search phrase","location": {"country": "US"},"domain": "us","filters": {"content_type": "news","upload_date": "past_week"},"is_device": false,"is_safe_search": false,"page": 1,"per_page": "10","webhook": "my.webhook.url"}

The syntax can easily be embedded in any Python-script.

Conclusion and Recommendations

scrapy proxies

Rotating proxies can be managed independently, but in this case, all technical details must be accounted for in the code of your parser. At a minimum, you will have to divide the proxies into working and non-working.

To save time, you can use ready-made solutions like scrapy-rotating-proxies rotator or simply add backconnect proxies (rotating ip proxy).

You can find high-quality rotating proxy with us. Froxy makes it possible to configure targeting and rotation parameters through a personal dashboard (or via API). Connection to the parser is done only once, with all subsequent actions handled on our service's side: either by timer or with each new request.

You can choose from mobile, residential or server proxies. The pool of addresses exceeds 10 million IPs, with targeting up to the city and telecom operator level.

Get notified on new Froxy features and updates

Be the first to know about new Froxy features to stay up-to-date with the digital marketplace and receive news about new Froxy features.

Related articles

Scraping Ebay with Froxy Scraper

Web Scraping

Scraping Ebay with Froxy Scraper

Learn how to easily scrape product listings from eBay with Froxy Scraper - a powerful web scraping tool. This tutorial will show you how to use Froxy...

Team Froxy 27 Mar 2024 3 min read
Using Scraper to Collect Google Search Data

Web Scraping

Using Scraper to Collect Google Search Data

Learn how to efficiently collect data from Google using a scraper in this simple 7-step guide. Master web scraping techniques to gain valuable...

Team Froxy 21 Nov 2024 6 min read