Stop Website Updates from Breaking Your Python Scraping Code

Written by Team Froxy | Oct 16, 2025 7:00:00 AM

It often happens that your personal or corporate web scraper suddenly stops in the middle of a process and throws an error. In some cases, it may not stop completely but instead start writing useless or corrupted data into the database. The reason? A designer slightly updated the layout, or the website’s structure was modified just a little. Even these small changes can be enough to make a target site “unreadable” for your scraping script.

So, how can you avoid such issues?
This article covers practical strategies for making your Python web scraping scripts more resilient to minor updates in website layouts and structures. Let’s start by looking at the most common causes.

Common Reasons Python Scraping Scripts Break

Let’s look at the most frequent problems that can cause a web scraping process to fail:

Changes in the HTML structure of web pages – even minor edits in markup or coding errors can break your parser.
Switch from static HTML to dynamic content – websites may begin loading data via Ajax or JavaScript, making it harder for traditional scrapers to capture the data.
Blocking by the target website – security systems or anti-bot mechanisms may prevent your script from accessing the site.
Changes in URL structures or API responses – infrastructure updates can disrupt scraping logic and data extraction pipelines.

In the following sections, we’ll dive deeper into each of these issues so you’ll know how to detect them and how to respond effectively when they occur.

Changes in HTML Structure

Most web scrapers, unless they rely on computer vision techniques or large language models (LLMs, such as ChatGPT), work by parsing the DOM structure of a web page. This includes the hierarchy of HTML tags, CSS classes, IDs, and similar attributes. That’s how scrapers are built and how they work.

When a target element disappears or its attributes are modified, data extraction becomes impossible – the scraper simply cannot find the required element on the page. And if it does detect something with similar criteria, the extracted content may turn out to be garbage data, meaning it is useless for your database or analysis.

Why HTML Code Changes Break a Scraper

HTML code on a website can change for many different reasons:

The design or layout has been updated. Even moving an element from one part of the page to another changes its position in the DOM structure.
New elements have been added, such as blocks, widgets, or sections.
A protection system has been intentionally implemented, which randomizes class names and selectors, making DOM navigation much more complicated.
One or more HTML tags are left unclosed. While there can be other coding errors, unclosed tags in particular often cause a scraper to either collect lots of irrelevant data or stop with an error.

The problem is most common for scrapers that rely on absolute paths instead of relative ones.
For example, in XPath syntax it might look like this:

Absolute path: //*[@id="main-content"]/div/div[2]/div[1]/h2
Relative path (a better approach): //h2[contains(@class, 'title')] or //*[@data-qa="product-name"]

To minimize issues caused by broken HTML, you can integrate a special module to check your scraper’s functionality before running it on a large scale. An even better option is to set up continuous monitoring for errors. However, keep in mind that such a controlled “sandbox” can consume significant time and computing resources.

Dynamic Content and JavaScript

Modern websites built with frameworks such as React, Angular, or Vue.js often do not deliver a ready-made HTML structure. Instead, the browser receives an almost empty HTML file and links to JavaScript scripts. The site is then assembled piece by piece as these scripts execute. You can call it one of the challenges of scraping dynamic websites.

Since the page does not initially contain the required code and is built asynchronously, classical scrapers using libraries like requests and BeautifulSoup only see a functional skeleton of the page, without the final useful content. As a result, scraping becomes impossible because the required data is literally missing at the moment of parsing – it may only appear later when one of the JavaScript scripts loads it.

To build and access the same content that a user sees, you need to use real browsers or special rendering engines for JavaScript (for example, HtmlUnit). For browser automation, intermediary libraries such as Selenium, Playwright, or Puppeteer are commonly used. More recently, Google introduced the Chrome DevTools Protocol (CDP), but interacting with it also often requires a helper library.

Headless browsers are the most reliable solution in such cases, but they are also very slow and resource-intensive. It is also worth noting that some advanced anti-fraud systems have learned to detect headless browsers within general traffic flows and can effectively block them.

Website Restrictions and Rate Limits

Site owners generally do not want to be scraped. At some point, parasitic load can lead to significant hosting costs. For this reason, administrators and developers use various protection mechanisms, including:

Limits on the number of requests and/or sessions. These are usually tied to a specific IP address.
Checking IPs against blacklists (specialized remote databases) and internal blocklists.
Verifying the type of address (server, home, or mobile) as well as its geographic location.
Behavioral analysis and browser fingerprinting (the user’s browser profile).

If a security system detects a bot, it will often return only an error code to the scraper instead of the full web page, saving server resources.

A sensible approach is to add server-error handling and bypass logic to your Python scraping script. For example, implement proxy rotation.

URL or API Modifications

Many scrapers that need to navigate through large catalogs of articles or entire websites rely on predictable URL structures. For example, to perform a search on Google, you can use a pattern like https://www.google.com/search?q=my%20query. This is the simplest approach, since you can also use special operators and settings inside the query. In fact, it almost works like a ready-made API. And many websites may indeed have a real API with its own command structure.

If previously a product page was accessible via a URL such as site.com/products/123, but later the format was changed to something like site.com/catalog/item/123, the scraper will no longer be able to find the page and will throw an error.

Techniques to Make Your Python Scraping Script More Robust

Now that we have classified the main problems, let’s talk about more effective ways to use Python for web scraping. Below are approaches that can help reduce the number of failures and increase the overall stability of your scraper.

Choosing Reliable Python Libraries for Web Scraping

The right choice of libraries, frameworks, and tools is half the battle. But it is important to remember that every technical solution has its own use cases, limitations, features, syntax, and architecture. There is no universal library that fits every possible scenario. You need to select tools based strictly on your goals and the specifics of the target websites.

For example:

For classic websites with server-side rendering (SSR), where the browser receives a fully generated HTML code, the Python requests library (for handling HTTP requests) together with BeautifulSoup4 (a functional HTML parser) will work well. If speed and performance are critical, BeautifulSoup4 can be replaced with lxml, but in that case you will need to learn a more complex syntax for locating elements on the page.
If you need to manage a large number of parallel threads at scale, a specialized framework like Scrapy is a better option. It can also handle JavaScript rendering on a separate server via Scrapy-Splash.
For dynamic websites, headless browsers and web drivers such as Selenium or Playwright are required.
If you need to work with a small number of target sites but simulate a wide range of browser profiles, including parallelization of scraping tasks, anti-detect browsers are worth considering. They bypass protection systems more effectively and are harder to detect or block.

Writing Stable CSS Selectors and XPath Queries

We have already mentioned absolute selector paths, but this is only one of the most common issues. To make your data extraction queries in HTML documents more reliable, it is important to follow a few basic rules:

Use relative paths in your queries.
Whenever possible, identify structural elements based on attributes (for example, data-* attributes such as data-qa="product-name"). These tend to change less frequently than other parts of the markup.
If such attributes are not available, use IDs or element names/classes. Keep in mind that class names are usually the first to change during design updates, so it is better to rely on partial matches. For example, instead of a direct selector like div[class="price"], it is safer to use something like div[class*="price"]. In XPath, this could look like //div[contains(@class, "price")]. This way, even if the overall structure changes, as long as the keyword “price” remains in the class, the scraper will continue to function.
The least reliable constructions are those that describe the element’s position in the DOM relative to others – child, parent, or sibling elements (e.g., the Nth element in a row).
To increase the accuracy of identification, combine multiple rules in a single query. For example, //a[@href and @title] will extract all <a> elements that contain both an href attribute and a title attribute.

Residential Proxies

Perfect proxies for accessing valuable data from around the world.

Try With Trial $1.99, 100Mb

Error Handling and Fallback Strategies

Don't fool yourself into thinking your scraper will run flawlessly, without errors or interruptions. You need to anticipate problems and prepare standard response scenarios in advance. Some errors may be new or unusual, but most can be predicted even without extensive experience.

Common best practices:

Use try...except blocks so you always have a way to quickly catch and handle exceptions.
Verify the presence of the required element before extracting data. Simple if...else checks are suitable for this.
Give data a second chance: implement retry logic. Inspect the HTTP error codes returned by the server and pause between repeated requests to the same page. For example, if the server returns a 503 (Service Unavailable) or 429 (Too Many Requests), wait a few seconds and then retry. To avoid infinite retry loops, set a hard limit on the number of attempts. Helper libraries such as tenacity can simplify implementing robust retry policies.
Persist scraping progress so that if the process fails or stops, you don't have to start from scratch.
Many failures are related to IP bans or behavioral analysis. Randomize the time between individual requests and use high-quality proxies with geo-targeting.

Handling Dynamic Pages with Headless Browsers

We already mentioned that headless browsers allow your Python scraping script to see the final page markup just as end users do. Moreover, with such browsers you can simulate user actions: fill forms, move the cursor, scroll the page, and more.

There are, however, several nuances:

Some services have learned to detect popular headless browsers by their distinctive fingerprints and attributes. To avoid detection, you need to go deeper into realistic fingerprint emulation: set the correct user-agent, manage cookie files properly, provide a realistic list of installed fonts, etc. Some browsers and platforms already implement these techniques well – these are the anti-detect browsers. In addition, for hiding headless traces you can use helper libraries such as undetected_chromedriver, selenium-stealth, playwright-stealth, fake-useragent, and others.
Use a browser in headless (non-graphical) mode to save server/PC resources. Enable graphical mode only for debugging.
Because dynamic content is not formed instantly but loads over time or in response to certain user actions, you must implement logic to wait for the required elements. For example, in Playwright you can use constructs like await page.wait_for_selector('h1.loaded'). This makes the Python scraping script wait until the h1 heading appears.
Use the browser together with high-quality proxies. For Python web scraping tasks, residential IPs and mobile proxies are usually the best choice because they provide trusted IP addresses of real users (home or mobile). To keep proxies from being blacklisted, rotate and replace them when problems occur. Some providers (for example, Froxy) let you automatically request new IPs in the same location and from the same operator. For large-scale scraping, you can change IPs as often as every request, though rotating on a timer is usually a better approach.

Using Regex and Pattern Matching for Flexibility

Regular expressions (regex) are one of the most powerful tools for extracting data from unstructured text. Their capabilities are impressive: with regex you can track repeating patterns (prefixes, endings, code blocks, etc.), as well as clean and normalize data.

Regex can also serve as a fallback strategy when none of the previous approaches work.

For example, with regex you can quickly extract recognizable data such as phone numbers, email addresses, prices, and similar values.

Unlike parsers such as BeautifulSoup, regular expressions are tied to the structure of the data itself, not to the structure of the HTML document.

Automating Monitoring and Alerts for Website Changes

Large-scale scraping can run for a very long time, and many jobs are executed repeatedly on a schedule. In such cases, it is worth planning in advance for:

Continuous logging of the process. A web scraping script should record key information about its actions and provide as much detail as possible about errors.
Real-time alerts. Think of this as a simple incident response system: the scraper stops – you receive a notification. An error occurs – you get another notification. For small projects, email or messaging apps (Slack, Telegram) are sufficient; for larger ones, entire ecosystems are built on Prometheus or Grafana.

This covers the scraper’s runtime itself. You can also react proactively to layout changes on the target site. For example, set up a small job to parse specific pages and detect changes. If data extraction on these pages starts failing, that is a separate signal to developers to adjust the data collection logic and fine-tune the selectors used for queries.

Mini Python Web Scraping Tutorial (Example Code with Error Handling and Resilience)

Let’s take a look at a sample Python scraping script with error handling and fault tolerance.

#!/usr/bin/env python3
"""
Robust scraper: requests -> fallback Playwright (headless)
Features:
- requests session with Retry/backoff
- UA rotation + optional proxy rotation
- CSS selector -> XPath fallback parsing
- DOM snapshot + diff detection (canonicalization + SHA256 signature)
- Telegram notifications for alerts
- Save progress to resume
- Playwright fallback: waits for a specific selector to appear
"""

import os
import time
import random
import json
import logging
import hashlib
import difflib
from typing import Optional, List, Dict

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
from lxml import html
from playwright.sync_api import sync_playwright

# ---------------- CONFIG ----------------
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.4 Safari/605.1.15",
    # add more realistic UA strings
]

PROXIES = [
    # "http://user:pass@proxy1.example:3128",
    # "http://proxy2.example:3128",
]

HEADERS_BASE = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

MAX_RETRIES = 3
BACKOFF_FACTOR = 1.0
SAVE_FILE = "scrape_progress.json"
LOG_FILE = "scraper.log"
SNAPSHOT_DIR = "snapshots"

# Telegram config (fill your values)
TELEGRAM_BOT_TOKEN = ""   # e.g. "123456:ABC-DEF..."
TELEGRAM_CHAT_ID = ""     # e.g. "987654321"

# Playwright settings
PLAYWRIGHT_TIMEOUT_MS = 20000  # 20s default navigation/selector timeout
# ----------------------------------------

os.makedirs(SNAPSHOT_DIR, exist_ok=True)

# Logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)

# ---------- Telegram ----------
def telegram_send(text: str):
    if not TELEGRAM_BOT_TOKEN or not TELEGRAM_CHAT_ID:
        logging.debug("Telegram not configured; skipping send.")
        return
    url = f"https://api.telegram.org/bot{TELEGRAM_BOT_TOKEN}/sendMessage"
    payload = {"chat_id": TELEGRAM_CHAT_ID, "text": text}
    try:
        r = requests.post(url, json=payload, timeout=10)
        if r.status_code != 200:
            logging.warning("Telegram send failed: %s %s", r.status_code, r.text)
    except Exception as e:
        logging.exception("Telegram send exception: %s", e)

def send_alert(msg: str):
    logging.warning("ALERT: %s", msg)
    telegram_send(f"ALERT: {msg}")

# ---------- session builder ----------
def build_session(proxy: Optional[str] = None) -> requests.Session:
    s = requests.Session()
    retries = Retry(
        total=MAX_RETRIES,
        backoff_factor=BACKOFF_FACTOR,
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=frozenset(["GET", "POST"])
    )
    adapter = HTTPAdapter(max_retries=retries)
    s.mount("https://", adapter)
    s.mount("http://", adapter)
    if proxy:
        s.proxies.update({"http": proxy, "https": proxy})
    return s

# ---------- fetch with requests ----------
def fetch_url_requests(url: str, session: requests.Session, timeout: int = 20) -> Optional[requests.Response]:
    headers = HEADERS_BASE.copy()
    headers["User-Agent"] = random.choice(USER_AGENTS)
    try:
        # polite jitter
        time.sleep(random.uniform(0.4, 1.6))
        resp = session.get(url, headers=headers, timeout=timeout)
        resp.raise_for_status()
        # basic sanity check: must be non-empty and contain <html> or <body>
        if resp.text and ("<html" in resp.text.lower() or "<body" in resp.text.lower()):
            return resp
        logging.debug("Requests returned small body for %s", url)
    except requests.HTTPError as e:
        logging.error("HTTP error for %s: %s", url, e)
    except requests.RequestException as e:
        logging.error("Request error for %s: %s", url, e)
    return None

# ---------- fallback: Playwright ----------
def fetch_with_playwright(url: str, wait_selector: str = "body", timeout_ms: int = PLAYWRIGHT_TIMEOUT_MS) -> Optional[str]:
    """
    Load page with Playwright (Chromium headless) and wait for wait_selector to appear.
    Returns rendered HTML or None.
    """
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            context = browser.new_context(user_agent=random.choice(USER_AGENTS))
            page = context.new_page()
            # navigate and wait for DOMContentLoaded; then wait for selector
            page.goto(url, timeout=timeout_ms, wait_until="domcontentloaded")
            # wait for selector (if not found -> TimeoutError)
            page.wait_for_selector(wait_selector, timeout=timeout_ms)
            content = page.content()
            browser.close()
            # sanity check
            if content and ("<html" in content.lower() or "<body" in content.lower()):
                return content
            logging.debug("Playwright returned unexpected content for %s", url)
    except Exception as e:
        logging.exception("Playwright fallback failed for %s: %s", url, e)
        send_alert(f"Playwright error for {url}: {e}")
    return None

# ---------- robust fetch combining requests + playwright ----------
def robust_fetch_html(url: str, proxy: Optional[str], wait_selector: str) -> Optional[str]:
    session = build_session(proxy=proxy)
    resp = fetch_url_requests(url, session)
    if resp:
        return resp.text

    logging.info("Requests failed or returned bad HTML for %s – switching to Playwright fallback", url)
    html_text = fetch_with_playwright(url, wait_selector=wait_selector)
    if html_text:
        logging.info("Playwright succeeded for %s", url)
    else:
        logging.error("Both requests and Playwright failed for %s", url)
    return html_text

# ---------- parsing helpers ----------
def parse_with_css(soup: BeautifulSoup, css_selector: str) -> Optional[str]:
    try:
        el = soup.select_one(css_selector)
        return el.get_text(strip=True) if el else None
    except Exception:
        return None

def parse_with_xpath(text: str, xpath_expr: str) -> Optional[str]:
    try:
        tree = html.fromstring(text)
        result = tree.xpath(xpath_expr)
        if not result:
            return None
        if isinstance(result[0], html.HtmlElement):
            return result[0].text_content().strip()
        return str(result[0]).strip()
    except Exception:
        return None

def extract_item(html_text: str, css_selector: str, xpath_expr: str) -> Optional[str]:
    soup = BeautifulSoup(html_text, "lxml")
    val = parse_with_css(soup, css_selector)
    if val:
        return val
    return parse_with_xpath(html_text, xpath_expr)

# ---------- DOM canonicalization + signature ----------
def canonicalize_html_fragment(html_fragment: str) -> str:
    soup = BeautifulSoup(html_fragment, "lxml")
    for el in soup.find_all(True):
        attrs = dict(el.attrs)
        new_attrs = {}
        for k, v in attrs.items():
            if k.startswith("on"):
                continue
            if k in ("id", "class"):
                if isinstance(v, list):
                    v = " ".join(v)
                # remove digits to reduce dynamic-id noise
                normalized = "".join(ch for ch in v if not ch.isdigit())
                new_attrs[k] = normalized
            else:
                new_attrs[k] = v
        el.attrs = new_attrs
    text = soup.prettify()
    lines = [ln.rstrip() for ln in text.splitlines() if ln.strip()]
    return "\n".join(lines)

def signature_of_fragment(html_fragment: str) -> str:
    canonical = canonicalize_html_fragment(html_fragment)
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()

def load_snapshot(url_key: str) -> Optional[Dict]:
    path = os.path.join(SNAPSHOT_DIR, f"{url_key}.json")
    if not os.path.exists(path):
        return None
    try:
        with open(path, "r", encoding="utf-8") as f:
            return json.load(f)
    except Exception:
        return None

def save_snapshot(url_key: str, canonical_html: str, sig: str):
    path = os.path.join(SNAPSHOT_DIR, f"{url_key}.json")
    payload = {"signature": sig, "html": canonical_html, "ts": int(time.time())}
    with open(path, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)

def diff_fragments(old: str, new: str) -> str:
    return "\n".join(difflib.unified_diff(old.splitlines(), new.splitlines(), lineterm=""))

# ---------- progress storage ----------
def load_progress() -> Dict:
    try:
        with open(SAVE_FILE, "r", encoding="utf-8") as f:
            return json.load(f)
    except FileNotFoundError:
        return {"visited": [], "data": []}
    except Exception:
        return {"visited": [], "data": []}

def save_progress(progress: Dict):
    try:
        with open(SAVE_FILE, "w", encoding="utf-8") as f:
            json.dump(progress, f, ensure_ascii=False, indent=2)
    except Exception as e:
        logging.error("Failed to save progress: %s", e)

# ---------- structure monitor ----------
def monitor_structure(url: str, html_text: str, css_container: str, url_key: str) -> bool:
    soup = BeautifulSoup(html_text, "lxml")
    el = soup.select_one(css_container) if css_container else soup.body
    if el is None:
        logging.warning("Container selector '%s' not found on %s", css_container, url)
        send_alert(f"Container selector '{css_container}' not found on {url}")
        return False

    canonical = canonicalize_html_fragment(str(el))
    sig = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
    prev = load_snapshot(url_key)
    if not prev:
        save_snapshot(url_key, canonical, sig)
        logging.info("Saved initial snapshot for %s", url)
        return False

    if prev.get("signature") != sig:
        old_html = prev.get("html", "")
        diff_text = diff_fragments(old_html, canonical)
        diff_path = os.path.join(SNAPSHOT_DIR, f"{url_key}_diff_{int(time.time())}.txt")
        with open(diff_path, "w", encoding="utf-8") as f:
            f.write(diff_text)
        save_snapshot(url_key, canonical, sig)
        logging.info("Structure changed for %s, diff saved to %s", url, diff_path)
        send_alert(f"STRUCTURE CHANGE: {url}\nSelector: {css_container}\nDiff saved: {diff_path}")
        return True
    return False

# ---------- main ----------
def main(urls: List[str], container_selector_for_monitoring: str = "main", wait_selector_for_playwright: str = "body"):
    progress = load_progress()
    visited = set(progress.get("visited", []))

    for url in urls:
        if url in visited:
            logging.info("Skipping visited: %s", url)
            continue

        proxy = random.choice(PROXIES) if PROXIES else None
        html_text = robust_fetch_html(url, proxy, wait_selector_for_playwright)
        if not html_text:
            logging.error("Unable to fetch %s via both requests and Playwright", url)
            send_alert(f"Failed to fetch {url} via both requests and Playwright")
            visited.add(url)
            progress["visited"] = list(visited)
            save_progress(progress)
            continue

        # monitor structure
        url_key = hashlib.sha1(url.encode("utf-8")).hexdigest()
        try:
            changed = monitor_structure(url, html_text, container_selector_for_monitoring, url_key)
            if changed:
                logging.info("Detected structure change on %s", url)
        except Exception as e:
            logging.exception("Structure monitor failed for %s: %s", url, e)
            send_alert(f"Structure monitor exception for {url}: {e}")

        # example extraction: CSS then XPath fallback
        css_selector = "h1.product-title"  # customize
        xpath_expr = "//h1[contains(@class,'product') or contains(.,'Product')]/text()"
        try:
            item = extract_item(html_text, css_selector, xpath_expr)
            if item:
                logging.info("Extracted from %s : %s", url, item)
                progress["data"].append({"url": url, "title": item})
            else:
                logging.info("Primary extraction failed for %s; trying meta fallback", url)
                meta_title = parse_with_xpath(html_text, "//meta[@property='og:title']/@content")
                if meta_title:
                    progress["data"].append({"url": url, "title": meta_title})
                else:
                    send_alert(f"Parsing failed for {url}")
        except Exception as e:
            logging.exception("Parsing error for %s: %s", url, e)
            send_alert(f"Parsing crash {url}: {e}")

        visited.add(url)
        progress["visited"] = list(visited)
        save_progress(progress)

        time.sleep(random.uniform(1.0, 3.0))

    logging.info("Scraping finished. Items collected: %d", len(progress.get("data", [])))
    telegram_send(f"Scraping finished. Items: {len(progress.get('data', [])}") )

# ---------- run ----------
if __name__ == "__main__":
    seed_urls = [
        # replace with real URLs
        "https://example.com/product/1",
        "https://example.com/product/2",
    ]
    # choose container to monitor (important block for detecting structural changes)
    container_selector = "div#content"
    # choose selector Playwright should wait for to consider page fully rendered
    wait_selector = "div#content"
    main(seed_urls, container_selector_for_monitoring=container_selector, wait_selector_for_playwright=wait_selector)

In this script, you can define your own list of proxies for rotation, custom user-agent headers, and the set of target pages to scrape.

Key features: the script first tries to fetch a page using requests (with retry/backoff). If that fails, it automatically switches to a headless browser via Playwright, waits for the specified CSS selector to appear, and then returns the rendered HTML.

It also includes logging, Telegram notifications (set your own tokens), DOM snapshots with change detection (via a diff function), progress saving, and basic proxy/user-agent rotation. Scraping is done with CSS selectors by default; if they fail, an XPath fallback is used.

Don’t forget to install the required libraries and dependencies:

pip install requests beautifulsoup4 lxml playwright

Then initialize the browser for Playwright:

python -m playwright install chromium

This is just one possible example of how a Python scraping script can be made resilient to layout changes.

Conclusion

By now, you should understand that it’s not enough to write the barebones scraper and start harvesting a target site. Finding a repeating pattern in the DOM is only the beginning. That approach won’t let you build a stable business process or achieve impressive results at large data-collection scales.

You also need to plan fallback paths: logging, proxy rotation, error handling, automation for URL/DOM changes, progress persistence, and much more. The more contingencies you build in, the more stable your Python scraping script will be.

Large commercial parsers are useless without a reliable network infrastructure and high-quality proxies. And Froxy has just that. At your service are 10+ million IPs with precise targeting and automatic rotation by time or with each new request.

View full post