It often happens that your personal or corporate web scraper suddenly stops in the middle of a process and throws an error. In some cases, it may not stop completely but instead start writing useless or corrupted data into the database. The reason? A designer slightly updated the layout, or the website’s structure was modified just a little. Even these small changes can be enough to make a target site “unreadable” for your scraping script.
So, how can you avoid such issues?
This article covers practical strategies for making your Python web scraping scripts more resilient to minor updates in website layouts and structures. Let’s start by looking at the most common causes.
Let’s look at the most frequent problems that can cause a web scraping process to fail:
In the following sections, we’ll dive deeper into each of these issues so you’ll know how to detect them and how to respond effectively when they occur.
Most web scrapers, unless they rely on computer vision techniques or large language models (LLMs, such as ChatGPT), work by parsing the DOM structure of a web page. This includes the hierarchy of HTML tags, CSS classes, IDs, and similar attributes. That’s how scrapers are built and how they work.
When a target element disappears or its attributes are modified, data extraction becomes impossible – the scraper simply cannot find the required element on the page. And if it does detect something with similar criteria, the extracted content may turn out to be garbage data, meaning it is useless for your database or analysis.
HTML code on a website can change for many different reasons:
The problem is most common for scrapers that rely on absolute paths instead of relative ones.
For example, in XPath syntax it might look like this:
To minimize issues caused by broken HTML, you can integrate a special module to check your scraper’s functionality before running it on a large scale. An even better option is to set up continuous monitoring for errors. However, keep in mind that such a controlled “sandbox” can consume significant time and computing resources.
Modern websites built with frameworks such as React, Angular, or Vue.js often do not deliver a ready-made HTML structure. Instead, the browser receives an almost empty HTML file and links to JavaScript scripts. The site is then assembled piece by piece as these scripts execute. You can call it one of the challenges of scraping dynamic websites.
Since the page does not initially contain the required code and is built asynchronously, classical scrapers using libraries like requests and BeautifulSoup only see a functional skeleton of the page, without the final useful content. As a result, scraping becomes impossible because the required data is literally missing at the moment of parsing – it may only appear later when one of the JavaScript scripts loads it.
To build and access the same content that a user sees, you need to use real browsers or special rendering engines for JavaScript (for example, HtmlUnit). For browser automation, intermediary libraries such as Selenium, Playwright, or Puppeteer are commonly used. More recently, Google introduced the Chrome DevTools Protocol (CDP), but interacting with it also often requires a helper library.
Headless browsers are the most reliable solution in such cases, but they are also very slow and resource-intensive. It is also worth noting that some advanced anti-fraud systems have learned to detect headless browsers within general traffic flows and can effectively block them.
Site owners generally do not want to be scraped. At some point, parasitic load can lead to significant hosting costs. For this reason, administrators and developers use various protection mechanisms, including:
If a security system detects a bot, it will often return only an error code to the scraper instead of the full web page, saving server resources.
A sensible approach is to add server-error handling and bypass logic to your Python scraping script. For example, implement proxy rotation.
Many scrapers that need to navigate through large catalogs of articles or entire websites rely on predictable URL structures. For example, to perform a search on Google, you can use a pattern like https://www.google.com/search?q=my%20query. This is the simplest approach, since you can also use special operators and settings inside the query. In fact, it almost works like a ready-made API. And many websites may indeed have a real API with its own command structure.
If previously a product page was accessible via a URL such as site.com/products/123, but later the format was changed to something like site.com/catalog/item/123, the scraper will no longer be able to find the page and will throw an error.
Now that we have classified the main problems, let’s talk about more effective ways to use Python for web scraping. Below are approaches that can help reduce the number of failures and increase the overall stability of your scraper.
The right choice of libraries, frameworks, and tools is half the battle. But it is important to remember that every technical solution has its own use cases, limitations, features, syntax, and architecture. There is no universal library that fits every possible scenario. You need to select tools based strictly on your goals and the specifics of the target websites.
For example:
We have already mentioned absolute selector paths, but this is only one of the most common issues. To make your data extraction queries in HTML documents more reliable, it is important to follow a few basic rules:
Perfect proxies for accessing valuable data from around the world.
Don't fool yourself into thinking your scraper will run flawlessly, without errors or interruptions. You need to anticipate problems and prepare standard response scenarios in advance. Some errors may be new or unusual, but most can be predicted even without extensive experience.
Common best practices:
We already mentioned that headless browsers allow your Python scraping script to see the final page markup just as end users do. Moreover, with such browsers you can simulate user actions: fill forms, move the cursor, scroll the page, and more.
There are, however, several nuances:
Regular expressions (regex) are one of the most powerful tools for extracting data from unstructured text. Their capabilities are impressive: with regex you can track repeating patterns (prefixes, endings, code blocks, etc.), as well as clean and normalize data.
Regex can also serve as a fallback strategy when none of the previous approaches work.
For example, with regex you can quickly extract recognizable data such as phone numbers, email addresses, prices, and similar values.
Unlike parsers such as BeautifulSoup, regular expressions are tied to the structure of the data itself, not to the structure of the HTML document.
Automating Monitoring and Alerts for Website Changes
Large-scale scraping can run for a very long time, and many jobs are executed repeatedly on a schedule. In such cases, it is worth planning in advance for:
This covers the scraper’s runtime itself. You can also react proactively to layout changes on the target site. For example, set up a small job to parse specific pages and detect changes. If data extraction on these pages starts failing, that is a separate signal to developers to adjust the data collection logic and fine-tune the selectors used for queries.
Let’s take a look at a sample Python scraping script with error handling and fault tolerance.
#!/usr/bin/env python3
"""
Robust scraper: requests -> fallback Playwright (headless)
Features:
- requests session with Retry/backoff
- UA rotation + optional proxy rotation
- CSS selector -> XPath fallback parsing
- DOM snapshot + diff detection (canonicalization + SHA256 signature)
- Telegram notifications for alerts
- Save progress to resume
- Playwright fallback: waits for a specific selector to appear
"""
import os
import time
import random
import json
import logging
import hashlib
import difflib
from typing import Optional, List, Dict
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
from lxml import html
from playwright.sync_api import sync_playwright
# ---------------- CONFIG ----------------
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.4 Safari/605.1.15",
# add more realistic UA strings
]
PROXIES = [
# "http://user:pass@proxy1.example:3128",
# "http://proxy2.example:3128",
]
HEADERS_BASE = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
MAX_RETRIES = 3
BACKOFF_FACTOR = 1.0
SAVE_FILE = "scrape_progress.json"
LOG_FILE = "scraper.log"
SNAPSHOT_DIR = "snapshots"
# Telegram config (fill your values)
TELEGRAM_BOT_TOKEN = "" # e.g. "123456:ABC-DEF..."
TELEGRAM_CHAT_ID = "" # e.g. "987654321"
# Playwright settings
PLAYWRIGHT_TIMEOUT_MS = 20000 # 20s default navigation/selector timeout
# ----------------------------------------
os.makedirs(SNAPSHOT_DIR, exist_ok=True)
# Logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[logging.FileHandler(LOG_FILE), logging.StreamHandler()]
)
# ---------- Telegram ----------
def telegram_send(text: str):
if not TELEGRAM_BOT_TOKEN or not TELEGRAM_CHAT_ID:
logging.debug("Telegram not configured; skipping send.")
return
url = f"https://api.telegram.org/bot{TELEGRAM_BOT_TOKEN}/sendMessage"
payload = {"chat_id": TELEGRAM_CHAT_ID, "text": text}
try:
r = requests.post(url, json=payload, timeout=10)
if r.status_code != 200:
logging.warning("Telegram send failed: %s %s", r.status_code, r.text)
except Exception as e:
logging.exception("Telegram send exception: %s", e)
def send_alert(msg: str):
logging.warning("ALERT: %s", msg)
telegram_send(f"ALERT: {msg}")
# ---------- session builder ----------
def build_session(proxy: Optional[str] = None) -> requests.Session:
s = requests.Session()
retries = Retry(
total=MAX_RETRIES,
backoff_factor=BACKOFF_FACTOR,
status_forcelist=(429, 500, 502, 503, 504),
allowed_methods=frozenset(["GET", "POST"])
)
adapter = HTTPAdapter(max_retries=retries)
s.mount("https://", adapter)
s.mount("http://", adapter)
if proxy:
s.proxies.update({"http": proxy, "https": proxy})
return s
# ---------- fetch with requests ----------
def fetch_url_requests(url: str, session: requests.Session, timeout: int = 20) -> Optional[requests.Response]:
headers = HEADERS_BASE.copy()
headers["User-Agent"] = random.choice(USER_AGENTS)
try:
# polite jitter
time.sleep(random.uniform(0.4, 1.6))
resp = session.get(url, headers=headers, timeout=timeout)
resp.raise_for_status()
# basic sanity check: must be non-empty and contain <html> or <body>
if resp.text and ("<html" in resp.text.lower() or "<body" in resp.text.lower()):
return resp
logging.debug("Requests returned small body for %s", url)
except requests.HTTPError as e:
logging.error("HTTP error for %s: %s", url, e)
except requests.RequestException as e:
logging.error("Request error for %s: %s", url, e)
return None
# ---------- fallback: Playwright ----------
def fetch_with_playwright(url: str, wait_selector: str = "body", timeout_ms: int = PLAYWRIGHT_TIMEOUT_MS) -> Optional[str]:
"""
Load page with Playwright (Chromium headless) and wait for wait_selector to appear.
Returns rendered HTML or None.
"""
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(user_agent=random.choice(USER_AGENTS))
page = context.new_page()
# navigate and wait for DOMContentLoaded; then wait for selector
page.goto(url, timeout=timeout_ms, wait_until="domcontentloaded")
# wait for selector (if not found -> TimeoutError)
page.wait_for_selector(wait_selector, timeout=timeout_ms)
content = page.content()
browser.close()
# sanity check
if content and ("<html" in content.lower() or "<body" in content.lower()):
return content
logging.debug("Playwright returned unexpected content for %s", url)
except Exception as e:
logging.exception("Playwright fallback failed for %s: %s", url, e)
send_alert(f"Playwright error for {url}: {e}")
return None
# ---------- robust fetch combining requests + playwright ----------
def robust_fetch_html(url: str, proxy: Optional[str], wait_selector: str) -> Optional[str]:
session = build_session(proxy=proxy)
resp = fetch_url_requests(url, session)
if resp:
return resp.text
logging.info("Requests failed or returned bad HTML for %s – switching to Playwright fallback", url)
html_text = fetch_with_playwright(url, wait_selector=wait_selector)
if html_text:
logging.info("Playwright succeeded for %s", url)
else:
logging.error("Both requests and Playwright failed for %s", url)
return html_text
# ---------- parsing helpers ----------
def parse_with_css(soup: BeautifulSoup, css_selector: str) -> Optional[str]:
try:
el = soup.select_one(css_selector)
return el.get_text(strip=True) if el else None
except Exception:
return None
def parse_with_xpath(text: str, xpath_expr: str) -> Optional[str]:
try:
tree = html.fromstring(text)
result = tree.xpath(xpath_expr)
if not result:
return None
if isinstance(result[0], html.HtmlElement):
return result[0].text_content().strip()
return str(result[0]).strip()
except Exception:
return None
def extract_item(html_text: str, css_selector: str, xpath_expr: str) -> Optional[str]:
soup = BeautifulSoup(html_text, "lxml")
val = parse_with_css(soup, css_selector)
if val:
return val
return parse_with_xpath(html_text, xpath_expr)
# ---------- DOM canonicalization + signature ----------
def canonicalize_html_fragment(html_fragment: str) -> str:
soup = BeautifulSoup(html_fragment, "lxml")
for el in soup.find_all(True):
attrs = dict(el.attrs)
new_attrs = {}
for k, v in attrs.items():
if k.startswith("on"):
continue
if k in ("id", "class"):
if isinstance(v, list):
v = " ".join(v)
# remove digits to reduce dynamic-id noise
normalized = "".join(ch for ch in v if not ch.isdigit())
new_attrs[k] = normalized
else:
new_attrs[k] = v
el.attrs = new_attrs
text = soup.prettify()
lines = [ln.rstrip() for ln in text.splitlines() if ln.strip()]
return "\n".join(lines)
def signature_of_fragment(html_fragment: str) -> str:
canonical = canonicalize_html_fragment(html_fragment)
return hashlib.sha256(canonical.encode("utf-8")).hexdigest()
def load_snapshot(url_key: str) -> Optional[Dict]:
path = os.path.join(SNAPSHOT_DIR, f"{url_key}.json")
if not os.path.exists(path):
return None
try:
with open(path, "r", encoding="utf-8") as f:
return json.load(f)
except Exception:
return None
def save_snapshot(url_key: str, canonical_html: str, sig: str):
path = os.path.join(SNAPSHOT_DIR, f"{url_key}.json")
payload = {"signature": sig, "html": canonical_html, "ts": int(time.time())}
with open(path, "w", encoding="utf-8") as f:
json.dump(payload, f, ensure_ascii=False, indent=2)
def diff_fragments(old: str, new: str) -> str:
return "\n".join(difflib.unified_diff(old.splitlines(), new.splitlines(), lineterm=""))
# ---------- progress storage ----------
def load_progress() -> Dict:
try:
with open(SAVE_FILE, "r", encoding="utf-8") as f:
return json.load(f)
except FileNotFoundError:
return {"visited": [], "data": []}
except Exception:
return {"visited": [], "data": []}
def save_progress(progress: Dict):
try:
with open(SAVE_FILE, "w", encoding="utf-8") as f:
json.dump(progress, f, ensure_ascii=False, indent=2)
except Exception as e:
logging.error("Failed to save progress: %s", e)
# ---------- structure monitor ----------
def monitor_structure(url: str, html_text: str, css_container: str, url_key: str) -> bool:
soup = BeautifulSoup(html_text, "lxml")
el = soup.select_one(css_container) if css_container else soup.body
if el is None:
logging.warning("Container selector '%s' not found on %s", css_container, url)
send_alert(f"Container selector '{css_container}' not found on {url}")
return False
canonical = canonicalize_html_fragment(str(el))
sig = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
prev = load_snapshot(url_key)
if not prev:
save_snapshot(url_key, canonical, sig)
logging.info("Saved initial snapshot for %s", url)
return False
if prev.get("signature") != sig:
old_html = prev.get("html", "")
diff_text = diff_fragments(old_html, canonical)
diff_path = os.path.join(SNAPSHOT_DIR, f"{url_key}_diff_{int(time.time())}.txt")
with open(diff_path, "w", encoding="utf-8") as f:
f.write(diff_text)
save_snapshot(url_key, canonical, sig)
logging.info("Structure changed for %s, diff saved to %s", url, diff_path)
send_alert(f"STRUCTURE CHANGE: {url}\nSelector: {css_container}\nDiff saved: {diff_path}")
return True
return False
# ---------- main ----------
def main(urls: List[str], container_selector_for_monitoring: str = "main", wait_selector_for_playwright: str = "body"):
progress = load_progress()
visited = set(progress.get("visited", []))
for url in urls:
if url in visited:
logging.info("Skipping visited: %s", url)
continue
proxy = random.choice(PROXIES) if PROXIES else None
html_text = robust_fetch_html(url, proxy, wait_selector_for_playwright)
if not html_text:
logging.error("Unable to fetch %s via both requests and Playwright", url)
send_alert(f"Failed to fetch {url} via both requests and Playwright")
visited.add(url)
progress["visited"] = list(visited)
save_progress(progress)
continue
# monitor structure
url_key = hashlib.sha1(url.encode("utf-8")).hexdigest()
try:
changed = monitor_structure(url, html_text, container_selector_for_monitoring, url_key)
if changed:
logging.info("Detected structure change on %s", url)
except Exception as e:
logging.exception("Structure monitor failed for %s: %s", url, e)
send_alert(f"Structure monitor exception for {url}: {e}")
# example extraction: CSS then XPath fallback
css_selector = "h1.product-title" # customize
xpath_expr = "//h1[contains(@class,'product') or contains(.,'Product')]/text()"
try:
item = extract_item(html_text, css_selector, xpath_expr)
if item:
logging.info("Extracted from %s : %s", url, item)
progress["data"].append({"url": url, "title": item})
else:
logging.info("Primary extraction failed for %s; trying meta fallback", url)
meta_title = parse_with_xpath(html_text, "//meta[@property='og:title']/@content")
if meta_title:
progress["data"].append({"url": url, "title": meta_title})
else:
send_alert(f"Parsing failed for {url}")
except Exception as e:
logging.exception("Parsing error for %s: %s", url, e)
send_alert(f"Parsing crash {url}: {e}")
visited.add(url)
progress["visited"] = list(visited)
save_progress(progress)
time.sleep(random.uniform(1.0, 3.0))
logging.info("Scraping finished. Items collected: %d", len(progress.get("data", [])))
telegram_send(f"Scraping finished. Items: {len(progress.get('data', [])}") )
# ---------- run ----------
if __name__ == "__main__":
seed_urls = [
# replace with real URLs
"https://example.com/product/1",
"https://example.com/product/2",
]
# choose container to monitor (important block for detecting structural changes)
container_selector = "div#content"
# choose selector Playwright should wait for to consider page fully rendered
wait_selector = "div#content"
main(seed_urls, container_selector_for_monitoring=container_selector, wait_selector_for_playwright=wait_selector)
In this script, you can define your own list of proxies for rotation, custom user-agent headers, and the set of target pages to scrape.
Key features: the script first tries to fetch a page using requests (with retry/backoff). If that fails, it automatically switches to a headless browser via Playwright, waits for the specified CSS selector to appear, and then returns the rendered HTML.
It also includes logging, Telegram notifications (set your own tokens), DOM snapshots with change detection (via a diff function), progress saving, and basic proxy/user-agent rotation. Scraping is done with CSS selectors by default; if they fail, an XPath fallback is used.
Don’t forget to install the required libraries and dependencies:
pip install requests beautifulsoup4 lxml playwright
Then initialize the browser for Playwright:
python -m playwright install chromium
This is just one possible example of how a Python scraping script can be made resilient to layout changes.
By now, you should understand that it’s not enough to write the barebones scraper and start harvesting a target site. Finding a repeating pattern in the DOM is only the beginning. That approach won’t let you build a stable business process or achieve impressive results at large data-collection scales.
You also need to plan fallback paths: logging, proxy rotation, error handling, automation for URL/DOM changes, progress persistence, and much more. The more contingencies you build in, the more stable your Python scraping script will be.
Large commercial parsers are useless without a reliable network infrastructure and high-quality proxies. And Froxy has just that. At your service are 10+ million IPs with precise targeting and automatic rotation by time or with each new request.