Parser developers have long noticed that the most sophisticated site-protection systems can detect automated traffic even when scripts use headless browsers and web drivers. Ready-made stealth plugins and libraries like nodriver offered a partial solution. Below, we’ll examine in detail how automated requests are detected and blocked, and how to bypass a WAF without relying on off-the-shelf tools.
A Web Application Firewall (WAF) is a software system that protects websites and web applications from attacks that target vulnerabilities in their code, architecture, or business logic. Unlike a classic network firewall, which filters traffic by IP addresses, ports, and protocols, a WAF inspects HTTP(S) requests and responses at a deep level, effectively monitoring activity on the application layer (the OSI model’s application level).
A WAF can detect and prevent attempts to exploit vulnerabilities such as:
A WAF can operate in passive mode (logging suspicious requests) or active mode (blocking malicious traffic). Many advanced WAFs are adaptive: they learn from traffic patterns and user behavior to refine protection rules.
Modern WAF solutions are often deployed as part of cloud security or CDN platforms (for example, Cloudflare), integrated with SIEM (Security Information and Event Management) systems, and augmented with AI and machine-learning models to detect anomalies.
For the purposes of this article, we’re specifically interested in how WAFs are used against scraping — and, more importantly, in practical tactics for bypassing a WAF during data collection.
There are several reasons why WAF systems actively block parsers and automated traffic. Many websites and web applications deliberately restrict bots because automated traffic creates unnecessary server load. Excessive requests can slow down a site to the point where real users can’t access it — essentially the same effect as a DDoS attack.
Another reason is that a WAF continuously monitors client behavior and may interpret parser traffic as malicious or unsafe. For example, if too many requests are sent from the same IP address in a short period, or if a bot repeatedly tries to submit hidden “honeypot” forms, the system can flag and block such activity.
Finally, automated scraping often violates a site’s Terms of Service. In some cases, the extracted data is copyrighted or otherwise protected, so site owners deploy WAF rules and additional security layers as a proactive defense against unauthorized data collection.
In short, WAF systems don’t just guard against hackers — they also act as gatekeepers to limit large-scale scraping, whether intentional or not.
Here are the specific signals a web application firewall (WAF) can use to identify scrapers and bots:
If we combine and simplify everything mentioned above, WAF blocks scraping based on:
Ethical considerations when bypassing a Web Application Firewall (WAF) are tied to balancing security, privacy, accessibility, and the responsibility of handling user or site data (especially if it is protected by copyright). Key aspects include:
By following these rules, you remain within legal and ethical boundaries, avoid unnecessary blocks, and respect the balance of interests between the site owner and its end users.
A WAF bypass is an attempt to evade the filters and blocking algorithms of a web application firewall in order to access a site or web service for automation, scraping, or other tasks.
To bypass WAF protection, you simply need to understand and account for the main detection methods for automated traffic — many of which we described above. For example, if a WAF blocks server IPs, use residential or mobile proxies; if it inspects cookies, issues session tokens, and checks for JavaScript execution, route traffic through headless/full browsers; if it analyzes fingerprints and automation traces, use realistic browser profiles combined with anti-detect solutions.
Below, we’ll outline a number of technical techniques for specific, narrow scraping tasks.
The User-Agent header should be realistic and, ideally, match current versions of popular browsers. It’s equally important to make other headers appear natural: locale, encoding, etc.
Example of random headers in Python:
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36'
]
headers = {
'User-Agent': random.choice(user_agents), # choose a random user-agent from the list above
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # shows preferred media types and quality factors
'Accept-Language': 'en-US,en;q=0.5', # preferred language
'Accept-Encoding': 'gzip, deflate', # preferred compression methods
'Referer': 'https://www.google.com/' # the referring source
}
If you’re connecting via headless browsers, also consider disabling or adjusting certain browser attributes and applying more “natural” context settings: window size, available fonts, etc. See the proxy rotation block below for an example of specifying such context settings.
If you have disparate proxy lists, you need to rotate them yourself. For example:
import requests
import random
proxies_list = [
'http://user:pass@192.168.1.1:8080',
'http://user:pass@192.168.1.2:8080',
'http://user:pass@192.168.1.3:8080'
]
proxy = {
'http': random.choice(proxies_list),
'https': random.choice(proxies_list)
}
# The timeout here is fixed; this is just an example
response = requests.get('https://example.com', proxies=proxy, timeout=10)
If you use a rotating-proxy service such as Froxy, your code can connect to a proxy only once; subsequent rotation and selection rules are configured in the service dashboard or via the API.
Below is an example for connecting a scraper through a proxy using a headless browser and realistic browser attributes without stealth plugins (using the Playwright web driver):
from playwright.sync_api import sync_playwright
def create_browser_with_residential_proxy():
with sync_playwright() as p:
# Residential proxy configuration
proxy_config = {
'server': 'http://proxy.froxy.com:9000', # or socks5:// for SOCKS5
'username': 'your_login',
'password': 'your_password'
}
# Launch the browser with the proxy
browser = p.chromium.launch(
headless=True,
proxy=proxy_config, # pass the proxy configuration
args=[
# Hide automation attributes
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
'--disable-dev-shm-usage'
]
)
# Create a context with additional settings
context = browser.new_context(
viewport={'width': 1920, 'height': 1080}, # Realistic screen resolution
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' # Realistic user-agent
)
# Further hide webdriver traces at the JS level during browser initialization
page = context.new_page()
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
""")
return browser, page
# Usage
browser, page = create_browser_with_residential_proxy()
try:
# Navigate to the target site through the proxy
page.goto('https://target-site.org/page1', timeout=60000)
# Further scraping actions...
# data = page.locator('.content').text_content()
finally:
# Close the browser
browser.close()
A WAF-bypass script should be able to save and reuse cookies to emulate a real user session.
Example with Python + Requests (see Requests docs for details):
import requests
session = requests.Session()
# First request to obtain cookies
session.get('https://example.com/login')
# Subsequent requests reuse the same cookies
response = session.get('https://example.com/protected-page')
# Manually set cookies
session.cookies.update({'session_id': 'abc123', 'user_prefs': 'en'})
Headless browsers automatically keep cookies within a single session, although you can implement custom logic to persist and restore cookies manually.
A human never maintains identical intervals between actions. Therefore, adding randomness to delays between requests is crucial. We recommend setting time limits — a minimum and a maximum — so delays don’t become too long or too short.
import time
import random
def human_delay(min_delay=1, max_delay=5):
# Random delay between requests
time.sleep(random.uniform(min_delay, max_delay))
# Usage inside a scraping loop
for page in range(10):
human_delay(2, 7) # Pause from 2 to 7 seconds
# ... perform the request
An option for masking the signs of an automated browser using Selenium settings (Python):
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
driver = webdriver.Chrome(options=options)
# Hide webdriver traces at the JS level
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
# Set a normal user-agent
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
Below are additional approaches for bypassing a WAF, so you know which techniques are available and what they’re useful for:
To continue exploring scraping tips, check examples of scraping with ChatGPT and resources on using frameworks like LangChain and LangGraph.
As we’ve said several times, websites and their protection mechanisms get more sophisticated every year. But this is a two-player game: WAFs and the techniques to bypass them also evolve continuously. With enough effort, you can emulate almost anything — from digital fingerprints and user behavior to client location and even microphone audio streams. It’s like dancing the tango: you must move in sync with your partner.
We provide high-quality rotating proxies with precise targeting down to specific cities and ISPs worldwide, as well as a set of ready-made cloud scrapers accessible via API.