How to Ethically Bypass WAF During Scraping

Written by Team Froxy | Nov 19, 2025 7:00:01 AM

Parser developers have long noticed that the most sophisticated site-protection systems can detect automated traffic even when scripts use headless browsers and web drivers. Ready-made stealth plugins and libraries like nodriver offered a partial solution. Below, we’ll examine in detail how automated requests are detected and blocked, and how to bypass a WAF without relying on off-the-shelf tools.

What Is a Web Application Firewall (WAF)?

A Web Application Firewall (WAF) is a software system that protects websites and web applications from attacks that target vulnerabilities in their code, architecture, or business logic. Unlike a classic network firewall, which filters traffic by IP addresses, ports, and protocols, a WAF inspects HTTP(S) requests and responses at a deep level, effectively monitoring activity on the application layer (the OSI model’s application level).

A WAF can detect and prevent attempts to exploit vulnerabilities such as:

SQL injection
Cross-site scripting (XSS)
Cookie/session hijacking
Injection of malicious commands via forms and API requests
Attempts to bypass authentication mechanisms

A WAF can operate in passive mode (logging suspicious requests) or active mode (blocking malicious traffic). Many advanced WAFs are adaptive: they learn from traffic patterns and user behavior to refine protection rules.

Modern WAF solutions are often deployed as part of cloud security or CDN platforms (for example, Cloudflare), integrated with SIEM (Security Information and Event Management) systems, and augmented with AI and machine-learning models to detect anomalies.

For the purposes of this article, we’re specifically interested in how WAFs are used against scraping — and, more importantly, in practical tactics for bypassing a WAF during data collection.

Why WAF Blocks Web Scraping

There are several reasons why WAF systems actively block parsers and automated traffic. Many websites and web applications deliberately restrict bots because automated traffic creates unnecessary server load. Excessive requests can slow down a site to the point where real users can’t access it — essentially the same effect as a DDoS attack.

Another reason is that a WAF continuously monitors client behavior and may interpret parser traffic as malicious or unsafe. For example, if too many requests are sent from the same IP address in a short period, or if a bot repeatedly tries to submit hidden “honeypot” forms, the system can flag and block such activity.

Finally, automated scraping often violates a site’s Terms of Service. In some cases, the extracted data is copyrighted or otherwise protected, so site owners deploy WAF rules and additional security layers as a proactive defense against unauthorized data collection.

In short, WAF systems don’t just guard against hackers — they also act as gatekeepers to limit large-scale scraping, whether intentional or not.

Here are the specific signals a web application firewall (WAF) can use to identify scrapers and bots:

Requests are sent at equal intervals of time — real users don’t operate on a built-in timer.
User actions on the pages follow a consistent pattern, including identical cursor movements, scrolling, and clicks.
Signs of automation can be found in technical parameters and special attributes. For example, unusual User-Agent, Accept, Referer, and other HTTP headers. Many headless browsers add special markers to requests that make them stand out from normal traffic.
A mismatch between the declared user location and the real one. Even if a proxy is used, the browser’s real IP can leak through related protocols such as WebRTC.
The “cleanliness” of an address can be checked against public databases; requests from server or cloud IPs (hosting providers) are especially suspicious.
If a scraper does not use a real browser, it cannot execute JavaScript. As a result, bots often fall into traps such as honeypot links/forms, since they only “see” the raw HTML. A real user would never click a hidden link.
The simplest bots don’t store cookies at all. If the site issues a unique token and the client doesn’t return it on the next request, it’s almost always a bot. Completely empty cookies are another clear sign of automation.
Suspicious or even all initial requests to the site may be routed through a “human check.” (See more in our Cloudflare bypass material.)
For example, scraping pages alphabetically or by traversing a site graph built from the home page, sequentially requesting all discovered links. More complex patterns can be detected using AI/ML.
Some URLs are meant only for programs or may be disallowed in robots.txt. Attempts to access such forbidden URLs are a clear indicator of automation and scraping.

If we combine and simplify everything mentioned above, WAF blocks scraping based on:

Signature analysis — identifying specific technical markers and headers from bots, scrapers, and headless browsers, similar to digital fingerprints.
Behavioral analysis — tracking user patterns; this is resource-intensive, so it’s either applied rarely or through the simplest rules.
Humanity checks — via CAPTCHA and execution of special JavaScript scripts on the client side.
IP analysis — reputation, location, and type of IP address.

Ethical Considerations

Ethical considerations when bypassing a Web Application Firewall (WAF) are tied to balancing security, privacy, accessibility, and the responsibility of handling user or site data (especially if it is protected by copyright). Key aspects include:

Scraper developers must remember that WAF protects a site not only technically but also legally. Ignoring filters, bypassing captchas, or spoofing headers can violate the site’s terms of use and may be considered system interference.
WAF bypass should always prefer working through the official API interface, if one is available for partners of the site or service.
Avoid creating excessive server load by sending too many requests simultaneously, as this can disrupt the service.
It is important to clearly distinguish content whose collection or storage may infringe on copyrights or compromise confidentiality (such as personal user data).

By following these rules, you remain within legal and ethical boundaries, avoid unnecessary blocks, and respect the balance of interests between the site owner and its end users.

Techniques for WAF Bypass in Web Scraping with Code Examples

A WAF bypass is an attempt to evade the filters and blocking algorithms of a web application firewall in order to access a site or web service for automation, scraping, or other tasks.

To bypass WAF protection, you simply need to understand and account for the main detection methods for automated traffic — many of which we described above. For example, if a WAF blocks server IPs, use residential or mobile proxies; if it inspects cookies, issues session tokens, and checks for JavaScript execution, route traffic through headless/full browsers; if it analyzes fingerprints and automation traces, use realistic browser profiles combined with anti-detect solutions.

Below, we’ll outline a number of technical techniques for specific, narrow scraping tasks.

Rotating User-Agents and Headers

The User-Agent header should be realistic and, ideally, match current versions of popular browsers. It’s equally important to make other headers appear natural: locale, encoding, etc.

Example of random headers in Python:


import random
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36',
    ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36'
]
headers = {
    'User-Agent': random.choice(user_agents), # choose a random user-agent from the list above
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # shows preferred media types and quality factors
    'Accept-Language': 'en-US,en;q=0.5', # preferred language
    'Accept-Encoding': 'gzip, deflate', # preferred compression methods
    'Referer': 'https://www.google.com/' # the referring source
}

If you’re connecting via headless browsers, also consider disabling or adjusting certain browser attributes and applying more “natural” context settings: window size, available fonts, etc. See the proxy rotation block below for an example of specifying such context settings.

Using Residential and Rotating Proxies

If you have disparate proxy lists, you need to rotate them yourself. For example:


import requests
import random

proxies_list = [
    'http://user:pass@192.168.1.1:8080',
    'http://user:pass@192.168.1.2:8080',
    'http://user:pass@192.168.1.3:8080'
]

proxy = {
    'http': random.choice(proxies_list),
    'https': random.choice(proxies_list)
}

# The timeout here is fixed; this is just an example
response = requests.get('https://example.com', proxies=proxy, timeout=10)

If you use a rotating-proxy service such as Froxy, your code can connect to a proxy only once; subsequent rotation and selection rules are configured in the service dashboard or via the API.

Below is an example for connecting a scraper through a proxy using a headless browser and realistic browser attributes without stealth plugins (using the Playwright web driver):


from playwright.sync_api import sync_playwright

def create_browser_with_residential_proxy():
    with sync_playwright() as p:
        # Residential proxy configuration
        proxy_config = {
            'server': 'http://proxy.froxy.com:9000',  # or socks5:// for SOCKS5
            'username': 'your_login',
            'password': 'your_password'
        }
        
        # Launch the browser with the proxy
        browser = p.chromium.launch(
            headless=True,
            proxy=proxy_config,  # pass the proxy configuration
            args=[
                # Hide automation attributes
                '--disable-blink-features=AutomationControlled',
                '--no-sandbox',
                '--disable-dev-shm-usage'
            ]
        )

# Create a context with additional settings
        context = browser.new_context(
            viewport={'width': 1920, 'height': 1080}, # Realistic screen resolution
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' # Realistic user-agent
        )
        
        # Further hide webdriver traces at the JS level during browser initialization
        page = context.new_page()
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined,
            });
        """)
        
        return browser, page

# Usage
browser, page = create_browser_with_residential_proxy()

try:
    # Navigate to the target site through the proxy
    page.goto('https://target-site.org/page1', timeout=60000)
    # Further scraping actions...
    # data = page.locator('.content').text_content()
    
finally:
    # Close the browser
    browser.close()

Proper (Natural) Cookie and Session Handling

A WAF-bypass script should be able to save and reuse cookies to emulate a real user session.

Example with Python + Requests (see Requests docs for details):


import requests
session = requests.Session()
# First request to obtain cookies
session.get('https://example.com/login')
# Subsequent requests reuse the same cookies
response = session.get('https://example.com/protected-page')
# Manually set cookies
session.cookies.update({'session_id': 'abc123', 'user_prefs': 'en'})

Headless browsers automatically keep cookies within a single session, although you can implement custom logic to persist and restore cookies manually.

Adding Delays and Randomization

A human never maintains identical intervals between actions. Therefore, adding randomness to delays between requests is crucial. We recommend setting time limits — a minimum and a maximum — so delays don’t become too long or too short.


import time
import random

def human_delay(min_delay=1, max_delay=5):
    # Random delay between requests
    time.sleep(random.uniform(min_delay, max_delay))

# Usage inside a scraping loop
for page in range(10):
    human_delay(2, 7)  # Pause from 2 to 7 seconds
    # ... perform the request

Hiding Headless Browser Attributes (Without Stealth Libraries)

An option for masking the signs of an automated browser using Selenium settings (Python):


from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')

driver = webdriver.Chrome(options=options)

# Hide webdriver traces at the JS level
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

# Set a normal user-agent
options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

How to Avoid Suspicious Request Patterns — Additional Strategies

Below are additional approaches for bypassing a WAF, so you know which techniques are available and what they’re useful for:

Introduce randomness into the sequence of actions, especially when iterating through lists of pages.
Emulate human behavior in the browser — move the cursor, perform natural scrolling, make occasional random clicks on page elements, type text into forms character-by-character, and so on.
Integrate with CAPTCHA-solving services.
Choose natural proxies — residential or mobile IPs from different subnets and regions — so that requests don’t all come from a single cluster (which looks like a DDoS). At the same time, in some scenarios it makes sense to keep a session alive for as long as possible and, when rotating, pick addresses from the same subnet or the same ISP. Such reconnections look natural because many ISPs rotate addresses dynamically.
Check pages and server responses for errors. If the server reports too many connections, reduce load and wait; if you see access errors that indicate blocking, rotate IPs and fingerprints.
Exclude pages from crawling based on rules in robots.txt.
Disable WebRTC in headless-browser settings and emulate plausible audio and camera device streams.
Pre-fill cookies by visiting a sequence of random sites first, so cookie jars aren’t empty.
Use caching for images and static assets to avoid repeatedly downloading the same files (another obvious sign of automation).
If you work with AI, build a pipeline that uses LLMs to generate more natural behavior and patterns, as well as fingerprints that are hard to mimic with simple random values or basic scripts.

To continue exploring scraping tips, check examples of scraping with ChatGPT and resources on using frameworks like LangChain and LangGraph.

Conclusion

As we’ve said several times, websites and their protection mechanisms get more sophisticated every year. But this is a two-player game: WAFs and the techniques to bypass them also evolve continuously. With enough effort, you can emulate almost anything — from digital fingerprints and user behavior to client location and even microphone audio streams. It’s like dancing the tango: you must move in sync with your partner.

We provide high-quality rotating proxies with precise targeting down to specific cities and ISPs worldwide, as well as a set of ready-made cloud scrapers accessible via API.

View full post