Scraping Dynamic Content Without API: JavaScript Sites Guide

Written by Team Froxy | Nov 13, 2025 7:00:00 AM

More and more large websites are moving away from static HTML and rendering content client-side with JavaScript frameworks. For businesses and site owners, this is great — it offloads server work (hosting) and shifts content rendering to high-performance CDNs.

For data researchers, however, this creates a serious problem: traditional scraping tools often "see" an empty page or just the JavaScript shell without the actual content.

Below, we’ll focus specifically on heavy JavaScript sites without an API and share practical techniques for scraping dynamic content.

Why JavaScript-Heavy Websites Are Hard to Scrape Without an API

The main challenge is that sites with heavy JavaScript don’t deliver the final page as plain HTML. Technically, HTML is sent, but instead of the main content, it contains references to JavaScript files. These scripts are then loaded and executed directly in the browser, pulling in the required data and assembling the final version of the page’s HTML.

This means that to capture the data, you need either a full browser or a special environment capable of downloading and rendering all the JavaScript (in browsers, this job is handled by the JavaScript engine).

If you try to scrape such a site “head-on” via HTTP requests — for example, with Python’s requests library or similar tools — the scraper will only see empty containers or bare markup without content.

The problem would be fully solved by an open API, which would allow requesting and receiving structured data in formats like JSON or XML. But many websites deliberately avoid offering APIs, making scraping even harder on purpose.

Common Challenges in Scraping Dynamic Content

As a result of the challenges described above, scraping JavaScript-heavy websites comes with several issues:

Rendering delays. Data may load only after a few seconds or in response to specific user actions.
Strong anti-bot protection. Dynamic sites often include checks on user behavior and “human-like” fingerprints (cookies, user-agent strings, headers, JavaScript execution).
High resource consumption. Without full browsers — or their equivalents, such as headless or anti-detect browsers — scraping is impossible. These, in turn, require significant memory and computing power to render JavaScript.
Complex DOM structures. Some sites deliberately run scripts that “randomize” classes, IDs, and attributes of HTML elements. On such pages, repeating patterns can’t be easily recognized, forcing scrapers to rely on AI methods, including computer vision techniques such as screenshot recognition.
Need for high-quality proxies. Many protection systems analyze IP history, IP type, and whether the address location matches other detectable signals. Residential and mobile proxies are the most trusted. Without proxies, automated requests are quickly flagged and blocked. Replacing your real IP with a new one is easiest through proxy services.

Techniques for Web Scraping JavaScript with Code Examples

Below is a set of practical methods and techniques for scraping JavaScript-heavy sites: SPA/SSR hybrids, React, Vue, Angular, and others.

Headless Browsers

The most reliable option, when a page with dynamic content doesn’t expose a clear API, is to render it in a real browser. To let a scraping script “talk” to browsers, special libraries called web drivers are used.

Google Chrome now has a built-in API interface based on the Chrome DevTools Protocol (CDP). But it still works with the same drivers: Playwright and Selenium (for Python) and Puppeteer (for JavaScript-based scraping).

The workflow is straightforward:

The script launches a new instance of a headless or real browser (useful when you need “human-like” fingerprints).
It opens the target page and waits for it to fully load — for example, until a specific element is rendered.
The browser then passes the final HTML code back to the script, which can parse it into components. Many drivers even include their own syntax for element selection, eliminating the need for external parsers.
The structured data can then be stored in the desired format, processed further, or sent to external services.

A simple Python script for web scraping a JavaScript site using Playwright (don’t forget to install the library and browser with the commands pip install playwright and playwright install):


from playwright.sync_api import sync_playwright

# Main scraping function, takes the target page as an argument
def run(url):
with sync_playwright() as p:

# Launch the browser in headless mode
browser = p.chromium.launch(headless=True)

# Open a new tab
page = browser.new_page()

# Navigate to the given address
page.goto(url)

# Wait for product cards to appear
page.wait_for_selector(".product-card, .catalog-item")

# Use locators to search for elements
cards = page.locator(".product-card, .catalog-item")

# Count the number of elements found
count = cards.count()

# Print the data to the console
print(f"Elements found: {count}")

# Process the array of elements
for i in range(count):
    title = cards.nth(i).locator(".title, .product-title, h3").inner_text(timeout=1000) if cards.nth(i).locator(".title, .product-title, h3").count() else "-"
    price = cards.nth(i).locator(".price, .product-price").inner_text(timeout=1000) if cards.nth(i).locator(".price, .product-price").count() else "-"
    
    # Output the product title and price
    print(f"{i+1}. {title} — {price}")

# Close the browser
browser.close()

if __name__ == "__main__":
    run("https://example.com")  # replace with your target site

Residential Proxies

Perfect proxies for accessing valuable data from around the world.

Try With Trial $1.99, 100Mb

Intercepting Network Requests

If a site doesn’t have a public API, that doesn’t mean there’s no API at all. Heavy JavaScript-based sites — especially those built with React, Vue, Angular, etc. — often load dynamic content over the network using XHR or fetch.

The API may simply be hard to detect, intentionally protected (with obfuscation or special authorization tokens), or hidden.

How to find the internal API of JavaScript sites:

Open your browser’s developer tools (DevTools) → “Network” tab → filter by Fetch/XHR. You’re interested in all outgoing requests — study their headers and response bodies.
The contents of related scripts and JSON structures can be found by clicking a request and opening the “Preview” tab.
Once you’ve identified the right data endpoints, you can write your own JS script to fetch dynamic content.

Example:

// install and import the fetch module — npm i node-fetch@2
const fetch = require('node-fetch');

// Main function
async function callApi() {
    // Call the internal API
    const url = 'https://example.com/api/products?page=1';

    // Send the request with fetch
    const res = await fetch(url, {
        headers: {
            'Accept': 'application/json',
            'User-Agent': 'Mozilla/5.0 ...'
            // include the required headers: Authorization, X-CSRF-Token, etc.
        }
    });

    // Parse the response data
    const data = await res.json();

    // Output the number of items found
    console.log(data.items.length);
}

callApi();

Useful Lifehacks for Scraping Dynamic Content

JavaScript-powered dynamic websites may not deliver content immediately, instead responding to user actions. They may track client behavior, show CAPTCHA, complicate the DOM structure, and more. Below are the most common situations and methods for handling them.

Handling Infinite Scroll and Dynamic Pagination

Here’s an example of a Python script that performs a set number of scrolls with a fixed pause between them:

def run(url, max_scroll=5, pause=2):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        # how many scrolls to perform
        for i in range(max_scroll):
            # scroll to the bottom
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(pause)  # wait for new content to load
            print(f"Scroll {i+1}/{max_scroll} done")

        # Continue with your parsing code here

For maximum reliability, you can replace fixed pauses with random delays. Before starting the scroll loop, you may also wait for a specific element to appear to ensure the page is normal content and not a CAPTCHA placeholder.

If the page loads additional content via a “Show more” button instead of infinite scroll, the script might look like this:

# Define the click limit
def run(url, max_clicks=10):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        # initialize click counter
        clicks = 0

        # while the counter is below the maximum
        while clicks < max_clicks:
            try:
                # search for the button
                button = page.locator("button:has-text('Show more'), .load-more, .show-more")
                if not button.count():
                    print("The 'Show more' button was not found — stopping.")
                    break

                # click the button
                button.first.click()
                page.wait_for_timeout(2000)  # wait for new data to load

                # increase the counter
                clicks += 1
                print(f"Click on the button {clicks}/{max_clicks}")

            except Exception as e:
                print(f"Error when clicking: {e}")
                break

        # After finishing clicks, collect the data

User Simulation

Some sites analyze not only the digital fingerprints of browser profiles but also the “humanity” of interactions with dynamic content. For comprehensive testing, headless browsers can simulate user actions — a capability you can leverage when scraping JavaScript-heavy sites. Below is an example of typical user actions when interacting with a search form (in Python):

from playwright.sync_api import sync_playwright

def run():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)  # deliberately disable headless so you can watch the process
        page = browser.new_page()

        # open the site
        page.goto("https://example.com")  # Replace with your target URL.

        # type a query into the search field
        page.fill("input[name='q']", "smartphone")  # Replace the query as needed

        # press Enter (simulate keyboard)
        page.press("input[name='q']", "Enter")

        # wait for results
        page.wait_for_selector(".search-results")  # Can be replaced with a timeout

        # click the first product
        page.locator(".search-results .item").first.click()

        # wait for the product card
        page.wait_for_selector(".product-card")

        print("Product card has been successfully opened!")

        browser.close()

if __name__ == "__main__":
    run()

Bypassing Anti-Bot Protection and CAPTCHAs

The first line of defense is using proxies. If you use our Froxy service, rotation and selection of exit IPs can be configured in the dashboard. The proxy is connected to the dynamic-content scraping script only once.

Example of a basic implementation (Playwright + proxy). The Russian comments inside the code are translated into English; the code itself is preserved:

from playwright.async_api import async_playwright
import asyncio

async def main():
    # Proxy credentials
    proxy_server = "http://proxy.example.com:8080"  # your proxy server
    proxy_username = "your_username"  # username
    proxy_password = "your_password"  # password

    # Launch a headless browser with the proxy
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={
                'server': proxy_server,
                'username': proxy_username,
                'password': proxy_password
            }
        )

        # Open a page
        page = await browser.new_page()

        # Check IP via a test site
        await page.goto("https://httpbin.org/ip")

        # Get the result
        content = await page.content()
        print("IP proxy:", content)

        # Close the browser
        await browser.close()

# Run the script
asyncio.run(main())

In Playwright you can detect the presence of a CAPTCHA just like any other page element — using a locator or CSS selector. After detection you can pause the main script and let an operator solve the challenge manually in the opened browser.

A more robust approach is to automate CAPTCHA solving via specialized services. Pay special attention to Cloudflare CAPTCHAs — they may appear before the site finishes loading.

Here’s a simple Python example with manual captcha solving:

from playwright.sync_api import sync_playwright

def run():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)  # disable headless mode so you can see the captcha
        page = browser.new_page()

        # Load the target site
        page.goto("https://example.com")  # You can replace with your own URL

        # Check for a captcha block (for example, a reCAPTCHA iframe)
        try:
            page.wait_for_selector("iframe[title='reCAPTCHA']", timeout=5000)
            print("The captcha has been detected! Solve it manually...")

            # Wait until the user solves the captcha and new content appears
            page.wait_for_selector("#content-loaded", timeout=0)  # timeout=0 means wait indefinitely
            print("The captcha is solved, we continue to work.")
        except:
            print("The captcha did not appear, we continue to work.")

        # Further parsing or actions
        browser.close()

if __name__ == "__main__":
    run()

Since some protection systems closely examine the browser profile, it makes sense to mask certain parameters when launching your browser instance. For example:

from playwright.async_api import async_playwright
import asyncio

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            # The list of arguments can be extended
            args=[
                '--no-sandbox',
                '--disable-blink-features=AutomationControlled',
                '--disable-dev-shm-usage',
                '--disable-features=VizDisplayCompositor',
                '--disable-background-timer-throttling',
                '--disable-backgrounding-occluded-windows',
                '--disable-renderer-backgrounding',
                '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
            ]
        )

        page = await browser.new_page()

        # Hide headless-browser attributes using an initialization script
        await page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined,
        });
        delete navigator.__proto__.webdriver;
        """)

        # Navigate to the target page
        await page.goto('https://site.com/')

# Run the async function
asyncio.run(main())

Part of the issue of detecting headless browsers can be mitigated by stealth libraries such as playwright_stealth, selenium_stealth, undetected-chromedriver, playwright_extra, and similar tools.

Mobile Proxies

Premium mobile IPs for ultimate flexibility and seamless connectivity.

Try With Trial $1.99, 100Mb

Combining parsing libraries (BeautifulSoup, lxml) with rendered HTML

Sometimes the built-in mechanisms of headless browsers for locating elements in the layout are not sufficient. In addition, some developers may find the syntax of external parsers such as BeautifulSoup, lxml, and others more familiar and convenient.

There’s nothing stopping you from passing the final rendered HTML code to these libraries for further processing.

Here’s an example of combining Selenium and BeautifulSoup:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# Setting up Selenium
def setup_driver():
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Remove this line for debugging
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    driver = webdriver.Chrome(options=chrome_options)
    return driver

# Parsing dynamic content
def parse_dynamic_content(url):
    driver = setup_driver()
    try:
        # Load the page
        driver.get(url)

        # Wait for dynamic content to load
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content")))

        # Allow some time for JavaScript execution
        time.sleep(2)

        # Get the final HTML after JavaScript execution
        page_source = driver.page_source

        # Pass it to BeautifulSoup for parsing
        soup = BeautifulSoup(page_source, 'html.parser')

        # Extract data
        products = []
        product_elements = soup.find_all('div', class_='product-item')  # Class identifying all product cards

        # Iterate through product cards and extract details
        for product in product_elements:
            name = product.find('h3')
            price = product.find('span', class_='price')
            description = product.find('p', class_='description')
            products.append({
                'name': name.text.strip() if name else 'No name',
                'price': price.text.strip() if price else 'No price',
                'description': description.text.strip() if description else 'No description'
            })
        return products
    finally:
        driver.quit()

# Example usage
if __name__ == "__main__":
    url = "https://example.com/dynamic-products"
    data = parse_dynamic_content(url)
    for item in data:
        print(f"Name: {item['name']}")
        print(f"Price: {item['price']}")
        print(f"Description: {item['description']}")
        print("-" * 50)

Conclusion

Working with dynamic sites is considerably harder than working with sites that deliver ready-made HTML. But that doesn’t mean scraping JavaScript-driven sites is impossible. You just need the right libraries and careful configuration.

A new component appears in the scraping stack: a headless or anti-detect browser that renders all the JS. In that browser, you can simulate user behavior, manage cookies and fingerprints, and interact with the page’s UI elements.

However, the browser alone does not solve the challenges of dynamic content 100%. You still need high-quality proxies, reliable and high-performance hardware (resource requirements increase dramatically), and a well-designed scraper architecture that can withstand anti-bot systems and adapt to changes in the structure of dynamic content.

View full post