More and more large websites are moving away from static HTML and rendering content client-side with JavaScript frameworks. For businesses and site owners, this is great — it offloads server work (hosting) and shifts content rendering to high-performance CDNs.
For data researchers, however, this creates a serious problem: traditional scraping tools often "see" an empty page or just the JavaScript shell without the actual content.
Below, we’ll focus specifically on heavy JavaScript sites without an API and share practical techniques for scraping dynamic content.
The main challenge is that sites with heavy JavaScript don’t deliver the final page as plain HTML. Technically, HTML is sent, but instead of the main content, it contains references to JavaScript files. These scripts are then loaded and executed directly in the browser, pulling in the required data and assembling the final version of the page’s HTML.
This means that to capture the data, you need either a full browser or a special environment capable of downloading and rendering all the JavaScript (in browsers, this job is handled by the JavaScript engine).
If you try to scrape such a site “head-on” via HTTP requests — for example, with Python’s requests library or similar tools — the scraper will only see empty containers or bare markup without content.
The problem would be fully solved by an open API, which would allow requesting and receiving structured data in formats like JSON or XML. But many websites deliberately avoid offering APIs, making scraping even harder on purpose.
As a result of the challenges described above, scraping JavaScript-heavy websites comes with several issues:
Below is a set of practical methods and techniques for scraping JavaScript-heavy sites: SPA/SSR hybrids, React, Vue, Angular, and others.
The most reliable option, when a page with dynamic content doesn’t expose a clear API, is to render it in a real browser. To let a scraping script “talk” to browsers, special libraries called web drivers are used.
Google Chrome now has a built-in API interface based on the Chrome DevTools Protocol (CDP). But it still works with the same drivers: Playwright and Selenium (for Python) and Puppeteer (for JavaScript-based scraping).
The workflow is straightforward:
A simple Python script for web scraping a JavaScript site using Playwright (don’t forget to install the library and browser with the commands pip install playwright and playwright install):
from playwright.sync_api import sync_playwright
# Main scraping function, takes the target page as an argument
def run(url):
with sync_playwright() as p:
# Launch the browser in headless mode
browser = p.chromium.launch(headless=True)
# Open a new tab
page = browser.new_page()
# Navigate to the given address
page.goto(url)
# Wait for product cards to appear
page.wait_for_selector(".product-card, .catalog-item")
# Use locators to search for elements
cards = page.locator(".product-card, .catalog-item")
# Count the number of elements found
count = cards.count()
# Print the data to the console
print(f"Elements found: {count}")
# Process the array of elements
for i in range(count):
title = cards.nth(i).locator(".title, .product-title, h3").inner_text(timeout=1000) if cards.nth(i).locator(".title, .product-title, h3").count() else "-"
price = cards.nth(i).locator(".price, .product-price").inner_text(timeout=1000) if cards.nth(i).locator(".price, .product-price").count() else "-"
# Output the product title and price
print(f"{i+1}. {title} — {price}")
# Close the browser
browser.close()
if __name__ == "__main__":
run("https://example.com") # replace with your target site
Perfect proxies for accessing valuable data from around the world.
If a site doesn’t have a public API, that doesn’t mean there’s no API at all. Heavy JavaScript-based sites — especially those built with React, Vue, Angular, etc. — often load dynamic content over the network using XHR or fetch.
The API may simply be hard to detect, intentionally protected (with obfuscation or special authorization tokens), or hidden.
How to find the internal API of JavaScript sites:
Example:
// install and import the fetch module — npm i node-fetch@2
const fetch = require('node-fetch');
// Main function
async function callApi() {
// Call the internal API
const url = 'https://example.com/api/products?page=1';
// Send the request with fetch
const res = await fetch(url, {
headers: {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 ...'
// include the required headers: Authorization, X-CSRF-Token, etc.
}
});
// Parse the response data
const data = await res.json();
// Output the number of items found
console.log(data.items.length);
}
callApi();
JavaScript-powered dynamic websites may not deliver content immediately, instead responding to user actions. They may track client behavior, show CAPTCHA, complicate the DOM structure, and more. Below are the most common situations and methods for handling them.
Here’s an example of a Python script that performs a set number of scrolls with a fixed pause between them:
def run(url, max_scroll=5, pause=2):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# how many scrolls to perform
for i in range(max_scroll):
# scroll to the bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(pause) # wait for new content to load
print(f"Scroll {i+1}/{max_scroll} done")
# Continue with your parsing code here
For maximum reliability, you can replace fixed pauses with random delays. Before starting the scroll loop, you may also wait for a specific element to appear to ensure the page is normal content and not a CAPTCHA placeholder.
If the page loads additional content via a “Show more” button instead of infinite scroll, the script might look like this:
# Define the click limit
def run(url, max_clicks=10):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# initialize click counter
clicks = 0
# while the counter is below the maximum
while clicks < max_clicks:
try:
# search for the button
button = page.locator("button:has-text('Show more'), .load-more, .show-more")
if not button.count():
print("The 'Show more' button was not found — stopping.")
break
# click the button
button.first.click()
page.wait_for_timeout(2000) # wait for new data to load
# increase the counter
clicks += 1
print(f"Click on the button {clicks}/{max_clicks}")
except Exception as e:
print(f"Error when clicking: {e}")
break
# After finishing clicks, collect the data
Some sites analyze not only the digital fingerprints of browser profiles but also the “humanity” of interactions with dynamic content. For comprehensive testing, headless browsers can simulate user actions — a capability you can leverage when scraping JavaScript-heavy sites. Below is an example of typical user actions when interacting with a search form (in Python):
from playwright.sync_api import sync_playwright
def run():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # deliberately disable headless so you can watch the process
page = browser.new_page()
# open the site
page.goto("https://example.com") # Replace with your target URL.
# type a query into the search field
page.fill("input[name='q']", "smartphone") # Replace the query as needed
# press Enter (simulate keyboard)
page.press("input[name='q']", "Enter")
# wait for results
page.wait_for_selector(".search-results") # Can be replaced with a timeout
# click the first product
page.locator(".search-results .item").first.click()
# wait for the product card
page.wait_for_selector(".product-card")
print("Product card has been successfully opened!")
browser.close()
if __name__ == "__main__":
run()
The first line of defense is using proxies. If you use our Froxy service, rotation and selection of exit IPs can be configured in the dashboard. The proxy is connected to the dynamic-content scraping script only once.
Example of a basic implementation (Playwright + proxy). The Russian comments inside the code are translated into English; the code itself is preserved:
from playwright.async_api import async_playwright
import asyncio
async def main():
# Proxy credentials
proxy_server = "http://proxy.example.com:8080" # your proxy server
proxy_username = "your_username" # username
proxy_password = "your_password" # password
# Launch a headless browser with the proxy
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False,
proxy={
'server': proxy_server,
'username': proxy_username,
'password': proxy_password
}
)
# Open a page
page = await browser.new_page()
# Check IP via a test site
await page.goto("https://httpbin.org/ip")
# Get the result
content = await page.content()
print("IP proxy:", content)
# Close the browser
await browser.close()
# Run the script
asyncio.run(main())
In Playwright you can detect the presence of a CAPTCHA just like any other page element — using a locator or CSS selector. After detection you can pause the main script and let an operator solve the challenge manually in the opened browser.
A more robust approach is to automate CAPTCHA solving via specialized services. Pay special attention to Cloudflare CAPTCHAs — they may appear before the site finishes loading.
Here’s a simple Python example with manual captcha solving:
from playwright.sync_api import sync_playwright
def run():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # disable headless mode so you can see the captcha
page = browser.new_page()
# Load the target site
page.goto("https://example.com") # You can replace with your own URL
# Check for a captcha block (for example, a reCAPTCHA iframe)
try:
page.wait_for_selector("iframe[title='reCAPTCHA']", timeout=5000)
print("The captcha has been detected! Solve it manually...")
# Wait until the user solves the captcha and new content appears
page.wait_for_selector("#content-loaded", timeout=0) # timeout=0 means wait indefinitely
print("The captcha is solved, we continue to work.")
except:
print("The captcha did not appear, we continue to work.")
# Further parsing or actions
browser.close()
if __name__ == "__main__":
run()
Since some protection systems closely examine the browser profile, it makes sense to mask certain parameters when launching your browser instance. For example:
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
# The list of arguments can be extended
args=[
'--no-sandbox',
'--disable-blink-features=AutomationControlled',
'--disable-dev-shm-usage',
'--disable-features=VizDisplayCompositor',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-renderer-backgrounding',
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
]
)
page = await browser.new_page()
# Hide headless-browser attributes using an initialization script
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
delete navigator.__proto__.webdriver;
""")
# Navigate to the target page
await page.goto('https://site.com/')
# Run the async function
asyncio.run(main())
Part of the issue of detecting headless browsers can be mitigated by stealth libraries such as playwright_stealth, selenium_stealth, undetected-chromedriver, playwright_extra, and similar tools.
Premium mobile IPs for ultimate flexibility and seamless connectivity.
Sometimes the built-in mechanisms of headless browsers for locating elements in the layout are not sufficient. In addition, some developers may find the syntax of external parsers such as BeautifulSoup, lxml, and others more familiar and convenient.
There’s nothing stopping you from passing the final rendered HTML code to these libraries for further processing.
Here’s an example of combining Selenium and BeautifulSoup:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
# Setting up Selenium
def setup_driver():
chrome_options = Options()
chrome_options.add_argument("--headless") # Remove this line for debugging
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)
return driver
# Parsing dynamic content
def parse_dynamic_content(url):
driver = setup_driver()
try:
# Load the page
driver.get(url)
# Wait for dynamic content to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content")))
# Allow some time for JavaScript execution
time.sleep(2)
# Get the final HTML after JavaScript execution
page_source = driver.page_source
# Pass it to BeautifulSoup for parsing
soup = BeautifulSoup(page_source, 'html.parser')
# Extract data
products = []
product_elements = soup.find_all('div', class_='product-item') # Class identifying all product cards
# Iterate through product cards and extract details
for product in product_elements:
name = product.find('h3')
price = product.find('span', class_='price')
description = product.find('p', class_='description')
products.append({
'name': name.text.strip() if name else 'No name',
'price': price.text.strip() if price else 'No price',
'description': description.text.strip() if description else 'No description'
})
return products
finally:
driver.quit()
# Example usage
if __name__ == "__main__":
url = "https://example.com/dynamic-products"
data = parse_dynamic_content(url)
for item in data:
print(f"Name: {item['name']}")
print(f"Price: {item['price']}")
print(f"Description: {item['description']}")
print("-" * 50)
Working with dynamic sites is considerably harder than working with sites that deliver ready-made HTML. But that doesn’t mean scraping JavaScript-driven sites is impossible. You just need the right libraries and careful configuration.
A new component appears in the scraping stack: a headless or anti-detect browser that renders all the JS. In that browser, you can simulate user behavior, manage cookies and fingerprints, and interact with the page’s UI elements.
However, the browser alone does not solve the challenges of dynamic content 100%. You still need high-quality proxies, reliable and high-performance hardware (resource requirements increase dramatically), and a well-designed scraper architecture that can withstand anti-bot systems and adapt to changes in the structure of dynamic content.