Static WEB is gradually becoming a thing of the past. Along with simple HTML, the familiar “monolith” architecture is also slowly disappearing — this is when the resulting HTML for dynamic sites was generated entirely on the server and then sent to clients and their browsers in a fully rendered form. Many large websites and web services are now wholly built using JavaScript. If you try to load their HTML code, you’ll be pretty surprised: apart from the <HEAD> и <BODY> tags, there’s almost nothing there. Their content mainly includes links to numerous scripts. These scripts are loaded directly in the browser and executed there, dynamically generating the necessary HTML code as required.
This significantly complicates web scraping. You can no longer simply rely on the Python Requests and BeautifulSoup libraries to analyze the content and extract the required data. You initially need to execute all the scripts, render the page, and then look at the resulting HTML. To do this, you need either a special JavaScript engine or a full-fledged browser (browsers already come with such built-in engines: V8 for Chromium-based browsers and SpiderMonkey for Firefox).
This guide is for those who want to learn how to scrape websites with Python when the pages have dynamic content. It will explain what tools and libraries are needed for the job, whether effective techniques and approaches exist, and what errors and pitfalls you might encounter along the way.
The modern web has notably changed. It all started with HTML, which provided only the markup opportunity that allowed documents and web pages to look presentable rather than plain. Browsers received the HTML content, rendered it, and displayed it to users.
With the emergence of hypertext preprocessors (based on full-fledged programming languages like PHP), it became possible to assemble web pages from various blocks and components. Content itself started being stored in specialized databases. These types of websites came to be known as “dynamic.” CMS platforms like WordPress, Joomla, and Drupal typically manage their functionality.
This, however, was also insufficient. It turned out that not just websites, but also the content itself could be dynamic. JavaScript gained remarkable popularity because it allowed real-time updates to webpage content directly in the browser, often in response to user interactions. This well-known AJAX technology made it possible to “refresh” and re-render only a specific section or element of a page without reloading the whole thing. Eventually, developers realized that the traditional monolithic approach, involving server-side preprocessing of every page, was becoming too resource-intensive. Despite scaling efforts, servers were struggling to keep up with the growing number of users. Welcome, Google, Amazon, and Facebook!
Static site generators, powered by modern JavaScript frameworks like React, Angular, and Vue, became a clever solution. But don’t let the word “static” deceive you — these websites are not static. The difference is in where and when the HTML is generated: instead of assembling the final HTML on the server every time a user requests, the page is dynamically assembled in the browser.
Since JavaScript scripts are just “static files” (from the server’s perspective), they can be distributed across different servers, such as a decentralized CDN. In this case, the load on the hosting infrastructure notably decreases. As a result, an opportunity for convenient decentralized content storage appears.
Thus, when users request a particular page, they’re redirected to the nearest CDN node, loading the required JS scripts from there and dynamically building the page in the browser.
Clever, right? It is — until you try to scrape data from such “dynamic web pages.” To extract content, you must load all the JS scripts and execute them in a special runtime environment (a JavaScript engine, similar in concept to a Java Machine). Only then will the actual HTML be generated. This is why scraping dynamic websites often means running the site through a real browser. That’s precisely how the need for headless browsers and web drivers arose — tools that allow automation and interaction with real browsers (including libraries like Selenium, Playwright, and Puppeteer).
In general, static web pages are pure HTML files. They can be saved as regular text documents with the .html extension, which means they will open in any web browser.
Here’s an example of a simple static page:
<!DOCTYPE html>
<html>
<head>
<title>This is the page title that appears in search results and when hovering over the browser tab</title>
<meta charset="utf-8">
</head>
<body>
<h1>This is the main heading shown in the page body</h1>
<p>
This is just a paragraph — it might contain some descriptive text about something useful…
</p>
<h2>Subheading — let’s call it "Proxy Types":</h2>
<ul id="thebestlist">
<li>Residential proxies</li>
<li>Mobile proxies</li>
<li>General proxies</li>
<li>Datacenter proxies</li>
<li>Private proxies</li>
</ul>
</body>
</html>
You can try it out yourself: create a plain text file on your computer, paste in the code above, and change the file extension from .txt to .html. Double-click the file — it should open in your browser and display the page.
Pages built using PHP (or other hypertext preprocessors) are first executed on the server in a special runtime environment and only then sent to the browser. Technically, this is the same “static content” as the browser works with simple HTML.
Dynamic web pages look a bit different. The brightest dynamic web page example is Google search results. The only things you’ll see in the raw HTML here are basic elements like <title>, <head>, and <body>. And even those are mainly used to load external JS files.
An attempt to “see” meaningful content or layout will lead nowhere. Moreover, the code is deliberately minified (spaces, line breaks, and other special characters are removed to reduce the code size).
Many other large websites and web services are organized similarly, so you can scrape TikTok, get data from IMDb, and other platforms.
The most common format of dynamic content used on the vast majority of websites and services, however, is embedding blocks or elements that load based on user actions. These may include:
Perfect proxies for accessing valuable data from around the world.
At this point, the whole tragedy of web scraping dynamic content should be pretty straightforward. But let’s clarify the details, just in case:
Since web page dynamic content needs to be pre-rendered before you can scrape it, the right tools are crucial, especially when working with Python. Here's a breakdown of the key tools and libraries that may come in handy to you:
This is a standard set for web scraping dynamic content. If you’d like, you can also look into additional libraries for parsing in Python.
You can add the rest as needed: tools for implementing random delays and asynchronous requests, scripts for advanced proxy rotation, libraries for convenient data formatting, etc.
Headless browsers are intentionally excluded from the list since they are usually bundled with web drivers nowadays.
The general algorithm for web scraping dynamic content looks as follows:
Let’s dive deep into the technical details of parsing dynamic webpages further.
Separate attention should be devoted to the possibility of data exchange with target websites via API. Some major platforms know that the information on their pages involves various business processes. To reduce the load on their main website (the web version of the interface), they provide a separate interface for parsers – a programmatic one (API, i.e., Application Programming Interface).
APIs can be found, for example, on platforms like Amazon (paid, with a small volume of free requests), IMDb (paid, GraphQL format), and TikTok (free, including for business use).
Accordingly, you can write your own Python script to generate requests to the desired API resource and parse the received responses into components. Usually, the data is supplied in a structured format, so parsing it into separate parts won’t be a problem. No headless browsers or web drivers are needed for this. Simple libraries for working with HTTP requests, like Python Requests, are enough (it even has built-in support for parsing responses in JSON format).
The main problem is that each site usually has its own API, its own set of commands, and response formats. Plus, to connect, you must register a developer account and obtain an API key (which serves as an identifier). Most accounts have specific technical limits. It’s worth noting that there are no severe sanctions if you exceed these limits. At most, you’ll receive an error instead of a response. Once the limit is reset, you can resume the data exchange.
Many large services, such as Google Search, YouTube video hosting, etc., do not have APIs. However, their data is extremely important and can be used in complex analytics.
Instead, you can use intermediary services that implement the missing APIs. These, however, are usually paid.
This is how the scheme of working with such services looks:
Froxy Web Scraper is an example of such a service. You can work with many major platforms like Google, Yahoo, Bing, Amazon, etc. For non-standard tasks, receiving the resulting HTML code from a list of target pages is an option. For each parsing task, you can set frequency and targeting parameters (proxy location – up to the city and mobile provider, and address type: mobile or residential). After parsing is complete, the service can send calls to specified webhooks, so your script can download the results and proceed to further processing. Data is returned in JSON or CSV format.
An example of this workflow is available in the material about scraping Google SERPs.
5 continents, No limits
Access our proxy network with over 200 locations and over 10 million IP addresses.
Install the library using the command:
pip install selenium
PIP is the standard package manager for Python. Note that for the command to run successfully on Windows, the path to the executable pip.exe must be added to the system environment variables.
The WebDriver and a compatible headless browser package will be installed automatically. The process may take some time, so be sure to wait until the installation is successfully completed.
Here is an example of a simple script that will address a special test website (scrapingcourse.com/javascript-rendering) and collect a list of products:
import time
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# Selenium Settings
driver = webdriver.Chrome() # Launch the browser
driver.get("https://www.scrapingcourse.com/javascript-rendering") # Open a target website
time.sleep(5) # Wait for the results to load
# Get the search results
results = driver.find_elements(By.CSS_SELECTOR, "div.product-item") # Containers with products
data = [] # Lists for data storage
for result in results:
try:
title = result.find_element(By.CSS_SELECTOR, "span.product-name").text # Title
link = result.find_element(By.TAG_NAME, "a").get_attribute("href") # Link
price = result.find_element(By.CSS_SELECTOR, "span.product-price").text # Price
data.append([title, link, price]) # Add to the list
except Exception as e:
print(f"Error while parsing: {e}")
# Close the browser
driver.quit()
# Save in CSV
csv_filename = "products_results.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Link", "Price"])
writer.writerows(data)
# Print data to the console
for item in data:
print(item)
Create a text file named «web_scraping_dynamic_content_with_selenium». Change the extension if needed. Open it in any text editor, copy and paste the code above, and save the changes.
Now all that’s left is to run the script using the command in the console:
python web_scraping_dynamic_content_with_selenium.(extension)
If you haven’t navigated to the directory where the script is located, you’ll need to specify the full path to the file, for example:
python C:\my-scraper\web_scraping_dynamic_content_with_selenium.(extension)
After launching and closing the browser window (wait for 5 seconds), a list of products with prices and links should be printed to the console. Additionally, in the folder with the script, a CSV file with the same content will appear, but formatted as a ready-to-use table.
Library installation is completed with a few commands. The first installs the library itself along with its dependencies, while the second installs all the necessary browsers:
pip install playwright
playwright install
If you need a specific browser, specify it as an additional argument. For example:
playwright install chromium
Here is an example of the simplest script, which works exactly like the Selenium script. That is, it accesses a test dynamic website (scrapingcourse.com/javascript-rendering) and collects a list of products from it:
import csv
from playwright.sync_api import sync_playwright
# Logic for parsing products
with sync_playwright() as playwright:
browser = playwright.chromium.launch(headless=False) # Launch browser with GUI
page = browser.new_page()
# Open target site
page.goto("https://www.scrapingcourse.com/javascript-rendering")
# Wait for results to load. Specify a particular selector (instead of time-based waiting, the default timeout is 30 seconds in case the selector is not found)
page.wait_for_selector("div.product-item")
# Collect the results
results = page.locator("div.product-item").all()
data = []
for result in results:
try:
title = result.locator("span.product-name").inner_text() # Title
link = result.locator("a").get_attribute("href") # Link
price = result.locator("span.product-price").inner_text() # Price
data.append([title, link, price])
except Exception as e:
print(f"Error while parsing: {e}")
browser.close() # Close the browser
# Save in CSV
csv_filename = "results_playwright.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Link", "Price"])
writer.writerows(data)
# Print data to the console
for item in data:
print(item)
Create a text file named «web_scraping_dynamic_content_with_playwright». Copy the code above and save the file. Close it and change the extension if needed.
Run the script in the console:
python web_scraping_dynamic_content_with_playwright.(extension)
The output will be similar to the first script. In the script's directory, a file named results_playwright.csv should appear (containing a table of product titles, links, and prices).
First, you need to thoroughly explore and understand how syntax analyzers work. This is the cornerstone of quality scraping — they’re essential for navigating and interacting with the DOM structure (i.e., the HTML markup layout).
Without understanding what and where to look for, you won’t be able to "explain" anything to your program.
Different web drivers use different syntaxes to search and extract data from elements.
For example, Playwright works with locators (this is an internal mechanism) and supports XPath and CSS syntax. If this is not enough, you can install additional libraries for other syntax analyzers like BeautifulSoup (this will come with a different syntax and DOM traversal logic).
Selenium has a slightly different approach but is broadly similar to how Playwright works. Its internal «By» tool can search elements based on names, IDs, tags, classes, CSS selectors, and XPath syntax. Again, you can pass the resulting HTML to another scarper if needed.
Second, extensive web services with dynamic content have learnt to detect headless browsers through digital fingerprints. To avoid being blocked or shown CAPTCHAs, you need to hide these traces and try to work through proxies (since each new proxy IP address comes as a new client, which is then analyzed and possibly blocked by security systems individually).
Proxies are usually connected to headless browsers at the level of an individual instance. Therefore, instead of coming up with clunky workarounds for rotating proxies and launching new browsers (because your device’s memory isn’t unlimited), initially setting up rotating proxies makes more sense.
Modern websites have become much more complicated. These are no longer just static HTML pages but full-fledged web applications. Working with dynamic content has become significantly more challenging because you can’t simply grab the DOM structure and extract the required data from the HTML. Instead, you need to load the site or its dynamic pages in a real browser (this can be done using headless or anti-detect browsers) before you even start scraping.
Even though you’re working with content through browsers, you still have to consider many technical nuances, including digital fingerprint emulation, user behavior simulation, and working through proxies.
You can find high-quality proxies with us. Froxy offers over 10 million IPs (residential, mobile, and datacenter proxies targeting city and mobile carrier level).
You can always switch to API-based parsing if you don’t want to overcomplicate your Python scraper. Froxy can help you with that, too. You can submit a data collection task for specific websites, and we’ll handle everything for you, delivering ready-to-use, structured data in JSON or CSV format. No need to configure proxies or headless browsers. Read more in the Froxy Scraper service description.