Business is impossible without data analysis. Naturally, data comes from various sources: competitor websites, electronic encyclopedias, marketplaces, applications and web services, specialized databases, etc. Each source presents information in its format, whether structured or unstructured. The task of programmers is to ensure that businesses receive data only in a specific structured format suitable for further analysis and integration into operations. This brings us to the realization that companies need a high-quality and functional parser.
Since there are many data sources, and most of them are websites that do not provide APIs (ready-made programmatic interfaces), the best solution is usually to develop a custom parsing script. However, this presents an interesting challenge - most modern websites use dynamic technologies, such as asynchronous JavaScript and AJAX. As a result, a simple analysis of the HTML code returned by the server is ineffective. This means an additional engine is required to process JavaScript scripts or, even better - a full-featured browser. But how can a parser "communicate" with a browser?
This is done through a unique local API. The organization of such an API is handled by specific libraries known as web drivers. We’ll review one of these solutions below.
Why Playwright?
Playwright is an open-source web driver developed by Microsoft for its internal needs (primarily for testing websites and applications). This framework supports all major operating systems (Windows, macOS, Linux) and is adapted for multiple programming languages: JavaScript (TypeScript), Python, Java, and .NET (DotNet). It can interact via API with all popular desktop browsers, including Chrome (and many Chromium-based browsers like Edge or Opera), Firefox, and Safari (starting from WebKit 18.2 and higher).
What else should a library have to attract everyone's attention? A unified command syntax. That’s exactly what makes Playwright appealing—regardless of the programming language you use for your parser, the browser you control, or the operating system you work in, API calls in Playwright maintain the same syntax.
Perhaps that is why so many developers have chosen Playwright for their projects since its first release in 2020. With Playwright, web scraping becomes truly convenient and functional.
Playwright Advantages for Web Scraping
Here are the common benefits of using Playwright for web scraping:
- Cross-platform compatibility (supports multiple programming languages, operating systems, and browsers. It can even emulate mobile browser environments);
- Unified command syntax;
- Comprehensive documentation and examples;
- Regular updates (Microsoft - a major vendor - is responsible for ensuring stability);
- Easy installation and setup (the script automatically downloads and installs the required headless browser versions);
- Parallel process support (no limitations on requests or resources, with robust container isolation for browser instances);
- Built-in inspector and tracer;
- Native network request interception (supports proxies (including rotating ones) out of the box);
- Asynchronous support (no need for additional libraries or custom scripts for handling async operations);
- Pre-built Docker images (allows remote execution and browser management);
- Screenshot and PDF export functionality.
For more details, check out our Playwright vs Puppeteer comparison.
Below, we will focus on code examples and integrating Playwright into scrapers.
Web Scraping Basics with Playwright
Playwright is essentially a browser automation tool. When using Playwright for web scraping, the process involves the following steps:
- Writing a script that instructs the browser on what to do (which websites to visit, which page elements to wait for, where to click, etc.);
- Extracting and parsing the required data from the page after interaction.
Since most web scrapers and parsing frameworks are built using Python, we’ll use the same language in our examples. We’ll choose Windows as the operating system, which is more commonly used than Linux or macOS.
Let’s break down Playwright web scraping step by step - especially for beginners.
Playwright Setup
We have discussed the process of installing the Python runtime environment multiple times. Here is what you should do during the preparation stage:
Step 1. Download and Install Python. Visit the Python download page to get the latest version, which is 3.13 at the time of writing. During installation, check the option to automatically add Python to the Windows environment variables.
Step 2. Add Python and the Pip Package Manager to Environment Variables (if you checked the box in the first step, Python is already added to the environment variables). Here’s an example of how the pip package manager path should look.
Step 3. Install the Playwright Library. The easiest way to install Playwright is via the pip package manager using the following command (preferably with administrator rights, especially if Python is installed system-wide and not just for a single user):
pip install pytest-playwright
Pip will automatically determine the required dependencies and install any missing libraries. If the console gives an error stating that the pip commandlet is not recognized (if you didn't add it to environment variables), use the following command instead:
python -m pip install pytest-playwright
playwright install
If you need specific browsers, you can specify them as parameters after the «install» command, e.g.:
playwright install chrome firefox.
You can optionally install additional Python libraries for parsing. BeautifulSoup will work well for high-quality HTML syntax analysis. Pandas is a good pick for fast data conversion into tables and structured datasets. There are even frameworks like Scrapy that are entirely ready for parsing.
Nuances of Using Playwright for Web Scraping
- Before scraping any website, review its Robots.txt file and Terms of Service. Whatever parser you use, it’s important to respect the rules the website owner sets. Ignoring these rules may result in bans and blocks. (More on scraping without getting blocked);
- Don’t overload the website with too many requests in a short time. This creates an unnecessary server load. Be mindful of detection mechanisms as well - unnatural browsing patterns (e.g., rapid page transitions, high request frequency from one IP, or identical time gaps between requests) make it easier to get flagged;
- Never scrape personal or sensitive user information unless you have explicit permission from the data owner.
- Always run Playwright in headless mode to save computing resources in production projects. This guide provides some examples using a graphical interface mode for educational purposes only.
- Consider emulating actual user behavior to avoid detection and bypass advanced bot protection: use unique digital fingerprints. For example, you can get actual cookies from real browsing sessions, simulate mouse movements, random scrolling ,and clicks., and so on A key part of bot fingerprinting is the User-Agent string, which should be set dynamically.
- Finally, protect your confidentiality and boost parser performance. Proxies handle these tasks. Rotating mobile and residential proxies are the best options for business scraping projects. For lower-priority tasks, server proxies can work but should still support rotation.
- Playwright can extract elements from pages but has limited syntax analysis capabilities. Because of this, we recommend using additional libraries for parsing. They will simplify searching for the required web page elements and further extracting data.
- Remember to use asynchronous functionality. Slow internet connections can delay content loading, affecting the extracted HTML. To avoid data loss, use proper waiting strategies: wait for specific page elements to load or implement randomized delays to simulate human-like browsing behavior.
Residential Proxies
Perfect proxies for accessing valuable data from around the world.
Creating a Simple Web Scraper with Playwright
Want to see your Playwright scraper in action? No problem! To visualize how it works, we’ll disable headless mode in Chrome.
Create a new dedicated folder. Let it be disc C:
C:\Playwright-scraper
Create a text file in this folder and name it like «First-Playwright-web-scraping-script.txt», for example.
Change the extension from .txt, if needed.
Open the file in any text editor (we use NotePad++) and insert the content:
# Import libraries, particularly, the required synchronous API
from playwright.sync_api import sync_playwright
# Detect browser example settings here,
with sync_playwright() as pw:
# Choose Chromuim browser, disable headless-mode for it
browser = pw.chromium.launch(headless=False)
# Create a new browser content, insert window parameters: width - 1366 пикселей, height – 700, but you can replace them with your parameters if needed
context = browser.new_context(viewport={"width": 1366, "height": 700})
# Open a new browser tab
page = context.new_page()
print("Open browser")
# Navigate to the https://twitch.tv/directory/game/Art page
print("Navigate to the twitch.tv/directory/game/Art" page)
page.goto("https://twitch.tv/directory/game/Art")
# Wait for the DIV selector, which will reflect the content load process
print("Wait for the div selector with the directory-first-item attribute")
page.wait_for_selector("div[data-target=directory-first-item]")
# Launch the parsing process, create the «parsed» list
print("Launch Playwright parsing")
parsed = []
# Search all the div elements with 'tw-tower' class on the page, with the child div elements and the data-target attribute. These will be video cards.
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
# Find titles, links, and other elements in each video card
for box in stream_boxes.element_handles():
parsed.append({
# Search the H3 selector titles and extract the text content of the element.
"title": box.query_selector("h3").inner_text(),
# Search the «tw-link» selector links and extract the href attribute.
"url": box.query_selector(".tw-link").get_attribute("href"),
# Search the name of the video author using the selector with the same «tw-link» class, but extract the text element content.
"username": box.query_selector(".tw-link").inner_text(),
# Search the views with the «tw-media-card-stat» selector class to extract the text content.
"viewers": box.query_selector(".tw-media-card-stat").inner_text(),
# Search the tags with the «tw-tag» selector class, and extract «None» if nothing is found.
"tags": box.query_selector(".tw-tag").inner_text() if box.query_selector(".tw-tag") else None,
})
# For all videos in the «parsed» list
for video in parsed:
# Print the lines with the elements found
print("Found the following video information:")
print(video)
Save the edits (you can leave the file open).
Now open the terminal and access the catalog with our Playwright-scraping script:
cd C:\Playwright-scraper
Launch the script in the Python environment. Wait for the script to end. A new browser window should open, leave it as it is :)
That’s it. Our first Playwright scraper is up and running!
The website's structure might have changed if the script doesn't return any data. Verify the selectors using the Developer Tools in your browser and update them accordingly. If anything has changed, make the changes and then restart the script.
Expanding the Capabilities of the Syntax Analyzer
As mentioned earlier, Playwright's built-in parsing capabilities are pretty limited. Specifically, the XPath syntax is used. However, it does not work well with complex structures and horizontal navigation within the element tree (e.g., selecting the first, second, and other elements on the same level).
Analyzing the resulting HTML using specialized libraries such as Parsel or BeautifulSoup is much more convenient.
Here’s how the script would look using BeautifulSoup (make sure to install the library first with «pip install beautifulsoup4»):
# Import libraries, specifically we need the synchronous API
from playwright.sync_api import sync_playwright
# Import the BeautifulSoup library
from bs4 import BeautifulSoup
# Detect the browser example settings here
with sync_playwright() as pw:
# Select the Chromuim browser, disable the headless-mode for it
browser = pw.chromium.launch(headless=False)
# Create a new context set up window parameter: width 1366 pixels, height – 700; you can replace the parameters with those of your own
context = browser.new_context(viewport={"width": 1366, "height": 700})
# Open a new tab
page = context.new_page()
print("Open the browser")
# Navigate to the page https://twitch.tv/directory/game/Art
print("Navigate to the page twitch.tv/directory/game/Art")
page.goto("https://twitch.tv/directory/game/Art")
# Wait for the DIV selector that signals content has loaded
print("Wait for the div selector with the directory-first-item attribute")
page.wait_for_selector("div[data-target=directory-first-item]")
# Launch the parser, create the «parsed» list
print("Launch Playwright parsing")
soup = BeautifulSoup(page.content(), "html.parser")
parsed = []
# Search all the elements with the 'tw-tower' class on the page, then locate child divs with attribute data-target. These will be video cards.
for item in soup.select(".tw-tower div[data-target]"):
# This is initially a list; you just need to review and browse it...
parsed.append({
# Find titles using the H3 selector and extract text content
'title': item.select_one('h3').text,
# Find links using the «tw-link» class, extract the href attribute.
'url': item.select_one('.tw-link').attrs.get("href"),
# Find the author’s name using the same «tw-link» class, but extract text content.
'username': item.select_one('.tw-link').text,
# Find tags using the «tw-tag» class, extract them as a list
'tags': [tag.text for tag in item.select('.tw-tag')],
# Find the view count using the «tw-media-card-stat» class to extract the text content.
'viewers': item.select_one('.tw-media-card-stat').text,
})
# Print the «parsed» list…
for video in parsed:
# Print the lines with the elements found
print("Extracted the following video details:")
print(video)
Simulating User Actions (Button Clicks, Text Input)
Let’s open a specific web page using Playwright, locate the search form on the page, and use it to perform a search.
This is what we’ll do in our script:
- Navigate to the https://twitch.tv/directory/game/Art page.
- Open the search bar (it is hidden by the special button by default).
- Type a search query into the form.
- Press the Enter button.
You can also use the .click() method on target sites. However, you must provide a unique element identifier when working with a headless browser. If multiple elements match, the browser won’t understand where to “click”:
search_button.click()
Here is a script sample:
# Import libraries, specifically, we need the synchronous API
from playwright.sync_api import sync_playwright
# Define browser instance settings,
with sync_playwright() as pw:
# Select the Chromuim browser, disable the headless-mode for it
browser = pw.chromium.launch(headless=False)
# Create the new context, set up the window parameters: width 1366 pixels, height – 700; you can replace them with your own parameters
context = browser.new_context(viewport={"width": 1366, "height": 700})
# Open a new tab
page = context.new_page()
print("Open the browser")
# Navigate to the https://twitch.tv/directory/game/Art page
print("Navigate to the twitch.tv/directory/game/Art" page)
page.goto("https://twitch.tv/directory/game/Art")
# Find the search input field:
print("Use the search bar")
search_box = page.locator('input[autocomplete="twitch-nav-search"]')
print("Print the Minecraft text with the 100 ms delay")
search_box.type("Minecraft", delay=100)
# Press the Enter key when finished:
print("Press the Enter key")
search_box.press("Enter")
# Click the link in the channels found
page.locator('.search-results .tw-link[href*="all/tags"]').click()
Mobile Proxies
Premium mobile IPs for ultimate flexibility and seamless connectivity.
Scrolling Pages (for Websites with Unlimited Feed and Dynamic Content Loading)
Many modern websites use infinite scrolling instead of traditional pagination. As you scroll down, a unique trigger is activated, and the content is updated.
Technically, you can’t "reach the end" of the page (to reach the footer) - new content is always being loaded. However, this same behavior can be leveraged for parsing instead of traditional page-by-page navigation (although everything will depend on the logic of selecting content for the feed).
In Playwright, we can use the function scroll_into_view_if_needed() to ensure new content loads, letting us scrape large amounts of video-related data.
Here is what the updated Playwright script will look like:
# Import libraries, specifically, we need the synchronous API
from playwright.sync_api import sync_playwright
# Import the BeautifulSoup library
from bs4 import BeautifulSoup
# Define the browser instance settings here,
with sync_playwright() as pw:
# Select the Chromuim browser, and disable it for the headless-mode
browser = pw.chromium.launch(headless=False)
# Create the new context, add the window parameters: width 1366 pixels, height – 700; you can replace the parameters with those of your own
context = browser.new_context(viewport={"width": 1366, "height": 700})
# Open a new tab
page = context.new_page()
print("Open the browser")
# Navigate to the https://twitch.tv/directory/game/Art page
print("Navigate to the twitch.tv/directory/game/Art" page)
page.goto("https://twitch.tv/directory/game/Art")
# Wait for the DIV selector that will signal the content load
print("Wait for the div selector with the directory-first-item attribute")
page.wait_for_selector("div[data-target=directory-first-item]")
# Infinite upload up to the last element in the list will work until the video stops loading… BE CAUTIOUS!!!
stream_boxes = None
print("Activate the infinite feed loading")
while True:
stream_boxes = page.locator("//div[contains(@class,'tw-tower')]/div[@data-target]")
# Here, we refer to the function that will define the availability of new elements while scrolling
stream_boxes.element_handles()[-1].scroll_into_view_if_needed()
items_on_page = len(stream_boxes.element_handles())
print("Elements:", items_on_page)
print("Wait for 10 seconds")
page.wait_for_timeout(10000) # You can replace the parameter with that of your own in milliseconds
#The number of elements added...
items_on_page_after_scroll = len(stream_boxes.element_handles())
print("Elements after scrolling:", items_on_page_after_scroll)
# While the number of new videos is changing, keep scrolling
if items_on_page_after_scroll > items_on_page:
continue # continue
else:
break # Enough, stop scrolling
print("No new videos, stop scrolling:")
# Start parsing, create the «parsed» list
print("Launch Playwright parsing")
soup = BeautifulSoup(page.content(), "html.parser")
parsed = []
# Find all the elements with the 'tw-tower' class, with built-in child div elements with data-target attribute. These will be video cards.
for item in soup.select(".tw-tower div[data-target]"):
# This is initially a list; you just have to browse it...
parsed.append({
# Find titles using the H3 selector and extract text content
'title': item.select_one('h3').text,
# Find links using the «tw-link» class, extract the href attribute.
'url': item.select_one('.tw-link').attrs.get("href"),
# Find the author’s name using the same «tw-link» class, but extract text content.
'username': item.select_one('.tw-link').text,
# Find tags using the «tw-tag» class, extract them as a list
'tags': [tag.text for tag in item.select('.tw-tag')],
#Find the view count using the «tw-media-card-stat» class to extract the text content.
'viewers': item.select_one('.tw-media-card-stat').text,
})
# For all videos in the «parsed» list…
for video in parsed:
# Print the «parsed» list…
print("Found the following information about the video")
print(video)
Advanced Web Scraping with Playwright
- Execution of custom JavaScript scripts. You can write any code and bind it to the required triggers. For example, this could be your algorithm for scrolling a page.
- Intercepting server requests and responses. First, this lets you see exactly what the browser sends to the website/web application. Second, you can configure specific events as triggers and use them in your parsing script.
- Blocking resources. To reduce the amount of traffic consumed, you can block the loading of specific content like videos or images. This can include any file types, specific URLs, or resource names. For example, you can disable CDN services, JavaScript scripts, CSS files, etc.
- Using proxies. Proxies allow parallelizing requests to a website and speeding up data collection processes. Additionally, they prevent IP-based blocking, ensuring that your parsing process is not interrupted.
Here is a sample of how proxies are connected to Playwright parsing:
# A custom library block import
from playwright.sync_api import sync_playwright
# Enable the browser instance
with sync_playwright() as pw:
# Load Chromium as an example, define launch parameters
browser = pw.chromium.launch(
# Disable the headless-mode, just as an example, don’t repeat this step
headless=False,
# This is how the direct proxy server is specified; if you don’t have login/password, discomment the line below and delete the second variant (with login/password)
# proxy={"server": "112.113.114.115:9999"},
# But you can use the construction with login and password:
proxy={"server": "112.113.114.115:9999", "username": "HERE-LOGIN", "password": "HERE-PASSWORD"},
)
# The rest is as usual: open a new tab
page = browser.new_page()
# Here is your parser code…
Conclusion and Recommendations
Web scraping with Playwright will be simple and functional, as the library’s developers have handled most possible use cases. It includes everything you might need, from intercepting server responses to working through proxies. There’s even a built-in parser, though it’s less convenient than BeautifulSoup.
With Playwright, you can scrape any modern website or web application without worrying about asynchronous JavaScript execution. If needed, you can significantly save bandwidth and block unnecessary content.
The Playwright API syntax is the same across all programming languages and does not depend on the browsers where the target sites are opened.
However, Playwright is not a silver bullet. The library has to work with your code and in combination with other libraries.
One of the most crucial elements of any parser designed to collect large amounts of data is proxies.
You can find and purchase high-quality rotating proxies from us! Froxy provides access to a network of over 10 million IPs and offers traffic packages for mobile, residential, and server proxies. Connecting proxies to your scrapers takes just a single line of code!