We have already mentioned various web drivers and libraries with frameworks for building custom website parsing programs. Among the most popular solutions for managing headless browsers are Playwright, Selenium, and Puppeteer. Each has its own features and application areas, as well as specific requirements regarding the programming languages used.
In this article, we will discuss the Puppeteer library and how it helps test applications and handle web parsing tasks. We will also provide a step-by-step guide on Puppeteer web scraping and explore the most common use cases.
Puppeteer is a Node.js library for controlling Google Chrome and Chromium browsers via the DevTools Protocol (CDP). It is an open-source project developed by the same Google team responsible for Chrome (official website). The library was originally created to demonstrate the capabilities of the browser’s built-in API.
Before 2017, Chrome could only be managed through third-party web driver libraries, as it lacked a native API. External solutions like Playwright and Selenium were responsible for providing automation capabilities.
Detailed Playwright vs Puppeteer Web Scraping Comparison
After the CDP (Chrome DevTools Protocol) was released, it became possible to communicate directly with the browser using special flags, console commands, and API interfaces.
Puppeteer was developed to showcase these management solutions.
Later, CDP support was also integrated into existing web drivers (Playwright, Selenium, etc.), and new high-level frameworks like Chromedp emerged.
However, this article is mainly about Puppeteer web scraping.
Note! Each Puppeteer release is tightly linked to a specific Chromium version. This is because Puppeteer is a reference implementation of browser automation that fully utilizes the Chrome API.
However, this approach comes with some limitations. Puppeteer developers focus their effort on one programming language only without handling orchestration tasks like Playwright or Selenium do. As a result, Puppeteer is best suited for specific tasks, particularly web scraping using JavaScript.
To start Puppeteer web scraping, you first need to install and set up a Node.js environment. This guide will focus on Windows, as this is the most commonly used platform among general users. However, you can also set up Linux (either as a standalone OS, through WSL in Windows, or via virtualization tools like VMware or VirtualBox) or use containerization platforms like Docker.
On many Linux distributions, Node.js can be installed via the default package manager or through the Snap package system.
We will choose a more straightforward solution - NodeJS and, correspondingly, any Puppeteer web scraping script, can work in Windows with no virtualization layers.
Download the installer package from the official NodeJS website. There, you will also find examples of commands for installing the console interface (PowerShell).
If you must have a NodeJS version manager, you can complete the setup via nvm, fnm, Brew, or Docker.
The default NodeJS setup path is «C:\Program Files\nodejs\», but you can change it as you like.
Remember to allow the NPM package manager to be set up and to add PATH to the Windows environment variables.
Wait for the installation to complete ( the script can download and install other libraries and programs, such as the Chocolatey repository, Microsoft Visual Studio editor, .Net application environment, and Python programming language).
First, create the catalog to store the script of your Puppeteer web scraping tool:
cd c:\
mkdir puppeteer-web-scraping-script
cd puppeteer-web-scraping-script
That’s it - we are in the required catalog. Now we need to check the node and npm package manager versions using the following commands:
npm -v
node –version
The console should generate the following messages:
Install Puppeteer:
npm install puppeteer
New catalogs and files will appear in the folder with your project (there will be around 100 sub-folders in the separate folder with modules. These will include @puppeteer, chromium-bidi, devtools-protocol, http-proxy-agent, etc.).
Create a new text file in the root of the project directory and call it «first-puppeteer-web-scraping-script», for example. Replace the .txt extension with .js, here is what you should get «first-puppeteer-web-scraping-script.js». Open the file in any text editor (we use Notepad++). Fill the file with content (copy and paste the code below):
const puppeteer = require('puppeteer')
// Connect (import) the Puppeteer library
async function run(){
// Main function runs asynchronously
const browser = await puppeteer.launch({
// Create the example of the headless-browser and download it asynchronously.
// Disable the hidden mode to make all the processes of script operation visible in the real mode.
headless: false,
// Set the “false” flag. If you wish the browser to run in the background with no window display, replace the flag with the “true” one
ignoreHTTPSErrors: true,
// Here, we ask the browser to ignore the HTTPS connection errors
})
// Launch a new tab in the browser
let page = await browser.newPage();
// Ask to open a certain URL address
await page.goto('http://httpbin.org/html', {
// Wait for the DOM to download
waitUntil: 'domcontentloaded',
});
// Output the required HTML content in the console
console.log(await page.content());
// Close the tab and the browser
await page.close();
await browser.close();
}
run();
Save the file and run the script using the command:
node first-puppeteer-web-scraping-script.js
A Chrome browser window will open. Then it will close and the console will display the web page HTML.
That’s it - our first parser has been built! It opened the browser (a browser can work in the background instead, with no graphic interface display), accessed a specific page, and then copied the HTML code.
We used the following Puppeteer features:
If needed, you can process it with the syntax analyzer to further choose (save) only the required elements. We’ll tell you about that below.
The 'domcontentloaded' event is a custom JavaScript event that means the HTML page is fully loaded and the browser has created the DOM.
You can wait for the certain HTML tag or selector to download and wait for the timeout instead. For example:
await page.waitForSelector('h2', {timeout: 3_000})
// Wait for an H2 tag to appear (timeout: 3 seconds)
Perfect proxies for accessing valuable data from around the world.
Now, let's take our web scraper a step further by selecting and extracting only specific elements from a webpage.
Puppeteer supports various query selector methods, including CSS selectors, XPath queries, search by name, role, text, prefixes, and Shadow DOM (query syntax documents).
Puppeteer's “page” class includes a range of built-in methods, such as simulating user actions, managing browser sessions, handling cookies, taking screenshots and screencasts, generating PDF versions of pages, setting custom user-agents, etc. Read the details in the official documents.
We are currently focused on element selection methods:
There are also "eval" variations that execute JavaScript before parsing the DOM:
Here is a script sample that will search the info about products on a special test page:
const puppeteer = require('puppeteer')
// Import the Puppeteer library
async function run(){
// The main function will work asynchronous
const browser = await puppeteer.launch({
// Launch the browser in non-headless mode and download it. Also asynchronous.
// Disable the hidden mode to make all the script working processes visible
headless: false,
// Set the “false” flag. If you wish the browser to run in the background with no window display, replace the flag with the “true” one
ignoreHTTPSErrors: true,
// Here, we ask the browser to ignore the HTTPS connection errors
})
// Launch a new tab in the browser
let page = await browser.newPage();
//Ask to open a certain URL address
// In our case, this is an example page with product cards
await page.goto('https://web-scraping.dev/products', {
// Wait for the DOM to download
waitUntil: 'domcontentloaded',
});
// Find the first H3 heading as an example (it has the "mb-0" class)
// ... and copy the text from it, which is written in the <a> tag
let textfirsturl = await (await page.$('.mb-0 a')).evaluate( node => node.innerText);
// copy the link in the same tag
let firsturl = await (await page.$('.mb-0 a')).evaluate( node => node.getAttribute("href"));
// Output the required HTML content in the console
console.log("First Product:", textfirsturl, "Its URL:", firsturl);
// Find all products and links at a time
// Find all product links (returns an array)
let alllinks = await page.$$('.mb-0 a');
//
for (const link of alllinks){
console.log("Product:", await link.evaluate( node => node.innerText));
console.log("URL:", await link.evaluate( node => node.getAttribute("href")));
}
// Close the tab and the browser
await page.close();
await browser.close();
}
run();
Save and launch the script of Puppeteer for web scraping.
The trickiest part of scraping dynamic websites (those that update based on JavaScript events) is processing infinite scrolling. Each new scrolling attempt loads fresh content as new elements are appended to the end of the list.
Below is a script for Puppeteer web scraping a list of reviews with infinite loading.
You can adjust the delay parameters and the maximum number of scroll attempts as needed.
const puppeteer = require('puppeteer')
// Connect (import) the Puppeteer library
// First, we define a function to process scrolling
async function scrollPageDown(page) {
// Previous page height
let prevHeight = -1;
// Maximum scroll iterations (if the page stops uploading the data earlier, the cycle will end in advance)
let maxScrolls = 50;
// Counter for scroll attempts
let scrollCount = 0;
// Describe the function logic
// Stops when max iterations are reached or when the page stops loading new content
while (scrollCount < maxScrolls) {
// Scroll down to the bottom of the page, using the current body element height as a parameter
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
// Wait for new content to load (at least 1 second)
await new Promise(resolve => setTimeout(resolve, 1000));
// Get the new page height
let newHeight = await page.evaluate('document.body.scrollHeight');
// If the new height parameter equals the previous one
if (newHeight == prevHeight) {
// Output the current scroll counter parameter
console.log("Scrolls Count:", scrollCount);
// Exit the cycle
break;
}
// Make a new body height parameter the previous one
prevHeight = newHeight;
// Add the scrolling integration counter
scrollCount += 1;
}
};
// Describe the logic of Puppeteer web scraping function
async function parseReviews(page) {
// Search all the elements with the testimonial class used as a root element for each block with rating
let elements = await page.$$('.testimonial');
// Create the array for results
let results = [];
// Browse the array of the generated elements
for (let element of elements) {
//Search the rating parameter, in our case these are the stars pictograms
let rate = await element.$$eval('span.rating > svg', elements => elements.map(el => el.innerHTML))
results.push({
// Collect the review text from the element with the text class
"text": await element.$eval('.text', node => node.innerHTML),
// Count the number of stars
"rate" : rate.length
});
}
// Return the array
return results;
};
// Describe the logic of the main function - what and where to launch
async function run(){
// Download the headless-browser sample via Puppeteer
const browser = await puppeteer.launch({
// Disable the headless mode for enhanced visibility
headless: false,
// Ignore HTTPS errors
ignoreHTTPSErrors: true,
});
// Open a new tab
page = await browser.newPage();
// Access the target page
await page.goto('https://web-scraping.dev/testimonials/');
// Perform scrolling processes (until we encounter the integrations limit or until the page stops changing its height)
await scrollPageDown(page);
// Create an array with reviews and fill it with data (parse with Puppeteer
reviews = await parseReviews(page);
// Close the browser
await browser.close();
// Output the array in the console
console.log(reviews);
};
run();
After running the Puppeteer web scraping, the console will display the extracted reviews.
In our test case, the script stopped after 7 scroll attempts when no new content was detected.
Another common scenario in Puppeteer web scraping involves navigating through paginated lists.
Product listings on our test target website are static, but each page displays only 10 items. To scrape all products, we need a script that modifies the pagination parameter and inserts it into the URL of each new page.
Below is our Puppeteer-based web scraper:
const puppeteer = require("puppeteer");
// Connect the Puppeteer library
// Describe the logic of parsing a certain web page
async function parseProducts(page) {
// Search all blocks with product descriptions, this is the div-element with "row" and "product" classes
let boxes = await page.$$('div.row.product');
// Create an array
let results = [];
// Browse the elements in the cycle
for(let box of boxes) {
results.push({
// The product title hides behind the “a" tag, but you need to collect the entire text content
"title": await box.$eval('a', node => node.innerHTML),
// Link to the product page. The same "a" tag, but this time we collect the meaning of the Href attribute
"link": await box.$eval('a', node => node.getAttribute('href')),
// Product price, specified in the div block with the price class
"price": await box.$eval('div.price', node => node.innerHTML)
})
}
return results;
}
// Describe the logic of the main function of Puppeteer parsing
async function run(){
//Download the headless browser
const browser = await puppeteer.launch({
// Make it visible (disable the headless mode)
headless: false,
// Ignore HTTPS errors
ignoreHTTPSErrors: true,
});
// Open a new page
page = await browser.newPage();
// Create an array
data = [];
// Currently, the number of pages to browse is less than 5 (that is, not more than 4)
for (let i=1; i < 5; i++) {
//Access the page with the required pagination number, getting the number from the cycle iteration (the cycle and the number are the same)
await page.goto(`https://web-scraping.dev/products?page=${i}`)
// Fill the array with product descriptions (parsing)
products = await parseProducts(page)
// Append the data array with that of the parsing page
data.push(...products);
}
// Output the array to the console
console.log(data);
// Close the browser
browser.close();
}
run();
You can disable unnecessary resources like images, videos, and fonts to save bandwidth and speed up page loading. Puppeteer web scraping allows request interception, enabling us to block specific resource types or domains:
const blockResourceType = [here, list, types, resources, example, 'image', 'font', 'etc.',];
const blockResourceName = [here, list, types, resources, example, 'cdn.api.twitter', 'fontawesome', 'google-analytics', 'googletagmanager',];
// Then you can additionally set up a headless browser using flags
const page = await browser.newPage();
// Here is how the activation of request interception looks like
await page.setRequestInterception(true);
// In this case we get the opportunity to inspect each browser request
// Only then we set up the logic: which requests should be sent to the browser and which should be avoided
page.on('request', request => {
const requestUrl = request._url.split('?')[0];
if (
// If the resource type is included in the list of blocked resources…
(request.resourceType() in blockedResourceType) ||
// Or if a particular resource is in the list of blocked resources…
blockResourceName.some(resource => requestUrl.includes(resource))
) {
// The outgoing request is processed
request.abort();
// In other cases
} else {
// Everything is as usual
request.continue();
}
});
}
Our dedicated team and 24/7 expert support are here to keep you online and unstoppable.
Everything is quite simple and understandable here. What you need is just to provide proxy parameters to launch the browser, for example:
// Open a new sample of the headless browser
const browser = await puppeteer.launch({
args: [ '--proxy-server=http://123.456.123.456:8080' ]
});
// The rest of the code
This approach has one drawback - the proxy is set at the browser level. Consequently, to route a stream of requests through a different proxy server, you need to open another instance of the browser.
There are various scripts for forced proxy rotation at the page and request levels, including the proxy-chain extension for NodeJS.
However, we recommend a more efficient solution - using rotating proxies. For example, you can rent rotating mobile, residential, and datacenter proxies from us.
In this case, our service will handle proxy rotation. Backconnect proxies will connect to the parser only once. You’ll simply have to define the logic for rotating outgoing IP addresses in your personal account - based on time intervals or with each new request.
This approach significantly reduces the likelihood of the parser getting blocked and simplifies the writing process.
More details are available in our dedicated article: How to Parse Without Getting Blocked.
A few concise tips are listed below:
Various scripts on web servers can detect a headless browser using Puppeteer based on specific attributes passed in HTTP requests (headers). To avoid this, you need to hide these attributes using specialized plugins like puppeteer-stealth.
Puppeteer is a powerful tool for automating testing scripts and building fully functional web scrapers. It provides everything needed for user behavior emulation, HTTP request interception, blocking unnecessary resources, parsing HTML content, and even working with proxies.
However, there are some unique nuances here. The tool supports only one programming language (JavaScript), and there are proxy binding restrictions related to browser instances. Rotating proxy is essential to ensure maximum parsing comfort with no blocks and easy scaling.
Froxy provides quality mobile and residential proxies with rotation. We offer 10 million IPs, rotating by time and per new request.