Puppeteer Web Scraping: Guide with a Practical Approach

Written by Team Froxy | Apr 17, 2025 9:00:00 AM

We have already mentioned various web drivers and libraries with frameworks for building custom website parsing programs. Among the most popular solutions for managing headless browsers are Playwright, Selenium, and Puppeteer. Each has its own features and application areas, as well as specific requirements regarding the programming languages used.

In this article, we will discuss the Puppeteer library and how it helps test applications and handle web parsing tasks. We will also provide a step-by-step guide on Puppeteer web scraping and explore the most common use cases.

Advantages of Web Scraping with Puppeteer

Puppeteer is a Node.js library for controlling Google Chrome and Chromium browsers via the DevTools Protocol (CDP). It is an open-source project developed by the same Google team responsible for Chrome (official website). The library was originally created to demonstrate the capabilities of the browser’s built-in API.

Before 2017, Chrome could only be managed through third-party web driver libraries, as it lacked a native API. External solutions like Playwright and Selenium were responsible for providing automation capabilities.

Detailed Playwright vs Puppeteer Web Scraping Comparison

After the CDP (Chrome DevTools Protocol) was released, it became possible to communicate directly with the browser using special flags, console commands, and API interfaces.

Puppeteer was developed to showcase these management solutions.

Later, CDP support was also integrated into existing web drivers (Playwright, Selenium, etc.), and new high-level frameworks like Chromedp emerged.

However, this article is mainly about Puppeteer web scraping.

Key Features of Puppeteer for Web Scraping

Support for JavaScript (Node.js) – one of the most popular web programming languages, which is also cross-platform;
Direct control of Chrome and Chromium headless browsers – even pre-installed versions in the OS can be used, without requiring a separate browser instance in an isolated environment;
This library can be used for the automation of Chrome extension testing;
Bi-directional WebDriver support (WebDriver BiDi) is available here (Since version 23, Puppeteer supports automation for both Chrome and Firefox);
Simple and intuitive command syntax;
Built-in syntax analyzer that can search the required elements in the DOM structure (HTML code) and extract attributes like text, links, etc.;
Native support for asynchronous operations (the parser can wait for specific elements to render, JavaScript execution, or particular browser events);
Full compatibility with the CDP protocol – capturing screenshots in the browser, generating PDF web page versions, debugging, etc.;
Cookie management and browser profile handling;
Support for environment variables, including rotating proxies;
User action emulation option and interaction with input fields, files, forms, etc.

Note! Each Puppeteer release is tightly linked to a specific Chromium version. This is because Puppeteer is a reference implementation of browser automation that fully utilizes the Chrome API.

However, this approach comes with some limitations. Puppeteer developers focus their effort on one programming language only without handling orchestration tasks like Playwright or Selenium do. As a result, Puppeteer is best suited for specific tasks, particularly web scraping using JavaScript.

Puppeteer Web Scraping Guide

To start Puppeteer web scraping, you first need to install and set up a Node.js environment. This guide will focus on Windows, as this is the most commonly used platform among general users. However, you can also set up Linux (either as a standalone OS, through WSL in Windows, or via virtualization tools like VMware or VirtualBox) or use containerization platforms like Docker.

On many Linux distributions, Node.js can be installed via the default package manager or through the Snap package system.

We will choose a more straightforward solution - NodeJS and, correspondingly, any Puppeteer web scraping script, can work in Windows with no virtualization layers.

Setting Up NodeJS for Puppeteer Web Scraping in Windows

Download the installer package from the official NodeJS website. There, you will also find examples of commands for installing the console interface (PowerShell).

If you must have a NodeJS version manager, you can complete the setup via nvm, fnm, Brew, or Docker.

The default NodeJS setup path is «C:\Program Files\nodejs\», but you can change it as you like.

Remember to allow the NPM package manager to be set up and to add PATH to the Windows environment variables.

Wait for the installation to complete ( the script can download and install other libraries and programs, such as the Chocolatey repository, Microsoft Visual Studio editor, .Net application environment, and Python programming language).

Puppeteer Installation

First, create the catalog to store the script of your Puppeteer web scraping tool:

cd c:\mkdir puppeteer-web-scraping-scriptcd puppeteer-web-scraping-script

That’s it - we are in the required catalog. Now we need to check the node and npm package manager versions using the following commands:

npm -vnode –versionThe console should generate the following messages:Install Puppeteer:npm install puppeteer

New catalogs and files will appear in the folder with your project (there will be around 100 sub-folders in the separate folder with modules. These will include @puppeteer, chromium-bidi, devtools-protocol, http-proxy-agent, etc.).

Creating the First Puppeteer Web Scraper

Create a new text file in the root of the project directory and call it «first-puppeteer-web-scraping-script», for example. Replace the .txt extension with .js, here is what you should get «first-puppeteer-web-scraping-script.js». Open the file in any text editor (we use Notepad++). Fill the file with content (copy and paste the code below):

const puppeteer = require('puppeteer')// Connect (import) the Puppeteer libraryasync function run(){// Main function runs asynchronouslyconst browser = await puppeteer.launch({// Create the example of the headless-browser and download it asynchronously.// Disable the hidden mode to make all the processes of script operation visible in the real mode.headless: false,

// Set the “false” flag. If you wish the browser to run in the background with no window display, replace the flag with the “true” one

ignoreHTTPSErrors: true,// Here, we ask the browser to ignore the HTTPS connection errors})// Launch a new tab in the browserlet page = await browser.newPage();// Ask to open a certain URL addressawait page.goto('http://httpbin.org/html', {// Wait for the DOM to downloadwaitUntil: 'domcontentloaded',});// Output the required HTML content in the consoleconsole.log(await page.content());// Close the tab and the browserawait page.close();await browser.close();}run();Save the file and run the script using the command:node first-puppeteer-web-scraping-script.js

A Chrome browser window will open. Then it will close and the console will display the web page HTML.

That’s it - our first parser has been built! It opened the browser (a browser can work in the background instead, with no graphic interface display), accessed a specific page, and then copied the HTML code.

We used the following Puppeteer features:

puppeteer.launch() – launches the browser with optional settings (e.g., headless mode);
newPage() – opens a new browser tab;
goto() – navigates to a specific web address (web address);
content() – extracts the whole HTML content of the page.

If needed, you can process it with the syntax analyzer to further choose (save) only the required elements. We’ll tell you about that below.

The 'domcontentloaded' event is a custom JavaScript event that means the HTML page is fully loaded and the browser has created the DOM.

You can wait for the certain HTML tag or selector to download and wait for the timeout instead. For example:

await page.waitForSelector('h2', {timeout: 3_000})// Wait for an H2 tag to appear (timeout: 3 seconds)

Residential Proxies

Perfect proxies for accessing valuable data from around the world.

Try With Trial $1.99, 100Mb

Finding Elements on the Page in the Puppeteer Web Scraper

Now, let's take our web scraper a step further by selecting and extracting only specific elements from a webpage.

Puppeteer supports various query selector methods, including CSS selectors, XPath queries, search by name, role, text, prefixes, and Shadow DOM (query syntax documents).

Puppeteer's “page” class includes a range of built-in methods, such as simulating user actions, managing browser sessions, handling cookies, taking screenshots and screencasts, generating PDF versions of pages, setting custom user-agents, etc. Read the details in the official documents.

We are currently focused on element selection methods:

page.$() – finds the first matching element on the page;
page.$$() – finds all matching elements and returns an array.

There are also "eval" variations that execute JavaScript before parsing the DOM:

Here is a script sample that will search the info about products on a special test page:

const puppeteer = require('puppeteer')// Import the Puppeteer libraryasync function run(){// The main function will work asynchronousconst browser = await puppeteer.launch({// Launch the browser in non-headless mode and download it. Also asynchronous.// Disable the hidden mode to make all the script working processes visibleheadless: false,// Set the “false” flag. If you wish the browser to run in the background with no window display, replace the flag with the “true” oneignoreHTTPSErrors: true,// Here, we ask the browser to ignore the HTTPS connection errors})// Launch a new tab in the browserlet page = await browser.newPage();//Ask to open a certain URL address// In our case, this is an example page with product cardsawait page.goto('https://web-scraping.dev/products', {// Wait for the DOM to downloadwaitUntil: 'domcontentloaded',});// Find the first H3 heading as an example (it has the "mb-0" class)// ... and copy the text from it, which is written in the <a> taglet textfirsturl = await (await page.$('.mb-0 a')).evaluate( node => node.innerText);// copy the link in the same taglet firsturl = await (await page.$('.mb-0 a')).evaluate( node => node.getAttribute("href"));// Output the required HTML content in the consoleconsole.log("First Product:", textfirsturl, "Its URL:", firsturl);// Find all products and links at a time// Find all product links (returns an array)let alllinks = await page.$$('.mb-0 a');//for (const link of alllinks){console.log("Product:", await link.evaluate( node => node.innerText));console.log("URL:", await link.evaluate( node => node.getAttribute("href")));}// Close the tab and the browserawait page.close();await browser.close();}run();

Save and launch the script of Puppeteer for web scraping.

Processing Infinite Scroll Diring Puppeteer Web Scraping

The trickiest part of scraping dynamic websites (those that update based on JavaScript events) is processing infinite scrolling. Each new scrolling attempt loads fresh content as new elements are appended to the end of the list.

Below is a script for Puppeteer web scraping a list of reviews with infinite loading.

You can adjust the delay parameters and the maximum number of scroll attempts as needed.

const puppeteer = require('puppeteer')// Connect (import) the Puppeteer library// First, we define a function to process scrollingasync function scrollPageDown(page) {// Previous page heightlet prevHeight = -1;// Maximum scroll iterations (if the page stops uploading the data earlier, the cycle will end in advance)let maxScrolls = 50;// Counter for scroll attemptslet scrollCount = 0;// Describe the function logic// Stops when max iterations are reached or when the page stops loading new contentwhile (scrollCount < maxScrolls) {// Scroll down to the bottom of the page, using the current body element height as a parameterawait page.evaluate('window.scrollTo(0, document.body.scrollHeight)');// Wait for new content to load (at least 1 second)await new Promise(resolve => setTimeout(resolve, 1000));// Get the new page heightlet newHeight = await page.evaluate('document.body.scrollHeight');// If the new height parameter equals the previous oneif (newHeight == prevHeight) {// Output the current scroll counter parameterconsole.log("Scrolls Count:", scrollCount);// Exit the cyclebreak;}// Make a new body height parameter the previous oneprevHeight = newHeight;// Add the scrolling integration counterscrollCount += 1;}};// Describe the logic of Puppeteer web scraping functionasync function parseReviews(page) {// Search all the elements with the testimonial class used as a root element for each block with ratinglet elements = await page.$$('.testimonial');// Create the array for resultslet results = [];// Browse the array of the generated elementsfor (let element of elements) {//Search the rating parameter, in our case these are the stars pictogramslet rate = await element.$$eval('span.rating > svg', elements => elements.map(el => el.innerHTML))results.push({// Collect the review text from the element with the text class"text": await element.$eval('.text', node => node.innerHTML),// Count the number of stars"rate" : rate.length});}// Return the arrayreturn results;};// Describe the logic of the main function - what and where to launchasync function run(){// Download the headless-browser sample via Puppeteerconst browser = await puppeteer.launch({// Disable the headless mode for enhanced visibilityheadless: false,// Ignore HTTPS errorsignoreHTTPSErrors: true,});// Open a new tabpage = await browser.newPage();// Access the target pageawait page.goto('https://web-scraping.dev/testimonials/');// Perform scrolling processes (until we encounter the integrations limit or until the page stops changing its height)await scrollPageDown(page);// Create an array with reviews and fill it with data (parse with Puppeteerreviews = await parseReviews(page);// Close the browserawait browser.close();// Output the array in the consoleconsole.log(reviews);};run();

After running the Puppeteer web scraping, the console will display the extracted reviews.

In our test case, the script stopped after 7 scroll attempts when no new content was detected.

Handling Pagination in Puppeteer Web Scraping

Another common scenario in Puppeteer web scraping involves navigating through paginated lists.

Product listings on our test target website are static, but each page displays only 10 items. To scrape all products, we need a script that modifies the pagination parameter and inserts it into the URL of each new page.

Below is our Puppeteer-based web scraper:

const puppeteer = require("puppeteer");// Connect the Puppeteer library// Describe the logic of parsing a certain web pageasync function parseProducts(page) {// Search all blocks with product descriptions, this is the div-element with "row" and "product" classeslet boxes = await page.$$('div.row.product');// Create an arraylet results = [];// Browse the elements in the cyclefor(let box of boxes) {results.push({// The product title hides behind the “a" tag, but you need to collect the entire text content"title": await box.$eval('a', node => node.innerHTML),// Link to the product page. The same "a" tag, but this time we collect the meaning of the Href attribute"link": await box.$eval('a', node => node.getAttribute('href')),// Product price, specified in the div block with the price class"price": await box.$eval('div.price', node => node.innerHTML)})}return results;}// Describe the logic of the main function of Puppeteer parsingasync function run(){//Download the headless browserconst browser = await puppeteer.launch({// Make it visible (disable the headless mode)headless: false,// Ignore HTTPS errorsignoreHTTPSErrors: true,});// Open a new pagepage = await browser.newPage();// Create an arraydata = [];// Currently, the number of pages to browse is less than 5 (that is, not more than 4)for (let i=1; i < 5; i++) {

//Access the page with the required pagination number, getting the number from the cycle iteration (the cycle and the number are the same)

await page.goto(`https://web-scraping.dev/products?page=${i}`)// Fill the array with product descriptions (parsing)products = await parseProducts(page)// Append the data array with that of the parsing pagedata.push(...products);}// Output the array to the consoleconsole.log(data);// Close the browserbrowser.close();}run();

Blocking Unnecessary Content Download

You can disable unnecessary resources like images, videos, and fonts to save bandwidth and speed up page loading. Puppeteer web scraping allows request interception, enabling us to block specific resource types or domains:

const blockResourceType = [here, list, types, resources, example, 'image', 'font', 'etc.',];

const blockResourceName = [here, list, types, resources, example, 'cdn.api.twitter', 'fontawesome', 'google-analytics', 'googletagmanager',];

// Then you can additionally set up a headless browser using flagsconst page = await browser.newPage();// Here is how the activation of request interception looks likeawait page.setRequestInterception(true);// In this case we get the opportunity to inspect each browser request// Only then we set up the logic: which requests should be sent to the browser and which should be avoidedpage.on('request', request => {const requestUrl = request._url.split('?')[0];if (// If the resource type is included in the list of blocked resources…(request.resourceType() in blockedResourceType) ||// Or if a particular resource is in the list of blocked resources…blockResourceName.some(resource => requestUrl.includes(resource))) {// The outgoing request is processedrequest.abort();// In other cases} else {// Everything is as usualrequest.continue();}});}

Professional Support

Our dedicated team and 24/7 expert support are here to keep you online and unstoppable.

Get Support

The Way Puppeteer Scraper Script Works Via Proxy

Everything is quite simple and understandable here. What you need is just to provide proxy parameters to launch the browser, for example:

// Open a new sample of the headless browserconst browser = await puppeteer.launch({args: [ '--proxy-server=http://123.456.123.456:8080' ]});// The rest of the code

This approach has one drawback - the proxy is set at the browser level. Consequently, to route a stream of requests through a different proxy server, you need to open another instance of the browser.

There are various scripts for forced proxy rotation at the page and request levels, including the proxy-chain extension for NodeJS.

However, we recommend a more efficient solution - using rotating proxies. For example, you can rent rotating mobile, residential, and datacenter proxies from us.

In this case, our service will handle proxy rotation. Backconnect proxies will connect to the parser only once. You’ll simply have to define the logic for rotating outgoing IP addresses in your personal account - based on time intervals or with each new request.

This approach significantly reduces the likelihood of the parser getting blocked and simplifies the writing process.

More details are available in our dedicated article: How to Parse Without Getting Blocked.

Additional Tips

A few concise tips are listed below:

Take your time monitoring user-agent directions, cookies, and other digital fingerprint parameters (screen resolution, supported web technologies, font sets in the operating system, etc.).
It’s crucial to try emulating actual user behavior.
Avoid sending requests at equal time intervals.

Various scripts on web servers can detect a headless browser using Puppeteer based on specific attributes passed in HTTP requests (headers). To avoid this, you need to hide these attributes using specialized plugins like puppeteer-stealth.

Conclusion and Recommendations

Puppeteer is a powerful tool for automating testing scripts and building fully functional web scrapers. It provides everything needed for user behavior emulation, HTTP request interception, blocking unnecessary resources, parsing HTML content, and even working with proxies.

However, there are some unique nuances here. The tool supports only one programming language (JavaScript), and there are proxy binding restrictions related to browser instances. Rotating proxy is essential to ensure maximum parsing comfort with no blocks and easy scaling.

Froxy provides quality mobile and residential proxies with rotation. We offer 10 million IPs, rotating by time and per new request.

View full post