How Chromedp Can Help with Scraping [Guide]

Written by Team Froxy | Sep 12, 2024 9:00:00 AM

We have previously discussed popular libraries for the Go language that assist with webpage parsing. This material will focus on the chromedp library: how to use it, its features, how to install and configure it. For better understanding, we will provide code examples and the most relevant use cases.

What Is Chromedp?

Chromedp - is a free, open-source library written in the Go programming language, designed to interact with Google Chrome's headless browser.

Link to Github Chromedp

It's important to clarify what a headless browser is - this is a browser that can be accessed via an API, meaning through programmatic calls and without a graphical interface.

For more details, see the material on Headless browsers.

Recently, Google Chrome has introduced its own official headless mode (without the use of third-party libraries and plugins like Selenium WebDriver).

The native Chrome protocol, designed to manage the browser from third-party programs, is called the Chrome DevTools Protocol. Mind that it's supported not only by Chrome but also by many browsers built on its code base, including pure Chromium and a special docker build for parsing called headless-shell.

The main problem with integrating headless Chrome with any external software is the need to write a special connector that could convert programmatic calls into commands and syntax for the Chrome DevTools Protocol.

This is exactly the task that the Chromedp library addresses. It allows managing headless Chrome without involving external dependencies from scripts.

Naturally, Chromedp is most often used for parsing web pages and organizing testing (websites and web applications).

Overview of Chromedp’s Features and Capabilities

Technically, you could write your own connector for headless Chrome and call the Chrome DevTools Protocol commands directly in your code. However, this greatly complicates the programming process and increases development time. Isn't it easier to use a ready-made library with a clear command syntax and code examples for all typical tasks? It is! That is exactly why all parser developers using Golang utilize Chromedp.

Features and advantages of the Chromedp library can be described as follows:

Speeds up the creation of parsers and testing scripts in the Go language;
There are many ready-to-use code samples for typical tasks in Chromedp;
Chromedp allows parsing both static and dynamic websites;
The library operates without a long list of additional dependencies - you won’t need anything but Chromedp;
User behavior emulation is available here - it supports filling in forms, controlling the cursor, scrolling, implementing click events etc.;
Capable of emulating specific devices, including desktops and mobile phones;
The library allows creating the screenshots and PDF files from pages;
Supports uploading and downloading files;
Advanced parsing features: includes extracting text from selected layout elements, supports geo-IP and connections via proxy servers;
Cookie and digital fingerprint management.

Getting Started with Chromedp

Enough with the theory, let's move on to the practical part.

To start working with the Chromedp library, you must have the Go language environment installed on your device. If you haven't done this yet, visit the official site and download the Go environment for your software platform, which supports Linux, Windows and MacOS.

If you don't have a compatible browser, you'll need to download Google Chrome, Chromium or headless-shell (provided as a Docker container, so you'll also need to install the Docker environment).

Installing and Setting Up the Chromedp Library

To install and set up the Chromedp library, start by opening the command line environment (like Terminal on Windows), which can be done without root or administrator rights.

If you wish to initiate a project in a specific directory, you first need to create a folder and then navigate to it in the terminal. The commands for creating and switching directories may differ across platforms.

When working in Windows, you can create a directory at the root of the C drive with:

mkdir \My-project

cd \My-project

Next, create and initialize your first Go module:

go mod init My-scraper

Then, add the Chromedp library to your module using the command:

go get -u github.com/chromedp/chromedp

Wait for the library to download and install. Once complete, a go.mod file will appear in your project directory.

If you need more code samples using the library, execute the following command:

go get -u -d github.com/chromedp/examples

Creating a Basic Scraping Script with Chromedp

To create a simple web scraping script using Chromedp, you'll start by creating a new text file in the same directory and renaming it to my-parser.go. Open the file in a code editor and input the following code:

package main

// import additional modules

// in our case these are local libraries https://pkg.go.dev/context and https://pkg.go.dev/fmt, time as well as the external chromedp with extension for reading DOM structure

import (

"context"

"fmt"

"github.com/chromedp/cdproto/dom"

"github.com/chromedp/chromedp"

"log"

"time"

)

// create the main program function

func main() {

// initiate the creation of a new headless-browser sample via context

ctx, cancel := chromedp.NewContext(

context.Background(),

)

// if we don’t need the browser anymore,

// then its instance exports, providing the resources

defer cancel()

// string variable to store the HTML of the page

var yourhtml string

// if there is no error, the following code is provided

err := chromedp.Run(ctx,

// specify the target page, in our case this is a special test website with unlimited content scrolling, due to scrapingclub

chromedp.Navigate("https://scrapingclub.com/exercise/list_infinite_scroll/"),

// add a pause to wait for the complete content download

chromedp.Sleep(2000*time.Millisecond),

// export the HTML code to the line, collecting and processing the errors to further place them to the err variable

chromedp.ActionFunc(func(ctx context.Context) error {

// find the root element of the DOM structure to do this

rootNode, err := dom.GetDocument().Do(ctx)

// if the err variable content does not equal to 0, return its content

if err != nil {

return err

}

// fill the yourhtml variable with content

yourhtml, err = dom.GetOuterHTML().WithNodeID(rootNode.NodeID).Do(ctx)

return err

}),

)

if err != nil {

log.Fatal("Fatal error in the parsing logic", err)

}

// Print the HTML to the terminal

fmt.Println(yourhtml)

}

Now you just need to save the file and run it in your terminal using the command:

go run my-parser.go

After a brief pause, the HTML code of the page should appear in the console. If something goes wrong, an error will be displayed.

Let's complicate the task a bit. Instead of extracting the entire page HTML, which you can easily view in any browser, try to collect only the product names and prices from the page.

For this task, you will need to complete the following steps:

identify the elements in the DOM structure by their ID or class that are associated with the relevant tags;
In the case of the page https://scrapingclub.com/exercise/list_infinite_scroll/", the tags used are <h4> for the product names and <h5> for the prices;
On actual commercial websites, identifying elements can be more challenging because site developers often deliberately make similar elements difficult to identify, for example, by making them unique (making it impossible to identify a pattern). However, identification is usually achieved using CSS classes;
Thus, to collect the names, you need to extract the text from the <h4> tags;
To collect the prices, you would extract the text from the <h5> tags;
To get all the names and prices, you need to browse through the entire document structure.

Copy and paste the code in the file my-parser.go, having replaced the previous one:

package main

// Import the modules

import (

"context"

"fmt"

"github.com/chromedp/cdproto/cdp"

"github.com/chromedp/chromedp"

"log"

)

// Our data structure for the product, which we will use to store the extracted information (name and price variables)

type Product struct {

name, price string

}

// create the main function of your program

func main() {

// to track all the data, let’s create variable products and the array []Product

var products []Product

// Initiate the chromedp headless sample

ctx, cancel := chromedp.NewContext(

context.Background(),

)

// close it, if not used

defer cancel()

// Here is the logic for automating the browser

var productNodes []*cdp.Node

err := chromedp.Run(ctx,

chromedp.Navigate("https://scrapingclub.com/exercise/list_infinite_scroll/"), //target web page

chromedp.Nodes(".post", &productNodes, chromedp.ByQueryAll), //search all product pages, they all have the post class

)

//instantly filling the errors

if err != nil {

log.Fatal("Error:", err)

}

// parsing logic

var name, price string

for _, node := range productNodes {

// extract data from HTML product card

err = chromedp.Run(ctx,

// collect text from the H4 tag, our name

chromedp.Text("h4", &name, chromedp.ByQuery, chromedp.FromNode(node)),

// collect text from the H5 tag, our price

chromedp.Text("h5", &price, chromedp.ByQuery, chromedp.FromNode(node)),

)

//here is the errors output, if any

if err != nil {

log.Fatal("Error:", err)

}

// launch the new process, but for structuring parsing data

product := Product{}

product.name = name

product.price = price

products = append(products, product)

}

//Output the product array

fmt.Println(products)

}

Since we haven’t added a delay, the browser will only display those product cards that do not require additional loading, that is, the first 10 items.

Navigation and Interacting with Web Pages

At this point, we need to either load the page content if it's an infinite scroll or use the built-in pagination mechanism.

In classic parsing scripts (for static pages), a list of all URLs on the page is collected, then they are entered into a special table, checked for uniqueness (to avoid parsing the same pages multiple times) and queued for further parsing procedures.

In the Chromedp library, you can leverage user behavior emulation mechanisms:

Waiting for content to load;
Clicking and moving the mouse pointer on the page;
Filling in (or clearing) input fields, submitting forms;
Waiting (delay);
Creating screenshots or PDF versions of the page (for subsequent recognition and other automation options based on computer vision);
Dragging elements.

Notably, Chromedp allows the use of JavaScript to describe the logic of working with the page.

But let's return to our example. To get data on all the products on the page, you need to scroll the feed down at least 8-10 times (depending on screen height). Since the page does not immediately show new products, it makes sense to add a delay time.

We'll write the scrolling logic in JS, add the script to a variable and then use that variable in Go.

Here’s how scrolling in JS would look:

// Scroll the page 10 times

const numberScrolls = 10

let countScroll = 0

// After each scroll iteration, add a 1-second pause (1000 milliseconds)

const interval = setInterval(() => {

window.scrollTo(0, document.body.scrollHeight)

countScroll++

if (countScroll === numScrolls) {

clearInterval(interval)

}

}, 1000)

Here is how the complete code will look on Go:

package main

// Import modules, but mind: the time module has been added to previous ones

import (

"context"

"fmt"

"github.com/chromedp/cdproto/cdp"

"github.com/chromedp/chromedp"

"log"

"time"

)

// Our product data structure that we will use to store the extracted data (variables name and price)

type Product struct {

name, price string

}

// Create the main function of your program

func main() {

// to track all the data, create the the products variable and an array of []Product

var products []Product

// Initialize the headless-chrome sample

ctx, cancel := chromedp.NewContext(

context.Background(),

)

//close it, if not used

defer cancel()

// Describe the scrolling logic in JavaScript, don't forget the single quotes (opening and closing)

scriptForScrolling := `

// scroll the page 10 times

const numberScrolls = 10

let countScroll = 0

// After each scrolling iteration, add a 1-second pause (1000 milliseconds)

const interval = setInterval(() => {

window.scrollTo(0, document.body.scrollHeight)

countScroll++

if (countScroll === numberScrolls) {

clearInterval(interval)

}

}, 1000)

// Browser automation logic - added JS script execution and a general delay of 10 seconds

var productNodes []*cdp.Node

err := chromedp.Run(ctx,

chromedp.Navigate("https://scrapingclub.com/exercise/list_infinite_scroll/"), //target page

chromedp.Evaluate(scriptForScrolling, nil), //execute scrolling script

chromedp.Sleep(10000*time.Millisecond), //a 10-seconds delay

chromedp.Nodes(".post", &productNodes, chromedp.ByQueryAll), // search all product pages with post class

)

//Fill in errors

if err != nil {

log.Fatal("Error:", err)

}

// parsing logic

var name, price string

for _, node := range productNodes {

// extract data from the HTML product card

err = chromedp.Run(ctx,

// collect text from the H4 tag, our product name

chromedp.Text("h4", &name, chromedp.ByQuery, chromedp.FromNode(node)),

// collect text from the H5 tag, our price

chromedp.Text("h5", &price, chromedp.ByQuery, chromedp.FromNode(node)),

)

//here is the errors output, if any

if err != nil {

log.Fatal("Error:", err)

}

// launch the new process, but for parsing data structuring

product := Product{}

product.name = name

product.price = price

products = append(products, product)

}

//Output product array

fmt.Println(products)

}

Run the script and wait for it to complete. The console will display all 60 test products from the array.

As you noticed, we set a hard time waiting limit. However, this isn't the most efficient approach as it can waste a lot of time on parsing. It's better to add an event that triggers when a specific element becomes visible. Chromedp has this capability, which can be triggered with the WaitVisible() function.

In our Go script, replace the line:

chromedp.Sleep(10000*time.Millisecond),

with the following one:

chromedp.WaitVisible(".post:nth-child(60)"),

Obligatory remove the time module import, as it is no longer needed in our script.

In the new code, we wait for the 60th product card with the "post" class to load.

You can use other events to manage parsing logic, such as:

WaitReady()
WaitNotVisible()
WaitEnabled()
WaitSelected()
WaitNotPresent()
Poll()
ListenTarget()
PollFunction()
Click()

For example, the code

chromedp.Click(`.post a`, chromedp.ByQuery),

will perform a click on the image within the product card (.post a is a combination of the CSS class of the card and the <a> tag containing the link and image).

Advanced Chromedp Techniques for Web Scraping

As mentioned above, chromedp can emulate user behavior, fill out forms, download and upload files.

Chromedp Proxy

One of the most advanced techniques for scraping with chromedp is working through a proxy. Here’s the official documentation for chromedp proxy.

Below, we’ll explore proxies through code examples.

Here’s a script that creates a new context (browser sample) and runs it through a proxy server. The script will finish by displaying the current IP address (to confirm if it’s working through the proxy):

package main

// Import modules and Chromedp library

import (

"context"

"fmt"

"github.com/chromedp/chromedp"

)

// Create the main function of your program

func main() {

//state the variable for storing proxy address and fill it with local host data

// the note proxy format may be as follows:

protocol://login:password@IP_ADDRESS:port

//replace proxy data with your own data

var proxyAddress string = "https://127.0.0.1:80"

//launch a new browser sample

ctx, cancel := chromedp.NewContext(context.Background())

//if not used, release the resources

defer cancel()

// set up proxy options and enable a browser flag to work in the headless mode

opts := []chromedp.ExecAllocatorOption{

chromedp.ProxyServer(proxyAddress),

chromedp.Flag("headless", true),

}

// launch context flow on the basis of options

ctx, cancel = chromedp.NewExecAllocator(ctx, opts...)

defer cancel()

// Create a new released context

ctx, cancel = chromedp.NewContext(ctx)

defer cancel()

// access the website and extract the required info

// in our case, this is the httpbin.org website, where you can find your IP address

var ip string

err := chromedp.Run(ctx,

chromedp.Navigate("https://httpbin.org/ip"),

chromedp.WaitVisible("body"),

chromedp.Text("body", &ip),

)

if err != nil {

fmt.Println("Error:", err)

return

}

// Print the retrieved IP address to the console

fmt.Println("Your IP-address", ip)

}

If the proxy server is unavailable, the script will display a connection error (this will happen if you leave the localhost IP – 127.0.0.1).

However, using a single proxy is not always convenient for a parser.

Here’s a script that rotates proxies from a list:

package main

// Import modules and Chromedp library

import (

"context"

"fmt"

"time"

"github.com/chromedp/chromedp"

)

// Create the main function of your program

func main() {

// define the list of proxy servers for rotation

proxyAddresses := []string{

"https://123.123.123.123:80",

"https://124.124.124.124:80",

"https://125.125.125.125:80",

// you need to add real working proxies in the following format: protocol://login:password@IP_ADDRESS:port

//proxy amount may vary

}

// create a new context with the 20-seconds timeout

ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)

defer cancel()

// rotate proxy addresses and parse data from the page

for _, proxyAddress := range proxyAddresses {

// set the correct proxy address

opts := append(chromedp.DefaultExecAllocatorOptions[:],

chromedp.Flag("headless", true), // enable the working flag in the headless mode

chromedp.ProxyServer(proxyAddress), // define proxy here

)

// create the allocator for current proxy options

allocCtx, cancel := chromedp.NewExecAllocator(ctx, opts...)

defer cancel()

// create a new context sample

ctx, cancel := chromedp.NewContext(allocCtx)

// release the resources, if not used

defer cancel()

// access the website and parse data

var ip string

err := chromedp.Run(ctx,

chromedp.Navigate("https://httpbin.org/ip"), //a web page showing current IP address of a client

chromedp.WaitVisible("body"),

chromedp.Text("body", &ip), //copy content with IP address

)

if err != nil {

fmt.Println("Error:", err)

// proceed to the next proxy in the list

continue

}

// Print the scraped IP address parsed on the page

// every new cycle will display a new address or a new connection error

fmt.Println("IP Address:", ip)

}

Setting the User-Agent

What Is User Agent

Technically, you're already working through the latest Chrome version (if it's installed on your system). However, this doesn't always meet the needs of web scraping.

To emulate work through different browser types and devices, the first thing to change is the User-Agent string.

The UserAgent() function is responsible for this.

Here’s how an example of browser instance options with a User-Agent string might look:

// all proxy server settings are concentrated here

options := []chromedp.ExecAllocatorOption{

chromedp.DefaultExecAllocatorOptions[:],

chromedp.ProxyServer("123.123.123.123:80"), // proxy server address and port

chromedp.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"), // the required user-agent is specified in this string

// other options may come next

}

Reading and Setting Cookies

GetCookies() is responsible for reading cookies. Below is the sample of code that reads the available cookies and displays them:

chromedp.ActionFunc(func(ctx context.Context) error {

cookies, err := storage.GetCookies().Do(ctx) // all cookies are read here

// error handling is realized here

if err != nil {

return err

}

// Loop to iterate through the cookie array and output each cookie in the console

for i, cookie := range cookies {

log.Printf("Cookie %d: %+v", i, cookie)

}

return nil

}),

Chromedp can not only read but also set cookies using the SetCookie() function. Here’s an example of how to set a cookie:

// Accessing the active context

chromedp.ActionFunc(func(ctx context.Context) error {

// setting cookies in the “key:meaning” format

cookie := [2]int{"cookie_name", "cookie value"}

// setting cookie expiration time based on the current moment + 1 year

expirationTime := cdp.TimeSinceEpoch(time.Now().Add(365 * 24 * time.Hour))

err := network.SetCookie(cookie[i], cookies[i+1]).

WithExpires(&expirationTime).

WithDomain("<YOUR_DOMAIN>"). //your domain should be specified here

WithHTTPOnly(true).

Do(ctx)

if err != nil {

return err

}

return nil

}),

/ for i, cookie := range cookies {

log.Printf("Cookie %d: %+v", i, cookie)

}

return nil

}),

Website Authorization with Chromedp

Many websites handle form data submissions through POST requests, especially for login forms. Below is an example of a script that sends a POST request containing a login and password to authorize on a website:

err := chromedp.Run(ctx,

chromedp.Navigate("https://your-site.com/login"), // do not forget to replace the website login page with a relevant one

chromedp.WaitVisible(sel), // wait until the target selector becomes visible, it should be chosen on your demand

chromedp.SendKeys(".login", "<YOUR_USERNAME>"), // login is specified here

chromedp.SendKeys(".password", "<YOUR_PASSWORD>"), // password is specified here

chromedp.Submit("button[role=submit]"), // send the form, using the click function instead of the submit

General Recommendations for Parsing

We've covered the most critical steps for bypassing blocks when parsing, but target websites often have their own protection systems.

To parse effectively without getting banned, we recommend checking out our material on "Best Practices for Web Scraping Without Getting Blocked."

Here’s a quick summary:

Use random delays between requests;
Always use proxies;
The proxies you use should be quality and match the type of device you are emulating. Sites can detect user agents and other characteristics like screen size and system fonts.
Consider the quality of digital fingerprints, do not forget about cookies;
Bypass bot traps;
Monitor load and request frequency to the target website;
If possible, use the available API (APIs often have rate limits but breaching them won't result in permanent bans).

Keep in mind: Chromedp is optimized for use with headless Chrome/Chromium, leveraging the Chrome DevTools Protocol (CDP). It won’t work with other browsers that don’t support this protocol.

For resource efficiency, you might want to explore using headless-shell, a lightweight Docker container built for Chromium.

Conclusion

The Chromedp library significantly simplifies the process of creating parsers based on the Golang-headless Chrome combination. Instead of writing large amounts of code for each new task, you can utilize ready-made functions, speeding up development and minimizing complexity.

The range of capabilities offered by Chromedp is impressive. You can automate almost anything - from handling cookies and user agents to downloading files and emulating user behavior.

However, no large-scale scraper can function effectively without quality rotating proxies. With Froxy, you can access over 10 million IPs, including residential and mobile proxies, with targeting up to specific cities and ISPs. Proxies can be integrated into your parser with just a single line of code, and further rotation and configuration are easily managed via the user dashboard. Pricing is based on traffic consumption rather than the number of proxies.

Take advantage of the affordable trial package to fully test the capabilities of our proxy services.

View full post