We have previously discussed popular libraries for the Go language that assist with webpage parsing. This material will focus on the chromedp library: how to use it, its features, how to install and configure it. For better understanding, we will provide code examples and the most relevant use cases.
Chromedp - is a free, open-source library written in the Go programming language, designed to interact with Google Chrome's headless browser.
It's important to clarify what a headless browser is - this is a browser that can be accessed via an API, meaning through programmatic calls and without a graphical interface.
For more details, see the material on Headless browsers.
Recently, Google Chrome has introduced its own official headless mode (without the use of third-party libraries and plugins like Selenium WebDriver).
The native Chrome protocol, designed to manage the browser from third-party programs, is called the Chrome DevTools Protocol. Mind that it's supported not only by Chrome but also by many browsers built on its code base, including pure Chromium and a special docker build for parsing called headless-shell.
The main problem with integrating headless Chrome with any external software is the need to write a special connector that could convert programmatic calls into commands and syntax for the Chrome DevTools Protocol.
This is exactly the task that the Chromedp library addresses. It allows managing headless Chrome without involving external dependencies from scripts.
Naturally, Chromedp is most often used for parsing web pages and organizing testing (websites and web applications).
Technically, you could write your own connector for headless Chrome and call the Chrome DevTools Protocol commands directly in your code. However, this greatly complicates the programming process and increases development time. Isn't it easier to use a ready-made library with a clear command syntax and code examples for all typical tasks? It is! That is exactly why all parser developers using Golang utilize Chromedp.
Features and advantages of the Chromedp library can be described as follows:
Enough with the theory, let's move on to the practical part.
To start working with the Chromedp library, you must have the Go language environment installed on your device. If you haven't done this yet, visit the official site and download the Go environment for your software platform, which supports Linux, Windows and MacOS.
If you don't have a compatible browser, you'll need to download Google Chrome, Chromium or headless-shell (provided as a Docker container, so you'll also need to install the Docker environment).
To install and set up the Chromedp library, start by opening the command line environment (like Terminal on Windows), which can be done without root or administrator rights.
If you wish to initiate a project in a specific directory, you first need to create a folder and then navigate to it in the terminal. The commands for creating and switching directories may differ across platforms.
When working in Windows, you can create a directory at the root of the C drive with:
mkdir \My-project
cd \My-project
Next, create and initialize your first Go module:
go mod init My-scraper
Then, add the Chromedp library to your module using the command:
go get -u github.com/chromedp/chromedp
Wait for the library to download and install. Once complete, a go.mod file will appear in your project directory.
If you need more code samples using the library, execute the following command:
go get -u -d github.com/chromedp/examples
To create a simple web scraping script using Chromedp, you'll start by creating a new text file in the same directory and renaming it to my-parser.go. Open the file in a code editor and input the following code:
package main
// import additional modules
// in our case these are local libraries https://pkg.go.dev/context and https://pkg.go.dev/fmt, time as well as the external chromedp with extension for reading DOM structure
import (
"context"
"fmt"
"github.com/chromedp/cdproto/dom"
"github.com/chromedp/chromedp"
"log"
"time"
)
// create the main program function
func main() {
// initiate the creation of a new headless-browser sample via context
ctx, cancel := chromedp.NewContext(
context.Background(),
)
// if we don’t need the browser anymore,
// then its instance exports, providing the resources
defer cancel()
// string variable to store the HTML of the page
var yourhtml string
// if there is no error, the following code is provided
err := chromedp.Run(ctx,
// specify the target page, in our case this is a special test website with unlimited content scrolling, due to scrapingclub
chromedp.Navigate("https://scrapingclub.com/exercise/list_infinite_scroll/"),
// add a pause to wait for the complete content download
chromedp.Sleep(2000*time.Millisecond),
// export the HTML code to the line, collecting and processing the errors to further place them to the err variable
chromedp.ActionFunc(func(ctx context.Context) error {
// find the root element of the DOM structure to do this
rootNode, err := dom.GetDocument().Do(ctx)
// if the err variable content does not equal to 0, return its content
if err != nil {
return err
}
// fill the yourhtml variable with content
yourhtml, err = dom.GetOuterHTML().WithNodeID(rootNode.NodeID).Do(ctx)
return err
}),
)
if err != nil {
log.Fatal("Fatal error in the parsing logic", err)
}
// Print the HTML to the terminal
fmt.Println(yourhtml)
}
Now you just need to save the file and run it in your terminal using the command:
go run my-parser.go
After a brief pause, the HTML code of the page should appear in the console. If something goes wrong, an error will be displayed.
Let's complicate the task a bit. Instead of extracting the entire page HTML, which you can easily view in any browser, try to collect only the product names and prices from the page.
For this task, you will need to complete the following steps:
Copy and paste the code in the file my-parser.go, having replaced the previous one:
package main
// Import the modules
import (
"context"
"fmt"
"github.com/chromedp/cdproto/cdp"
"github.com/chromedp/chromedp"
"log"
)
// Our data structure for the product, which we will use to store the extracted information (name and price variables)
type Product struct {
name, price string
}
// create the main function of your program
func main() {
// to track all the data, let’s create variable products and the array []Product
var products []Product
// Initiate the chromedp headless sample
ctx, cancel := chromedp.NewContext(
context.Background(),
)
// close it, if not used
defer cancel()
// Here is the logic for automating the browser
var productNodes []*cdp.Node
err := chromedp.Run(ctx,
chromedp.Navigate("https://scrapingclub.com/exercise/list_infinite_scroll/"), //target web page
chromedp.Nodes(".post", &productNodes, chromedp.ByQueryAll), //search all product pages, they all have the post class
)
//instantly filling the errors
if err != nil {
log.Fatal("Error:", err)
}
// parsing logic
var name, price string
for _, node := range productNodes {
// extract data from HTML product card
err = chromedp.Run(ctx,
// collect text from the H4 tag, our name
chromedp.Text("h4", &name, chromedp.ByQuery, chromedp.FromNode(node)),
// collect text from the H5 tag, our price
chromedp.Text("h5", &price, chromedp.ByQuery, chromedp.FromNode(node)),
)
//here is the errors output, if any
if err != nil {
log.Fatal("Error:", err)
}
// launch the new process, but for structuring parsing data
product := Product{}
product.name = name
product.price = price
products = append(products, product)
}
//Output the product array
fmt.Println(products)
}
Since we haven’t added a delay, the browser will only display those product cards that do not require additional loading, that is, the first 10 items.
At this point, we need to either load the page content if it's an infinite scroll or use the built-in pagination mechanism.
In classic parsing scripts (for static pages), a list of all URLs on the page is collected, then they are entered into a special table, checked for uniqueness (to avoid parsing the same pages multiple times) and queued for further parsing procedures.
In the Chromedp library, you can leverage user behavior emulation mechanisms:
Notably, Chromedp allows the use of JavaScript to describe the logic of working with the page.
But let's return to our example. To get data on all the products on the page, you need to scroll the feed down at least 8-10 times (depending on screen height). Since the page does not immediately show new products, it makes sense to add a delay time.
We'll write the scrolling logic in JS, add the script to a variable and then use that variable in Go.
Here’s how scrolling in JS would look:
// Scroll the page 10 times
const numberScrolls = 10
let countScroll = 0
// After each scroll iteration, add a 1-second pause (1000 milliseconds)
const interval = setInterval(() => {
window.scrollTo(0, document.body.scrollHeight)
countScroll++
if (countScroll === numScrolls) {
clearInterval(interval)
}
}, 1000)
Here is how the complete code will look on Go:
package main
// Import modules, but mind: the time module has been added to previous ones
import (
"context"
"fmt"
"github.com/chromedp/cdproto/cdp"
"github.com/chromedp/chromedp"
"log"
"time"
)
// Our product data structure that we will use to store the extracted data (variables name and price)
type Product struct {
name, price string
}
// Create the main function of your program
func main() {
// to track all the data, create the the products variable and an array of []Product
var products []Product
// Initialize the headless-chrome sample
ctx, cancel := chromedp.NewContext(
context.Background(),
)
//close it, if not used
defer cancel()
// Describe the scrolling logic in JavaScript, don't forget the single quotes (opening and closing)
scriptForScrolling := `
// scroll the page 10 times
const numberScrolls = 10
let countScroll = 0
// After each scrolling iteration, add a 1-second pause (1000 milliseconds)
const interval = setInterval(() => {
window.scrollTo(0, document.body.scrollHeight)
countScroll++
if (countScroll === numberScrolls) {
clearInterval(interval)
}
}, 1000)
`
// Browser automation logic - added JS script execution and a general delay of 10 seconds
var productNodes []*cdp.Node
err := chromedp.Run(ctx,
chromedp.Navigate("https://scrapingclub.com/exercise/list_infinite_scroll/"), //target page
chromedp.Evaluate(scriptForScrolling, nil), //execute scrolling script
chromedp.Sleep(10000*time.Millisecond), //a 10-seconds delay
chromedp.Nodes(".post", &productNodes, chromedp.ByQueryAll), // search all product pages with post class
)
//Fill in errors
if err != nil {
log.Fatal("Error:", err)
}
// parsing logic
var name, price string
for _, node := range productNodes {
// extract data from the HTML product card
err = chromedp.Run(ctx,
// collect text from the H4 tag, our product name
chromedp.Text("h4", &name, chromedp.ByQuery, chromedp.FromNode(node)),
// collect text from the H5 tag, our price
chromedp.Text("h5", &price, chromedp.ByQuery, chromedp.FromNode(node)),
)
//here is the errors output, if any
if err != nil {
log.Fatal("Error:", err)
}
// launch the new process, but for parsing data structuring
product := Product{}
product.name = name
product.price = price
products = append(products, product)
}
//Output product array
fmt.Println(products)
}
Run the script and wait for it to complete. The console will display all 60 test products from the array.
As you noticed, we set a hard time waiting limit. However, this isn't the most efficient approach as it can waste a lot of time on parsing. It's better to add an event that triggers when a specific element becomes visible. Chromedp has this capability, which can be triggered with the WaitVisible() function.
In our Go script, replace the line:
chromedp.Sleep(10000*time.Millisecond),
with the following one:
chromedp.WaitVisible(".post:nth-child(60)"),
Obligatory remove the time module import, as it is no longer needed in our script.
In the new code, we wait for the 60th product card with the "post" class to load.
You can use other events to manage parsing logic, such as:
For example, the code
chromedp.Click(`.post a`, chromedp.ByQuery),
will perform a click on the image within the product card (.post a is a combination of the CSS class of the card and the <a> tag containing the link and image).
As mentioned above, chromedp can emulate user behavior, fill out forms, download and upload files.
One of the most advanced techniques for scraping with chromedp is working through a proxy. Here’s the official documentation for chromedp proxy.
Below, we’ll explore proxies through code examples.
Here’s a script that creates a new context (browser sample) and runs it through a proxy server. The script will finish by displaying the current IP address (to confirm if it’s working through the proxy):
package main
// Import modules and Chromedp library
import (
"context"
"fmt"
"github.com/chromedp/chromedp"
)
// Create the main function of your program
func main() {
//state the variable for storing proxy address and fill it with local host data
// the note proxy format may be as follows:
protocol://login:password@IP_ADDRESS:port
//replace proxy data with your own data
var proxyAddress string = "https://127.0.0.1:80"
//launch a new browser sample
ctx, cancel := chromedp.NewContext(context.Background())
//if not used, release the resources
defer cancel()
// set up proxy options and enable a browser flag to work in the headless mode
opts := []chromedp.ExecAllocatorOption{
chromedp.ProxyServer(proxyAddress),
chromedp.Flag("headless", true),
}
// launch context flow on the basis of options
ctx, cancel = chromedp.NewExecAllocator(ctx, opts...)
defer cancel()
// Create a new released context
ctx, cancel = chromedp.NewContext(ctx)
defer cancel()
// access the website and extract the required info
// in our case, this is the httpbin.org website, where you can find your IP address
var ip string
err := chromedp.Run(ctx,
chromedp.Navigate("https://httpbin.org/ip"),
chromedp.WaitVisible("body"),
chromedp.Text("body", &ip),
)
if err != nil {
fmt.Println("Error:", err)
return
}
// Print the retrieved IP address to the console
fmt.Println("Your IP-address", ip)
}
If the proxy server is unavailable, the script will display a connection error (this will happen if you leave the localhost IP – 127.0.0.1).
However, using a single proxy is not always convenient for a parser.
Here’s a script that rotates proxies from a list:
package main
// Import modules and Chromedp library
import (
"context"
"fmt"
"time"
"github.com/chromedp/chromedp"
)
// Create the main function of your program
func main() {
// define the list of proxy servers for rotation
proxyAddresses := []string{
"https://123.123.123.123:80",
"https://124.124.124.124:80",
"https://125.125.125.125:80",
// you need to add real working proxies in the following format: protocol://login:password@IP_ADDRESS:port
//proxy amount may vary
}
// create a new context with the 20-seconds timeout
ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
defer cancel()
// rotate proxy addresses and parse data from the page
for _, proxyAddress := range proxyAddresses {
// set the correct proxy address
opts := append(chromedp.DefaultExecAllocatorOptions[:],
chromedp.Flag("headless", true), // enable the working flag in the headless mode
chromedp.ProxyServer(proxyAddress), // define proxy here
)
// create the allocator for current proxy options
allocCtx, cancel := chromedp.NewExecAllocator(ctx, opts...)
defer cancel()
// create a new context sample
ctx, cancel := chromedp.NewContext(allocCtx)
// release the resources, if not used
defer cancel()
// access the website and parse data
var ip string
err := chromedp.Run(ctx,
chromedp.Navigate("https://httpbin.org/ip"), //a web page showing current IP address of a client
chromedp.WaitVisible("body"),
chromedp.Text("body", &ip), //copy content with IP address
)
if err != nil {
fmt.Println("Error:", err)
// proceed to the next proxy in the list
continue
}
// Print the scraped IP address parsed on the page
// every new cycle will display a new address or a new connection error
fmt.Println("IP Address:", ip)
}
}
Technically, you're already working through the latest Chrome version (if it's installed on your system). However, this doesn't always meet the needs of web scraping.
To emulate work through different browser types and devices, the first thing to change is the User-Agent string.
The UserAgent() function is responsible for this.
Here’s how an example of browser instance options with a User-Agent string might look:
// all proxy server settings are concentrated here
options := []chromedp.ExecAllocatorOption{
chromedp.DefaultExecAllocatorOptions[:],
chromedp.ProxyServer("123.123.123.123:80"), // proxy server address and port
chromedp.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"), // the required user-agent is specified in this string
// other options may come next
}
GetCookies() is responsible for reading cookies. Below is the sample of code that reads the available cookies and displays them:
chromedp.ActionFunc(func(ctx context.Context) error {
cookies, err := storage.GetCookies().Do(ctx) // all cookies are read here
// error handling is realized here
if err != nil {
return err
}
// Loop to iterate through the cookie array and output each cookie in the console
for i, cookie := range cookies {
log.Printf("Cookie %d: %+v", i, cookie)
}
return nil
}),
Chromedp can not only read but also set cookies using the SetCookie() function. Here’s an example of how to set a cookie:
// Accessing the active context
chromedp.ActionFunc(func(ctx context.Context) error {
// setting cookies in the “key:meaning” format
cookie := [2]int{"cookie_name", "cookie value"}
// setting cookie expiration time based on the current moment + 1 year
expirationTime := cdp.TimeSinceEpoch(time.Now().Add(365 * 24 * time.Hour))
err := network.SetCookie(cookie[i], cookies[i+1]).
WithExpires(&expirationTime).
WithDomain("<YOUR_DOMAIN>"). //your domain should be specified here
WithHTTPOnly(true).
Do(ctx)
if err != nil {
return err
}
return nil
}),
/ for i, cookie := range cookies {
log.Printf("Cookie %d: %+v", i, cookie)
}
return nil
}),
Many websites handle form data submissions through POST requests, especially for login forms. Below is an example of a script that sends a POST request containing a login and password to authorize on a website:
err := chromedp.Run(ctx,
chromedp.Navigate("https://your-site.com/login"), // do not forget to replace the website login page with a relevant one
chromedp.WaitVisible(sel), // wait until the target selector becomes visible, it should be chosen on your demand
chromedp.SendKeys(".login", "<YOUR_USERNAME>"), // login is specified here
chromedp.SendKeys(".password", "<YOUR_PASSWORD>"), // password is specified here
chromedp.Submit("button[role=submit]"), // send the form, using the click function instead of the submit
We've covered the most critical steps for bypassing blocks when parsing, but target websites often have their own protection systems.
To parse effectively without getting banned, we recommend checking out our material on "Best Practices for Web Scraping Without Getting Blocked."
Here’s a quick summary:
Keep in mind: Chromedp is optimized for use with headless Chrome/Chromium, leveraging the Chrome DevTools Protocol (CDP). It won’t work with other browsers that don’t support this protocol.
For resource efficiency, you might want to explore using headless-shell, a lightweight Docker container built for Chromium.
The Chromedp library significantly simplifies the process of creating parsers based on the Golang-headless Chrome combination. Instead of writing large amounts of code for each new task, you can utilize ready-made functions, speeding up development and minimizing complexity.
The range of capabilities offered by Chromedp is impressive. You can automate almost anything - from handling cookies and user agents to downloading files and emulating user behavior.
However, no large-scale scraper can function effectively without quality rotating proxies. With Froxy, you can access over 10 million IPs, including residential and mobile proxies, with targeting up to specific cities and ISPs. Proxies can be integrated into your parser with just a single line of code, and further rotation and configuration are easily managed via the user dashboard. Pricing is based on traffic consumption rather than the number of proxies.
Take advantage of the affordable trial package to fully test the capabilities of our proxy services.