Blog Froxy | News, Helpful Articles About Using Proxies

Headless Browser: What is It and How to Use It for Scraping

Written by Team Froxy | Feb 7, 2024 10:00:00 PM

Only programmers and IT specialists could come up with a term like "headless" browser. There is one more name for this term in English - a “decoupled browser”, which hints at the notion of beheading.

Delving deeper into the term's meaning clarifies the reason for this wordplay. There's indeed something to it – after all, how would you communicate with a person who has no head? In our case, the head represents the main interface through which speech, facial expressions and everything else is interpreted.

So, what is headless browser? Intrigued to find out what? Then let's dive deeper and explore what these browsers have to do with data parsing.

What Is a Headless Browser?

A headless browser, a term from programming, refers to a browser without a graphical user interface. The thing is, modern website engines (CMS systems) often separate the core engine from graphical interfaces, both for management and end-user display.

In this scenario, interface tasks may be handled by other services, such as native mobile apps for mobile operation and PWA applications for desktop browser rendering. A single engine can interact with various interfaces, minimizing core load. Each component focuses on its task and can be scaled efficiently. Such CMS systems are known as Headless CMS.

Programmers’ tasks don’t need graphical interfaces sometimes. Moreover, there are tasks where graphical mode is unnecessary. Data parsing is one of those tasks.

What does a headless mean in a few words? That’s simple. A Headless browser is the one that does not have a graphical interface and is operated solely through an API.

What Is Headless Browser Used For?

For standard data parsing tasks, pre-built libraries that analyze HTML document structures, locate specific tags and extract relevant information are usually sufficient. We've discussed them before:

However, the challenge arises with modern websites that don't serve "pure" HTML. Many site operations are calculated and completed within the browser, using JavaScript scripts.

Websites now function as full-fledged web applications, with user actions (including simple actions on a page) altering layout and content.

Instead of classical HTML, parsers might encounter links to JS scripts that can be stored in several places at a time. To render the HTML code in its final form, it's necessary to download all these scripts, execute them (applying them to the basic HTML initially sent via the HTTP protocol) and only then proceed to parsing.

Standard parsing libraries don't typically handle this, hence using a browser to process the page and obtain the resultant HTML code is more straightforward.

Headless browsers are ideal for tasks involving heavy use of JavaScript scripts on web pages. They work as follows:

  1. The parser instructs the headless browser with the URL of the target page and any specific actions required (like authorization, focusing on elements, delay etc.).
  2. The browser interacts with the page as a regular user would, executing and calculating all related JavaScript scripts.
  3. This results in a final HTML page.
  4. The parser then processes this resulting code.

The parser then acts according to its standard algorithm - detects the required HTML tags, extracts the data, saves it in a table, exports to another software etc.

Headless browsers are primarily used for rendering pages with extensive JavaScript. They are used on one of the stages of the parser’s work.

The communication between the parser and the headless browser occurs via an API. Apart from parsing, headless browsers are also used for:

  • layout testing (design, error/problem detection, layout creation on various scream types etc.);
  • website/web page (code) performance testing;
  • automating routine tasks like managing multiple accounts or activities in social networks.

Headless Browser Testing Mode

Browsers like Google Chrome offer interfaces for "headless" operation, providing programmers with debugging tools, APIs, libraries etc.

However, the browser can signal to websites that it is operating in headless mode. Advanced anti-fraud systems might use this information to block access or take other measures against such connections. Any attempt to access web pages with the signs of headless connections will result in IP blocking or other disappointing consequences.

To avoid detection and potential blocking while parsing, it's essential to use libraries and scripts that can remove headless indicators from HTTP requests and responses. This is crucial for effective and safe web scraping without triggering blocks.

Popular Headless Browsers

Nowadays, it's no secret that most browsers are built either on the Chromium engine (like Google Chrome, Edge, Opera etc.) or on the Gecko/Quantum engine (like Firefox, Waterfox, Pale Moon, Basilisk, etc.).

However, a headless browser is not just a core for executing JS scripts; it also includes a special toolset for developers as well as specific extension modules and libraries.

So, what are the available options for headless browser environments?

Chrome DevTools Protocol

Starting with version 59 of the browser (since 2017), developers have provided the ability to use a special API interface – the Chrome DevTools Protocol. All you need to do is install the browser and start working.

Headless Chrome is managed at the command line level (CLI mode). You need to address the standard executable browser file but with a special "--headless" flag.

Among the additional flags, there are tools for exporting PDF files, creating screenshots of pages, debugging and printing the DOM structure.

Even in this mode, you can do many useful things. For example, you can write a script for the command line that sequentially visits website pages, exports the content to HTML files and passes it to a parser for analysis.

For more complex and advanced automation algorithms, however, it's better to use special libraries discussed below.

You can read more about the capabilities of the command line for Chrome in the official browser documentation.

Chrome-Remote-Interface Library

This is a low-level library provided as an npm package. It utilizes the debugging interface of the original Chrome browser and can perform many small actions.

In fact, this very library is frequently used for abstracting the commands of other high-level libraries and utilities, even when writing your parsers.

Chrome-Remote-Interface guarantees maximum closeness to the hardware component. However, the library has its downsides. Thus, it is unable to automatically launch a browser sample. For this purpose, additional tools like ChromeLauncher are required.

Puppeteer + Headless Chrome

Puppeteer is a young yet extremely ambitious project. The key point is that the library is developed and maintained by the Chrome team. This is a Node.js library that implements a high-level API interface to work with headless Google Chrome.

It boasts simple and understandable syntax, a vast array of extensions and integrations, support for deployment in Docker containers and high-quality documentation.

Puppeteer drawbacks include lack of audio/video support, software rendering only, absence of features for managing browser bookmarks and passwords.

Selenium (Headless Chrome + Headless Firefox)

Selenium is a powerful and comprehensive solution that allows the use of a full-fledged Chrome browser in automated scripts. Note that Selenium requires a special driver, WebDriver or ChromeDriver, for its operation. These drivers work not with a headless browser, but with a full-fledged one. In other words, these libraries are needed to convert a regular browser into a headless one (adding a headless mode meaning to it).

Since drivers can be added to different browsers, not only Chrome, Selenium is compatible with Firefox, Opera, IE etc.

On top of Selenium, higher-level solutions can be used, such as libraries like WebDriverIO. As a result, a ready-to-use high-level API can be obtained.

Selenium has remained the most popular solution for organizing headless browsers for quite a long time. This is not because it is unique, but because the developers have approached the issue in a comprehensive manner. It includes a ready-made server implementation, an IDE (Integrated Development Environment to quickly write automation scripts), drivers for practically any browser version (from Android applications to Internet Explorer and Safari) as well as Selenium Grid (an environment for bulk management of a large number of headless browsers).

This provides a hint as to what most anti-detection browsers are based on.

HtmlUnit

HtmlUnit is one of the few attempts to create a solution capable of processing and rendering JS scripts without using full-fledged browsers.

It is written in Java and is used as a headless browser within other high-level testing and parsing solutions such as JWebUnit, Canoo WebTest, Serenity BDD, WebDriver, Arquillian Drone etc.

Interestingly, HtmlUnit can emulate any browser and work through a proxy. It even includes a built-in parser and basic tools for emulating user behavior, such as following links, submitting forms, authentication etc.

A Few Words About Anti-Detect Browsers

There are not many people willing to create their own browsers from scratch. For this very reason, Microsoft discontinued support for Internet Explorer and openly switched to a competitor's engine, (Chrome).

Anti-detect browsers are essentially a combination of various existing solutions, rather than standalone headless browsers. Most of them are based on a combination of Chromium and Selenium, along with other libraries. This is why many anti-detect browsers offer ready-made APIs for automation or instructions for integration through WebDriver/ChromeDriver.

Consequently, anti-detect browsers should not be considered as independent headless browsers but rather as pre-assembled packages designed for specific tasks.

There are also less common solutions for supporting the parsing of JavaScript-heavy websites, but they are either abandoned or do not have the same level of support as the aforementioned solutions. These include frameworks like Splinter, PhantomJS, Nightmare JS or Playwright.

Scraping with a Headless Browser

Let's provide a simple example of web scraping using JavaScript, specifically with Puppeteer (in conjunction with Chrome).

// Install Puppeteer using the NPM package manager

npm i puppeteer

// Create and launch a new sample of a headless browser

const browser = await puppeteer.launch({

headless: false,

});

// Start working with a new page

const page = await browser.newPage();

// Navigate to a page with the URL https://site.com/index.html

await page.goto("https://site.com/index.html", {

// Wait for the page to fully load

waitUntil: 'domcontentloaded',

});

// Launch a JavaScript script at the end of the target page

let data = await page.evaluate(() => {

// Find all elements with the 'H1' tag on the page

return Array.from(document.querySelectorAll("article h1")).map((el) => {

return {

// Find the first element with the 'Title' tag

title: el.querySelector("a").getAttribute("title"),

// Find the element with the HREF attribute

link: el.querySelector("a").getAttribute("href"),

};

});

});

// Output the content of the tags to the console

console.log(data);

// Close the browser sample

await browser.close();

This is just one sample of web scraping using a headless browser.

You may use different programming languages and various sets of environments (headless browsers and related libraries) in your project. Therefore, the syntax, collected data and initial configurations may vary.

Puppeteer primarily works with JavaScript only, but other tools like Selenium can work with various programming languages, browser versions and platforms. You can even create your own web service for scraping specific websites or different types of websites.

For specific commands and API documentation, you can refer to the official documentation and communities of selected headless browsers or related libraries.

Advantages and Disadvantages of Headless Browsers

The main advantages of using headless browsers can be summarized as follows:

  • Headless browsers are essential for interacting with modern web applications and JavaScript-driven websites.
  • With well-documented APIs, you can automate virtually any action, including simulating human-like behavior. This ability can help bypass even complex security systems.
  • Headless browsers lack a graphical user interface and real hardware rendering, which leads to faster page processing and content retrieval, and they consume fewer resources compared to full-fledged browsers.
  • Due to correct proxy configuration and use, you can run multiple independent browser samples at a time, thus accelerating the process of web scraping.
  • High-Level Libraries: Many libraries designed for headless browsers are high-level, reducing the amount of work required when developing your own parsers. Sometimes, you can even utilize built-in solutions without reinventing the wheel.

However, there are some drawbacks to using headless browsers:

  • While headless browsers are more resource-efficient compared to full browsers, they still consume a significant amount of resources, especially in a programmatic implementation. Each new page consumes almost as much memory and resources for rendering as a regular graphical browser.
  • Each headless browser library has its own unique features and nuances, requiring knowledge of their APIs and fundamental commands.
  • Advanced website security systems can detect and block headless browsers. It is a must to learn to circumvent such protection (e.g., adjusting request structures, increasing delays etc.) for full-fledged parsing.
  • Headless browsers alone do not fully solve the problem of parsing JavaScript websites. This is just a part of the problem solution. Integration with a headless browser involves setting it up, configuring it, writing a parser and then integrating it.
  • Headless browsers render pages at a programmatic level only, which can result in specific challenges and unexpected issues.

Conclusion

Headless browsers are an effective and almost the only adequate solution for parsing dynamic websites that rely on AJAX and JavaScript technologies.

Recently, headless mode has become available in Google Chrome, but in other cases, you'll need a specialized driver and a library that implements the API interface like Selenium to access the browser.

Therefore, to develop a comprehensive web scraper that takes into account all technical nuances and configurations, you will need to invest a significant amount of time and effort.

Anyway, if you require high-speed web scraping, it's essential to use proxies. You can acquire the best mobile and residential proxies from us.

Froxy offers a vast network of IP addresses with dynamic rotation, featuring over 8.5 million endpoints across 200+ locations. The accuracy of IP selection can go down to the level of the city and internet service provider. Payments are based on traffic packages only, not on specific proxies. You can use up to 1000 ports at a time.