How to Scrape Data Using Beautiful Soup: A Comprehensive Guide

Written by Team Froxy | Feb 4, 2025 9:00:00 AM

The process of creating your own parser might seem quite complex and confusing for beginners, especially if you need to write a program from scratch. However, the availability of specialized libraries like Beautiful Soup (written in Python, an ideal programming language for beginners), notably simplifies the task.

And yes, don't let the library's name mislead you — we are not going to teach you “how to cook a beautiful soup." This material has nothing to do with cooking, although… it is still a step-by-step guide you can follow!

Let’s get started.

What Is Beautiful Soup?

Beautiful Soup – is a Python web scraping library meant to speed up the creation of custom web page parsing utilities. It transforms HTML code into an indexed dictionary-like structure.

The library's developers aimed to simplify programmers' lives, as parsing tasks frequently arise in technical projects. These tasks go beyond simply gathering competitor data or automating multi-account management.

Key Features of Beautiful Soup

Provides easy-to-use methods and idioms for navigating, searching, and modifying the parse tree;
Quick conversion of documents into Unicode and UTF-8;
Works seamlessly with other popular parsers like lxml and html5lib (you can switch them via specific parameters in your code).

Additional Advantages

Despite its initial release back in 2004, Beautiful Soup is actively developed and regularly updated;
It is open-source and distributed under the MIT license, allowing it to be included in any Python application, even commercial ones;
Complex parsers can be built within minutes (compared to hours or days when done manually by professional developers);
Minimum coding is required from developers;
Supports data extraction not only from HTML but also XML documents;
Can integrate with headless browsers (by analyzing the resulting HTML code passed to bs4).

The library was created and is actively maintained by Leonard Richardson, a Python developer, RESTful API design expert, and science fiction writer.

In short, Beautiful Soup converts a complex HTML document into a tree of Python objects. There are four main types of these objects:

Tag
NavigableString
BeautifulSoup
Comment

You can apply various methods and attributes to these tags, significantly accelerating document processing.

For example:

You can quickly extract all links on a page, all quotes, etc.;
You can also navigate horizontally across elements at the same hierarchical level (e.g., first, second, third, last).

Main Steps of Beautiful Soup Web Scraping

To start scraping websites, you need to prepare your programming environment by installing Python, the Beautiful Soup library, and several additional components before proceeding to the practical realization — script writing.

Let’s break the process down step by step and go through each of them in detail.

1. Prerequisites and Setting Up Your Environment

Install Python. To do this, visit the official Python download page and download the latest stable version, 3.13, as of this writing. Mind that Beautiful Soup supports older Python versions, the second Python 2.17+ and the third Python 3.2+.

Python can be installed on these operating systems: Windows, macOS, and Linux. Many popular distributions already come with Python pre-installed. If needed, Python can also be installed and used on mobile devices: iOS and Android.

Note! When installing Python on Windows, check the box labeled «Add python.exe to PATH.» This will let you automatically add the processed file to the changing software environments. Otherwise, you’ll need to manually add Python to your system's environment variables to run it from the command line without specifying the full path every time (C:\Program Files\Python313\python.exe or C:\Users\YOUR_LOGIN\AppData\Local\Programs\Python\Python313\python.exe, depending on the installation type.

The latest version is 4.12. If needed, you can download the outdated third version, but keep in mind that it is no longer maintained or updated.

While Beautiful Soup is distributed via the official repository, it is enough to use the following command in the Python console:

pip install beautifulsoup4

If the pip command comes with the error, it has to be added to PATH. If you haven't done that yet, enter the following in the terminal:

python -m pip install beautifulsoup4

If you use the easy_install package manager, use the command:

easy_install beautifulsoup4

In Ubuntu/Debian Linux distributives, Beautiful Soup can be installed as a built-in program package:

apt-get install python3-bs4 (for the Python 3+ version)apt-get install python-bs4 (for the Python 2 version)

More experienced users can download the source code and manually install the library.

If the support of various parser versions is crucial for you, it is possible to additionally install lxml and html5lib:

pip install lxmlpip install html5lib

If you wish to interact with external servers, you will have to manually send and receive HTTPS requests. To simplify the process of formulating HTTP headers, install the requests library:

pip install requests

Make sure that all Python-processed files and scripts remain in the same place. For example, a part of libraries may be stored in C:\Users\YOUR_LOGIN\AppData\Roaming\Python\Python313\Scripts during installation. You can also regularly check the correctness of paths in software variables (specified in PATH).

You can work with local HTML documents for test parsers.

Residential Proxies

Perfect proxies for accessing valuable data from around the world.

Try With Trial $1.99, 100Mb

2. Understanding the Basics of HTML Structure

HTML is a markup language used to structure content on the web. It consists of special tags and attributes that browsers interpret to render the final web page version.

Here is the simplest HTML page sample:

<!DOCTYPE html><html><head><title> This is a web page title displayed in the search results and when hovering the mouse on the browser tab </title><meta charset="utf-8"></head><body><h1> This is the header displayed in the page body </h1><p>

This is just a paragraph - any descriptive text can be inside…

</p><h2>Subheading: Types of Proxies»:</h2><ul id="thebestlist"><li>Residential proxies</li><li>Mobile proxies</li><li>General proxies</li><li>Datacenter proxies</li><li>Private proxies</li></ul></body></html>

You can open any text editor, copy the code above, and save the file in any catalog on your computer. Make sure to replace the .txt extension with .html. If you open this file in the browser, you’ll see a heading, a subheading, and a list of proxy types displayed neatly. No extra code there.

Tags like <head>, <body>, <p>, <ul>, <li>, etc. are needed for the browser only. Most tags come with opening and closing elements (<li>Tag content</li>), and the closing tag includes a slash. Sometimes, however, a tag may include one element only, for example: <img src="images/some-img.png" alt="Alternative text">.

Styles and other attributes can be used inside tags. For example: <a href="https://site.name/index.html">This is a link</a>. The web page URL comes with href here.

Similarly, styles, classes, and identifiers work.

<div class="container large-box"></div>

Our div block has two CSS classes – «container» and «large-box.» CSS processors can describe the parameters of block design using special syntax.

For example:<style>.container {background-color: yellow;font-size: 18px;}</style>

Styles can be described directly in the HTML code between the tags <style>…</style>. Another option is to connect as a file link (for example, <link rel="stylesheet" type="text/css" href="https://site.name/styles.css">). Styles are rarely described directly inside a tag (inline).

If you wish to dive deep in HTML, CSS and JavaScript, you can explore the official specifications or complete niche courses.

Our current task is to provide the understanding of the parser’s work. So, let’s stick to the code described above.

3. Making HTTP Requests and Parsing HTML with Beautiful Soup

Create a folder for your Python scripts. Let it be «C:\My-first-parser.»

Copy the HTML content we’ve provided in the Step 2. Save it as a ample.html file in the folder with python scripts: C:\My-first-parser\sample.html.

Create your first script in the same catalog:

Access the C:\My-first-parser catalog.
Create the text file. Let it be true-code.txt.
Replace the extension if needed.
Open the file in the text editor and fill it with content (pay attention to spaces at the beginning; Python considers them):

from bs4 import BeautifulSoupwith open('sample.html', 'r') as file:content = file.read()soup = BeautifulSoup(content, "html.parser")for child in soup.descendants:if child.name:print(child.name)

Save the file.
Launch your first Python scraper.

This is the way the Beautiful Soup library sees your document. Having indexed all HTML elements, it has created a ready-made bulk on its basis. If you are aware of the Beautiful Soup command syntax, it will be very easy to make the choice this way.

In the sample above, we read a ready-made file. Here is how you can address a real website:

import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://news.ycombinator.com/")if response.status_code == 200:html_content = response.contentprint(html_content)else:print(response)soup = BeautifulSoup(html_content, "html.parser")for child in soup.descendants:if child.name:print(child.name)

As soon as you realize the script, the entire web page HTML code will be displayed in the console with the upcoming names of all HTML elements. This will be completed after its procession by the BeautifulSoup library.

Let’s find out what the program has done:

import requestsfrom bs4 import BeautifulSoup

We import requests and bs4 libraries here (BeautifulSoup).

response = requests.get("https://news.ycombinator.com/")

We address a particular page «https://news.ycombinator.com/». The requests.get() method is taken from the requests library. It provides the server response.

We further analyze the server response.

if response.status_code == 200:

If the response code is 200 (this means the server works and the addressed page exists), we have to extract the HTML content and fill in the html_content variable.

html_content = response.content

To make sure the variable is filled, we display it in the console:

print(html_content)

However, we have not used Beautiful Soup! Let’s create the soup variable, addressing the BeautifulSoup method(). Transmit the HTML content and ask to parse it.

soup = BeautifulSoup(html_content, "html.parser")

The .descendants attribute processes all existing tags inside the <html> root tag, thus providing the basis. We only need to display the names of generated tags consistently. This is done in the following cycle:

for child in soup.descendants:if child.name:print(child.name)

Worldwide Coverage

5 continents, No limits

Access our proxy network with over 200 locations and over 10 million IP addresses.

See Pricing

4. Navigating and Extracting Data

Let's complicate the task: we will output the page title, count all the links, and parse only the text.
Copy the code below and replace the contents of the file C:\My-first-parser\your URL.

import requestsfrom bs4 import BeautifulSoup# Request the HTML content of the page and assign it to the variable html_content.# If the server response code is different from 200, print only the response.response = requests.get("https://news.ycombinator.com/")if response.status_code == 200:html_content = response.contentelse:print(response)# Pass the HTML content to the BeautifulSoup library for analysis.soup = BeautifulSoup(html_content, "html.parser")# Print the entire <title> tag.print(soup.title)# The same, but without the surrounding tags – only the content.print(soup.title.string)# Find all links on the page (using the <a> tag) and count them.all_links = len(soup.find_all("a"))print(f"{all_links} links were found on the page")#Print all the text from the page.print(soup.get_text())

Since the website might not be working or its content could change, let's return to our HTML file (to ensure a guaranteed result).

We'll try to find a list (<ul> tag) with the identifier “thebestlist” and then output all the list items (they have the <li> tag).

Script code:

from bs4 import BeautifulSoup# Open our HTML file and pass its content to the variable htmlcontent.with open('sample.html', 'r') as file:htmlcontent = file.read()soup = BeautifulSoup(htmlcontent, "html.parser")#Find the <ul> tag with the attribute id="thebestlist".

Note: If there are multiple elements, only the first one will be printed. Since there is a single list, it will be printed as a whole with all the elements.

print(soup.find('ul', attrs={'id': 'thebestlist'}))# Iterate through all found <li> tags (list items) and print their text content.for tag in soup.find_all('li'):print(tag.text)

Congratulations, your first parser works!

We can try something more complicated:

print(soup.select_one('body ul li:nth-of-type(3)'))

This command will print the third item in the list.

print(soup.select_one('ul:nth-child(1) > li:nth-child(2)'))

This one will print the second <li> element in the first <ul> list found.

Listed below is the example code for multiple-page parsing. If request attempts fail, the parser will retry several times. The result will be saved in a JSON file hn_articles.json in the user's folder (C:\Users\YOUR_USER):

import requestsfrom bs4 import BeautifulSoupimport timeimport json# Initiate the variablesis_scraping = True # Parser status, True = activecurrent_page = 1 # Current page, incremented by 1 on each iterationscraped_data = [] # Array for storing datamax_retries = 3 # Maximum number of retriesprint("Parsing news from Hacker News started...")# Loop through pages until parsing is completewhile is_scraping:try:# Make a request to the pageresponse = requests.get(f"https://news.ycombinator.com/?p={current_page}") #start with 1 and then add index one by onehtml_content = response.contentprint(f"Parsing page {current_page}: {response.url}")# Parse HTML content with BeautifulSoupsoup = BeautifulSoup(html_content, "html.parser")# Find all elements with the class "athing"articles = soup.find_all(class_="athing")# If content exists, then ...if articles:# Extract data from the pagefor article in articles:article_data = {"URL": article.find(class_="titleline").find("a").get("href"), # Find elements with "titleline" class, but accept only href attributes, that is, the links"Title": article.find(class_="titleline").getText(), # Find elements with titleline" class, but extract the text"Rank": article.find(class_="rank").getText().replace(".", ""), # Find elements with "rank" class, extract the text, and replace spots with spaces}# Append new data to the listscraped_data.append(article_data)print(f"Data from page {current_page} extracted successfully, found {len(articles)} articles.")# Check for a "next page" linknext_page_link = soup.find(class_="morelink") # The check is done based on the "morelink" class# If the link exists, increase the web page count to replace the link for parsing with the following iterationif next_page_link:current_page += 1# If there is no link, state the “false” parsing statuselse:is_scraping = Falseprint(f"Parsing completed! Processed: {current_page} pages")print(f"Total articles found: {len(scraped_data)}")# Save data to a JSON file (use utf-8 encoding for Linux, cp1251 for Windows)with open("hn_articles.json", "w", encoding="cp1251") as jsonfile:json.dump(scraped_data, jsonfile, indent=4)except requests.exceptions.RequestException as e:print(f"Error parsing page {current_page}: {e}")# Retry logic with delays between attemptsfor attempt in range(1, max_retries + 1):print(f"Retrying page {current_page} (attempt {attempt}/{max_retries})...")time.sleep(2) # Change the delay, if neededtry:response = requests.get(f"https://news.ycombinator.com/?p={current_page}")break # Successful request, keep parsingexcept requests.exceptions.RequestException:pass # Continue retrying until max attempts are reachedelse:print(f"Gave up after {max_retries} retries for page {current_page}")# Add a delay between requeststime.sleep(2)

Advanced Beautiful Soup Techniques

The most challenging task in Beautiful Soup web scraping is parsing dynamic websites. Such websites do not have a final HTML structure; instead, they contain only partial HTML with references to JS scripts. All the content is rendered directly in the browser after executing the JavaScript code.

Beautiful Soup cannot execute JavaScript, so you'll need a fully-featured browser to render the page first and then pass the resulting HTML code to the library.

To make the task even more complex, we'll export an array of the extracted elements into a CSV file (a familiar tabular format for many users).

Here is how integration of Beautiful Soup with headless browsers and CSV Export looks like:

First, install Selenium (you can also use other web drivers like Playwright or Puppeteer, which are website testing tools).

pip install selenium

Next, install Pandas for working with CSV files

pip install pandas

The final script will look like this:

# Connect librariesfrom selenium import webdriverfrom bs4 import BeautifulSoupimport pandas as panda# Open Chrome headless-browserdriver = webdriver.Chrome()# Pass the page URL to the browser and get the contentdriver.get("http://quotes.toscrape.com/js/")# Save the resulting content to the variable js_contentjs_content = driver.page_source# Pass the content for analysis to BeautifulSoupsoup = BeautifulSoup(js_content, "html.parser")# Find all <span> tags with the class 'text' and store them in the array 'quotes'quotes = soup.find_all("span", class_="text")# Pass the array to Pandas and write the data to a 'quotes.csv' file. The 'cp1251' encoding ensures proper Cyrillic display on Windowsdf = panda.DataFrame({Quotes: quotes})df.to_csv('quotes.csv', index=False, encoding='cp1251')# Print the array to the consoleprint(quotes)

This approach already feels more professional, yet the code remains compact.

The next challenge is - proxies!

Many large websites and web services are protected against bots. Additionally, if you want to speed up data collection, you’ll need to use multiple parallel threads. None of this is possible without proxies.

Different types of proxies are used for scraping:

Datacenter, mobile, and residential proxies;
Forward and reverse proxies
HTTP or SOCKS protocol-based proxies

We recommend using rotating HTTP proxies with mobile or residential addresses. They connect to the code once and can rotate automatically through a large pool of addresses provided by services like Froxy.

Here’s a code sample:

# Connect librariesfrom selenium import webdriverfrom selenium.webdriver.common.proxy import *from selenium.webdriver.chrome.options import Optionsfrom bs4 import BeautifulSoup# Add your proxy settings heremyProxy = "188.114.98.233:80"proxy = Proxy({'proxyType': ProxyType.MANUAL,'httpProxy': myProxy,'sslProxy': myProxy,'noProxy': ''})# Set up WebDriver options for the headless browser with proxy. Options may vary, for example, you can install your own user agent, etc.options = Options()options.proxy = proxy# Launch the Chrome web driver with our optionsdriver = webdriver.Chrome(options=options)# Pass the target page URLdriver.get("https://httpbin.io/ip")# Save the resulting content to the content variablecontent = driver.page_source# Analyze the content with BeautifulSoupsoup = BeautifulSoup(content, "html.parser")# Find the <pre> tag contentyourip = soup.find("pre")# Extract the IP addressprint(yourip.text)# Release resources and close the browserdriver.quit()

If the proxy connection is successful, you will see its IP address in the following format:

{"origin": "188.114.98.233:80"}

This is only a part of the “complex” functionality. As your programming skills improve, consider the following advanced techniques:

Handle multi-page navigation and content loaded via JavaScript on a single page;
JSON and XML format export support;
Rotate user-agent headers and other HTTP headers to emulate genuine browser requests. Use the emulation of full-featured digital fingerprints (for example, instead of headless browsers, you can start using anti-detect browsers that also have API, yet it is much easier to manage more user profiles here);
Multithreading and delay management. Avoid repetitive or excessively frequent requests to prevent detection and blocking.

Conclusions and Recommendations

The Beautiful Soup library significantly simplifies HTML parsing. While it doesn't natively support proxies or integration with headless browsers out of the box, these features can be easily added using other libraries and scripts.

The major strength of Beautiful Soup lies in its ability to accept a server response (with arbitrary HTML code) and transform it into a multidimensional array of elements. Awareness of Beautiful Soup syntax lets you quickly locate specific tags, extract their contents, save them to files, or pass them to other program threads.

To build an efficient parser, it’s essential to consider several technical details, such as cookies, digital fingerprints, delays between requests, multithreading and proxy usage.

For high-quality proxies compatible with any parser, Froxy offers over 10 million IPs (including server, residential, and mobile). IP rotation can be configured by time or for every new request via the user dashboard, with targeting options up to the city and network operator level.

View full post