The process of creating your own parser might seem quite complex and confusing for beginners, especially if you need to write a program from scratch. However, the availability of specialized libraries like Beautiful Soup (written in Python, an ideal programming language for beginners), notably simplifies the task.
And yes, don't let the library's name mislead you — we are not going to teach you “how to cook a beautiful soup." This material has nothing to do with cooking, although… it is still a step-by-step guide you can follow!
Let’s get started.
Beautiful Soup – is a Python web scraping library meant to speed up the creation of custom web page parsing utilities. It transforms HTML code into an indexed dictionary-like structure.
The library's developers aimed to simplify programmers' lives, as parsing tasks frequently arise in technical projects. These tasks go beyond simply gathering competitor data or automating multi-account management.
The library was created and is actively maintained by Leonard Richardson, a Python developer, RESTful API design expert, and science fiction writer.
In short, Beautiful Soup converts a complex HTML document into a tree of Python objects. There are four main types of these objects:
You can apply various methods and attributes to these tags, significantly accelerating document processing.
For example:
To start scraping websites, you need to prepare your programming environment by installing Python, the Beautiful Soup library, and several additional components before proceeding to the practical realization — script writing.
Let’s break the process down step by step and go through each of them in detail.
Install Python. To do this, visit the official Python download page and download the latest stable version, 3.13, as of this writing. Mind that Beautiful Soup supports older Python versions, the second Python 2.17+ and the third Python 3.2+.
Python can be installed on these operating systems: Windows, macOS, and Linux. Many popular distributions already come with Python pre-installed. If needed, Python can also be installed and used on mobile devices: iOS and Android.
Note! When installing Python on Windows, check the box labeled «Add python.exe to PATH.» This will let you automatically add the processed file to the changing software environments. Otherwise, you’ll need to manually add Python to your system's environment variables to run it from the command line without specifying the full path every time (C:\Program Files\Python313\python.exe or C:\Users\YOUR_LOGIN\AppData\Local\Programs\Python\Python313\python.exe, depending on the installation type.
The latest version is 4.12. If needed, you can download the outdated third version, but keep in mind that it is no longer maintained or updated.
While Beautiful Soup is distributed via the official repository, it is enough to use the following command in the Python console:
pip install beautifulsoup4
If the pip command comes with the error, it has to be added to PATH. If you haven't done that yet, enter the following in the terminal:
python -m pip install beautifulsoup4
If you use the easy_install package manager, use the command:
easy_install beautifulsoup4
In Ubuntu/Debian Linux distributives, Beautiful Soup can be installed as a built-in program package:
apt-get install python3-bs4 (for the Python 3+ version)
apt-get install python-bs4 (for the Python 2 version)
More experienced users can download the source code and manually install the library.
If the support of various parser versions is crucial for you, it is possible to additionally install lxml and html5lib:
pip install lxml
pip install html5lib
If you wish to interact with external servers, you will have to manually send and receive HTTPS requests. To simplify the process of formulating HTTP headers, install the requests library:
pip install requests
Make sure that all Python-processed files and scripts remain in the same place. For example, a part of libraries may be stored in C:\Users\YOUR_LOGIN\AppData\Roaming\Python\Python313\Scripts during installation. You can also regularly check the correctness of paths in software variables (specified in PATH).
You can work with local HTML documents for test parsers.
Perfect proxies for accessing valuable data from around the world.
HTML is a markup language used to structure content on the web. It consists of special tags and attributes that browsers interpret to render the final web page version.
Here is the simplest HTML page sample:
<!DOCTYPE html>
<html>
<head>
<title> This is a web page title displayed in the search results and when hovering the mouse on the browser tab </title>
<meta charset="utf-8">
</head>
<body>
<h1> This is the header displayed in the page body </h1>
<p>
This is just a paragraph - any descriptive text can be inside…
</p>
<h2>Subheading: Types of Proxies»:</h2>
<ul id="thebestlist">
<li>Residential proxies</li>
<li>Mobile proxies</li>
<li>General proxies</li>
<li>Datacenter proxies</li>
<li>Private proxies</li>
</ul>
</body>
</html>
You can open any text editor, copy the code above, and save the file in any catalog on your computer. Make sure to replace the .txt extension with .html. If you open this file in the browser, you’ll see a heading, a subheading, and a list of proxy types displayed neatly. No extra code there.
Tags like <head>, <body>, <p>, <ul>, <li>, etc. are needed for the browser only. Most tags come with opening and closing elements (<li>Tag content</li>), and the closing tag includes a slash. Sometimes, however, a tag may include one element only, for example: <img src="images/some-img.png" alt="Alternative text">.
Styles and other attributes can be used inside tags. For example: <a href="https://site.name/index.html">This is a link</a>. The web page URL comes with href here.
Similarly, styles, classes, and identifiers work.
<div class="container large-box">
<!-- Any block. By the way, this is how the comments are written in HTML. They are not displayed to users in the browser, yet they are visible in the code -->
</div>
Our div block has two CSS classes – «container» and «large-box.» CSS processors can describe the parameters of block design using special syntax.
For example:<style>
.container {
background-color: yellow;
font-size: 18px;
}
</style>
Styles can be described directly in the HTML code between the tags <style>…</style>. Another option is to connect as a file link (for example, <link rel="stylesheet" type="text/css" href="https://site.name/styles.css">). Styles are rarely described directly inside a tag (inline).
If you wish to dive deep in HTML, CSS and JavaScript, you can explore the official specifications or complete niche courses.
Our current task is to provide the understanding of the parser’s work. So, let’s stick to the code described above.
Create a folder for your Python scripts. Let it be «C:\My-first-parser.»
Copy the HTML content we’ve provided in the Step 2. Save it as a ample.html file in the folder with python scripts: C:\My-first-parser\sample.html.
Create your first script in the same catalog:
from bs4 import BeautifulSoup
with open('sample.html', 'r') as file:
content = file.read()
soup = BeautifulSoup(content, "html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
This is the way the Beautiful Soup library sees your document. Having indexed all HTML elements, it has created a ready-made bulk on its basis. If you are aware of the Beautiful Soup command syntax, it will be very easy to make the choice this way.
In the sample above, we read a ready-made file. Here is how you can address a real website:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://news.ycombinator.com/")
if response.status_code == 200:
html_content = response.content
print(html_content)
else:
print(response)
soup = BeautifulSoup(html_content, "html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
As soon as you realize the script, the entire web page HTML code will be displayed in the console with the upcoming names of all HTML elements. This will be completed after its procession by the BeautifulSoup library.
Let’s find out what the program has done:
import requests
from bs4 import BeautifulSoup
We import requests and bs4 libraries here (BeautifulSoup).
response = requests.get("https://news.ycombinator.com/")
We address a particular page «https://news.ycombinator.com/». The requests.get() method is taken from the requests library. It provides the server response.
We further analyze the server response.
if response.status_code == 200:
If the response code is 200 (this means the server works and the addressed page exists), we have to extract the HTML content and fill in the html_content variable.
html_content = response.content
To make sure the variable is filled, we display it in the console:
print(html_content)
However, we have not used Beautiful Soup! Let’s create the soup variable, addressing the BeautifulSoup method(). Transmit the HTML content and ask to parse it.
soup = BeautifulSoup(html_content, "html.parser")
The .descendants attribute processes all existing tags inside the <html> root tag, thus providing the basis. We only need to display the names of generated tags consistently. This is done in the following cycle:
for child in soup.descendants:
if child.name:
print(child.name)
6 continents, No limits
Access our proxy network with more than 200 locations and over 10 million IP addresses.
Let's complicate the task: we will output the page title, count all the links, and parse only the text.
Copy the code below and replace the contents of the file C:\My-first-parser\your URL.
import requests
from bs4 import BeautifulSoup
# Request the HTML content of the page and assign it to the variable html_content.
# If the server response code is different from 200, print only the response.
response = requests.get("https://news.ycombinator.com/")
if response.status_code == 200:
html_content = response.content
else:
print(response)
# Pass the HTML content to the BeautifulSoup library for analysis.
soup = BeautifulSoup(html_content, "html.parser")
# Print the entire <title> tag.
print(soup.title)
# The same, but without the surrounding tags – only the content.
print(soup.title.string)
# Find all links on the page (using the <a> tag) and count them.
all_links = len(soup.find_all("a"))
print(f"{all_links} links were found on the page")
#Print all the text from the page.
print(soup.get_text())
Since the website might not be working or its content could change, let's return to our HTML file (to ensure a guaranteed result).
We'll try to find a list (<ul> tag) with the identifier “thebestlist” and then output all the list items (they have the <li> tag).
Script code:
from bs4 import BeautifulSoup
# Open our HTML file and pass its content to the variable htmlcontent.
with open('sample.html', 'r') as file:
htmlcontent = file.read()
soup = BeautifulSoup(htmlcontent, "html.parser")
#Find the <ul> tag with the attribute id="thebestlist".
Note: If there are multiple elements, only the first one will be printed. Since there is a single list, it will be printed as a whole with all the elements.
print(soup.find('ul', attrs={'id': 'thebestlist'}))
# Iterate through all found <li> tags (list items) and print their text content.
for tag in soup.find_all('li'):
print(tag.text)
Congratulations, your first parser works!
We can try something more complicated:
print(soup.select_one('body ul li:nth-of-type(3)'))
This command will print the third item in the list.
print(soup.select_one('ul:nth-child(1) > li:nth-child(2)'))
This one will print the second <li> element in the first <ul> list found.
Listed below is the example code for multiple-page parsing. If request attempts fail, the parser will retry several times. The result will be saved in a JSON file hn_articles.json in the user's folder (C:\Users\YOUR_USER):
import requests
from bs4 import BeautifulSoup
import time
import json
# Initiate the variables
is_scraping = True # Parser status, True = active
current_page = 1 # Current page, incremented by 1 on each iteration
scraped_data = [] # Array for storing data
max_retries = 3 # Maximum number of retries
print("Parsing news from Hacker News started...")
# Loop through pages until parsing is complete
while is_scraping:
try:
# Make a request to the page
response = requests.get(
f"https://news.ycombinator.com/?p={current_page}") #start with 1 and then add index one by one
html_content = response.content
print(f"Parsing page {current_page}: {response.url}")
# Parse HTML content with BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
# Find all elements with the class "athing"
articles = soup.find_all(class_="athing")
# If content exists, then ...
if articles:
# Extract data from the page
for article in articles:
article_data = {
"URL": article.find(class_="titleline").find("a").get("href"), # Find elements with "titleline" class, but accept only href attributes, that is, the links
"Title": article.find(class_="titleline").getText(), # Find elements with titleline" class, but extract the text
"Rank": article.find(class_="rank").getText().replace(".", ""), # Find elements with "rank" class, extract the text, and replace spots with spaces
}
# Append new data to the list
scraped_data.append(article_data)
print(f"Data from page {current_page} extracted successfully, found {len(articles)} articles.")
# Check for a "next page" link
next_page_link = soup.find(class_="morelink") # The check is done based on the "morelink" class
# If the link exists, increase the web page count to replace the link for parsing with the following iteration
if next_page_link:
current_page += 1
# If there is no link, state the “false” parsing status
else:
is_scraping = False
print(f"Parsing completed! Processed: {current_page} pages")
print(f"Total articles found: {len(scraped_data)}")
# Save data to a JSON file (use utf-8 encoding for Linux, cp1251 for Windows)
with open("hn_articles.json", "w", encoding="cp1251") as jsonfile:
json.dump(scraped_data, jsonfile, indent=4)
except requests.exceptions.RequestException as e:
print(f"Error parsing page {current_page}: {e}")
# Retry logic with delays between attempts
for attempt in range(1, max_retries + 1):
print(f"Retrying page {current_page} (attempt {attempt}/{max_retries})...")
time.sleep(2) # Change the delay, if needed
try:
response = requests.get(
f"https://news.ycombinator.com/?p={current_page}")
break # Successful request, keep parsing
except requests.exceptions.RequestException:
pass # Continue retrying until max attempts are reached
else:
print(f"Gave up after {max_retries} retries for page {current_page}")
# Add a delay between requests
time.sleep(2)
The most challenging task in Beautiful Soup web scraping is parsing dynamic websites. Such websites do not have a final HTML structure; instead, they contain only partial HTML with references to JS scripts. All the content is rendered directly in the browser after executing the JavaScript code.
Beautiful Soup cannot execute JavaScript, so you'll need a fully-featured browser to render the page first and then pass the resulting HTML code to the library.
To make the task even more complex, we'll export an array of the extracted elements into a CSV file (a familiar tabular format for many users).
Here is how integration of Beautiful Soup with headless browsers and CSV Export looks like:
First, install Selenium (you can also use other web drivers like Playwright or Puppeteer, which are website testing tools).
pip install selenium
Next, install Pandas for working with CSV files
pip install pandas
The final script will look like this:
# Connect libraries
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as panda
# Open Chrome headless-browser
driver = webdriver.Chrome()
# Pass the page URL to the browser and get the content
driver.get("http://quotes.toscrape.com/js/")
# Save the resulting content to the variable js_content
js_content = driver.page_source
# Pass the content for analysis to BeautifulSoup
soup = BeautifulSoup(js_content, "html.parser")
# Find all <span> tags with the class 'text' and store them in the array 'quotes'
quotes = soup.find_all("span", class_="text")
# Pass the array to Pandas and write the data to a 'quotes.csv' file. The 'cp1251' encoding ensures proper Cyrillic display on Windows
df = panda.DataFrame({Quotes: quotes})
df.to_csv('quotes.csv', index=False, encoding='cp1251')
# Print the array to the console
print(quotes)
This approach already feels more professional, yet the code remains compact.
The next challenge is - proxies!
Many large websites and web services are protected against bots. Additionally, if you want to speed up data collection, you’ll need to use multiple parallel threads. None of this is possible without proxies.
Different types of proxies are used for scraping:
We recommend using rotating HTTP proxies with mobile or residential addresses. They connect to the code once and can rotate automatically through a large pool of addresses provided by services like Froxy.
Here’s a code sample:
# Connect libraries
from selenium import webdriver
from selenium.webdriver.common.proxy import *
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# Add your proxy settings here
myProxy = "188.114.98.233:80"
proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': myProxy,
'sslProxy': myProxy,
'noProxy': ''})
# Set up WebDriver options for the headless browser with proxy. Options may vary, for example, you can install your own user agent, etc.
options = Options()
options.proxy = proxy
# Launch the Chrome web driver with our options
driver = webdriver.Chrome(options=options)
# Pass the target page URL
driver.get("https://httpbin.io/ip")
# Save the resulting content to the content variable
content = driver.page_source
# Analyze the content with BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
# Find the <pre> tag content
yourip = soup.find("pre")
# Extract the IP address
print(yourip.text)
# Release resources and close the browser
driver.quit()
If the proxy connection is successful, you will see its IP address in the following format:
{
"origin": "188.114.98.233:80"
}
This is only a part of the “complex” functionality. As your programming skills improve, consider the following advanced techniques:
Here is a detailed guide on parsing websites without blocking.
The Beautiful Soup library significantly simplifies HTML parsing. While it doesn't natively support proxies or integration with headless browsers out of the box, these features can be easily added using other libraries and scripts.
The major strength of Beautiful Soup lies in its ability to accept a server response (with arbitrary HTML code) and transform it into a multidimensional array of elements. Awareness of Beautiful Soup syntax lets you quickly locate specific tags, extract their contents, save them to files, or pass them to other program threads.
To build an efficient parser, it’s essential to consider several technical details, such as cookies, digital fingerprints, delays between requests, multithreading and proxy usage.
For high-quality proxies compatible with any parser, Froxy offers over 10 million IPs (including server, residential, and mobile). IP rotation can be configured by time or for every new request via the user dashboard, with targeting options up to the city and network operator level.