Blog Froxy | News, Helpful Articles About Using Proxies

A Comprehensive Guide to Scraping Websites for Emails

Written by Team Froxy | Jul 25, 2024 9:00:00 AM

Initially, it makes sense to remind that parsing (web scraping) is the collection and transformation of data posted on web pages into data necessary for analysis and further work. Email scraping from websites is quite a common task. Email addresses may be required for mailings, updating contact databases, adding to blacklists (trap mailboxes) and many other purposes.

For more information, see our article on what parsing is.

Below, we will discuss the technical nuances of the scraping process: what needs to be done, whether it makes sense to write your own email parser, whether there are ready-made online tools and which libraries and services might be useful to you.

Let's get started.

Why Do You Need Email Scraping

The most common practical application of email scrapers is the creation of databases for conducting advertising (marketing) mailings.

However, this is where most problems arise. Mailings without user consent are prohibited. If a company is caught violating this, it may face serious fines. This practice is enforced in all developed countries.

Nevertheless, entrepreneurs sometimes try to save on creating full-fledged marketing campaigns and consciously break the law. Popular email services have learned to identify such mailings and quickly block or mark them as spam.

There are also more legitimate use cases for email scrapers:

  • Searching for сontact information of сounterparties (partners, suppliers, manufacturers etc.);
  • Creating a unified email database from scattered pages and sources within a single company (for example, scraping emails from their own websites);
  • Gathering additional information about clients and subscribers regarding the websites they register on, the topics and categories they are interested in, their behavior etc. The email or phone number serves as an identifier when browsing the web;
  • Timely detection of data leaks on the Darknet (to enhance the security of personal data and/or the personal data of clients;
  • Data format conversion from one format to another (for systematization and convenient search/indexing);
  • Increasing the efficiency of B2B contacts: For example, by searching for available contacts of a potential client by corporate domain.

Scraping is not always associated with malicious hackers and spam.

How Email Scraping Works

An email is an electronic mail address. Despite the widespread use of messengers and social networks, email remains one of the most in-demand communication channels. Additionally, email addresses serve as customer identifiers in popular search engines, websites and online services.

Initially, an email is contact information. Naturally, businesses are very interested in collecting new contacts from the web.

Email addresses have a specific structure. An email consists of three components:

  • Login (username or nickname): This must be unique within a single email service, within the scope of a single email domain;
  • Separator: This is always the "@" symbol;
  • Email domain: This is a regular domain name. Anyone who owns a domain can create their own email boxes. However, most users on the web use ready-made email services like Gmail, Yahoo Mail, Outlook, Proton Mail etc. These services operate on domains they own, like @gmail.com etc.

Sometimes you may encounter clickable links with emails, organized using special attributes in the HTML code, like <a href="mailto:EMAIL-HERE">Link text here</a>.

There is even a collection of special standards for email addresses that describe all technical details and limitations – a series of RFC standards related to DNS, POP3, SMTP and IMAP protocols.

To find all email addresses on a page, you need to analyze the text (or the source HTML code) of the page and extract the appropriate sections using a special pattern.

As you can see, the most distinctive email attribute is the "@" symbol. The domain linkage is interesting as well. For example, all words that include the "@" symbol can be analyzed for inclusion of known domain zones like .com, .net etc. The found lines will be email addresses, but this is an advanced practice. Most often, a regular expression check is sufficient.

However, it's easier said than done. To start analyzing the content of pages, you first need to access them. Most modern websites use advanced Ajax technologies (loading content in response to user actions) or render all content using JavaScript, forming the final HTML directly in the browser (from various pieces and scripts).

Thus, you need to use a full-fledged browser to load the page and then scrape the content. Special browsers that can be automated and connected to your scrapers are called headless browsers. If you need to work with a large number of accounts simultaneously on a single site or web service, you will need anti-detect browsers.

Another challenge is that some websites and web applications may prohibit debugging and block the display of the source code. In such cases, creating screenshots and using full-fledged computer vision (for screen scraping) may be required.

Large-scale projects do not like parasitic loads and quickly identify automated requests to pages. This is real science. Some check for JavaScript support, others analyze the time between requests, some check the IP against spam databases and blacklists etc.

Separately: how to avoid the most common mistakes when parsing and do everything like a pro.

As a result, the simplest task of finding text patterns no longer seems so simple.

It is not surprising that developers create special web services for extracting emails (online email scrapers), browser plugins, standalone desktop programs, server software, libraries, frameworks and provide separate APIs and more.

If you are interested to find out how to scrape website for emails, let’s discuss the typical methods below.

Option 1. Email Scraping Using Ready-Made Software

Ready-made niche solutions include the following specialized software:

  • ePochta Extractor: A successful software that flexibly integrates with other programs from the same developer. It is distributed on a paid basis but has a trial version. It can determine the country of the email owner, sort found emails by domains and export lists in various formats. However, additional software or special services are required to verify the existence of emails, i.e., for validation.
  • LetsExtract Contact Extractor: A utility that is part of a suite of programs for email marketers. It can search and extract not only emails but also other contact information like phone numbers, Skype logins, etc. from websites. The search theme can be determined using search queries (i.e., keywords). The software platform considered is only Windows (for desktops or servers).
  • Email Extractor Pro: A product from the Italy-based team that has a version for Windows and MacOS. It ensures the search and collection of email addresses not only on the web but also on local devices.
  • ScrapeBox: A unique "mega-combine" that can scrape emails along with other required information. Email scraping is organized through a special free plugin.

These are only a few examples. The list of software can be very long. We have enlisted products that are most frequently mentioned and regularly updated.

The main problem of any scraping software is the loss of compatibility with search engine results. This is crucial for selecting resources for further scanning. Lack of support for dynamic sites is another essential issue. In most cases, there is no built-in headless browser under the hood, which leads to many errors when attempting to load sites that use JavaScript.

The obvious advantages include quick and ready-made solutions that can search for emails and export them in a convenient format.

However, the more scanning threads you want to launch, the higher the need for methods to bypass blocks is. Therefore, such software usually does not work without proxy integration. There is frequently an interface for loading large lists of proxies or a built-in proxy manager (for rotating non-functional addresses and checking their validity).

Option 2. Ready-Made Online Services

Here, we can highlight two additional subcategories: services that have already collected their own database of email addresses and are ready to share it with you for an extra fee, and services that can scrape the required information to further generate the processed result only.

The first type of online services includes:

  • Hunter (a huge database of corporate email addresses);
  • Zoominfo (similar to hunter);
  • skrapp;
  • Findymail (allows checking addresses through LinkedIn contact databases);
  • AeroLeads etc.

In most cases, such services have many asterisks and notes in the terms of use. They operate as intermediaries and disclaim all potential risks associated with illegal mailings. Their main task is to verify the email's existence, thus increasing the chances of successful contact. No matter how large their databases are, they cannot encompass everything. This is frequently a small segment in a specific niche or from a particular site/platform.

The second type of online email scrapers includes real scrapers just as they should be. Their advantages are obvious: you don't need to install, configure, integrate, monitor errors, bypass security systems etc. yourself. Here, even proxies are built-in.

The most accessible example is the set of Froxy scrapers: We offer solutions for scraping popular eCommerce platforms, social networks and search engines. No IP blocks, any locations and regions are available. You can create scheduled tasks and receive prompt notifications about the process completion via webhooks.

Option 3. Writing Your Own Scraper

This approach is ideal if you have the strength and resources to maintain your own script. The main advantages include the always functional code that can be quickly updated and improved, so you never need to wait for updates from sluggish software developers. You can scrape emails from website and other services, optimize the scraper and add support for any technologies, including proxy rotation, working with JavaScript etc.

The demerits include the need for specialized knowledge and skills. Naturally, proxies will also be required.

Below, we will consider an example of how to scrape emails using your own scraper written in Python.

Email Web Scraping Using Python

As mentioned above, an email address has a specific structure. It looks like this: username@domain.zone. Detection of text patterns is better realized via regex expressions like this one:

pattern_email = re.compile(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})')

By the way, web developers frequently use it to pre-check the forms and prevent users from entering the unneeded symbols there.

To read the detailed re.compile() info, you can address the official Python documentation.

To enable the HTML-document scraping in Python, you need to add BeautifulSoup4 and HTTPX libraries.

To do this, use the pip installer::

pip install httpx bs4

Here is how the search code for all the mailto links will look like:

matches_email = soup.findAll("a", attrs={"href": re.compile("^mailto:")})

And here is how the entire code search will look like (with no mailto links):

pattern_email = re.compile(r'([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4})')

matches_email = re.findall(email_pattern, page_html)

Mind that a direct search process with no connection to links will be completed longer, requiring more resources.

Practically, the lines above are the core of all subsequent scraping operations.

Here is how a sample of the final script write in Python may look like:

import httpx

from bs4 import BeautifulSoup

import re

import json

def parse_site(main_url: str):

links = []

response = httpx.get(url=main_url)

bssoup = BeautifulSoup(response.text, "html.parser")

# Collect the list of all website pages

for link_box in bssoup.select("div.info-section.info-primary"):

# Extract links found on each page and add them to the main website URL

link = "https://www.your-site.com" + link_box.select_one("a").attrs["href"]

links.append(link)

return links

def parse_emails(links: list):

emails = {}

for link in links:

# Send the GET inquiry to each new link in the list

page_response = httpx.get(url=link)

bssoup = BeautifulSoup(page_response.text, "html.parser")

# Get company name (contact name)

company_name = bssoup.select_one("h1.dockable.business-name").text

# Find all mailto links and copy the text with email from them

for link in bssoup.findAll("a", attrs={"href": re.compile("^mailto:")}):

# Get email addresses from mailto tags

email = link.get("href").replace("mailto:", "")

# Check the company availability in the list of emails and create a new entry, if there isn’t any yet

if company_name not in emails:

emails[company_name] = []

emails[company_name].append(email)

return emails

# Parse all links and add them to the list

links = parse_site("https://www.your-site.com/target-page.html")

# Collect all emails on a page

emails = parse_emails(links)

# Generate the result in the JSON format

print(json.dumps(emails, indent=4))

In this case, all the addresses with the mailto attribute are collected. The list will contain the email with the company name it belongs to. This will be a kind of a ready-made contact database.

Company name is extracted from a container with a specific tag or class. In our case, the address to the container is defined by a series of special tags. That is why, if the structure of your target page does not match (and this is highly possible on a new website), you need to find and update the current tags for the script to work.

In our script, the BeautifulSoup library (the official documentation) is responsible for the analysis of tag structure.

You may also need other Python web scraping libraries.

Best Practices for Email Scraping

Developers aware that their pages might be scraped for email addresses can take additional protective measures, including:

  • Obfuscation (emails are encoded using built-in obfuscation functions);
  • Concatenation (emails are joined using JavaScript);
  • Tokenization (emails are replaced with tokens);
  • Full encryption and decryption of emails on demand;
  • Emails are embedded as images;
  • Access Protection (email access is protected by additional checks like CAPTCHA input.

These measures come in addition to standard scraping protection schemes: dynamic code (loading via Ajax), blocking automated requests from a single IP, using trap links etc.

However, it's not enough to scrape emails only; it's also important to verify the quality and functionality of email addresses. What other practices are there in email scraping?

  • It is important to monitor the frequency of requests to pages and avoid making delays identical. Equal time between requests is the first sign of automated traffic;
  • Validate the obtained emails through special databases. A more complex and costly method is to attempt sending test emails and processing server responses. Since public email lists often contain trap mailboxes, we recommend checking email lists against databases and never sending bulk emails from primary mail servers. Otherwise, they might be blocked or their rating may be lowered;
  • Try to collect additional contact information along with email addresses: name, company name, website where the email was found etc.;
  • Employ headless browsers in scraping. They help bypass many protection systems and support the operation of dynamic sites and loading complex JS scripts;
  • Use rotating proxy servers to ensure multithreading. The IP rotation model for each new request reliably protects against IP blocks.

Conclusion and Recommendations

As you may have noticed, all programs (specialized software) and custom parsing scripts require integration with proxy servers. If you don't route requests through intermediary IPs, you will quickly get banned and end up on a blacklist, making it impossible to parse the target site.

You can get high-quality residential and mobile proxies from Froxy. You don't pay for specific IP addresses, but for the consumed traffic only. Therefore, you can endlessly rotate IPs, even with each new request.

The Froxy address pool includes over 8.5 million IPs, with precision down to the required city and ISP.

In addition, we offer ready-made online parsers where you pay for successful requests to target sites only. There is no need to install or configure anything. All you have to do is download the scraping results and proceed with their further processing.