LinkedIn Data Scraping with Python: A Complete Step-by-Step Guide

Written by Team Froxy | Oct 10, 2024 9:00:00 AM

Data and information have always been important assets. In the age of digitalization and IT technologies, their significance has notably increased, especially if the data is well-structured and can be applied for specific purposes.

LinkedIn - is an ideal platform for gathering information about job seekers, potential employers, labor market trends etc. However, LinkedIn does not have an official API interface, so to collect data, one will have to write their own parser.

This article is a step-by-step guide for beginners. Through code examples, we will share how to parse LinkedIn pages, what is required for this and what potential pitfalls may arise.

Brief Overview of Data Scraping and Its Relevance in Extracting Information from LinkedIn

LinkedIn - is a major social network focused on professional connections and contact sharing. The company owner is Microsoft, so it is not surprising that AI-based services are actively developing within the platform.

It is very popular in the United States and Europe, with an audience exceeding 1 billion registered users. Among its clients are job seekers and major employers from various countries (200+).

With such an audience, LinkedIn has become the number one service for publishing resumes or job vacancies, finding suitable candidates for specific positions as well as personal recommendations and contact exchanges (similar to word-of-mouth referrals).

There is one catch, however - LinkedIn does not have automation tools or API interfaces for obtaining data without using a browser. To collect necessary information in large volumes, one will need to automate the process using parsing scripts.

Web scraping LinkedIn allows completing the following goals:

Quick search for candidates that meet specified criteria/requirements without the need for manual browsing and studying profile pages;
Salary analysis in specific niches;
Extraction of business contacts - emails, phone numbers, social networks, etc.;
Identification of labor market trends;
Comparison of one’s own offerings with those of potential competitors (comparative analysis);
Bulk sending of responses to job vacancies (to increase the chances for employment in more favorable positions);
Data export in tabular formats that can be utilized in other programs and tools: business analytics, diagrams, statistics, etc.

Understanding LinkedIn’s Data Usage Policy

Does LinkedIn allow scraping? The website operates with users' personal data. Although the data is publicly available (viewable after logging in), attempts to collect information about customers can have unpleasant consequences due to the requirements of various local laws. Technically, parsing does not violate any laws. Violation occurs when the data is used, such as in the case of sending spam.

However, parsing is officially prohibited by LinkedIn’s rules - this is explicitly stated in the user agreement.

Here arises the question: Is LinkedIn scraping legal? The problem with LinkedIn scraping is in the purposes of data collection. If you are automating a routine task, such as writing a script to quickly view suitable candidate profiles (something you could do manually, but want to save time and effort), then there will be no legal consequences.

But if the data is collected for purposes that are explicitly prohibited by LinkedIn's policies and legal requirements, you could face not only a ban but also a real lawsuit. Such practices have actually occurred, particularly in the case against hiQ Labs, which was based on the Computer Fraud and Abuse Act (CFAA).

Since LinkedIn does not know your intentions, all automated traffic and parsing attempts are blocked. The service has one of the most complex anti-fraud systems on the market.

In this respect, we strongly recommend protecting yourself as much as possible: use multiple virtual profiles (to parallelize requests and diversify risks), avoid registering them with your real phone numbers and personal information, and access the site only through proxies (preferably residential, or better yet, mobile).

Note: LinkedIn uses dynamic infinite page loading and works extensively with AJAX and JavaScript. This means that a headless browser is required. There are various ways to connect to APIs of headless browsers: through special libraries, directly, or via drivers/local servers (like Selenium).

Since there will be many profiles, they should each have unique fingerprints. To manage a large number of browser samples with unique profiles, it’s best to use an anti-detect browser like Dolphin {anty}. It can operate in headless mode (with API access).

As a result of tightening internal policies, LinkedIn now has the following account restrictions:

No more than 10 personalized messages per month;
No more than 100 new contacts per month;
Significant limits on sending requests.

These limitations can only be circumvented by parallelizing accounts.

Basic Steps for LinkedIn Data Scraping

Below, we will discuss the situation of scraping LinkedIn with Python. Python - is a cross-platform programming language. To support it, you need to download the runtime environment for your operating system and install it in advance. You can download Python from the official website. Installation packages are available for Windows, Linux, and macOS. Ready-to-use Docker containers are even provided specifically for web developers.

Just in case, explore popular Python web scraping libraries.

Now, let's move on to our parser’s code.

Step 1. Installing Libraries

We start by preparing the working environment. You initially need to install a number of Python libraries that facilitate writing scripts for scraping LinkedIn.

Open the console and start the Python environment. Enter the following commands:

pip install playwright

pip3 install requests

pip3 install beautifulsoup4

If you use Windows, you can open the terminal (command prompt) after installing the Python environment and enter the following command:

py -m pip install playwright requests beautifulsoup4

Tools for exporting in CSV format are integrated in Python.

If you are very enthusiastic, you can connect the Selenium library with your browser. To do this, you will require both a library and a web driver (this is a standalone web server created specifically for browser testing). Starting from Chrome version 115 and above, you should use the Chrome accessibility panel for testing.

We have chosen a simple route and use Playwright for this purpose.

However, there are nuances here as well. Simply installing Playwright is not enough. You also need to install a browser for it. If the browser is already available in the system, the script will start complaining and will suggest a forced installation.

Since Chrome is available on 99% of computers, we will install Firefox (this will be our separate headless browser).

The command for the Windows terminal is as follows:

py -m playwright install firefox

Wait for Firefox to download and unpack it.

Step 2. Logging Into Your LinkedIn Account

Without logging into your account, the platform will not allow viewing profiles, search results and many other functions.

To authorize, you need to access the login page, find the corresponding fields, enter your username and password.

This can be done in advance in the browser you will be working with via the API.

For example, in our Firefox browser installed with Playwright, this is done as follows:

Find and launch the browser. It is installed in the folder: C:\Users\YOUR-USERNAME\AppData\Local\ms-playwright\firefox-1463\firefox (the version number "1463" may vary);
Launch the firefox.exe file;
The browser will open;
You can install plugins and set up a proxy;
Access the website linkedin.com and log in as usual.

In code, this is done like this (this is a snippet of a Python script; the full code will be provided further):

# Import Playwright modules

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:

# We need to launch Firefox specifically without activating the headless mode, so all actions will be visible (do not touch the browser window when it launches)

browser = pw.firefox.launch(headless=False)

# Set window parameters

context = browser.new_context(viewport={"width": 1366, "height": 720})

# Launch a new tab

page = context.new_page()

page.goto('https://www.linkedin.com/login')

page.wait_for_selector("h2") # Here, you can find and specify the selector that loads after the input form

# Filling in input forms

page.get_by_role("textbox", name="username").fill("your_email@address.com")

page.get_by_role("textbox", name="password").fill("YOUR-PASSWORD")

page.get_by_role("button", name="login").click()

Above, we used the Playwright code.

If you are using a web driver, the code will be as follows:

# This is the login page

driver.get('https://www.linkedin.com/login')

# Here we find the required input fields and fill them using the send_keys() method

driver.find_element_by_id('username').send_keys('your_email@example.com')

driver.find_element_by_id('password').send_keys('your_password')

driver.find_element_by_css_selector('.login__form_action_container button').click()

Step 3. Analyzing Page Structure (Searching Classes and Identifiers of Required HTML Elements)

Suppose we need to extract data from a search page - here we have an infinite page with lists.

A lot of information is recorded directly in the URL structure. A simple example for job searches:

https://www.linkedin.com/jobs/search?keywords=DevOps%20Engineer&location=United%20States&position=1&pageNum=0&start=0

We searched for “DevOps Engineer” in the United States. The space character is replaced with the combination “%20.”

The pageNum parameter does not affect anything, as the page loads dynamically and always remains the same.

However, the start parameter determines which item in the list should be displayed as search results. Through experience, it has been detected that the best step for navigating through the list is 25 items. That is, the start parameter can equal 0, 25, 50, 75… and so on up to 975 (beyond that, a 404 error will be shown instead of results).

But where exactly in the HTML code are the job titles, company names, etc. hidden?

By analyzing the page code in the developer console, you can identify the following patterns:

All job cards are wrapped in a div container with the class “base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card».
Within these cards, you can find identifiers for:

Job Title – an h3 header with the class base-search-card__title
Company Name – an h4 header with the class base-search-card__subtitle
Address/Job Location – a span tag with the class job-search-card__location
Link to Full Profile – an “a” tag with the class base-card__full-link

Please note that over time, tags and classes may change, as nothing is eternal. If desired, you can always open the developer console in your web browser and find the tags and classes that correspond to various layout elements on your own.

Step 4. Data Extraction

Once authorization is completed, we need to go to the search page, wait for a specific selector to load (to ensure that everything has fully loaded), then find the job cards and browse them, collecting the required content.

We will simply output the data array to the console.

When creating a full-fledged parser, you can add a module for exporting data to a CSV file with column and field separation.

Here’s how our code for iterating through job cards will look:

page_two.goto('https://www.linkedin.com/jobs/search?keywords=DevOps%20Engineer&location=United%20States&position=1&pageNum=0&start=0')

page_two.wait_for_selector("h3") # wait for the web page content to load

# First, find the job result cards, and declare an array

parsed = []

all_jobs = page_two.query_selector_all('base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card')

# Inside each card, process the required elements and save the data

for job in all_jobs.element_handles():

parsed.append({

#Job title first

"Vacancy title”: job.query_selector("h3 .base-search-card__title").inner_text(),

#Company name then

"Company name”: job.query_selector("h4 .base-search-card__subtitle").inner_text(),

#Address

"Address": job.query_selector("span .job-search-card__location").inner_text(),

#finally, profile link

"Link": job.query_selector("a .base-card__full-link").get_attribute("href"),

})

# Print everything to the console

for item in parsed:

print(item)

Step 5. Putting It All Together

Here is how the final script looks like:

#import playwright modules

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:

#We need to launch Firefox without activating the headless mode, so all actions will be visible to you (do not interact with the browser window when it opens)

browser = pw.firefox.launch(headless=False)

#set window parameters

context = browser.new_context(viewport={"width": 1366, "height": 720})

#open a new tab

page = context.new_page()

page.goto('https://www.linkedin.com/login')

page.wait_for_selector("h2") # Here, you can find and specify the selector that loads after the input form

# Fill out input forms

page.get_by_role ("textbox", name="username").fill("your_email@address.com")

get_by_role ("textbox", name="password").fill("YOUR-PASSWORD")

page.get_by_role ("button", name="login").click()

page_two.goto('https://www.linkedin.com/jobs/search?keywords=DevOps%20Engineer&location=United%20States&position=1&pageNum=0&start=0')

page_two.wait_for_selector("h3") # wait for the web page content to load

# First, find cards with search results

parsed = []

all_jobs = page_two.query_selector_all('base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card')

# Inside each card, process the required elements and save the data

for job in all_jobs.element_handles():

parsed.append({

#Job title first

"Job title": job.query_selector("h3 .base-search-card__title").inner_text(),

#company name next

"Company name": job.query_selector("h4 .base-search-card__subtitle").inner_text(),

#Address

"Address": job.query_selector("span .job-search-card__location").inner_text(),

#Finally, the profile link

"Link": job.query_selector("a .base-card__full-link").get_attribute("href"),

})

# Print everything to the console

for item in parsed:

print(item)

General Recommendations for LinkedIn Web Scraping

Hypothetically, if you correctly iterate over the starting card numbers in the URL structure of the search page, you can go without a headless browser. You just need to run a loop that increments the start numbers by steps of 25 cards. In this case, the search results page will display new results without including previous elements, creating an effect similar to pagination.

However, this method works with LinkedIn's job search only.

Other LinkedIn pages often use dynamic content loading, so using a headless browser becomes unavoidable.

LinkedIn has one of the most robust security systems. It can recognize and block all browser extensions used to bypass restrictions or rotate proxies. Moreover, many VPNs and popular shared proxies are already on its blacklist.

The only reliable tool is using clean residential and mobile proxies with rotation.

Even so, proxies only partially solve the problem. The primary restriction is the mandatory profile authorization; without logging into an account, access to the content is impossible.

LinkedIn closely monitors and penalizes any attempts at automation. It's not just the IP addresses that get banned but specific user accounts. Even having a premium subscription does not protect against bans resulting from parsing.

To conclude, it is necessary to have a large number of accounts and diversify risks. To manage and automate work with a large number of accounts, an anti-detect browser is required, not just a headless one.

Thus, a full jackpot with headless and anti-detect browsers, a large number of profiles, the most advanced protection systems, and definitely using white proxies is required.

Conclusion and Recommendations

Parsing LinkedIn pages requires significant effort. You won’t be able to write a parser in just a few lines of code. To make it work, you need a complete setup that includes a headless/anti-detection browser, libraries, rotating proxies.

The probability of getting blocked will mostly depend on the proxy quality. You can get white proxies with rotation from us. Froxy has over 10 million IPs in the database. These include residential, mobile and server proxies. You pay for the consumed traffic, not for the address quantity. Unused traffic can be transferred to the next month. Targeting is up to the city and ISP level. Rotation by time or API/request is also available.

View full post