There are many programming languages out there. Almost any of them can be used for web scraping: Java, Python, Go, C#, etc. The question is about personal preferences and the set of existing libraries and frameworks, as these are often used for a quick start – speeding up the development of application programs.
This article is not about choosing the best platform for development but about a specific language: Java. More precisely, it is about the most popular libraries that help create your own web parsers in Java.
First of all, Java is one of the first cross-platform programming languages. Thanks to Java Virtual Machines, Java programming code can be run on almost any operating system. Everyone should remember games on feature phones. Well, they were written in Java.
Secondly, Java is a very popular language. It performs well and excels at working with complex logic.
Thirdly, Java offers many ready-made libraries and frameworks for parsing HTML and XML documents.
Fourthly, let's not forget about multithreading. Java is fully equipped in this area. It ensures efficient task distribution between multiple threads simultaneously and significantly speeds up the data collection process from many pages or websites.
Libraries are the sets of ready-made classes and methods designed for specific tasks. Since Java is used in various fields, many specialized libraries are available for Java web scraping and other different purposes.
Below, we highlighted the most relevant libraries for creating web parsers. Let's start with the most popular and almost indispensable Java scraping library – Jsoup, which is similar to Python's BeautifulSoup.
Jsoup is an open-source Java library designed for parsing HTML structures. HTML is the code returned in response to any HTTP/HTTPS request. Just as a reminder, HTTP stands for HyperText Transfer Protocol. Since Jsoup supports the current WHATWG HTML5 specification, it allows you to fetch and render the page code in the same way modern browsers do (which also works with this specification).
The library has been in development since 2009. With Jsoup, you can fetch HTML content from specific URLs or files, extract data based on HTML tags and CSS selectors, set your own attributes, navigate through pages, fill and submit forms, etc. Jsoup has a well-documented API.
Advantages:
Disadvantages:
Code Example:
// Import libraries
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
// A simple sample of Wikipedia parsing, starting with the class creation
public class Wikipedia {
public static void main(String[] args) throws IOException {
// Create the document and fill it with content from Wikipedia homepage
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
// Output the title
log(doc.title());
// Get the list of news headlines
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
// Collect titles and links
log("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));
}
}
// Print the log
private static void log(String msg, String... vals) {
System.out.println(String.format(msg, vals));
}
}
Here is what an example of code parsing through the proxy may look like:
URL url = new URL("http://www.your-taget-site.com/");
// proxy connection details
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080));
// open the connection
HttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);
hc.setRequestProperty("User-Agent", " Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 YaBrowser/24.10.0.0 Safari/537.36");
uc.setRequestProperty("Content-Language", "en-EN");
uc.setRequestMethod("GET");
uc.connect();
// Create the document and proceed with the parsing logic
Document doc = Jsoup.parse(uc.getInputStream());
…
This is a self-sufficient headless browser initially written in Java to be cross-platform and easily integrated with other scripts and programs written in Java. However, thanks to its high-level API, it can also work with other programming languages. Unlike more popular headless browsers, it does not have a graphical interface. However, it consumes minimal resources and provides the fastest site processing.
It can handle JavaScript, Ajax, and cookies, supports the HTTPS protocol, forms submissions, and effectively emulates user behavior. Development has been ongoing since 2002. The code is open source under the Apache 2.0 license. The library is ideal for automating tests (in web development) and content scraping.
Advantages:
Disadvantages:
Code Example
Here’s how the process of connecting via proxy with authentication may look like:
public void homePage_proxy() throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {
// The use of proxy with authentication
final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
credentialsProvider.addCredentials("login", "password", PROXY_HOST, PROXY_PORT);
final HtmlPage page = webClient.getPage("https://target-domain.zone");
Assert.assertEquals("Web page title", page.getTitleText());
}
}
This is a suite of libraries and tools for automating almost any browser. Selenium acts as a middleware that organizes the API interface through special web drivers. Drivers are developed for all major browsers and platforms, including Firefox, Chrome, Safari, Edge, etc. It even supports the aforementioned HtmlUnit and integrates with most anti-detect browsers. Selenium works on all desktop operating systems: Windows, Linux, and macOS. A unique command language called Selenese was created for the original IDE system.
The project has existed since 2004. Currently, it offers web drivers, an IDE system, docker containers, and a system for managing many browsers (including those on remote servers) — Selenium Grid. Multiple programming languages, including Java, are supported.
Read also: Comparison of website testing tools (Selenium, Playwright, Puppeteer, Lighthouse).
Advantages:
Disadvantages:
Code examples
Typical use of Selenium web driver for the Java parser:
package dev.selenium.hello;
// Import libraries
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
// Launch a sample of headless browser
public class HelloSelenium {
public static void main(String[] args) {
// Connect to Google Chrome
WebDriver driver = new ChromeDriver();
// Open the page
driver.get("https://selenium.dev");
// Close the browser
driver.quit();
}
}
And here is a sample of code to scrape text content from element:
// Find the first element with the H1 tag in the sample and copy its content
String text = driver.findElement(By.tagName("h1")).getText();
System.out.println(text);
More examples of web scraping in Java can be found in the official repository.
This is a ready-made web parser written in Java, developed and maintained under the Apache team. It is positioned as an open-source web crawler. In some sources, it is referred to as a framework for building your own search engines (theoretically, you can create your own Google-like system). Nutch has a modular and extensible architecture, and the final project can run on a single machine or in a distributed environment like a Hadoop cluster. The collected data can be indexed by search engines like Elasticsearch and Apache Solr.
Mind, there was an attempt to develop Apache Nutch 2.0 intended to work with abstract data storage, but it was abandoned. Therefore, the first version is the most relevant and still actively developed.
Advantages:
Disadvantages:
Code examples
If you manage to set up the essential environment, here’s what you need to do:
If you wish, you can address nutch from your Java code:
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.crawl.Crawl;
import org.apache.nutch.fetcher.Fetcher;
import org.apache.nutch.util.NutchConfiguration;
public class NutchScraper {
public static void main(String[] args) throws Exception {
// Initiate Nutch sample
String url = "https://here.target.url/";
// Create the configuration
Configuration conf = NutchConfiguration.create();
Crawl nutch = new Crawl(conf);
// Start crawling the URL
nutch.crawl(url);
Fetcher fetcher = new Fetcher(conf);
fetcher.fetch(url);
// Output the website Title in console
System.out.println("Website title" + fetcher.getPage().getTitle());
}
}
This is another Java framework designed to accelerate the creation of custom parsers. Compared to Apache Nutch, it is much simpler both in implementation and configuration.
Advantages:
Disadvantages:
Code examples
Here is an example of a Java parser code:
public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
// specify start URL
.addUrl("https://target.url")
.addPipeline(new JsonFilePipeline("D:\\site-files\\"))
// Open 2 threads
.thread(2)
// Launch the spider
.run();
}
Here is what a complete code for parser creation from official documentation may look like (parsing GitHub section):
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
// Create the processor with the entire critical logic and regular expressions
public class GithubRepoPageProcessor implements PageProcessor {
// Set retry attempts and sleep time
private Site site = Site.me().setRetryTimes(3).setSleepTime(100);
@Override
public void process(Page page) {
// Define the structure of requests - URL addresses and get the HTML code
page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());
// Create the “Author” field
page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
// Define the “Name” field based on the H1 header
page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
// If the name is blank
if (page.getResultItems().get("name")==null){
// Ignore the pages with blank names
page.setSkip(true);
}
// Define the readme filed
page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
}
@Override
public Site getSite() {
return site;
}
// Launch the task and transfer it to the processor for parsing a particular website section in 5 threats public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
}
}
Perfect proxies for valuable data from around the world.
As mentioned earlier, Java works great for parser development. However, there are challenges you may encounter during the development process. Let’s outline the most significant limitations below:
Additionally, do not forget about the peculiarities of writing web parsers (as the software part)
Read Also: A Detailed Guide on Parsing Without Getting Blocked
The need to develop custom parsers arises for various reasons. Whatever these reasons are, simplifying an already complex task is always desirable. This is where specialized libraries and frameworks come into play.
Since Java is a widely used programming language, a sufficient number of ready-made solutions and libraries are available for it. The most popular and functional ones for Java web scraping have been listed above.
If you lack high-quality rotating proxies with global coverage to power your parser, consider Froxy. Our service provides access to over 10 million IP addresses, including mobile, residential, and server options. It offers targeting down to the city and telecom operator level. You only pay for the traffic used. The number of ports can be substantial - up to 1,000.