There are many programming languages out there. Almost any of them can be used for web scraping: Java, Python, Go, C#, etc. The question is about personal preferences and the set of existing libraries and frameworks, as these are often used for a quick start – speeding up the development of application programs.
This article is not about choosing the best platform for development but about a specific language: Java. More precisely, it is about the most popular libraries that help create your own web parsers in Java.
First of all, Java is one of the first cross-platform programming languages. Thanks to Java Virtual Machines, Java programming code can be run on almost any operating system. Everyone should remember games on feature phones. Well, they were written in Java.
Secondly, Java is a very popular language. It performs well and excels at working with complex logic.
Thirdly, Java offers many ready-made libraries and frameworks for parsing HTML and XML documents.
Fourthly, let's not forget about multithreading. Java is fully equipped in this area. It ensures efficient task distribution between multiple threads simultaneously and significantly speeds up the data collection process from many pages or websites.
Libraries are the sets of ready-made classes and methods designed for specific tasks. Since Java is used in various fields, many specialized libraries are available for Java web scraping and other different purposes.
Below, we highlighted the most relevant libraries for creating web parsers. Let's start with the most popular and almost indispensable Java scraping library – Jsoup, which is similar to Python's BeautifulSoup.
Jsoup is an open-source Java library designed for parsing HTML structures. HTML is the code returned in response to any HTTP/HTTPS request. Just as a reminder, HTTP stands for HyperText Transfer Protocol. Since Jsoup supports the current WHATWG HTML5 specification, it allows you to fetch and render the page code in the same way modern browsers do (which also works with this specification).
The library has been in development since 2009. With Jsoup, you can fetch HTML content from specific URLs or files, extract data based on HTML tags and CSS selectors, set your own attributes, navigate through pages, fill and submit forms, etc. Jsoup has a well-documented API.
Advantages:
Disadvantages:
Code Example:
// Import librariesimport org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;import java.io.IOException;// A simple sample of Wikipedia parsing, starting with the class creationpublic class Wikipedia {public static void main(String[] args) throws IOException {// Create the document and fill it with content from Wikipedia homepageDocument doc = Jsoup.connect("http://en.wikipedia.org/").get();// Output the titlelog(doc.title());// Get the list of news headlinesElements newsHeadlines = doc.select("#mp-itn b a");for (Element headline : newsHeadlines) {// Collect titles and linkslog("%s\n\t%s", headline.attr("title"), headline.absUrl("href"));}}// Print the logprivate static void log(String msg, String... vals) {System.out.println(String.format(msg, vals));}}
Here is what an example of code parsing through the proxy may look like:
URL url = new URL("http://www.your-taget-site.com/");// proxy connection detailsProxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080));// open the connectionHttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);hc.setRequestProperty("User-Agent", " Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 YaBrowser/24.10.0.0 Safari/537.36");uc.setRequestProperty("Content-Language", "en-EN");uc.setRequestMethod("GET");uc.connect();// Create the document and proceed with the parsing logicDocument doc = Jsoup.parse(uc.getInputStream());…
This is a self-sufficient headless browser initially written in Java to be cross-platform and easily integrated with other scripts and programs written in Java. However, thanks to its high-level API, it can also work with other programming languages. Unlike more popular headless browsers, it does not have a graphical interface. However, it consumes minimal resources and provides the fastest site processing.
It can handle JavaScript, Ajax, and cookies, supports the HTTPS protocol, forms submissions, and effectively emulates user behavior. Development has been ongoing since 2002. The code is open source under the Apache 2.0 license. The library is ideal for automating tests (in web development) and content scraping.
Advantages:
Disadvantages:
Code Example
Here’s how the process of connecting via proxy with authentication may look like:
public void homePage_proxy() throws Exception {try (final WebClient webClient = new WebClient(BrowserVersion.CHROME, PROXY_HOST, PROXY_PORT)) {// The use of proxy with authenticationfinal DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();credentialsProvider.addCredentials("login", "password", PROXY_HOST, PROXY_PORT);final HtmlPage page = webClient.getPage("https://target-domain.zone");Assert.assertEquals("Web page title", page.getTitleText());}}
This is a suite of libraries and tools for automating almost any browser. Selenium acts as a middleware that organizes the API interface through special web drivers. Drivers are developed for all major browsers and platforms, including Firefox, Chrome, Safari, Edge, etc. It even supports the aforementioned HtmlUnit and integrates with most anti-detect browsers. Selenium works on all desktop operating systems: Windows, Linux, and macOS. A unique command language called Selenese was created for the original IDE system.
The project has existed since 2004. Currently, it offers web drivers, an IDE system, docker containers, and a system for managing many browsers (including those on remote servers) — Selenium Grid. Multiple programming languages, including Java, are supported.
Read also: Comparison of website testing tools (Selenium, Playwright, Puppeteer, Lighthouse).
Advantages:
Disadvantages:
Code examples
Typical use of Selenium web driver for the Java parser:
package dev.selenium.hello;// Import librariesimport org.openqa.selenium.WebDriver;import org.openqa.selenium.chrome.ChromeDriver;// Launch a sample of headless browserpublic class HelloSelenium {public static void main(String[] args) {// Connect to Google ChromeWebDriver driver = new ChromeDriver();// Open the pagedriver.get("https://selenium.dev");// Close the browserdriver.quit();}}
And here is a sample of code to scrape text content from element:
// Find the first element with the H1 tag in the sample and copy its contentString text = driver.findElement(By.tagName("h1")).getText();System.out.println(text);
More examples of web scraping in Java can be found in the official repository.
This is a ready-made web parser written in Java, developed and maintained under the Apache team. It is positioned as an open-source web crawler. In some sources, it is referred to as a framework for building your own search engines (theoretically, you can create your own Google-like system). Nutch has a modular and extensible architecture, and the final project can run on a single machine or in a distributed environment like a Hadoop cluster. The collected data can be indexed by search engines like Elasticsearch and Apache Solr.
Mind, there was an attempt to develop Apache Nutch 2.0 intended to work with abstract data storage, but it was abandoned. Therefore, the first version is the most relevant and still actively developed.
Advantages:
Disadvantages:
Code examples
If you manage to set up the essential environment, here’s what you need to do:
If you wish, you can address nutch from your Java code:
import org.apache.hadoop.conf.Configuration;import org.apache.nutch.crawl.Crawl;import org.apache.nutch.fetcher.Fetcher;import org.apache.nutch.util.NutchConfiguration;public class NutchScraper {public static void main(String[] args) throws Exception {// Initiate Nutch sampleString url = "https://here.target.url/";// Create the configurationConfiguration conf = NutchConfiguration.create();Crawl nutch = new Crawl(conf);// Start crawling the URLnutch.crawl(url);Fetcher fetcher = new Fetcher(conf);fetcher.fetch(url);// Output the website Title in consoleSystem.out.println("Website title" + fetcher.getPage().getTitle());}}
This is another Java framework designed to accelerate the creation of custom parsers. Compared to Apache Nutch, it is much simpler both in implementation and configuration.
Advantages:
Disadvantages:
Code examples
Here is an example of a Java parser code:
public static void main(String[] args) {Spider.create(new GithubRepoPageProcessor())// specify start URL.addUrl("https://target.url").addPipeline(new JsonFilePipeline("D:\\site-files\\"))// Open 2 threads.thread(2)// Launch the spider.run();}
Here is what a complete code for parser creation from official documentation may look like (parsing GitHub section):
import us.codecraft.webmagic.Page;import us.codecraft.webmagic.Site;import us.codecraft.webmagic.Spider;import us.codecraft.webmagic.processor.PageProcessor;// Create the processor with the entire critical logic and regular expressionspublic class GithubRepoPageProcessor implements PageProcessor {// Set retry attempts and sleep timeprivate Site site = Site.me().setRetryTimes(3).setSleepTime(100);@Overridepublic void process(Page page) {// Define the structure of requests - URL addresses and get the HTML codepage.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());// Create the “Author” fieldpage.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());// Define the “Name” field based on the H1 headerpage.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());// If the name is blankif (page.getResultItems().get("name")==null){// Ignore the pages with blank namespage.setSkip(true);}// Define the readme filedpage.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));}@Overridepublic Site getSite() {return site;}// Launch the task and transfer it to the processor for parsing a particular website section in 5 threats public static void main(String[] args) {Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();}}
Perfect proxies for valuable data from around the world.
As mentioned earlier, Java works great for parser development. However, there are challenges you may encounter during the development process. Let’s outline the most significant limitations below:
Additionally, do not forget about the peculiarities of writing web parsers (as the software part)
Read Also: A Detailed Guide on Parsing Without Getting Blocked
The need to develop custom parsers arises for various reasons. Whatever these reasons are, simplifying an already complex task is always desirable. This is where specialized libraries and frameworks come into play.
Since Java is a widely used programming language, a sufficient number of ready-made solutions and libraries are available for it. The most popular and functional ones for Java web scraping have been listed above.
If you lack high-quality rotating proxies with global coverage to power your parser, consider Froxy. Our service provides access to over 10 million IP addresses, including mobile, residential, and server options. It offers targeting down to the city and telecom operator level. You only pay for the traffic used. The number of ports can be substantial - up to 1,000.