We’ve covered libraries and frameworks that speed up parser development, whether in Python, Java, or Golang, many times before. There are various approaches: from the simplest HTTP clients that can only read “raw” HTML to highly complex algorithms incorporating computer vision and AI assistance. In this article, we’ll explore the most advanced frameworks for integrating large language models (LLMs) into your parsers: LangChain vs. LangGraph.
When considering how to build AI-based workflows that manage token limits and maintain conversational context, the LangChain vs LangGraph comparison becomes especially relevant.
Let’s start with a quick reminder: not all AI is the same. Neural networks that work with textual data are known as LLMs (Large Language Models). These models can do more than just generate coherent text or give concise answers, they’re also capable of performing simple tasks such as analyzing documents and summarizing their contents, extracting specific data from HTML code, and much more. As you might guess from the context, our focus here is on everything related to web parsing.
Many modern neural networks are available as public web services with accessible API interfaces. Some LLMs have already evolved to handle a variety of document types and media formats, including images, PDFs, presentations, audio, and video. However, despite this versatility, using them via API comes with certain limitations. The main challenge is tokenization.
To explain it simply: every new request sent to an LLM is treated independently, without memory of previous interactions. Moreover, each request is limited by a specific number of tokens (think of tokens as word-like units, though the actual calculation is more nuanced). This leads to a key technical challenge: when sending large requests to an LLM, you need to:
This is where the concept of chains comes in, linked sequences of prompts and responses that preserve state and context across interactions.
To automate this process of managing request chains, you need specialized software or libraries. And this is exactly where LangChain and LangGraph come into play.
LangChain is a framework designed to create chains — sequences of language model calls and to integrate those calls with external tools like databases, scrapers, APIs, and more.
Currently, LangChain is available both as open-source libraries (for Python, JavaScript, and TypeScript) and as a ready-to-use platform. The hosted version functions as a web service with an API, offering robust integration capabilities, orchestration tools, and performance evaluation, all hosted in a secure cloud environment that’s well-suited for enterprise applications.
Key Features of LangChain:
In Short:
LangChain is the foundational layer for integrating LLMs into your scripts, tools, and enterprise workflows. It enables advanced use cases like contextual dialogue, tool-augmented reasoning, and AI-powered agents. In fact, more complex orchestration systems like LangGraph are built on top of LangChain to provide even greater control and flexibility. For many developers, LangChain is often the starting point in the LangChain vs LangGraph decision tree, especially when workflows are relatively straightforward.
LangGraph is a powerful framework built to streamline the design and deployment of LangChain-based agents. It handles all the behind-the-scenes complexity like defining input/output points, background task management, state control, scaling, and fault tolerance, so you can focus on configuring your agent’s logic rather than coding it from scratch.
In essence, LangGraph provides ready-to-use agent templates built on top of modern LLMs. These templates can be quickly customized to suit specific project needs, allowing developers to speed up production and iterate with ease.
Just like LangChain, the LangGraph library is open source and released under the permissive MIT license.
Key Points:
In Summary:
LangGraph takes LangChain to the next level by adding structured orchestration, high scalability, and developer-friendly tools for building LLM agents. Whether you're deploying AI workflows for production or iterating on prototypes, LangGraph helps you get there faster with less manual setup and more reliable results. LangGraph sits at the advanced end of the LangChain vs LangGraph spectrum, designed for teams needing orchestration, modularity, and visual control.
LangChain is a framework for integrating LLMs (large language models) into your scripts and applications. It serves as a foundational layer from which you can begin developing complex data-processing chains. LangChain acts as middleware that connects LLMs with external tools and services.
LangGraph, in contrast, is a tool for designing LLM agents. Technically, it's also a framework, but it operates on top of LangChain and enables faster deployment of common use cases. Its main function is orchestrating a large number of AI agents.
The core architectural difference lies in how workflows are structured. LangChain is built around sequential chains — step-by-step flows. LangGraph, on the other hand, is based on a state graph model. Nodes in this graph can have multiple transitions and conditions.
In LangGraph, loops and conditional logic are built into the architecture. You can define actions for error handling, retries, and transitions based on outcomes. While similar behaviors can be implemented in LangChain, doing so typically requires more complex programming.
In summary, if you just need to connect an LLM and manage the breakdown of large data tasks into chains, LangChain is the right choice.
If you need to rapidly design and deploy your own LLM agent or multiple agents with orchestration, then LangGraph is the better fit.
One of the key evaluation points in the LangChain vs LangGraph debate is how each framework handles parsing workflows and tool orchestration. The LangChain library doesn’t perform parsing by itself. Its purpose is to facilitate interaction with neural networks. LangChain scraping relies on external tools and LLMs to handle the actual data extraction, which can include pattern recognition (identifying CSS classes, IDs, and specific attributes that help locate and extract the desired HTML elements and content), image recognition and processing (including detecting objects in images or videos), audio transcription, and more.
For practical examples of AI-based scraping, see the article: AI Scraping with ChatGPT: A Practical Guide.
LangChain Scraping Capabilities:
LangChain Scraping Limitations:
LangChain standardizes the syntax for interacting with LLMs. To support this, it offers built-in integration packages, core abstractions (langchain-core), chains, agents, and tools that help define the cognitive architecture of your application.
Among the tools that simplify the parsing process via LangChain are:
These tools work best when paired with a well-configured LangChain proxy layer, ensuring seamless data flow between the LLMs and target sources.
Since LangChain does not work directly with websites, it does not require proxies. However, for advanced use cases, setting up a LangChain proxy configuration becomes necessary when the framework is used alongside external tools that do require direct website access. Many target websites actively protect themselves from bots and automated traffic. To bypass these protections, you may need various tools and techniques. Rotating proxies with precise geotargeting are among the most reliable solutions and can work effectively in combination with others.
The most logical approach is to manage proxies at the level of the HTTP client you are using to connect to target websites. For example, if you are working in a Node.js environment, you need to configure proxies for node-fetch. If you’re using an anti-detect browser or a headless browser, the proxies should be set up for those using plugins or special variables/APIs.
A fully functional parsing script written in Python. It enables the setup and operation of LangChain through a proxy:
import sqlite3
import json
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from langchain.llms import OpenAI
from langchain.agents import initialize_agent, Tool, AgentType
# -------------------------------
# Setting up SQLite database
# -------------------------------
conn = sqlite3.connect("products.db")
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
title TEXT,
price TEXT,
description TEXT
)
''')
conn.commit()
# -------------------------------
# Setting up Selenium with proxy
# -------------------------------
chrome_options = Options()
chrome_options.add_argument("--headless")
proxy = "http://username:password@your_proxy_ip:port" # If no login: "http://ip:port"
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(options=chrome_options)
# -------------------------------
# Setting up the LLM
# -------------------------------
llm = OpenAI(model_name="gpt-4o", temperature=0)
# -------------------------------
# Function for parsing HTML via LLM
# -------------------------------
def extract_product_data(html_content: str) -> str:
prompt = f"""
Here is the HTML content of a product page:
{html_content}
Extract the product title, price, and description from it. Return the answer strictly in JSON format, for example:
Respond only with JSON, without any extra text.
"""
response = llm(prompt)
return response
# -------------------------------
# Tool description for the agent
# -------------------------------
extract_tool = Tool(
name="ExtractProductData",
func=extract_product_data,
description="Extracts product title, price, and description from an HTML page and returns JSON"
)
# -------------------------------
# Agent initialization
# -------------------------------
agent = initialize_agent(
tools=[extract_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# -------------------------------
# Function to parse a single page
# -------------------------------
def parse_product(url):
driver.get(url)
time.sleep(3) # Wait for page to render, in seconds
html = driver.page_source
# Query the agent
query = f"Use the ExtractProductData tool to extract product data from this HTML: {html}"
response = agent.run(query)
try:
data = json.loads(response)
title = data.get("title", "")
price = data.get("price", "")
description = data.get("description", "")
except json.JSONDecodeError:
print(f"JSON parsing error for {url}")
return None
# Save to DB
cursor.execute('INSERT INTO products (url, title, price, description) VALUES (?, ?, ?, ?)',
(url, title, price, description))
conn.commit()
print(f"Data saved for page: {url}")
return data
# -------------------------------
# List of pages to parse
# -------------------------------
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
# Add your URLs here
]
for u in urls:
parse_product(u)
# -------------------------------
# Cleanup
# -------------------------------
driver.quit()
conn.close()
This script sequentially processes a list of target pages using a proxy and a headless browser via the Selenium web driver. The HTML code of each page is passed to the LLM for analysis, which returns a response in JSON format based on the prompt. The script extracts product names, prices, and descriptions from the pages. The data is saved to an SQLite database.
When comparing LangChain vs LangGraph for scraping use cases, LangGraph offers significantly more control over conditions, state transitions, and branching logic.
LangGraph allows you to create flexible, multi-step workflows that can be easily described using visual flowcharts. In fact, there is dedicated software for this purpose — LangGraph Studio. The application’s architecture can be easily adjusted or extended to meet new requirements.
Key features include:
As with LangChain, proxies do not work out of the box in LangGraph. They need to be configured at the level of the HTTP clients (applications and libraries responsible for connecting to target websites), such as:
Here is the Python code example:
import requests
import random
from langgraph.graph import StateGraph, END
# --- State passed between nodes ---
class ScraperState:
def __init__(self, url="", html="", proxy_used=""):
self.url = url
self.html = html
self.proxy_used = proxy
# --- List of proxies for rotation ---
proxy_list = [
"http://user:pass@ip1:port",
"http://user:pass@ip2:port",
"http://user:pass@ip3:port",
]
# --- LangGraph Node: Request page using a proxy ---
def fetch_page_with_proxy(state: ScraperState):
# Pick a random proxy
proxy = random.choice(proxy_list)
proxies = {
"http": proxy,
"https": proxy,
}
try:
print(f"Using proxy: {proxy}")
response = requests.get(state.url, proxies=proxies, timeout=10)
state.html = response.text
state.proxy_used = proxy
except Exception as e:
print(f"Request error with proxy {proxy}: {e}")
state.html = ""
state.proxy_used = proxy
return state
# --- Create the graph ---
graph = StateGraph(ScraperState)
graph.add_node("fetch_with_proxy", fetch_page_with_proxy)
graph.set_entry_point("fetch_with_proxy")
graph.add_edge("fetch_with_proxy", END)
compiled_graph = graph.compile()
# --- Run the graph ---
start_state = ScraperState(url="https://example.com")
final_state = compiled_graph.invoke(start_state)
print("Final HTML snippet:", final_state.html[:500])
print("Proxy used:", final_state.proxy_used)
A few words about the logic of the script with proxy rotation in LangGraph:
After analyzing multiple workflows and use cases, we can summarize the LangChain vs LangGraph dilemma in practical terms. So, it’s clear that neither LangGraph nor LangChain can work with proxies directly. This is because these libraries do not interact with websites or web resources on their own. They always require intermediaries. Their main feature is working with neural networks and LLMs, and in that area, they truly boost productivity, especially when it comes to complex architectures or a large number of LLMs.
This framework is ideal for building a middleware layer that connects and communicates with any neural networks. It comes with a variety of small, specialized tools available in one place. Need to count tokens, split content into chunks, or use different prompts for different tasks? No problem.
However, LangChain is a foundation, so it’s best suited for linear parsing scenarios.
This framework extends LangChain and simplifies the creation of custom conditions, transitions, and loops between different scenarios and nodes. As such, LangGraph allows you to describe (or design) highly complex parsers.
LangGraph and LangChain can be used together in the same project. In fact, the same developer offers a number of additional commercial products, such as LangSmith and LangGraph Platform (a ready-to-use cloud infrastructure).
Parsing is becoming more complex and often requires AI or neural networks. In the LangChain vs LangGraph landscape, instead of writing your own connectors or dealing with the syntax of individual external services, you can use an abstraction layer provided by the LangChain framework. It simplifies integration with over 600 external tools, including vector databases, storage systems, caching layers, and more. But the most important component is LLMs (Large Language Models).
If you need more than just linear request-processing scripts, you can enhance your parser with the LangGraph framework. It’s responsible for building graphs and allows you to create complex, multi-level workflows with custom transitions and loops.
It’s not accurate to say that one is better in the LangChain vs LangGraph comparison, each library serves its own purpose. Neither LangChain nor LangGraph can work with proxies out of the box. That’s because they don’t have built-in support for direct web access. Still, proxies are critical for any parser. We've explained above how to integrate proxies into LangChain or LangGraph.