LangChain vs. LangGraph for Scraping with Proxies and Workflow Control

Written by Team Froxy | Aug 7, 2025 7:00:00 AM

We’ve covered libraries and frameworks that speed up parser development, whether in Python, Java, or Golang, many times before. There are various approaches: from the simplest HTTP clients that can only read “raw” HTML to highly complex algorithms incorporating computer vision and AI assistance. In this article, we’ll explore the most advanced frameworks for integrating large language models (LLMs) into your parsers: LangChain vs. LangGraph.

When considering how to build AI-based workflows that manage token limits and maintain conversational context, the LangChain vs LangGraph comparison becomes especially relevant.

What Are LangChain and LangGraph, and How Do They Differ

Let’s start with a quick reminder: not all AI is the same. Neural networks that work with textual data are known as LLMs (Large Language Models). These models can do more than just generate coherent text or give concise answers, they’re also capable of performing simple tasks such as analyzing documents and summarizing their contents, extracting specific data from HTML code, and much more. As you might guess from the context, our focus here is on everything related to web parsing.

Many modern neural networks are available as public web services with accessible API interfaces. Some LLMs have already evolved to handle a variety of document types and media formats, including images, PDFs, presentations, audio, and video. However, despite this versatility, using them via API comes with certain limitations. The main challenge is tokenization.

To explain it simply: every new request sent to an LLM is treated independently, without memory of previous interactions. Moreover, each request is limited by a specific number of tokens (think of tokens as word-like units, though the actual calculation is more nuanced). This leads to a key technical challenge: when sending large requests to an LLM, you need to:

Break the request into smaller chunks (to stay within token limits), and
Maintain a memory of the conversation (so the model doesn’t lose context).

This is where the concept of chains comes in, linked sequences of prompts and responses that preserve state and context across interactions.

To automate this process of managing request chains, you need specialized software or libraries. And this is exactly where LangChain and LangGraph come into play.

LangChain in Brief

LangChain is a framework designed to create chains — sequences of language model calls and to integrate those calls with external tools like databases, scrapers, APIs, and more.

Currently, LangChain is available both as open-source libraries (for Python, JavaScript, and TypeScript) and as a ready-to-use platform. The hosted version functions as a web service with an API, offering robust integration capabilities, orchestration tools, and performance evaluation, all hosted in a secure cloud environment that’s well-suited for enterprise applications.

Key Features of LangChain:

Open source & MIT-licensed: The library is freely available under a permissive license, making it ideal for both experimentation and production use.
Seamless LLM integration: Easily connect with leading large language models like OpenAI, Google Gemini, Groq, YandexGPT, xAI, and many more—with over 600 pre-built integrations already supported.
Chaining and tokenization: Build sequential flows like “question → web search → response,” or combine multiple LLMs to solve complex tasks collaboratively.
Custom knowledge bases with RAG: Generate Retrieval-Augmented Generation (RAG) pipelines using your own documents and datasets to build custom, context-aware knowledge systems.
Memory and storage: Preserve conversation history using vector stores or key-value databases, enabling long-term memory and contextual awareness for your models.
Agent creation: Deploy autonomous agents that intelligently choose which tools and models to use based on the task at hand.

In Short:

LangChain is the foundational layer for integrating LLMs into your scripts, tools, and enterprise workflows. It enables advanced use cases like contextual dialogue, tool-augmented reasoning, and AI-powered agents. In fact, more complex orchestration systems like LangGraph are built on top of LangChain to provide even greater control and flexibility. For many developers, LangChain is often the starting point in the LangChain vs LangGraph decision tree, especially when workflows are relatively straightforward.

LangGraph in Brief

LangGraph is a powerful framework built to streamline the design and deployment of LangChain-based agents. It handles all the behind-the-scenes complexity like defining input/output points, background task management, state control, scaling, and fault tolerance, so you can focus on configuring your agent’s logic rather than coding it from scratch.

In essence, LangGraph provides ready-to-use agent templates built on top of modern LLMs. These templates can be quickly customized to suit specific project needs, allowing developers to speed up production and iterate with ease.

Just like LangChain, the LangGraph library is open source and released under the permissive MIT license.

Key Points:

Rapid agent development: LangGraph abstracts away boilerplate code, making it easy to build stateful, task-oriented LLM agents using LangChain components.
Built-in orchestration tools: It includes mechanisms for entry/exit points, managing task flows, tracking agent state, and ensuring reliability under load.
Not to be confused with the LangGraph platform: While the open-source framework is MIT-licensed, the LangGraph platform is a commercial SaaS product with proprietary code. It includes hosting, pricing tiers, and cloud-based orchestration tools.
Visual development environment — LangGraph Studio: To simplify the design process, LangGraph offers a local IDE-style environment for creating agent logic visually. A cloud-based version of this tool is available with the commercial platform.

In Summary:

LangGraph takes LangChain to the next level by adding structured orchestration, high scalability, and developer-friendly tools for building LLM agents. Whether you're deploying AI workflows for production or iterating on prototypes, LangGraph helps you get there faster with less manual setup and more reliable results. LangGraph sits at the advanced end of the LangChain vs LangGraph spectrum, designed for teams needing orchestration, modularity, and visual control.

LangChain vs LangGraph — Key Differences

LangChain is a framework for integrating LLMs (large language models) into your scripts and applications. It serves as a foundational layer from which you can begin developing complex data-processing chains. LangChain acts as middleware that connects LLMs with external tools and services.

LangGraph, in contrast, is a tool for designing LLM agents. Technically, it's also a framework, but it operates on top of LangChain and enables faster deployment of common use cases. Its main function is orchestrating a large number of AI agents.

The core architectural difference lies in how workflows are structured. LangChain is built around sequential chains — step-by-step flows. LangGraph, on the other hand, is based on a state graph model. Nodes in this graph can have multiple transitions and conditions.

In LangGraph, loops and conditional logic are built into the architecture. You can define actions for error handling, retries, and transitions based on outcomes. While similar behaviors can be implemented in LangChain, doing so typically requires more complex programming.

In summary, if you just need to connect an LLM and manage the breakdown of large data tasks into chains, LangChain is the right choice.

If you need to rapidly design and deploy your own LLM agent or multiple agents with orchestration, then LangGraph is the better fit.

Scraping with LangChain: Capabilities and Limitations

One of the key evaluation points in the LangChain vs LangGraph debate is how each framework handles parsing workflows and tool orchestration. The LangChain library doesn’t perform parsing by itself. Its purpose is to facilitate interaction with neural networks. LangChain scraping relies on external tools and LLMs to handle the actual data extraction, which can include pattern recognition (identifying CSS classes, IDs, and specific attributes that help locate and extract the desired HTML elements and content), image recognition and processing (including detecting objects in images or videos), audio transcription, and more.

For practical examples of AI-based scraping, see the article: AI Scraping with ChatGPT: A Practical Guide.

LangChain Scraping Capabilities:

Extracting specific content from raw or preprocessed HTML code.
Setting up complex logic for next-step decision-making: determining which parsing algorithm to apply depending on page content (e.g., downloading videos, recognizing images, extracting product descriptions/prices, retrieving meta tags, etc.).
Building sequential processing chains: locating content, generating a list of pages to crawl, and performing the crawl/extraction.
Integrating multiple external neural networks: each can have its own prompt templates tailored to specific tasks.
Chunking and tokenization of large datasets, as well as maintaining query history and context.

LangChain Scraping Limitations:

The LangChain library cannot render web pages or fetch HTML from specified URLs. You’ll need additional tools for this: HTTP clients, headless browsers, or web drivers such as Puppeteer, Playwright, Selenium, etc.
All real work is performed by neural networks. These either require paid API access or become costly when run locally (demanding high-performance hardware,especially GPUs and large, preferably SSD-based, storage).
To save on tokens (which are used to measure LLM processing), you must carefully preprocess the data before sending it to the model. This adds complexity to the scripting and infrastructure around LangChain-based scraping. A smart LangChain proxy strategy can also reduce the frequency of failed requests due to IP bans or CAPTCHAs, improving data extraction reliability.
In some cases, when using just one LLM or agent, writing a direct API connector might be easier than using LangChain as a middleware layer.

LangChain Scraping Tools and Agents

LangChain standardizes the syntax for interacting with LLMs. To support this, it offers built-in integration packages, core abstractions (langchain-core), chains, agents, and tools that help define the cognitive architecture of your application.

Among the tools that simplify the parsing process via LangChain are:

Ready-made methods for working with structured data formats like JSON, XML, YAML, etc., as well as built-in utilities for analyzing and parsing LLM responses.
API request rate and load management.
Token usage tracking.
Chunking — for sending large texts or HTML pages in parts.
Quick prompt switching depending on context.
Local caching of neural network responses.

These tools work best when paired with a well-configured LangChain proxy layer, ensuring seamless data flow between the LLMs and target sources.

Proxy Support in LangChain

Since LangChain does not work directly with websites, it does not require proxies. However, for advanced use cases, setting up a LangChain proxy configuration becomes necessary when the framework is used alongside external tools that do require direct website access. Many target websites actively protect themselves from bots and automated traffic. To bypass these protections, you may need various tools and techniques. Rotating proxies with precise geotargeting are among the most reliable solutions and can work effectively in combination with others.

The most logical approach is to manage proxies at the level of the HTTP client you are using to connect to target websites. For example, if you are working in a Node.js environment, you need to configure proxies for node-fetch. If you’re using an anti-detect browser or a headless browser, the proxies should be set up for those using plugins or special variables/APIs.

Simple LangChain Scraping Agent via Proxy

A fully functional parsing script written in Python. It enables the setup and operation of LangChain through a proxy:

import sqlite3import jsonimport timefrom selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom langchain.llms import OpenAIfrom langchain.agents import initialize_agent, Tool, AgentType# -------------------------------# Setting up SQLite database# -------------------------------conn = sqlite3.connect("products.db")cursor = conn.cursor()cursor.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT,url TEXT,title TEXT,price TEXT,description TEXT)''')conn.commit()# -------------------------------# Setting up Selenium with proxy# -------------------------------chrome_options = Options()chrome_options.add_argument("--headless")proxy = "http://username:password@your_proxy_ip:port" # If no login: "http://ip:port"chrome_options.add_argument(f'--proxy-server={proxy}')driver = webdriver.Chrome(options=chrome_options)# -------------------------------# Setting up the LLM# -------------------------------llm = OpenAI(model_name="gpt-4o", temperature=0)# -------------------------------# Function for parsing HTML via LLM# -------------------------------def extract_product_data(html_content: str) -> str:prompt = f"""Here is the HTML content of a product page:{html_content}Extract the product title, price, and description from it. Return the answer strictly in JSON format, for example:Respond only with JSON, without any extra text."""response = llm(prompt)return response# -------------------------------# Tool description for the agent# -------------------------------extract_tool = Tool(name="ExtractProductData",func=extract_product_data,description="Extracts product title, price, and description from an HTML page and returns JSON")# -------------------------------# Agent initialization# -------------------------------agent = initialize_agent(tools=[extract_tool],llm=llm,agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,verbose=True)# -------------------------------# Function to parse a single page# -------------------------------def parse_product(url):driver.get(url)time.sleep(3) # Wait for page to render, in secondshtml = driver.page_source# Query the agentquery = f"Use the ExtractProductData tool to extract product data from this HTML: {html}"response = agent.run(query)try:data = json.loads(response)title = data.get("title", "")price = data.get("price", "")description = data.get("description", "")except json.JSONDecodeError:print(f"JSON parsing error for {url}")return None# Save to DBcursor.execute('INSERT INTO products (url, title, price, description) VALUES (?, ?, ?, ?)',(url, title, price, description))conn.commit()print(f"Data saved for page: {url}")return data# -------------------------------# List of pages to parse# -------------------------------urls = ["https://example.com/product/1","https://example.com/product/2",# Add your URLs here]for u in urls:parse_product(u)# -------------------------------# Cleanup# -------------------------------driver.quit()conn.close()

This script sequentially processes a list of target pages using a proxy and a headless browser via the Selenium web driver. The HTML code of each page is passed to the LLM for analysis, which returns a response in JSON format based on the prompt. The script extracts product names, prices, and descriptions from the pages. The data is saved to an SQLite database.

Scraping with LangGraph: More Control, Less Abstraction

When comparing LangChain vs LangGraph for scraping use cases, LangGraph offers significantly more control over conditions, state transitions, and branching logic.

Why LangGraph Fits Complex Scraping Workflows

LangGraph allows you to create flexible, multi-step workflows that can be easily described using visual flowcharts. In fact, there is dedicated software for this purpose — LangGraph Studio. The application’s architecture can be easily adjusted or extended to meet new requirements.

Key features include:

Support for conditional logic: for example, you can define scenarios like skipping parsing, using a proxy when CAPTCHA is detected (or forcing proxy rotation), etc.
Loops and repetitions for common actions.
Multithreading support.
Reliable agent state control: you can easily determine whether the agent is available or busy and at which step the overall process is currently at.
Prebuilt templates for tasks such as memory handling, search, and more.

Proxy in LangGraph: Hands-On Setup

As with LangChain, proxies do not work out of the box in LangGraph. They need to be configured at the level of the HTTP clients (applications and libraries responsible for connecting to target websites), such as:

requests and node-fetch,
Playwright, Selenium, Puppeteer,
environment variables,
host (operating system) settings,
headless browsers or anti-detect browsers used for connections.

LangGraph Node + Proxy Config

Here is the Python code example:

import requestsimport randomfrom langgraph.graph import StateGraph, END# --- State passed between nodes ---class ScraperState:def __init__(self, url="", html="", proxy_used=""):self.url = urlself.html = htmlself.proxy_used = proxy# --- List of proxies for rotation ---proxy_list = ["http://user:pass@ip1:port","http://user:pass@ip2:port","http://user:pass@ip3:port",]# --- LangGraph Node: Request page using a proxy ---def fetch_page_with_proxy(state: ScraperState):# Pick a random proxyproxy = random.choice(proxy_list)proxies = {"http": proxy,"https": proxy,}try:print(f"Using proxy: {proxy}")response = requests.get(state.url, proxies=proxies, timeout=10)state.html = response.textstate.proxy_used = proxyexcept Exception as e:print(f"Request error with proxy {proxy}: {e}")state.html = ""state.proxy_used = proxyreturn state# --- Create the graph ---graph = StateGraph(ScraperState)graph.add_node("fetch_with_proxy", fetch_page_with_proxy)graph.set_entry_point("fetch_with_proxy")graph.add_edge("fetch_with_proxy", END)compiled_graph = graph.compile()# --- Run the graph ---start_state = ScraperState(url="https://example.com")final_state = compiled_graph.invoke(start_state)print("Final HTML snippet:", final_state.html[:500])print("Proxy used:", final_state.proxy_used)

A few words about the logic of the script with proxy rotation in LangGraph:

Node — this is a function that accepts the state (in our case, State), makes a request to the target URL via a proxy, and returns the updated state.
Proxy — can be fixed or rotated from a list.
In the example, we use the requests library because it's the simplest HTTP client (used here for demonstration purposes).

LangChain vs LangGraph: When and What to Use for Scraping

After analyzing multiple workflows and use cases, we can summarize the LangChain vs LangGraph dilemma in practical terms. So, it’s clear that neither LangGraph nor LangChain can work with proxies directly. This is because these libraries do not interact with websites or web resources on their own. They always require intermediaries. Their main feature is working with neural networks and LLMs, and in that area, they truly boost productivity, especially when it comes to complex architectures or a large number of LLMs.

Choose LangChain if you want to handle simple chains of actions

This framework is ideal for building a middleware layer that connects and communicates with any neural networks. It comes with a variety of small, specialized tools available in one place. Need to count tokens, split content into chunks, or use different prompts for different tasks? No problem.

However, LangChain is a foundation, so it’s best suited for linear parsing scenarios.

Choose LangGraph if you want to build complex logic

This framework extends LangChain and simplifies the creation of custom conditions, transitions, and loops between different scenarios and nodes. As such, LangGraph allows you to describe (or design) highly complex parsers.

LangGraph and LangChain can be used together in the same project. In fact, the same developer offers a number of additional commercial products, such as LangSmith and LangGraph Platform (a ready-to-use cloud infrastructure).

Conclusion and Recommendation

Parsing is becoming more complex and often requires AI or neural networks. In the LangChain vs LangGraph landscape, instead of writing your own connectors or dealing with the syntax of individual external services, you can use an abstraction layer provided by the LangChain framework. It simplifies integration with over 600 external tools, including vector databases, storage systems, caching layers, and more. But the most important component is LLMs (Large Language Models).

If you need more than just linear request-processing scripts, you can enhance your parser with the LangGraph framework. It’s responsible for building graphs and allows you to create complex, multi-level workflows with custom transitions and loops.

It’s not accurate to say that one is better in the LangChain vs LangGraph comparison, each library serves its own purpose. Neither LangChain nor LangGraph can work with proxies out of the box. That’s because they don’t have built-in support for direct web access. Still, proxies are critical for any parser. We've explained above how to integrate proxies into LangChain or LangGraph.

View full post