Large-scale scraping always involves a huge number of requests and responses transmitted over the Internet. It is worth remembering that different types of data can be transmitted via HTTP and HTTPS protocols: text, audio, video, documents, and any other files. Due to the parser restart, many requests in the queue may be reprocessed. Each of these requests represents traffic, time, computing resources, and ultimately money.
A simple example: if you do not save the progress of the parser (intermediate data), such as the final HTML code from heavy dynamic pages written in JavaScript, you will have to start all over again. There is a way out of this situation: caching and saving the parser's state.
In general, this article is about how to use caching in Python when parsing and how to make this process efficient.
What Is Caching in Python and Why It Matters for Web Scraping
If you’re just starting with web scraping, you’ve likely run into two common problems:
- Slow script performance, especially when using headless browsers or anti-detect environments. Rendering each page can take tens of seconds, and when scaled to 1,000 pages or more, that quickly turns into hours of continuous work in a single thread.
- The risk of being blocked due to excessive requests. Website protection systems — for instance, advanced Web Application Firewalls (WAFs) — can flag your scraper’s activity as suspicious and block it at any moment.
Caching helps solve both of these problems.
Caching is the process of temporarily storing data. Such data can be stored in RAM, on a hard drive, or inside databases. The main task of caching is to speed up access to information so that important working data is always at hand (without additional complex actions).
Caching in scraping is a subject-specific technical system for caching data that the parser works with. In this case, the purpose of caching is to save the current state of the parsing process so that the program can resume working from the point where it stopped due to a block, a user command, or an unexpected error. Less commonly, caching can solve the problem of storing “raw” data that has been obtained from a website but has not yet been analyzed (extracted).
Common Approaches to Implementing Python Cache

Since we are talking about implementing a scraper in Python, it makes sense to select and demonstrate the operation of those systems, standard mechanisms, functions, and libraries that have already been implemented for this programming language.
As mentioned earlier, caching is most often implemented for storing data in RAM, on disk, and in databases. So let's take a look at Python caching tools broken down into these categories.
In-Memory Caching
In Python, in-memory caching (also known as memoization) can be implemented using built-in tools, such as dictionaries, or through external libraries. In some cases, you can also use specialized storage systems such as Memcached, a well-known solution among PHP and web server developers.
The main goal of caching in a computer’s RAM (on a local machine or server) is to speed up repeated access to data, returning results instantly instead of running the same operations again.
1. Using a Regular Dictionary (dict)
A dictionary is the simplest way to implement caching in Python.
The idea is straightforward: the URL of a page acts as the key, and the HTML response as the value.
How it works:
- On the first request, the page is fetched from the website and stored in the dictionary.
- On the second request with the same URL, the data is returned directly from the dictionary (cache) — without sending a new network request.
This approach requires no external dependencies and works perfectly for small datasets or single-threaded scripts.
Here’s a simple example of implementation:
import requests
cache = {}
def get_page(url):
if url in cache:
return cache[url] # return from cache
response = requests.get(url)
cache[url] = response.text # save the result
return response.text
html = get_page("https://example.com")
2. Caching Decorator — functools.lru_cache
This is a built-in decorator in Python used to cache the results of functions with recurring parameters.
“LRU” stands for Least Recently Used, meaning that when the cache is full, the oldest (least recently accessed) item is automatically removed.
Example (main part of the code):
import requests
from functools import lru_cache
@lru_cache(maxsize=100) # sets the maximum cache size
def get_page(url):
return requests.get(url).text
html = get_page("https://example.com")
3. Cachetools
Python also has an external library called cachetools, which extends the functionality of lru_cache and makes cache management easier.
Example of caching pages with cachetools:
import requests
from cachetools import TTLCache
cache = TTLCache(maxsize=100, ttl=3600)
def get_page(url):
if url not in cache:
cache[url] = requests.get(url).text
return cache[url]
Besides LRU, cachetools also supports:
- LFU (Least Frequently Used)
- RRCache (Random Replacement)
- TTLCache (Time-To-Live based caching)
4. Memcached
Memcached is a high-performance, in-memory key-value store that uses hash tables to cache data. It works as a local caching service with a client-server architecture, allowing multiple applications or scripts to access shared cached data.
In Python, integration is done through the python-memcached library. Memcached is an excellent choice when dealing with large volumes of data or when you need shared caching across different processes or servers.
Example of usage:
import requests
import memcache
mc = memcache.Client(['127.0.0.1:11211']) # connect to memcached client
def get_page(url):
cached = mc.get(url)
if cached:
return cached
response = requests.get(url)
mc.set(url, response.text, time=3600) # cache for 1 hour
return response.text
html = get_page("https://example.com")
File-Based Caching in Python

RAM is always in short supply because every application wants access to it. For caching, if you have a fast disk, you can store HTML as files — disks usually offer far more space than RAM.
The simplest way to implement a file-based Python cache is to do it yourself: dedicate a directory for saved HTML pages and name files based on their URLs or a hash. Your scraper can then check whether a downloaded version exists and skip re-fetching that page.
Example of file-based caching in Python with URL hashing:
import os, hashlib, requests, time
# Define your folder for storing HTML
CACHE_DIR = 'cache_html'
EXPIRE = 3600 * 24 # cache lifetime — 1 day
os.makedirs(CACHE_DIR, exist_ok=True)
def get_html(url):
# generate a unique filename based on the URL hash
filename = hashlib.md5(url.encode()).hexdigest() + '.html'
path = os.path.join(CACHE_DIR, filename)
# if the file exists and hasn't expired — read from cache
if os.path.exists(path) and (time.time() - os.path.getmtime(path)) < EXPIRE:
with open(path, 'r', encoding='utf-8') as f:
return f.read()
# otherwise, fetch again (you can use a headless browser instead of requests)
response = requests.get(url, timeout=10)
response.raise_for_status()
# save to cache
with open(path, 'w', encoding='utf-8') as f:
f.write(response.text)
return response.text
If outdated files take up space or get in the way, delete them with a separate function. Clearing the Python cache:
def clear_old_cache():
for f in os.listdir(CACHE_DIR):
path = os.path.join(CACHE_DIR, f)
if time.time() - os.path.getmtime(path) > EXPIRE:
os.remove(path)
Because only a single process is dealing with files in this simple script, conflicts are unlikely. But with a complex, multithreaded scraper running many operations in parallel, you can run into read/write contention, and you won’t have centralized control over cache lifetimes. For more robust setups, consider dedicated libraries.
Residential Proxies
Perfect proxies for accessing valuable data from around the world.
File-Based Python Cache Using pickle
pickle is Python’s standard module for object serialization. It allows you to save and load any data structures (lists, dictionaries, etc.) as binary files, which makes it a simple and compact way to implement caching.
Example of a minimal caching implementation:
import pickle, os
# Function to get cache based on a page URL
def get_cache(url):
if os.path.exists('cache.pkl'):
with open('cache.pkl', 'rb') as f:
cache = pickle.load(f)
else:
cache = {}
if url in cache:
return cache[url]
# Emulation of the parsing process
data = f"Data for {url}"
cache[url] = data
with open('cache.pkl', 'wb') as f:
pickle.dump(cache, f)
return data
Caching Complex Objects with joblib
joblib is a specialized library for efficiently storing large objects, such as NumPy arrays or the outputs of machine learning models (for reference, see the research about frameworks like LangChain and LangGraph).
It supports compression and is optimized for working with large binary data files.
Typical integration into a Python scraper:
from joblib import dump, load
import os
if os.path.exists('cache.joblib'):
data = load('cache.joblib')
else:
data = {"page": "parsing results"}
dump(data, 'cache.joblib')
Caching in Python with shelve
shelve is a built-in Python module that works like a persistent dictionary — similar to dict, but it stores data on disk.
It’s a good fit for small to medium-sized caching (up to about 1 GB) between scraper runs.
Example:
import shelve
with shelve.open('cache.db') as cache:
if 'url1' not in cache:
cache['url1'] = 'Query results'
print(cache['url1'])
Requests-Cache with Filesystem Backend
For the popular requests library, there’s a dedicated caching extension — requests-cache.
This is one of the best caching tools for web scraping with Requests, as it automatically stores responses from requests.get() and requests.post() calls.
Depending on the configuration, it can use filesystem, SQLite, or Redis as storage backends.
Example implementation using filesystem storage:
import requests
import requests_cache
# Install cache with a 1-hour lifetime (in seconds)
requests_cache.install_cache('cache_pages', expire_after=3600)
# First request — fetched from the network
r1 = requests.get('https://example.com')
# Second request — loaded from cache
r2 = requests.get('https://example.com')
# Verify if the page was retrieved from cache
print(r2.from_cache)
Diskcache — Powerful File-Based Cache for Multithreaded Web Scraping
Diskcache is a fast and reliable file-based caching library, ideal for large-scale and multithreaded scraping tasks.
It supports TTL (time-to-live) settings, automatic cleanup, and can function as both a queue and a persistent dictionary.
Its API is also compatible with functools.lru_cache.
Example of file-based caching (SQLite can be added if needed):
from diskcache import Cache
cache = Cache('cache_disk')
@cache.memoize(expire=3600)
def fetch_data(url):
import requests
return requests.get(url).text
html = fetch_data('https://example.com')
Dogpile.cache — “Enterprise-Level” Caching System
Dogpile.cache is a more advanced and flexible caching solution, well-suited for complex scrapers and large-scale applications.
It supports detailed TTL control, customizable refresh logic, and various backends — including files, memory, Redis, and Memcached.
Below is an example using a file-based storage backend:
from dogpile.cache import make_region
region = make_region().configure(
'dogpile.cache.dbm',
arguments={'filename': './file_cache.dbm'}
)
@region.cache_on_arguments()
def get_html(url):
import requests
return requests.get(url).text
Database Caching

When you cache to files, managing large numbers of records becomes tricky: which entries have expired, where they’re stored (especially with nested folder structures), and which process/thread is accessing them. To centralize this metadata and support more complex, large-scale caching — especially when several scraper threads run concurrently — you’ll want a real database.
The most popular databases for Python cache are SQLite and Redis.
Caching in Python with SQLite
SQLite is an embedded relational database that ships as a library for many languages (including Python). It does not require a separate DB server (unlike MariaDB/MySQL). Note that the database is stored as a single file on disk, so handle that file with care.
Example of caching with SQLite:
import sqlite3, time
conn = sqlite3.connect('cache.sqlite')
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS cache (url TEXT PRIMARY KEY, data TEXT, ts REAL)')
def get_data(url):
cur.execute('SELECT data, ts FROM cache WHERE url=?', (url,))
row = cur.fetchone()
if row and time.time() - row[1] < 3600:
return row[0]
# request emulation
data = f"Parsing {url}"
cur.execute('REPLACE INTO cache VALUES (?, ?, ?)', (url, data, time.time()))
conn.commit()
return data
Caching with Redis
Redis is one of the fastest in-memory databases, purpose-built for caching. It’s great for multithreaded and asynchronous scraping workflows. It runs as a service, so you connect to it over a host/port.
Example integration:
import redis, json, time
r = redis.Redis(host='localhost', port=6379, db=0)
def get_data(url):
cached = r.get(url)
if cached:
return json.loads(cached)
data = {"url": url, "time": time.time()}
r.setex(url, 3600, json.dumps(data)) # store for 1 hour
return data
Practical Examples of Caching in Python for Web Scraping

Above, we showed only the most simplified constructs. Now, let’s get more specific.
Caching HTTP Responses
import requests
import requests_cache
from bs4 import BeautifulSoup
import time
# Cache configuration
requests_cache.install_cache(
cache_name='cache_pages', # SQLite DB file will be cache_pages.sqlite
backend='sqlite', # storage type — SQLite
expire_after=3600 * 24, # cache lifetime — 24 hours
allowable_methods=('GET',) # cache only GET requests
)
def fetch_html(url):
"""Loading an HTML page with caching"""
try:
response = requests.get(url, timeout=10)
source = "cache" if response.from_cache else "network"
print(f"[{time.strftime('%H:%M:%S')}] {url} → source: {source}")
response.raise_for_status()
return response.text
except Exception as e:
print(f"Error loading {url}: {e}")
return None
def parse_title(html):
"""Extracting the page title"""
soup = BeautifulSoup(html, 'html.parser')
return soup.title.text if soup.title else '(without title)'
if __name__ == "__main__":
urls = [
"https://www.python.org/",
"https://www.wikipedia.org/",
"https://example.com/"
]
for url in urls:
html = fetch_html(url)
if html:
print(f"→ {parse_title(html)}")
After the first run, a file named “cache_pages.sqlite” will appear next to your script. That’s your database. It contains the following tables:
- responses (stores response bodies — HTML, JSON, etc.),
- urls (maps URL → response ID),
- redirects (redirect history),
- metadata (service fields — request time, TTL, etc.).
Each record stores: the URL and method (GET, POST), response status (200, 301, 504, etc.), request date and TTL, and the HTML body.
If you wish, you can clear the cache via requests_cache.clear().
More advanced syntax for removing expired records:
from requests_cache import get_cache
get_cache().remove_expired_responses()
And here’s how to check from code whether a page exists in the cache:
from requests_cache import get_cache
cache = get_cache()
print(cache.has_url('https://www.python.org/'))
Caching with CacheControl
This is an alternative focused on HTTP caching headers (Cache-Control, ETag, Expires). It’s useful when the server itself defines cache rules.
import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache
sess = CacheControl(requests.Session(), cache=FileCache('.web_cache'))
resp = sess.get("https://example.com")
print(resp.from_cache if hasattr(resp, 'from_cache') else False)
Reusing Parsed HTML (BeautifulSoup / lxml objects)
To avoid re-parsing a page on every run, you can save already parsed HTML or extracted data to a file. File-based caching can be handled by libraries like pickle, joblib, diskcache, etc.
import os, pickle, requests
from bs4 import BeautifulSoup
CACHE_FILE = 'parsed_cache.pkl'
# load saved results if present
if os.path.exists(CACHE_FILE):
with open(CACHE_FILE, 'rb') as f:
cache = pickle.load(f)
else:
cache = {}
def get_parsed(url):
if url in cache:
print("→ from cache ")
return cache[url]
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
title = soup.title.text if soup.title else "(without a title)"
cache[url] = title
with open(CACHE_FILE, 'wb') as f:
pickle.dump(cache, f)
return title
print(get_parsed("https://www.python.org/"))
Combining Proxies with Cache for Efficiency
The idea is simple: caching eliminates redundant requests, while proxies ensure anonymity and distribute the load.
If a response is already found in the cache, no network request is sent — meaning proxy traffic and limits aren’t consumed.
import requests
import requests_cache
# enable caching
requests_cache.install_cache(
'cache_proxy',
backend='filesystem',
expire_after=3600 * 12 # cache valid for 12 hours
)
# proxy list for rotation
proxies_list = [
{"http": "http://proxy1:8080", "https": "http://proxy1:8080"},
{"http": "http://proxy2:8080", "https": "http://proxy2:8080"},
]
def get_html(url, proxy):
response = requests.get(url, proxies=proxy, timeout=10)
print("Source:", "cache" if response.from_cache else "network")
return response.text
urls = ["https://example.com", "https://www.python.org"]
for i, url in enumerate(urls):
proxy = proxies_list[i % len(proxies_list)]
html = get_html(url, proxy)
Worldwide Coverage
5 continents, No limits
Access our proxy network with over 200 locations and 10+ million IP addresses.
Limitations and Pitfalls of Caching in Python
Python cache can greatly reduce traffic and let your scraper preserve state, but there are downsides:
- Stale data. If prices or content change on a page, your scraper won’t see it if it keeps pulling from previously saved files or databases. Set cache lifetimes carefully and ensure “expired” pages are fetched from the site rather than the cache.
- URL parameter variants. Tricky topic. Some sites embed meaningful options in query parameters; others append partner or ad IDs. If you cache only the base URL (dropping parameters), you may lose important data. If you cache the full URL (including parameters), you may end up with endless duplicates of the same page. Choose a strategy based on a detailed analysis of the site.
- Multithreading conflicts. With more parallel scraping processes, the risk of collisions grows: simultaneous reads/writes to the same cache files, accidental overwrites, etc. That’s why you should use databases or specialized libraries with access control instead of (or in addition to) plain file caches.
- Cache growth. The more you collect, the bigger your cache. In-memory caches can fill RAM quickly, and even disk stores can struggle at scale, slowing the scraper. Plan eviction policies up front and set storage limits.
- Content differences when using proxies. Some sites return different content based on device type or location. To avoid mismatches, choose proxies from the same region and with the same type of client IP. See also rotating residential and mobile proxies.
Conclusion

Caching during scraping in Python can both solve certain problems and create new ones. For caching to work properly, you need to think in advance about limits, data storage methods, and update schemes when data becomes outdated.
There are various options, as well as ways to implement them, ranging from simple file storage to a distributed cache based on Redis. The optimal approach will depend on your tasks: the volume of parsing, the frequency of updates, the number of threads, and stability requirements. The main thing to remember is that the cache should not replace the data collection logic, but only speed it up and reduce the load on the network.
Caching is not the only problem. Websites often update their layout, which causes parsers to stop working. Here is some information on how to make your parser more stable.
For our part, we can offer high-quality proxies. They will help with scaling and bypassing blocks. Froxy has over 10 million IP addresses with precise targeting and automatic rotation.

