Large-scale scraping is not just about extracting data from HTML code. It is a continuous process, somewhat similar to moving through streets and houses. To explore the entire area, you need to walk down every street (the categories and sections of the target website) and look into every yard (individual pages). So, where do you get a list of all URLs on a website, and how do you then retrieve the content of all those pages?
In this article, we explain in detail the existing approaches to finding all links on a website, as well as how to organize their proper crawling.
Why only on a single domain? The reason is that pages on the target website may contain links to external resources. If you do not limit the collection of pages to one domain, the scraper can fall into recursion: it will keep finding new URLs on new websites, add them to the extraction queue, then discover even more links there, and so on indefinitely — or, more precisely, until the script has crawled the entire internet. And even then, that assumes you prevent it from visiting the same addresses twice.
By the way, there are ready-made tools that handle website crawling for the purpose of collecting all its URLs, such as Screaming Frog.
But in the context of this article, we are more interested in the logic behind building the crawling queue. This is a fundamental task when creating any scraper. The unfortunate part is that it has many possible solutions, and none of them can be called universally best.
So, how can you find all URLs on a specific website:
The scraper accesses the website's sitemap directly and extracts structured data from it. Under normal conditions, the site engine generates these sitemaps automatically, specifically for search bots and crawlers. The format of the stored information is strictly defined — it uses XML markup. At first glance, it seems easy.
But in practice, things are not always that simple:
To avoid sanctions and blocks, you should also study the bot rules described in the robots.txt directives.
The basic idea is simple: you enter the target domain into a program or an online service, the specialized tool crawls the pages on its own, and in the end it returns a complete list of all website URLs in a convenient format such as CSV or XML. Your scraper can then work with that list. Niche tools in this category include solutions such as Netpeak Spider, Screaming Frog, Ahrefs, Semrush, and others.
It sounds ideal, but there are some important caveats:
Google can be replaced with any other search engine. However, Google traditionally has broader coverage, which means its data is more likely to be up to date. The logic is simple: you use special search operators such as site:target-domain.com, and the search engine returns a list of all pages from that domain that are included in its index. It is a clever approach and can be quite practical in certain situations, for example, when you cannot get past the anti-bot protection on the target website.
But again, there are some limitations:
See also: How to Scrape Google SERPs
All of the problems and limitations listed above can be addressed by writing your own script — one that extracts URLs and crawls pages according to the logic you need: in public and restricted sections, with or without respecting robots.txt directives, with filtering of unwanted addresses, and so on.
But this approach also comes with its own challenges:
Here we describe the general logic of a scraper designed to find all URLs on a single domain:
The homepage of the domain or a sitemap link such as domain.com/sitemap.xml is usually used as the starting address. In some cases, however, it may be more practical to start with lists of key categories or URL masks based on regular expressions. This makes it possible to define the crawl structure from the beginning instead of relying on a single entry point.
The script must retrieve the HTML content of the starting page or pages in order to extract the next set of URLs. This is how the queue is built, and it is the queue that allows the scraper to discover all URLs on the domain. The most common crawling logic is FIFO — first in, first out — meaning the earliest discovered URL is processed first. But other approaches are also possible, such as sequential category-based crawling, randomization, and similar strategies.
To preserve the progress and status of individual pages, the scraper needs a structure for storing data and visit markers. This can be a hash table or a full database with dedicated fields.
Technically, nothing stops you from finding URLs that lead to other websites, but including them in the crawl is a bad idea — otherwise, the process may never end. In addition, the database may be filled with unnecessary or "junk" URLs such as links to scripts, images, CSS files, or filtered pages with parameters. To exclude them, you need a set of filtering rules.
The most sensible rules usually look like this:
At this stage, you define the parsing logic for an individual page — this becomes your main loop. In a typical setup, the scraper takes a URL from the queue that is not marked as "visited," sends an HTTP request to that URL, extracts the full HTML code from the page body, searches the code for all links, cleans and normalizes them according to the rules defined in step 4, checks whether they already exist in the database, and if the links are new, adds them to the crawl queue.
In this context, "new" means pages that do not yet have the "visited" marker.
Even if the script is stopped or interrupted, the crawl can always be resumed from the exact point where it left off, thanks to the status-tracking system.
Perfect proxies for accessing valuable data from around the world.
Here are a few other things worth paying attention to:
Typical problems when trying to discover all URLs on a domain include:
In other words, even a "simple" website crawl can easily turn into an enterprise-level task involving distributed queues, load balancing, dashboards, graphs, and more.
Here is an example of a simple script that crawls a specified number of pages on a target website. If you set a very large limit, it can crawl all URLs on the domain:
import time
import os
import csv
import requests
from urllib.parse import urljoin, urlparse, urldefrag, parse_qs, urlencode
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
# -----------------------------
# Configuration
# -----------------------------
START_URLS = ["https://example.com"]
ALLOWED_DOMAIN = "example.com"
STATE_FILE = "crawler_state.csv" # can be changed to .xml
SAVE_EVERY = 20 # save state every N URLs
REQUEST_DELAY = 0.5
MAX_URLS = 1000
TIMEOUT = 10
BLACKLIST_PARAMS = {"utm_source", "utm_medium", "utm_campaign", "sort", "filter", "page"}
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; SimpleCrawler/1.1)"
}
# -----------------------------
# URL normalization
# -----------------------------
def normalize_url(url):
url, _ = urldefrag(url)
parsed = urlparse(url)
if ALLOWED_DOMAIN not in parsed.netloc:
return None
query = parse_qs(parsed.query)
filtered_query = {k: v for k, v in query.items() if k not in BLACKLIST_PARAMS}
query_string = urlencode(filtered_query, doseq=True)
normalized = parsed._replace(query=query_string).geturl()
if normalized.endswith("/"):
normalized = normalized[:-1]
return normalized
# -----------------------------
# Link extraction
# -----------------------------
def extract_links(html, base_url):
soup = BeautifulSoup(html, "html.parser")
links = set()
for tag in soup.find_all("a", href=True):
full_url = urljoin(base_url, tag.get("href"))
normalized = normalize_url(full_url)
if normalized:
links.add(normalized)
return links
# -----------------------------
# Save state (CSV)
# -----------------------------
def save_state_csv(queue, visited):
with open(STATE_FILE, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["type", "url"])
for url in visited:
writer.writerow(["visited", url])
for url in queue:
writer.writerow(["queue", url])
def load_state_csv():
queue = []
visited = set()
with open(STATE_FILE, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
if row["type"] == "visited":
visited.add(row["url"])
elif row["type"] == "queue":
queue.append(row["url"])
return queue, visited
# -----------------------------
# Save state (XML)
# -----------------------------
def save_state_xml(queue, visited):
root = ET.Element("crawler")
visited_el = ET.SubElement(root, "visited")
for url in visited:
ET.SubElement(visited_el, "url").text = url
queue_el = ET.SubElement(root, "queue")
for url in queue:
ET.SubElement(queue_el, "url").text = url
tree = ET.ElementTree(root)
tree.write(STATE_FILE, encoding="utf-8", xml_declaration=True)
def load_state_xml():
tree = ET.parse(STATE_FILE)
root = tree.getroot()
visited = set()
queue = []
for url in root.find("visited"):
visited.add(url.text)
for url in root.find("queue"):
queue.append(url.text)
return queue, visited
# -----------------------------
# Universal save/load
# -----------------------------
def save_state(queue, visited):
if STATE_FILE.endswith(".xml"):
save_state_xml(queue, visited)
else:
save_state_csv(queue, visited)
def load_state():
if not os.path.exists(STATE_FILE):
return None, None
if STATE_FILE.endswith(".xml"):
return load_state_xml()
else:
return load_state_csv()
# -----------------------------
# Crawler
# -----------------------------
def crawl():
queue, visited = load_state()
if queue is None:
print("[i] State file not found. Starting fresh.")
queue = list(START_URLS)
visited = set()
else:
print(f"[i] Loaded state. Queue: {len(queue)}, Visited: {len(visited)}")
processed_since_save = 0
while queue and len(visited) < MAX_URLS:
url = queue.pop(0)
if url in visited:
continue
print(f"[+] Visiting: {url}")
visited.add(url)
processed_since_save += 1
try:
response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
if response.status_code != 200:
continue
links = extract_links(response.text, url)
for link in links:
if link not in visited and link not in queue:
queue.append(link)
except requests.RequestException:
continue
# periodic save
if processed_since_save >= SAVE_EVERY:
print("[i] Saving state...")
save_state(queue, visited)
processed_since_save = 0
time.sleep(REQUEST_DELAY)
print("[i] Final save...")
save_state(queue, visited)
return visited
# -----------------------------
# Run
# -----------------------------
if __name__ == "__main__":
urls = crawl()
print(f"\nTotal URLs collected: {len(urls)}")
The script works without a headless browser and uses the Beautiful Soup parser. If there is no file with saved URLs next to the script when it starts, it creates one and fills it from scratch. If the file already exists, the script loads the visit statuses from it so the process does not restart from the beginning every time.
If needed, you can set your own target URL, as well as your own limits, delays, and parameters that should be removed from the URL structure.
Above, we looked at the most common ways to check all pages of a website and build a final list of URLs for a single domain. The easiest approach is to use ready-made tools. But the most flexible and technically correct one is to write your own crawling script.
Neither specialized tools nor large-scale scrapers can work for long without high-quality proxies. Froxy provides millions of rotating proxies with traffic-based pricing. The number of parallel threads can reach up to 1000. With our help, you can handle even the most complex data collection tasks.