Find All URLs on a Domain: Crawl Website & List Pages

Written by Team Froxy | May 26, 2026 7:00:00 AM

Large-scale scraping is not just about extracting data from HTML code. It is a continuous process, somewhat similar to moving through streets and houses. To explore the entire area, you need to walk down every street (the categories and sections of the target website) and look into every yard (individual pages). So, where do you get a list of all URLs on a website, and how do you then retrieve the content of all those pages?

In this article, we explain in detail the existing approaches to finding all links on a website, as well as how to organize their proper crawling.

Best Ways to Find All URLs on a Domain

Why only on a single domain? The reason is that pages on the target website may contain links to external resources. If you do not limit the collection of pages to one domain, the scraper can fall into recursion: it will keep finding new URLs on new websites, add them to the extraction queue, then discover even more links there, and so on indefinitely — or, more precisely, until the script has crawled the entire internet. And even then, that assumes you prevent it from visiting the same addresses twice.

By the way, there are ready-made tools that handle website crawling for the purpose of collecting all its URLs, such as Screaming Frog.

But in the context of this article, we are more interested in the logic behind building the crawling queue. This is a fundamental task when creating any scraper. The unfortunate part is that it has many possible solutions, and none of them can be called universally best.

So, how can you find all URLs on a specific website:

Method 1. Access the XML Sitemap Files (sitemap.xml)

The scraper accesses the website's sitemap directly and extracts structured data from it. Under normal conditions, the site engine generates these sitemaps automatically, specifically for search bots and crawlers. The format of the stored information is strictly defined — it uses XML markup. At first glance, it seems easy.

But in practice, things are not always that simple:

The website may not have a sitemap at all.
The sitemap may contain outdated information, especially if it is not generated automatically.
Important sections and pages may be missing from the sitemap.
The sitemap itself may be split into several files for different sections and content types.

To avoid sanctions and blocks, you should also study the bot rules described in the robots.txt directives.

Method 2. Use Ready-Made Specialized Tools and Programs

The basic idea is simple: you enter the target domain into a program or an online service, the specialized tool crawls the pages on its own, and in the end it returns a complete list of all website URLs in a convenient format such as CSV or XML. Your scraper can then work with that list. Niche tools in this category include solutions such as Netpeak Spider, Screaming Frog, Ahrefs, Semrush, and others.

It sounds ideal, but there are some important caveats:

Many SEO crawlers and similar SEO services are paid products.
Standalone programs may not work well with certain types of websites. They may also require high-quality rotating proxies so that your IP does not get banned after several automated requests. This is because many modern websites cannot be crawled properly without headless browsers.
Until the program finishes collecting URLs, you cannot move on to the next stage of processing. And if the target website has tens of thousands of pages, you may have to wait a very long time.
External services and standalone programs are often difficult to integrate with custom automation scripts.

Method 3. Use Google Search

Google can be replaced with any other search engine. However, Google traditionally has broader coverage, which means its data is more likely to be up to date. The logic is simple: you use special search operators such as site:target-domain.com, and the search engine returns a list of all pages from that domain that are included in its index. It is a clever approach and can be quite practical in certain situations, for example, when you cannot get past the anti-bot protection on the target website.

But again, there are some limitations:

In Google, you can only retrieve the pages that the search engine actually knows about. Coverage is never 100%, since some pages and sections may remain inaccessible to the crawler.
Google's search depth is not unlimited. And considering that search results can no longer be configured to display 100 results at once, you will only be able to collect a small portion of all existing URLs on a single domain.
Search results have long been personalized, so the output may be somewhat inconsistent. Different IP addresses and locations can return different lists of pages for the same domain.
Google also protects itself from scrapers and may periodically show captchas or block certain IP addresses.

See also: How to Scrape Google SERPs

Method 4. Find All URLs on a Domain with a Custom Script

All of the problems and limitations listed above can be addressed by writing your own script — one that extracts URLs and crawls pages according to the logic you need: in public and restricted sections, with or without respecting robots.txt directives, with filtering of unwanted addresses, and so on.

But this approach also comes with its own challenges:

You have to build the script yourself. Without programming knowledge, you may not even be able to adapt ready-made solutions to your specific needs.
Pages may contain traps such as honeypots, and the website itself may be protected by a web firewall, for example Cloudflare WAF.
The pages themselves may be generated dynamically. As a result, crawling may require more complex and resource-intensive solutions, such as anti-detect tools or headless browsers.
It is important to save the progress or status of the crawl so you can resume from the point where an error occurred instead of starting over from scratch.
There is always a risk of a combinatorial explosion — a situation where URLs contain extra parameters that do not carry useful meaning. Even so, they make each URL unique, so the final list can grow endlessly with technical duplicates.
The URL list may end up including not only actual pages, but also all kinds of additional files such as CSS stylesheets, JavaScript scripts, images, documents, and more.

How to Crawl a Website for All URLs

Here we describe the general logic of a scraper designed to find all URLs on a single domain:

Define the starting points (entry points)

The homepage of the domain or a sitemap link such as domain.com/sitemap.xml is usually used as the starting address. In some cases, however, it may be more practical to start with lists of key categories or URL masks based on regular expressions. This makes it possible to define the crawl structure from the beginning instead of relying on a single entry point.

Initialize the crawl queue

The script must retrieve the HTML content of the starting page or pages in order to extract the next set of URLs. This is how the queue is built, and it is the queue that allows the scraper to discover all URLs on the domain. The most common crawling logic is FIFO — first in, first out — meaning the earliest discovered URL is processed first. But other approaches are also possible, such as sequential category-based crawling, randomization, and similar strategies.

Create a structure for storing crawl progress and visited URLs

To preserve the progress and status of individual pages, the scraper needs a structure for storing data and visit markers. This can be a hash table or a full database with dedicated fields.

Exclude external links and normalize URLs

Technically, nothing stops you from finding URLs that lead to other websites, but including them in the crawl is a bad idea — otherwise, the process may never end. In addition, the database may be filled with unnecessary or "junk" URLs such as links to scripts, images, CSS files, or filtered pages with parameters. To exclude them, you need a set of filtering rules.

The most sensible rules usually look like this:

Only URLs that begin with the target domain are added to the crawl queue or the database. Whether subdomains are included depends on your goals and the website's structure.
Found URLs should be converted into canonical form. That means removing anchors after the # symbol and query parameters, which are passed after ? and separated with &. In some cases, you may also need to add or remove trailing slashes / to make all URLs consistent. It is also reasonable to filter out parameter-based pages such as sorting and filtering URLs, which often contain patterns like ?sort= or ?filter=.
Unnecessary patterns should be removed from the crawl list, including URLs that lead to scripts, images, CSS files, documents, and similar resources.

Run the first full crawl iteration

At this stage, you define the parsing logic for an individual page — this becomes your main loop. In a typical setup, the scraper takes a URL from the queue that is not marked as "visited," sends an HTTP request to that URL, extracts the full HTML code from the page body, searches the code for all links, cleans and normalizes them according to the rules defined in step 4, checks whether they already exist in the database, and if the links are new, adds them to the crawl queue.

Finish the crawl only when no new pages can be found

In this context, "new" means pages that do not yet have the "visited" marker.

Even if the script is stopped or interrupted, the crawl can always be resumed from the exact point where it left off, thanks to the status-tracking system.

Residential Proxies

Perfect proxies for accessing valuable data from around the world.

Try With Trial $1.99, 100Mb

Here are a few other things worth paying attention to:

Load management. When dealing with a large number of pages, it is important to strike the right balance between crawl depth and breadth. You should think carefully about the multithreading architecture and request speed. The request rate can be controlled through delays, but those delays should not always be identical, otherwise anti-bot systems may become suspicious.
Restrictions and limitations. Consider the crawling rules and logic based on robots.txt directives and the website's usage policy. Scraping should remain ethical.
Handling errors and unusual responses. The standard successful server response is status code 200. But not every link found on a website points to a valid resource. On top of that, an anti-bot system may block your requests at any time or return placeholder pages instead of real content. To protect your scraper from these issues, you need to properly handle error codes and monitor unusual page content. The most important status codes are: 404 (page does not exist), 403 (access denied), 301/302 (redirects to the correct final URL), and 500 (server unavailable).
Retry logic. This is the part of the code that allows you to quickly define how the scraper should behave when errors are detected — for example, when the website or server is unavailable, or when your access has been blocked. Proxy rotation can be used here as well.
Logging and monitoring. When scraping at a very large scale, it makes sense to introduce metrics that make progress easier to monitor: how many URLs have already been processed, how many remain in the queue, how many errors have been found and what types they are, the current crawl speed, and so on.
Special crawl termination logic. For example, the scraper may stop not only when the queue is empty, but also when certain limits are reached: time limits, crawl depth limits, page count limits, and so on.

Typical problems when trying to discover all URLs on a domain include:

loops, where the scraper keeps going in circles;
exponential growth in the number of URLs due to incorrect parameter cleanup, or no cleanup at all;
content that is inaccessible without JavaScript rendering, meaning there is effectively nowhere to get the complete list of website URLs without a full browser;
blocks from the website, most often IP bans, which are usually handled through proxies;
inefficient crawling logic, where the most important pages keep getting pushed to the end of the queue;
increasing architectural complexity as the crawl scale grows.

In other words, even a "simple" website crawl can easily turn into an enterprise-level task involving distributed queues, load balancing, dashboards, graphs, and more.

Python Example to Find All URLs on a Domain

Here is an example of a simple script that crawls a specified number of pages on a target website. If you set a very large limit, it can crawl all URLs on the domain:

import time
import os
import csv
import requests
from urllib.parse import urljoin, urlparse, urldefrag, parse_qs, urlencode
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET

# -----------------------------
# Configuration
# -----------------------------
START_URLS = ["https://example.com"]
ALLOWED_DOMAIN = "example.com"
STATE_FILE = "crawler_state.csv"   # can be changed to .xml
SAVE_EVERY = 20                    # save state every N URLs
REQUEST_DELAY = 0.5
MAX_URLS = 1000
TIMEOUT = 10

BLACKLIST_PARAMS = {"utm_source", "utm_medium", "utm_campaign", "sort", "filter", "page"}
HEADERS = {
   "User-Agent": "Mozilla/5.0 (compatible; SimpleCrawler/1.1)"
}

# -----------------------------
# URL normalization
# -----------------------------
def normalize_url(url):
   url, _ = urldefrag(url)
   parsed = urlparse(url)
   if ALLOWED_DOMAIN not in parsed.netloc:
       return None
   query = parse_qs(parsed.query)
   filtered_query = {k: v for k, v in query.items() if k not in BLACKLIST_PARAMS}
   query_string = urlencode(filtered_query, doseq=True)
   normalized = parsed._replace(query=query_string).geturl()
   if normalized.endswith("/"):
       normalized = normalized[:-1]
   return normalized

# -----------------------------
# Link extraction
# -----------------------------
def extract_links(html, base_url):
   soup = BeautifulSoup(html, "html.parser")
   links = set()
   for tag in soup.find_all("a", href=True):
       full_url = urljoin(base_url, tag.get("href"))
       normalized = normalize_url(full_url)
       if normalized:
           links.add(normalized)
   return links

# -----------------------------
# Save state (CSV)
# -----------------------------
def save_state_csv(queue, visited):
   with open(STATE_FILE, "w", newline="", encoding="utf-8") as f:
       writer = csv.writer(f)
       writer.writerow(["type", "url"])
       for url in visited:
           writer.writerow(["visited", url])
       for url in queue:
           writer.writerow(["queue", url])

def load_state_csv():
   queue = []
   visited = set()
   with open(STATE_FILE, newline="", encoding="utf-8") as f:
       reader = csv.DictReader(f)
       for row in reader:
           if row["type"] == "visited":
               visited.add(row["url"])
           elif row["type"] == "queue":
               queue.append(row["url"])
   return queue, visited

# -----------------------------
# Save state (XML)
# -----------------------------
def save_state_xml(queue, visited):
   root = ET.Element("crawler")
   visited_el = ET.SubElement(root, "visited")
   for url in visited:
       ET.SubElement(visited_el, "url").text = url
   queue_el = ET.SubElement(root, "queue")
   for url in queue:
       ET.SubElement(queue_el, "url").text = url
   tree = ET.ElementTree(root)
   tree.write(STATE_FILE, encoding="utf-8", xml_declaration=True)

def load_state_xml():
   tree = ET.parse(STATE_FILE)
   root = tree.getroot()
   visited = set()
   queue = []
   for url in root.find("visited"):
       visited.add(url.text)
   for url in root.find("queue"):
       queue.append(url.text)
   return queue, visited

# -----------------------------
# Universal save/load
# -----------------------------
def save_state(queue, visited):
   if STATE_FILE.endswith(".xml"):
       save_state_xml(queue, visited)
   else:
       save_state_csv(queue, visited)

def load_state():
   if not os.path.exists(STATE_FILE):
       return None, None
   if STATE_FILE.endswith(".xml"):
       return load_state_xml()
   else:
       return load_state_csv()

# -----------------------------
# Crawler
# -----------------------------
def crawl():
   queue, visited = load_state()
   if queue is None:
       print("[i] State file not found. Starting fresh.")
       queue = list(START_URLS)
       visited = set()
   else:
       print(f"[i] Loaded state. Queue: {len(queue)}, Visited: {len(visited)}")

   processed_since_save = 0

   while queue and len(visited) < MAX_URLS:
       url = queue.pop(0)

       if url in visited:
           continue

       print(f"[+] Visiting: {url}")
       visited.add(url)
       processed_since_save += 1

       try:
           response = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
           if response.status_code != 200:
               continue

           links = extract_links(response.text, url)
           for link in links:
               if link not in visited and link not in queue:
                   queue.append(link)

       except requests.RequestException:
           continue

       # periodic save
       if processed_since_save >= SAVE_EVERY:
           print("[i] Saving state...")
           save_state(queue, visited)
           processed_since_save = 0

       time.sleep(REQUEST_DELAY)

   print("[i] Final save...")
   save_state(queue, visited)
   return visited

# -----------------------------
# Run
# -----------------------------
if __name__ == "__main__":
   urls = crawl()
   print(f"\nTotal URLs collected: {len(urls)}")

The script works without a headless browser and uses the Beautiful Soup parser. If there is no file with saved URLs next to the script when it starts, it creates one and fills it from scratch. If the file already exists, the script loads the visit statuses from it so the process does not restart from the beginning every time.

If needed, you can set your own target URL, as well as your own limits, delays, and parameters that should be removed from the URL structure.

Conclusion and Recommendations

Above, we looked at the most common ways to check all pages of a website and build a final list of URLs for a single domain. The easiest approach is to use ready-made tools. But the most flexible and technically correct one is to write your own crawling script.

Neither specialized tools nor large-scale scrapers can work for long without high-quality proxies. Froxy provides millions of rotating proxies with traffic-based pricing. The number of parallel threads can reach up to 1000. With our help, you can handle even the most complex data collection tasks.

View full post