Wikipedia API Tutorial: Scrape Wiki Pages with Python

Written by Team Froxy | May 6, 2026 7:00:00 AM

Wikipedia is an open encyclopedia created in 2001 by Jimmy Wales and Larry Sanger. It is maintained by a community of enthusiasts and has been translated into hundreds of languages worldwide (360+). Over time, the encyclopedia has been expanded with many supporting projects, such as Wiktionary, Wikimedia Commons, Wikiquote, Wikisource, and others. This ecosystem is now unified under the Wikimedia Foundation.

Since Wikimedia projects contain more than 300 million pages of subject-specific information, they form the most comprehensive database that can be used for a wide range of tasks: AI training, building reference sections, clarifying translations and word meanings, selecting images, icons, and other types of content, and more.

This article explains how Wikipedia can be scraped: whether it has an API, where and what ready-made libraries can be found, and which approach is better in different situations — direct scraping or using the API.

Does Wikipedia Have an API?

Yes, Wikipedia does have an API, but it is important to understand that this is not the result of a microservices architecture. It became possible thanks to the open-source MediaWiki engine written in PHP. It is somewhat like WordPress, but designed not for blogs, but for online encyclopedia websites. The necessary programming interfaces are already built into it.

At the moment, the Wikipedia API is available in three formats:

Direct HTTP requests. These are processed by a PHP script, which can be accessed at REGIONAL-VERSION.wikipedia.org/w/api.php. The handler supports a set of specific arguments, including response formatting options. This interface can be used without authorization, for example, if you only need to retrieve data without editing or adding anything.
MediaWiki REST API (JavaScript API). This is a more modern and functional RESTful API version. Its endpoint is /w/rest.php. This version of the API provides access to virtually all resources within the Wikimedia ecosystem, not just Wikipedia. The first version of the protocol is available at /api/rest_v1/ and is mainly suitable for retrieving brief topic-related information.
Wikimedia Enterprise API. This is a group of interfaces for corporate users available through a paid subscription. It includes the Snapshot API, On-demand API, and Realtime API. Data is returned only in JSON or NDJSON format. Some requests are available free of charge.

For many common tasks, the following are already available:

A set of bots capable of automating routine tasks when working with the Wikipedia API.
A collection of ready-made libraries for different programming languages that simplify building your own scraper or automation script.
Extension modules for the MediaWiki API.
Manuals for creating custom bots and scripts.

Wikipedia API Basics: What You Can Get Without Scraping

Wikipedia’s APIs allow you to retrieve structured data directly, without having to download an HTML page and parse its syntax with BeautifulSoup or similar libraries. This approach is faster, more reliable, and easier.

What exactly can you get out of the box through the MediaWiki Action API (/w/api.php)?

A short article introduction (the opening paragraphs) as plain text.
The full article text in plain text format, without HTML markup.
The full wikitext, including markup.
The HTML code of an article or its individual sections.
Page metadata: ID, namespace, last edit timestamp, and size.
A list of the page’s categories.
All external and internal links.
A list of images, including file names and links.
Revision history.
A list of related pages, redirects, and templates used on the page.
Search results by title and content.

Everything listed above is returned either in JSON format or as HTML/plain text. You do not need to manually search for the necessary tags or clean the text from service elements. The Wikipedia API already does all of this for you. As a result, automated processing of Wiki data becomes simpler and faster.

MediaWiki API and REST Endpoints

Let’s look at a few examples of direct HTTP requests for retrieving the data you need. This is roughly how Wikipedia web scraping can be organized without complex scripts or libraries:

Retrieving article content

GET https://en.wikipedia.org/w/api.php?action=parse&format=json&formatversion=2&page=Python_(programming_language)&prop=text&disableeditsection=1

This request returns the full article content inside JSON, without page editing controls and service elements. Configuration is handled through the parameters format, formatversion, prop, and disableeditsection. The target page is passed in the page parameter.

Search query

GET https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&list=search&srsearch=Python programming language&srlimit=10&srwhat=text

This request returns a list of articles with titles, snippets (text fragments), and result positions.

Autocomplete (content search)

GET https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=Python&limit=10&namespace=0

It returns the classic OpenSearch format — an array in the following structure: [query, [titles], [descriptions], [urls]].

How to Properly Set a User-Agent and Follow Wikipedia API Limits

Wikipedia is very strict about automated requests. If you do not follow the rules, your IP may be temporarily or permanently blocked. This is why it is important to understand how to avoid IP-based blocks.

Here is what you should pay attention to:

A User-Agent must always be specified. Even if the request does not require authorization or tokens. Never use the default User-Agent of libraries such as python-requests, curl, or other HTTP clients. A proper User-Agent should include the unique name of your project or script, its version number, and contact information such as an email address or a link to the project page in a public repository.

Example of a proper User-Agent:

headers = {"User-Agent": "MyWikipediaParser/1.2 (https://example.com/myproject; contact@example.com)"}

Respect the request limits. The maxlag parameter should be at least 5 seconds. If this parameter is not used, you should stay within a rate of about 10 requests per minute. Multithreaded requests should be avoided. A single persistent connection with sequential requests is considered the safe limit for the Wiki API.
Watch for server errors. If the server returns an access error instead of content, you should increase the delay before retrying. Most likely, the server is overloaded at that moment. Repeating the request immediately will only make the situation worse.
Use compression whenever possible. The exact implementation and support for different formats will depend on the target resource.
Add your scraper or bot to Wikipedia. This way, you can receive important notifications from the Wikipedia API.

All of these requirements and recommendations for bot usage follow directly from Wikimedia policy.

Manual Wikipedia Scraping: When It Makes Sense

As you may have noticed, the official Wikipedia API is not well suited for large-scale scraping — you need to wait 5–10 seconds between requests, and increasing the number of parallel threads is not allowed. This is more of a pet-project level solution, nothing beyond that.

To work around these limitations, there are several possible approaches:

use ready-made cloud services for scraping Wiki content (they often even describe their services as API interfaces),
create several independent bots and organize their parallel work from different servers or IP addresses,
write your own custom script and scrape Wikipedia on your own terms through a network of rotating proxies. Fortunately, the Wiki website is still built in PHP and does not rely heavily on JavaScript. This means the source code of pages can be retrieved directly through HTTP clients, without headless browsers.

When manual scraping really makes sense:

You need to extract data that is not available through the API — not the full text or whole blocks, but specific pieces of information. For example, unusual templates, complex infoboxes in non-standard formats, dynamically loaded elements, and so on.
You need to process a very large volume of requests. For example, when building your own local knowledge base or preparing materials for machine learning: tagged images, collections of terms from a specific subject area, and similar tasks.
You need data from side sections that Wikipedia usually does not return through the API. Yet this is often where some of the most interesting and valuable information is located.
You are doing a one-time quick analysis of a small group of pages and do not want to spend time dealing with the API.

SERP Scraper. Effortless control over search results

Scrape Google, Bing, and others — fast, stable, and convenient with SERP scraper.

Get Started

Example of a Simple Wikipedia Parser in Python Without Additional Libraries

This Wikipedia parsing script takes a search query (a term), finds the most relevant page through Wikipedia search, extracts the first paragraph (the definition), and stores the result in a DataFrame.

import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_wikipedia_intro(query):

    # 1. Build the search URL
    search_url = "https://en.wikipedia.org/w/index.php"
    params = {
        "search": query
    }
    
    response = requests.get(search_url, params=params)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # 2. Try to find the first article link
    result = soup.select_one(".mw-search-result-heading a")
    
    if result:
        article_url = "https://en.wikipedia.org" + result.get("href")
    else:
        # If the search immediately opened an article page (redirect)
        article_url = response.url
    
    # 3. Load the article page
    article_response = requests.get(article_url)
    article_soup = BeautifulSoup(article_response.text, "html.parser")
    
    # 4. Find the first paragraph
    paragraphs = article_soup.select("p")
    
    intro_text = None
    for p in paragraphs:
        text = p.get_text(strip=True)
        if text:
            intro_text = text
            break
    
    return {
        "query": query,
        "url": article_url,
        "intro": intro_text
    }

if __name__ == "__main__":
    query = input("Enter a term: ")
    data = get_wikipedia_intro(query)
    df = pd.DataFrame([data])
    print(df)
    # save the DataFrame if needed
    df.to_csv("wikipedia_intro.csv", index=False)

How to Use the Wikipedia API in Python

Of course, writing a parser from scratch is always more difficult. Wikipedia encourages open-source development and even promotes a wide range of ready-made libraries and automation tools on its pages. There is even a special community-maintained tool directory called Toolhub.

wikipedia 1.4+. This library uses the syntax of the official MediaWiki API. With it, you can significantly speed up the development of your automation scripts. But keep in mind the limits of the official API.
Wikipedia-API. Another wrapper for the official API, suitable for retrieving article summaries and full text, sections, images, categories, links, and language versions. It is actively maintained by developers.
mwparserfromhell. This library helps parse the source code of Wikipedia pages and is considered one of the most powerful parsers for MediaWiki markup. However, for full functionality it needs to be used together with an HTTP client.
Pywikibot. The official library from the Wikimedia Foundation for building custom bots and connecting to the Wikipedia API.
pymediawiki. Suitable for working with the Action API at a low level when maximum flexibility and control are required. It is a thin wrapper around the MediaWiki API with minimal abstractions.
Wikidata. Created specifically for working with structured Wikidata content — properties, entities, and queries.

Python Example: Search and Fetch a Wikipedia Page

Do not forget to install the required libraries first: pip install Wikipedia-API pandas.

import requests
import wikipediaapi
import pandas as pd

def create_wiki_client(lang="en"):
    """
    Initialize the Wikipedia API client
    """
    return wikipediaapi.Wikipedia(
        language=lang,
        extract_format=wikipediaapi.ExtractFormat.WIKI,
        user_agent="my-wiki-parser/1.0 (example@mail.com)"  # Be sure to replace this with your own details!
    )

def search_wikipedia(query, lang="en"):
    """
    Search for a relevant article through the Wikipedia API
    """
    url = f"https://{lang}.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "list": "search",
        "srsearch": query,
        "format": "json"
    }
    try:
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        data = response.json()
        results = data.get("query", {}).get("search", [])
        if results:
            return results[0]["title"]
    except Exception as e:
        print(f"[Search error]: {e}")
    return None

def get_wikipedia_intro(query, wiki, lang="en"):
    """
    Main function:
    - searches for an article
    - retrieves the page
    - extracts the first paragraph
    """
    title = search_wikipedia(query, lang)
    if not title:
        return {
            "query": query,
            "title": None,
            "url": None,
            "intro": None,
            "status": "not_found"
        }

    page = wiki.page(title)

    if not page.exists():
        return {
            "query": query,
            "title": title,
            "url": None,
            "intro": None,
            "status": "page_not_exists"
        }

    intro = page.summary.split('\n')[0]

    return {
        "query": query,
        "title": page.title,
        "url": page.fullurl,
        "intro": intro,
        "status": "ok"
    }

if __name__ == "__main__":
    query = input("Enter the term: ").strip()
    # you can change the language: en / ru / de
    lang = "en"

    # 1. create the client
    wiki = create_wiki_client(lang)

    # 2. retrieve the data
    result = get_wikipedia_intro(query, wiki, lang)

    # 3. output
    df = pd.DataFrame([result])
    print(df)

    # 4. save
    df.to_csv("wikipedia_result.csv", index=False)

Unlike direct HTML parsing, this approach uses the official API, which makes the code more stable and easier to maintain.

Wikipedia API vs Scraping Wiki: Which One Should You Choose?

Let us first look at the advantages and disadvantages of each approach.

The official Wikipedia API provides maximum stability and predictability for your parser. You can plan the load in advance and run almost any task with minimal coding effort. Data can be collected immediately in the required format, without additional processing. And if you have an active corporate subscription to the Wikimedia Enterprise API, you do not have to worry about rate limits at all. As long as you follow the platform’s recommendations, the risk of being blocked remains minimal. In practice, the Wikipedia API covers about 90% of all possible tasks. If you do not need to modify encyclopedia pages or publish new content, the API can work perfectly well without authorization.

Direct Wikipedia scraping involves a number of risks. If your request rate is too high, you may receive a temporary or permanent IP ban. In addition, page layouts may occasionally change, which means the scraper may need to be partially rewritten. The scraper itself must be developed and maintained, and you also need to monitor its operation on your own infrastructure. However, if you use a system of proxy servers, you can bypass potential bans and request limits across the Wikimedia ecosystem. Another advantage is that you can collect absolutely any data, including information that cannot be accessed through the API.

So, which option should you choose?

If you have a pet project and need to collect a relatively small amount of information on a specific topic, the official Wikipedia API is the more sensible choice. There are ready-made libraries and detailed documentation for working with the encyclopedia and free media repositories. Building a parser in this case will require very little time and effort.

But if you need to collect large volumes of data in a short period of time, or if you need to extract special or non-standard content from pages, then you will have to build your own scraper. It will not work reliably without high-quality rotating proxies.

Conclusion

Wikipedia does indeed support the open distribution of information. All services within the Wikimedia ecosystem have official API interfaces, and not just one, but several, designed for different tasks and purposes. To quickly and smoothly collect a small amount of data, as well as receive update information, Wikipedia offers a wide range of ready-made tools and libraries. In addition, there are many technical solutions created by third-party developers.

However, when it comes to large-scale scraping, the Wikipedia API may not always be suitable, since the interface has significant request-rate limits and is not appropriate for every type of task. For this reason, it may become necessary to build a custom scraper that works independently of the API.

No large-scale scraper can operate reliably without high-quality proxies. Froxy offers a huge pool of residential, mobile, and datacenter proxies with automatic rotation. With them, you do not have to worry about blocks and can run hundreds of parallel data-collection threads at the same time.

View full post