Wikimedia launches dataset on Kaggle to dissuade AI scraping, ease server load

Voltaire Staff
Apr 17, 2025
2 min read

The Wikimedia Foundation has released a new beta dataset on Kaggle, offering clean, structured Wikipedia content tailored for machine learning workflows, apparently to reduce server strain caused by large-scale AI scraping.

The dataset is designed to serve as an alternative for AI developers who traditionally scrape raw article text, a practice that places significant pressure on Wikimedia's infrastructure.

Wikimedia has claimed that at least 65 per cent of the most resource-intensive traffic to its core data centres comes from bots, stretching its resources thin across the globe.

It claimed that the surge is driven largely by automated programs scraping Wikimedia Commons — home to over 144 million freely licenced images, videos, and files — for use in training AI models.

Also Read: Wikimedia claims it's groaning under traffic from bots scraping for AI

The dataset, released Tuesday, features structured content from English and French Wikipedia articles in a developer-friendly JSON format. This includes high-utility components such as abstracts, infobox-style key-value data, short descriptions, image links, and clearly segmented article sections—excluding references and non-prose elements for streamlined use, the non-profit said in a release.

By leveraging the Snapshot API's Structured Contents beta, the dataset eliminates the need for cumbersome parsing, allowing data scientists and ML practitioners to jump straight into modeling, benchmarking, fine-tuning, and exploratory analysis.

"As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data," said Brenda Flynn, Partnerships Lead at Kaggle. "There are few open datasets with more impact than those hosted by Wikimedia."

All data is freely licensed under Creative Commons Attribution-Share-Alike 4.0 and the GNU Free Documentation License, with some content available under public domain or alternative licenses.

As a beta release, Wikimedia is inviting feedback from the research and development community to help refine the offering and ensure it meets evolving AI needs—without compromising the sustainability of its platforms.