Data drying up for AI firms with increasing use of ant-crawl measures: Report
- Voltaire Staff
- Jul 20, 2024
- 2 min read

The data used to train AI is becoming limited as most of the important web sources used to train AI models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an MIT-led research group.
The study looked at 14,000 web domains that were commonly used for training AI models and found an "emerging crisis in consent."
The researchers found that in the three data sets called C4, RefinedWeb, and Dolma 5 per cent of all data, and 25 per cent of data from the highest-quality sources, have been restricted. The Robots Exclusion Protocol has been used for restriction.
"We're seeing a rapid decline in consent to use data across the web that will have ramifications not just for AI companies, but for researchers, academics, and noncommercial entities," Shayne Longpre, the study's lead author, told New York Times.
Data is crucial for AI systems, they are fed on images, text, and videos for training. Generative AI tools like ChatGPT, Google's Gemini, and Anthropic's Claude learn from data to write, code, and create images and videos. The more high-quality data is fed into these models, the better their outputs.
For years, AI developers collected data easily, but the boom in the industry in recent years has led to tension among owners of the data. Some publishers have created paywalls and have changed their terms of service to limit the use of their data for AI generative purposes, while some have blocked the automated web crawlers used by companies such as OpenAI, Anthropic, and Google.
Reddit and Stack Overflow have started charging AI companies for data access. A few publishers have also taken legal action against AI companies.
Of late, OpenAI, Google, and Meta have gone to extreme lengths to gain data, including transcribing YouTube videos and bending their own data policies. Some companies have struck deals with publishers including The Associated Press and News Corp, the owner of The Wall Street Journal, to gain access to their data.
All the same, one researcher said the companies have all the data they need and the current fencing of data was akin to bolting the barn door when the horse has left.
Stella Biderman, the executive director of EleutherAI, a nonprofit AI research organisation, echoed those fears.
"Major tech companies already have all of the data," she said. "Changing the license on the data doesn't retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller start-ups or researchers," she told NYT.
Image Source: Unsplash




































Comments