OpenAI harvested more than a million hours of YouTube content to train its most advanced large language model – ChatGPT-4, a report has claimed.
Artificial Intelligence firms have been in a scramble to look for newer sources to train their models, having harvested most of the traditional repository of human knowledge, such as books, newspapers, and scientific databases.
Many a time such poaching of databases has run afoul of copyright laws, with OpenAI, as well as several other AI firms, facing lawsuits by writers and publishers alike.
According to a report by The New York Times, OpenAI trained its AI model through its voice recognition software Whisper. The firm's president Greg Brockman was personally involved in the collecting of videos, NYT wrote.
OpenAI spokesperson Lindsay Held told The Verge in an email that the company curates "unique" datasets for each of its models to "help their understanding of the world."
The spokesperson added that the company is also now looking into generating its own synthetic data.
Google spokesperson Matt Bryant told The Verge that the company has "seen unconfirmed reports" of OpenAI’s activity.
He said, "Both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content."
YouTube CEO Neal Mohan too had earlier alleged that OpenAI had used the video steamers' content to train its text-to-video AI model Sora.
Both however stopped short of expressing whether OpenAI's acts merit a legal action from the company.
Comments