
Major tech companies like OpenAI, Google, and Meta have gone to acquire sufficient data to train their artificial intelligence (AI) models. As AI technology has advanced, the demand for large volumes of high-quality data has surged, pushing these companies to explore unconventional and sometimes controversial methods of data acquisition. According to a report by The New York Times, OpenAI has used over a million hours of YouTube videos to train its large language model GPT-4.
The report claims that the GPT-4 model developed by OpenAI was trained using a speech recognition tool named Whisper to transcribe YouTube videos, generating new conversational text. This approach, which involved transcribing over one million hours of video content, raised concerns about compliance with YouTube's policies, as YouTube, owned by Google, restricts the use of its videos for independent applications.
This revelation comes days after YouTube CEO Neal Mohan was based if OpenAI's Sora video generator uses data from YouTube. Mohan, in the interview with WSJ, said he was not aware if OpenAI used any YouTube data to train it new video tool. However, he claimed it would be a problem if OpenAI used YouTube videos to train the new models.
Even Google and Meta (owner of Facebook and Instagram) have been found to use some controversial data. For example, the report claims that Google has transcribed YouTube videos for AI training, potentially breaching copyright laws, and modified its terms of service to tap into more user-generated content. Meta discussed acquiring Simon & Schuster to access a vast library of books and considered using copyrighted data from the internet, despite legal and ethical implications.
Data Volume and AI Performance
The effectiveness of AI models, particularly in generating human-like text, images, sounds, and videos, is significantly enhanced by the volume of data they're trained on. The demand for high-quality data is so high that some speculate tech companies might exhaust available internet data by 2026.
Responses from Companies
OpenAI stated that each of its AI models is trained on a unique dataset to maintain competitiveness in research. Google acknowledged training AI models on some YouTube content under agreements with creators and clarified it doesn't use data from office apps outside an experimental program. Meta emphasized its investment in integrating AI into its services, leveraging billions of publicly shared images and videos.
For Unparalleled coverage of India's Businesses and Economy – Subscribe to Business Today Magazine