OpenAI again used paywalled data to train its GPT-4o model: Report

The report claims that OpenAI's new GPT-4o model used information from books behind a paywall without proper authorisation.

Updated Apr 3, 2025 5:09 PM IST

OpenAI

OpenAI is once again under fire for its data practices, as new allegations suggest that the company may have trained its latest model, GPT-4o, on copyrighted and paywalled content without proper authorisation. The accusations come from the AI Disclosures Project, a non-profit AI watchdog organisation founded in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss.

Allegations of Unauthorised Training Data Use

The AI Disclosures Project’s recently published study has sparked controversy by alleging that OpenAI’s GPT-4o model demonstrates a strong recognition of copyrighted books published by O’Reilly Media, despite no licensing agreement between OpenAI and the publisher. According to the report, GPT-4o shows significantly higher recognition of paywalled O’Reilly book content compared to older models such as GPT-3.5 Turbo.

The research employed a method known as "membership inference attack" or DE-COP to test whether the model could reliably differentiate between human-authored texts and AI-generated paraphrased versions. If an AI model demonstrates the ability to distinguish the two, it implies that the model may have prior knowledge of the original text, suggesting its inclusion in training data. The study tested 13,962 paragraph excerpts from 34 O’Reilly books, concluding that GPT-4o "recognised" more paywalled content than GPT-3.5 Turbo, with an AUROC score of 82% compared to the latter’s score just above 50%.

Despite the compelling results, the co-authors, including AI researcher Sruly Rosenblat, acknowledged potential limitations in their methodology, noting that users might have copied and pasted paywalled excerpts into ChatGPT, which could have introduced the content indirectly. Furthermore, the study did not examine OpenAI’s most recent models, such as GPT-4.5 and the reasoning models o3-mini and o1, leaving questions about whether these models also contain similar data.

A Broader Industry Issue

The report’s findings add to ongoing legal challenges faced by OpenAI, as the company battles multiple lawsuits alleging copyright infringement and unauthorised data usage. OpenAI and other leading AI companies have been advocating for looser restrictions on using copyrighted data for model training, arguing that such practices should fall under the fair use doctrine. Notably, OpenAI has already struck licensing deals with news publishers, social networks, and stock media libraries to secure data, and has been hiring journalists to fine-tune the output of its models.

The AI Disclosures Project highlights a systemic issue that could impact the quality and diversity of internet content. The study argues that the use of copyrighted data without compensation might reduce revenue for professional content creators, potentially diminishing content diversity online. It advocates for increased accountability and transparency in AI companies’ training processes, calling for policies that ensure content creators are compensated when their data is used.

While OpenAI continues to defend its practices, the AI Disclosures Project’s findings have undoubtedly intensified the debate over copyright and data ethics within the rapidly evolving AI industry. As the legal battles continue, the question of how to balance innovation with intellectual property rights remains unresolved.

For Unparalleled coverage of India's Businesses and Economy – Subscribe to Business Today Magazine

Published on: Apr 3, 2025 5:09 PM IST

COMPANIES

NEWS

OpenAI again used paywalled data to train its GPT-4o model: Report

The report claims that OpenAI's new GPT-4o model used information from books behind a paywall without proper authorisation.

Related Articles

TOP STORIES

TOP VIDEOS