Researchers Claim OpenAI Trained AI Models on Copyrighted O'Reilly Media Books

  


A recent study published by the non-profit AI Disclosures Project raises concerns about whether OpenAI's artificial intelligence models, including GPT-4o, were trained on copyrighted content. According to the findings, OpenAI's latest large language models (LLMs) demonstrated a higher recognition of copyrighted material compared to older models, which might suggest that certain copyrighted content was included in their training datasets. However, the study also highlighted that GPT-4o mini was not trained on this specific copyrighted content.

DE-COP Method Used to Analyze Training Data

The research paper, titled Beyond Public Access in LLM Pre-Training Data, aimed to determine whether OpenAI's AI models had been trained on non-public book content, particularly focusing on O'Reilly Media—a major online learning platform containing numerous copyrighted books. The study was co-authored by Tim O'Reilly, the founder of O'Reilly Media.

To investigate the potential use of copyrighted material in OpenAI's training data, the researchers employed a new testing method called DE-COP, introduced in a 2024 paper. DE-COP, or Membership Inference Attack, involves a multiple-choice test that quizzes an AI model to identify copyrighted content from machine-generated paraphrased alternatives.

Claude 3.5 Sonnet Used to Paraphrase Copyrighted Material

For the experiment, the researchers used the Claude 3.5 Sonnet model to paraphrase excerpts from copyrighted material. In total, 3,962 paragraph excerpts from 34 O'Reilly Media books were tested. According to the results, the GPT-4o model exhibited the highest recognition rate of the copyrighted and paywalled O'Reilly books, scoring 82 percent on the Area Under the Receiver Operating Characteristic Curve (AURUC), a key metric derived from the test's guess rates.

This recognition suggests that GPT-4o may have been trained using content from O'Reilly Media, potentially violating copyright protections. However, older models like GPT-3.5 Turbo demonstrated lower recognition rates, though still significant enough to imply a possible overlap with copyrighted content.

GPT-4o Mini Excluded from Copyrighted Training Data

Interestingly, the study found that GPT-4o mini did not show recognition of the O'Reilly Media books. This raises the possibility that OpenAI may have intentionally excluded this content from training the smaller version of GPT-4o. The researchers speculated that the DE-COP test may not be as effective in detecting copyrighted content in smaller models, which could explain why GPT-4o mini did not exhibit the same patterns of recognition as the full version of GPT-4o.

Potential Legal and Ethical Concerns

This study raises important questions about the ethical implications of using copyrighted content to train AI models. If OpenAI’s models have been trained on copyrighted material without proper permission or licensing, it could lead to legal challenges related to intellectual property rights. The results also highlight the growing need for transparency in how training data is sourced and the potential impact on the future development of AI technologies.

As AI models continue to evolve and integrate with different industries, the ongoing debate about copyright, fair use, and the responsibilities of AI developers is expected to intensify. This paper may serve as a catalyst for further scrutiny regarding the use of copyrighted content in AI training and its potential legal ramifications.

Looking Ahead: OpenAI’s Future Plans

As OpenAI continues to advance its AI capabilities, it has also teased plans for releasing its first open-source reasoning AI model, which could address some of the concerns raised by this research. The upcoming launch could provide more clarity on OpenAI’s approach to training data and how it will handle intellectual property in future models.

While the findings of this study are significant, OpenAI has yet to respond to the claims raised by the AI Disclosures Project, leaving room for further investigation into how AI models are trained and the transparency surrounding the datasets they utilize.

Post a Comment (0)
Previous Post Next Post