YouTuber Files Class Action Over OpenAI Transcript Scraping

James Emmanuel

1 year ago

YouTuber Files Class Action Over OpenAI's Transcript Scraping

A YouTuber is seeking to file a class action lawsuit against OpenAI, alleging that the company used millions of YouTube video transcripts to train its AI models without notifying or compensating the creators

The attorneys for David Millette, a YouTube user based in Massachusetts, filed a complaint in the U.S. District Court for the Northern District of California on Friday. The complaint alleges that OpenAI surreptitiously transcribed Millette’s and other creators’ videos to train the models that power the company’s AI-powered chatbot platform, ChatGPT, and other generative AI tools and products.

The complaint accuses OpenAI of violating copyright law and YouTube’s terms of service, which prohibit the use of videos for apps independent of its service, while “profiting significantly” from the creators’ work by collecting this data.

The complaint states that the value of [OpenAI’s] AI products increases as they become more sophisticated through training data sets, which is why prospective and current users purchase subscriptions to access them. “However, a significant portion of the content in OpenAI’s training data sets is derived from works that the company copied without permission, credit, or compensation.”

Millette, represented by the law firm Bursor & Fisher, requests a jury trial and over $5 million in damages for all YouTube users and creators whose data may have been compromised during OpenAI’s training.

No true intelligence is present in generative AI models such as OpenAI. Models “learn” the likelihood of data occurrence based on patterns, including the context of any surrounding data, by being fed many examples (e.g., movies, voice recordings, essays).

Most models are trained using data obtained from public websites and datasets located throughout the internet. Companies contend that fair use protects their efforts to collect data and use it to train commercial models indiscriminately. Nevertheless, many copyright holders are disputing this assertion and are pursuing legal action to prevent the practice.

Video transcriptions have emerged as a critical component of training data due to the depletion of other data sources.

According to data from Originality.AI, over 35% of the world’s top 1,000 websites currently disable OpenAI’s web crawler. According to a study conducted by the Data Provenance Initiative at MIT, approximately 25% of data from “high-quality” sources has been excluded from the main datasets utilized to train AI models.

The research group Epoch AI anticipates that developers will exhaust the data available to train generative AI models between 2026 and 2032 if the current trend of access blocking persists.

According to The New York Times, Whisper, OpenAI’s inaugural speech recognition model, was developed in April to transcribe audio from videos to accumulate further training data.

According to The Times, an OpenAI team that included the company’s president, Greg Brockman, utilized Whisper to transcribe over one million hours of video from YouTube. The transcripts were then utilized to train OpenAI’s text-generating and text-analyzing model GPT-4.

According to the Times, certain OpenAI employees deliberated on the potential consequences of such an action, including potential violations of YouTube’s policies.

Proof News reported in July that generative AI models were trained using a dataset known as The Pile, which contains transcripts from hundreds of thousands of YouTube videos. This dataset was utilized by companies such as Anthropic, Apple, Salesforce, and Nvidia.

Numerous YouTube creators whose subtitles were in The Pile were unaware of and did not authorize this. Apple subsequently issued a statement denying that it intended to employ these models to enable AI features in its products.

Additionally, Google, the parent company of YouTube, has attempted to train its models using transcripts.

Google’s terms of service (ToS) were expanded last year to facilitate the collection of additional user data for training generative AI models. It was unclear whether Google could utilize YouTube data to develop products beyond the video platform under the previous Terms of Service. Not so under the new provisions, which significantly loosen the reins.

We have contacted OpenAI and Google to inquire about the class action suit and will revise this article if they respond.

OpenAI has experienced a challenging beginning to the month.

On Monday, Elon Musk, the CEO of Tesla and X, filed a new lawsuit against OpenAI and its CEO, Sam Altman. The suit accuses the company of forsaking its original nonprofit mission by withholding some of its most advanced technology for commercial customers. Musk made the same allegations in a February lawsuit against OpenAI; however, the new suit also alleges that OpenAI is engaging in racketeering activity.