The New York Times and Daily News are suing OpenAI for scraping their works to train its AI models without permission, and OpenAI programmers purportedly removed important data
OpenAI consented to furnish two virtual machines earlier this autumn to enable counsel for The Times and Daily News to conduct searches for their copyrighted content in its AI training sets.
(Virtual machines are software-based computers that are frequently employed for testing, backing up data, and executing applications within the operating system of another computer.)
The publishers’ attorneys have stated in a letter that they and the experts they have employed have devoted more than 150 hours to investigating OpenAI’s training data since November 1.
However, the letter mentioned above, filed in the U.S. District Court for the Southern District of New York late Wednesday, indicates that on November 14, OpenAI engineers deleted all of the publishers’ search data stored on one of the virtual servers.
OpenAI attempted to retrieve the data, and it was largely successful. Nevertheless, the letter states that the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models” due to the irretrievable loss of the folder structure and file names.
Counsel for The Times and Daily News stated that the plaintiffs have been compelled to re-create their work from the ground up, which has consumed a significant amount of computer processing time and personal hours.
Plaintiffs were informed yesterday that the recovered data is unusable and that an entire week’s worth of attorneys’ and experts’ work must be redone. Consequently, this supplemental letter is being submitted today.
The plaintiffs’ counsel asserts that they have no cause to suspect that the deletion was deliberate. However, they did point out that the incident serves as a reminder that OpenAI is “in the best position to search its own datasets” for potentially infringing content using its own tools.
A spokesperson for OpenAI declined to issue a statement.
OpenAI has consistently maintained that training models with publicly available data, such as articles from The Times and Daily News, is fair use in this instance and others.
In other words, OpenAI believes that it is not obligated to license or compensate for the examples — even if it generates revenue from the models — when developing models such as GPT-4o, which “learn” from billions of examples of e-books, essays, and other types of content to produce human-sounding text.
That being said, OpenAI has signed licensing agreements with many new publishers, such as the Associated Press, Axel Springer, the proprietor of Business Insider, the Financial Times, the parent company of People, Dotdash Meredith, and News Corp.
OpenAI has refrained from disclosing the specifics of these agreements to the public; however, Dotdash, one of its content partners, is purportedly receiving a minimum of $16 million annually.
OpenAI has not affirmed or denied that it trained its AI systems on any specific copyrighted works without permission.