In August, word leaked out that The New York Times was considering joining the growing legion of creators that are suing AI companies for misappropriating their content. The Times had reportedly been negotiating with OpenAI regarding the potential to license its material, but those talks had not gone smoothly. So, eight months after the company was reportedly considering suing, the suit has now been filed.
The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times' paywall and ascribe hallucinated misinformation to the Times.
Journalism is expensive
The suit notes that The Times maintains a large staff that allows it to do things like dedicate reporters to a huge range of beats and engage in important investigative journalism, among other things. Because of those investments, the newspaper is often considered an authoritative source on many matters.
All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall. In addition, each print edition has a copyright notification, the Times' terms of service limit the copying and use of any published material, and it can be selective about how it licenses its stories. In addition to driving revenue, these restrictions also help it to maintain its reputation as an authoritative voice by controlling how its works appear.
The suit alleges that OpenAI-developed tools undermine all of that. "By providing Times content without The Times’s permission or authorization, Defendants’ tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue," the suit alleges.
Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called "Common Crawl," which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most referenced source, behind Wikipedia and a database of US patents.