OpenAI: Give Us Your Content or Die

The Financial Times announced a deal with OpenAI on Monday to license its world-class journalism for training and information ChatGPT’s models. It joins Axel Springer and Associated Press, which have struck similar deals, where OpenAI is reportedly offering millions for the right to use content. However, ChatGPT has been trained on a lot of other web scraping content that OpenAI has not paid for. So why does OpenAI pay for some datasets and not others?

Why is everyone suing AI companies? | Future technology

OpenAI’s licensing agreements seem to send a clear message: We will still use your content, so sign a contract with us or you’ll be left behind. The main benefit of a licensing agreement seems to feature prominently in ChatGPT’s answers. Some publishers may also want to cement a relationship with the next major information distribution channel before it takes over. However, it seems that OpenAI uses a lot of content from publishers anyway.

OpenAI already partially trains its AI models on “publicly available data“, says CTO Mira Murati, which seems intentionally vague. What is publicly available data anyway? The sentence assumes that anything that can be read for free on the Internet can also be integrated into ChatGPT for free. For example, Gizmodo is part of OpenAI’s “publicly available data.” Our website has been cached 34,000 times on GPT-2’s WebText Dataset, the latest dataset released by OpenAI used to train an AI model.

Gizmodo is free for readers primarily because of the advertising on this website. Allowing readers to access our content via ChatGPT will destroy our business model. The New York Times, which is used significantly more in GPT-2’s WebText dataset, sued OpenAI for copyright infringement about this very matter.

A content licensing agreement with OpenAI seems to be the only way for publishers to stay relevant in the age of AI. In one Press releaseAccording to John Ridding, CEO of the Financial Times Group, this deal will “expand the reach” of their work while providing “early insights into how content is surfaced through AI.”

“The thing about AI is that it’s not really artificial intelligence,” said Matthew Butterick, a lawyer who represents Sarah Silverman and other book authors suing OpenAI, in an interview with Gizmodo. “It is human intelligence harvested from one place and separated from its creators. This big tech company then puts a price tag on them and sells them to someone else.”

Butterick is a plaintiff in six copyright lawsuits against AI companies. He is also an author, programmer and designer and says he understands how AI can threaten these industries. In general, his cases involve the claim that AI simultaneously exploits the work of creators while endangering their livelihoods.

OpenAI’s licensing agreements caused a stir regarding the content ChatGPT uses for free. Tech companies have argued that generative AI constitutes a “fair use” of copyrighted works because it transforms them into something new. The AI world has also argued that it uses a similar model to Google Search that caches copyrighted content to create a useful information search tool. Similar to Google, AI chatbots have recently started incorporating hyperlinks. Ultimately, a court must decide whether generative AI constitutes “fair use.”

OpenAI did not immediately respond to Gizmodo’s request for comment.

Book authors and publishers aren’t the only ones from whom OpenAI appears to be taking content. The New York Times recently reported that OpenAI has trained GPT-4 for more a million hours of transcribed YouTube videos. Days before the report’s release, YouTube’s CEO said that using its videos for AI training would be a “clear violation” of its policies.

OpenAI’s content licensing agreements cloud the discussion. The company somehow uses Internet content for free while simultaneously paying others for their work. Other tech companies like Apple have reportedly taken a more proactive approach to paying for all of their training data. Adobe reportedly paid $3 per minute of video to train its AI video generator.

However, it is unclear whether even a one-time payment is enough to receive AI training data. We’re talking about a tool that could potentially revolutionize the media industry for writers, audio and video producers, and more. Signing a deal with OpenAI could guarantee you a good spot in the ChatGPT results, but it seems like the AI chatbot used your content anyway. At least for now, AI companies are interested in using everything on the internet and asking questions about the legality of it all later.

Leave a Comment Cancel reply