The AI Copyright Wars: OpenAI, Google, and the Future of Training Data

If you use tools like ChatGPT, Gemini, or Claude, you are sitting on top of one of the biggest legal questions of this decade: is it legal to train AI models on copyrighted material scraped from the internet?

For a while, AI companies acted as if the answer was obviously “yes”. They quietly gathered text from news sites, books, forums, and code repositories, then used it to train models that could write, draw, and code on demand. Now, authors, news publishers, and regulators are pushing back hard.

In 2023 and 2024, a wave of copyright lawsuits hit OpenAI, Google, Meta, Anthropic, xAI, Adobe, Nvidia, and others. In 2025 and into 2026, we finally started to see real court decisions – including a significant ruling that some AI training can qualify as fair use, at least in the U.S. One U.S. court recently held that legally obtained works used to train Anthropic’s models were fair use. The result is a messy, fast-evolving landscape that will shape how future AI systems are built and who gets paid.

You do not need to be a lawyer to care. These cases could decide whether open-source models remain viable, whether smaller startups can afford to compete, and whether your own content can be used to train AI without your permission.

What are these AI copyright lawsuits actually about?

Most of the current lawsuits center on a simple question: when a company copies your work into a training dataset for a large model, is that a form of copyright infringement, or is it a transformative, fair use?

To get there, plaintiffs usually make three main claims:

The company copied and stored their works (books, articles, images, audio) without permission.
The model sometimes spits out text or images that are “substantially similar” or nearly verbatim to the originals.
This harms their market, because AI can generate cheap substitutes or summaries that reduce demand for the real thing.

On the other side, AI companies argue that:

Training is a statistical process: the model learns patterns, not a usable copy of any one work.
Using large collections of publicly available content to learn those patterns is highly transformative and therefore fair use.
The economic benefits and innovation outweigh the potential harms, especially when individual works are a tiny part of the training corpus.

We are now seeing those high-level arguments tested against specific facts – and OpenAI and Google (plus Meta) are at the center of it.

OpenAI vs The New York Times and the news industry

The most famous AI copyright case today is The New York Times vs. OpenAI and Microsoft. The Times sued in December 2023, arguing that models like ChatGPT and Microsoft Copilot could reproduce Times articles or close summaries without permission. The lawsuit claims both training on and regurgitation of Times content violate copyright.

Key developments so far:

In March 2025, a federal judge in New York declined to throw out the case, allowing core copyright claims about training and output to go forward, while narrowing some of the broader allegations. The court rejected OpenAI’s argument that the Times waited too long to sue.
OpenAI maintains that using publicly available data, including news articles, for training is protected by fair use. Its own public statement about the case emphasizes that models “learn deep mathematical patterns” and are not intended to reproduce copyrighted content verbatim. OpenAI’s NYT explainer lays out that fair use theory.
Discovery in related media lawsuits has been messy. In other publisher cases, judges have ordered OpenAI to hand over millions of chat logs to test whether its “we don’t regurgitate” claims really hold up in practice, highlighting just how central memorization and exact copying are to these disputes.

Meanwhile, The New York Times has also sued Perplexity, an AI search startup, alleging that its answer engine regurgitates Times content and undercuts traffic by showing AI-generated summaries above links. That second lawsuit shows news organizations coordinating a broader strategy to protect their content.

Whatever happens in the Times case will send a strong signal: if training on news without a license is ruled infringing, expect a rapid shift from “scrape now, ask later” to structured licensing deals.

Google, Gemini, and AI Overviews under fire

Google is facing a different flavor of copyright and competition pushback.

First, there are legacy concerns about datasets like Books3 and alleged scraping of “shadow libraries” such as LibGen and Z-Library. A 2024 lawsuit accuses Google (and also xAI and OpenAI) of training on pirated books from these sources without paying authors. The complaint explicitly calls out use of pirated repositories as a core issue.

Second, Google’s AI Overviews – the AI summary boxes now appearing at the top of many searches – are triggering lawsuits from publishers who say Google is copying their text and cannibalizing their traffic. In 2025, Penske Media, the parent of outlets like Rolling Stone, sued Google, arguing that AI Overviews “regurgitate” their content and appear above links, reducing incentives for users to click through. The case claims that AI Overviews are both copyright infringement and an abuse of search dominance.

For you as a user, this is about more than legal theory. If publishers succeed:

Google might have to dramatically reduce how much of an article AI Overviews can quote or closely paraphrase.
Gemini’s training data could lean more heavily on licensed and synthetic content, potentially changing its behavior and costs.
Smaller sites might gain leverage to demand compensation when their content is used to train foundation models or power search features.

Meta, Anthropic, and the first big fair use wins

While OpenAI and Google fight in the headlines, some of the most important legal signals have come from cases against Meta and Anthropic.

In 2025, a U.S. federal court ruled that Anthropic’s use of lawfully obtained copies of books to train its Claude models was fair use under copyright law. The court granted summary judgment for Anthropic on that point, while treating reliance on pirated copies very differently. The judge emphasized:

Training is highly transformative because the model learns patterns and correlations, not expressive content you can read like a book.
When training data is acquired legally, the balance of the four fair use factors can favor the AI developer.
However, using pirated copies (for example, via unauthorized online libraries) may not enjoy the same protection.

That last point matters because many complaints argue that AI labs pulled from illegal repositories. Courts may be drawing a line: fair use can protect “how” you use data, but not “where” you got it.

On the Meta side, a new class action filed in May 2026 by five major publishers and author Scott Turow accuses Meta and Mark Zuckerberg personally of authorizing training Llama models on millions of pirated books and articles. The complaint claims Meta reproduced and distributed works without permission or compensation. That case is early, but it will test how far courts are willing to extend the Anthropic logic when the alleged source is explicitly illegal.

Artists, images, and the Stability AI lawsuits

Text is only half the story. Visual artists were some of the first to sue generative AI companies over training data.

In Andersen v. Stability AI, a group of artists sued Stability AI, Midjourney, and DeviantArt, alleging that Stable Diffusion and related tools copied billions of copyrighted images into a training dataset without permission. A California federal court dismissed many of their theories but allowed a core direct copyright infringement claim against Stability AI to proceed, focusing on the mass copying of images for training. The court signaled that, if plaintiffs can prove copying of protected images into the training process, liability is at least plausible.

Since then:

Discovery battles in Andersen have revolved around forcing companies to reveal exactly what datasets they used, and whether those included stock libraries, scraped websites, or licensed sets.
Similar suits have been filed against Adobe, alleging that its Firefly models use pirated books and other content despite the company’s marketing that Firefly is “commercially safe”. One 2025 class action claims Adobe trained on pirated books without permission.

For creative professionals, the message is mixed: courts are not treating “AI magic” as a legal black box anymore, but they are also not automatically siding with artists. How much copying is acceptable in the training pipeline remains an open question.

Where this is heading: licensing markets and data hygiene

If you zoom out, a pattern is emerging:

Courts are more sympathetic to training on lawfully acquired data, especially when the use is clearly transformative and outputs do not regurgitate works.
Courts are much less sympathetic when plaintiffs can show use of clearly pirated sources or systematic verbatim reproduction.
Companies are responding by cleaning up their pipelines and signing content deals.

By 2024–2025, you could already see the market reacting:

OpenAI, Google, and others began signing licensing agreements with major publishers and stock libraries.
Specialized “data licensing” startups appeared, offering structured deals between rights holders and AI labs. Recent trackers point to a growing ecosystem of training-data licensors and direct publisher–AI lab agreements.
Some labs now emphasize “opt-out” or “no training” flags for user content, though how faithfully these are implemented is not always clear.

Long term, you should expect:

More data provenance controls: clear logs of where training data came from, and the ability to filter out or pay for specific sources.
A split between “clean” models trained on fully licensed or public domain data (likely more expensive, used for enterprise) and “open” models with less perfect lineage (used for research or lower-risk tasks).
Stronger product features in ChatGPT, Gemini, Claude, and others that avoid reproducing long passages from specific works, reducing the risk of “smoking gun” examples for plaintiffs.

What it means for you – builder, creator, or just user

So where does all this leave you?

If you are building on top of models like ChatGPT, Gemini, or Claude, you probably will not be sued personally over their training data – that risk sits with the model providers. But you are not completely off the hook:

If you fine-tune a model on proprietary client content, you need to be clear about who owns that data and how it can be used.
If you ship products that heavily quote or remix specific works, you could face your own copyright questions, separate from the base model.

If you are a publisher, author, artist, or developer whose work is online, these cases will determine:

Whether “being on the public web” automatically means “available for AI training”.
Whether you can realistically charge for including your content in training corpora.
How easy or hard it is to enforce “do not train on my data” in practice.

And if you are “just a user”, the outcomes will affect:

The cost and availability of powerful general-purpose models.
How much your AI assistant can access in real time from the open web versus gated, licensed sources.
Whether we end up in a world of a few heavily licensed, very powerful systems, or a broader ecosystem of open, messy, but cheaper models.

How to navigate the AI copyright storm today

The law around AI training data is still unsettled, but it is no longer a total free-for-all. Courts have begun to bless certain kinds of training as fair use, especially on lawfully obtained data, while signaling that pirated sources and regurgitation are red lines.

In the next few years, you can expect:

More big-ticket settlements (think OpenAI, Google, Meta quietly paying major publishers).
New regulations, especially in the EU, requiring disclosure of training sources.
A growing compliance industry around “data-clean” models and auditable training pipelines.

If you want to stay on the right side of this as AI becomes part of your work, here are concrete next steps:

Audit your own usage. If you are feeding client or proprietary data into tools like ChatGPT, Gemini, Claude, or custom fine-tuned models, read the providers’ data use policies and consider whether you need enterprise or “no training” options for sensitive content.
Treat training data like a supply chain. If you build or buy models, start asking where the data came from, whether it was licensed, and what indemnities (if any) your vendor offers. This will matter in contracts sooner than you think.
If you create content, set your preferences and watch the courts. Use available opt-out mechanisms where they exist, pay attention to major rulings (especially in the OpenAI/Times and Meta book cases), and be ready to join or negotiate licensing programs as they mature.

The AI copyright lawsuits are not just about the past training of GPT-4 or Gemini. They are about the rules that will govern every frontier model that comes next.

Read other posts

< [GPU Shortages and AI: Why the Hardware Bottleneck Still Matters ] :: [Personal AI Agents: The Rise of Your Digital Executive Assistant ] >