Information Technology

Many companies won’t say if they’ll comply with California’s AI training transparency law

On Sunday, California Governor Gavin Newsom signed a bill, AB-2013, requiring companies developing generative AI systems to publish a high-level summary of the data that they used to train their systems. Among other points, the summaries must cover who owns the data and how it was procured or licensed, as well as whether it includes any copyrighted or personal info.

Few AI companies are willing to say whether they’ll comply.

TechCrunch reached out to major players in the AI space, including OpenAI, Anthropic, Microsoft, Google, Amazon, Meta, and startups Stability AI, Midjourney, Udio, Suno, Runway and Luma Labs. Fewer than half responded, and one vendor — Microsoft — explicitly declined to comment.

Only Stability, Runway and OpenAI told TechCrunch that they’d comply with AB-2013.

“OpenAI complies with the law in jurisdictions we operate in, including this one,” an OpenAI spokesperson said. A spokesperson for Stability said the company is “supportive of thoughtful regulation that protects the public while at the same time doesn’t stifle innovation.”

To be fair, AB-2013’s disclosure requirements don’t take effect immediately. While they apply to systems released in or after January 2022 — ChatGPT and Stable Diffusion, to name a few — companies have until January 2026 to begin publishing training data summaries. The law also only applies to systems made available to Californians, leaving some wiggle room.

But there may be another reason for vendors’ silence on the matter, and it has to do with the way most generative AI systems are trained.

Training data frequently comes from the web. Vendors scrape vast amounts of images, songs, videos and more from websites, and train their systems on these.

Years ago, it was standard practice for AI developers to list the sources of their training data, typically in a technical paper accompanying a model’s release. Google, for example, once revealed that it trained an early version of its image generation family of models, Imagen, on the public LAION data set. Many older papers mention The Pile, an open-source collection of training text that includes academic studies and codebases.

In today’s cut-throat market, the makeup of training data sets is considered a competitive advantage, and companies cite this as one of the main reasons for their nondisclosure. But training data details can also paint a legal target on developers’ backs. LAION links to copyrighted and privacy-violating images, while The Pile contains Books3, a library of pirated works by Stephen King and other authors.

There’s already a number of lawsuits over training data misuse, and more are being filed each month.

Authors and publishers claim that OpenAI, Anthropic and Meta used copyrighted books — some from Books3 — for training. Music labels have taken Udio and Suno to court for allegedly training on songs without compensating musicians. And artists have filed class-action lawsuits against Stability and Midjourney for what they say are data scraping practices amounting to theft.

It’s not tough to see how AB-2013 could be problematic for vendors trying to keep courtroom battles at bay. The law mandates that a range of potentially incriminating specifications about training datasets be made public, including a notice indicating when the sets were first used and whether data collection is ongoing.

AB-2013 is quite broad in scope. Any entity that “substantially modifies” an AI system — i.e. fine-tunes or retrains it — is also compelled to publish info on the training data that they used to do so. The law has a few carve-outs, but they mostly apply to AI systems used in cybersecurity and defense, such those used for “the operation of aircraft in the national airspace.”

Of course, many vendors believe the doctrine known as fair use provides legal cover, and they’re asserting this in court and in public statements. Some, such as Meta and Google, have changed their platforms’ settings and terms of service to allow them to tap more user data for training.

Spurred by competitive pressures and betting that fair use defenses will win out in the end, some companies have liberally trained on IP-protected data. Reporting by Reuters revealed that Meta at one point used copyrighted books for AI training despite its own lawyers’ warnings. There’s evidence that Runway sourced Netflix and Disney movies to train its video-generating systems. And OpenAI reportedly transcribed YouTube videos without creators’ knowledge to develop models, including GPT-4.

As we’ve written before, there’s an outcome in which generative AI vendors get off scot-free, system training data disclosures or no. The courts may end up siding with fair use proponents, and decide that generative AI is sufficiently transformative — and not the plagiarism engine The New York Times and other plaintiffs allege that it is.

In a more dramatic scenario, AB-2013 could lead to vendors withholding certain models in California, or releasing versions of models for Californians trained only on fair use and licensed data sets. Some vendors may decide that the safest course of action with AB-2013 is the one that avoids compromising — and lawsuit-spawning — disclosures.

Assuming the law isn’t challenged and/or stayed, we’ll have a clear picture by AB-2013’s deadline just over a year from now.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button