The Most Important Copyright Battle of Our Lives

In the last 12 months, what the lay person would call artificial intelligence has taken a major jump forward. OpenAI launched ChatGPT on November 30, 2022 and probably has the fastest growing user base of any product in history. I say probably, because OpenAI hasn’t disclosed any usage data. But it went from a thing that didn’t exist to the thing that everyone was using in just a few months.

ChapGPT is more properly categorized as a large language model, and there are many new implementations of these models which can generate images, video, and music. All these models had to be trained on something, and they were all trained on data from on the internet, which creates a legal dilemma under current US law.

Just because this blog post is available on the web doesn’t mean that I have relenquished the copyright. Neither the statue nor the case law are clear on whether using a copyrighted work to train a large language model is fair use. In the short few months since these models became wildly popular, this has been mostly a thought exercise. Getty Images brought the first notable cases in this area of the law by suing Stability AI in both the US and the UK. Getty alleges:

Stability AI has copied more than 12 million photographs from Getty Images’ collection, along with the associated captions and metadata, without permission from or compensation to Getty Images, as part of its efforts to build a competing business.

Court cases take a long time. DCMA takedown notices get processed much faster. Nilay Patel explains:

The AI Drake track that mysteriously went viral over the weekend is the start of a problem that will upend Google in one way or another — and it’s really not clear which way it will go.

Here’s the basics: there’s a new track called “Heart on My Sleeve” by a TikTok user called @ghostwriter877 with AI-generated vocals that sound like Drake and The Weeknd.

This track was posted to YouTube, and then Google got a DMCA takedown notice from Universal Music Group:

But then TikTok and YouTube also pulled the track. And YouTube, in particular, pulled it with a statement that it was removed due to a copyright notice from UMG. And this is where it gets fascinatingly weedsy and probably existentially difficult for Google: to issue a copyright takedown to YouTube, you need to have… a copyright on something. Since “Heart on my Sleeve” is an original song, UMG doesn’t own it — it’s not a copy of any song in the label’s catalog.

So what did UMG claim? I have been told that the label considers the Metro Boomin producer tag at the start of the song to be an unauthorized sample, and that the DMCA takedown notice was issued specifically about that sample and that sample alone.

Nilay explains Google’s predicament with “Heart on My Sleeve”, but it’s merely another skirmish in the broader war. We have large language models that can pass the bar exam. But only when trained on a corpus of data with legally disputed provenance. The lawyers are gonna make a lot of money over the next decade as we sort this all out.