Thoughts About a Mouse
The Limits of Copyright in Shaping AI
Hi all! Before we get to this week’s post, there’s a new video art piece linked at the end, or you can jump straight to reading about it here. More on that next week!
Walt Disney’s 1928 film, “Steamboat Willie,” doesn’t really touch on self-regulating systems, or neural nets, or generative AI. But that film, Walt Disney’s first to include Mickey Mouse, entered the public domain in 2024. That means copyright has lapsed, on this particular style of Mickey, just as we enter a period where our understanding of copyright is being subject to massive challenges and potential upheavals.
Mickey’s entry into the public domain lead to the usual responses, as Kate Knibbs noted in Wired:
“Within days, an explosion of homebrewed Steamboat Willie art hit the internet, including a horror movie trailer, a meme coin—and, of course, a glut of AI-generated Willies.”
One of these included an image dataset: Pierre-Carl Langlais uploaded 96 stills from the film as the Mickey-1928 dataset to Hugging Face, where it was used to fine-tune Stable Diffusion. If that’s gibberish to you, imagine that Stable Diffusion’s dataset of 5 billion images is Play Doh coming out of a can: that would be normal image generation. Now imagine putting a little plastic mold over it, shaped like a star, and squeezing that Play-Doh through it. It would come out shaped like a star.
Basically, fine-tuning a dataset creates a narrower set of constraints for what an AI model will produce. In this case, that “mold” is 96 images of Mickey Mouse from 1928. This dataset caught some attention, mostly because it’s in dialogue with copyright rules around generative AI.
The Stable Diffusion datasets contains 2.3 billion images. Big Tech argues that copyright shouldn’t apply to the images in that training data, because “copies” weren’t made, though this depends very much on your definition of a “copy.” Factually, the datasets were made without artist’s consent or permission, and these models can then produce images that replicate particular artist’s styles.
Mickey Mouse is no stranger to copyright controversy, and this dataset is in dialogue with a long history of legal arguments by the Disney corporation that have weakened the Commons and lengthened copyright protections on works like Mickey.
In 1976, copyright lasted 50 years after the death of the author, or 75 years after it was published, or 100 years it was made (whatever was shortest). In 1998, the Sonny Bono Copyright Extension Act was passed, which extended copyright protections to their holders for an additional 20 years. (This was far short of Bono’s original intent to have copyright protection last “forever minus one day,” a bit of legal wrangling that was intended to circumvent the Constitution.)
One of the main lobbyists in favor of the 1998 bill was Disney. Prior to 1976, Disney’s mouse would have entered the public domain in 1978. After the 1976 act, Steamboat Willie would go into the public domain in 2003. In 1998 — with a mere five years left to keep the mouse under its paws — Disney actively lobbied for the Bono act, and when it passed, Disney’s copyright was extended until January 1, 2024. So active were Disney’s lawyers that the bill was nicknamed “The Mickey Mouse Protection Act” in 1998.
There were other reasons to extend copyright — one of which was to match European copyright protection terms. Without this law, European and US copyrights would have been unaligned, creating all kinds of weird loopholes.
Nonetheless, these extensions, especially as they coincided with an era of cheap digital copies and online distribution, drove a backlash to copyright that has muddied the waters of protection in the AI era. Followed by a campaign by the RIAA to bankrupt college students who downloaded MP3s in the 2000s, piracy became punk: a way of sticking it to the money-hungry industries that had access to lawyers and lobbyists when it came to their own interests, and seemed to offer little reward to the artists they represented.
Today, that attitude toward copyright faces a complex reversal. A part of me knows there is nothing less cool than standing up for corporate copyright lawyers. But I blame the way this power is being wielded: Big Tech is now in the position of exploiting independent, small artists in order to make profits they don’t share. The result of this will be a contraction of freely given and shared imagery, as humans start to worry about how their art and ideas are going to be commercialized against their will.
That, to be juvenile about it, is pretty uncool. Generative AI companies are the ones asking to weaken copyright protections, so that it can make use of more imagery from the Web without the permission of individual artists. That is: it’s still a bunch of rich people making laws that serve their interests.
If piracy and copyright violations were punk rock, well, for me, it’s lost its edge. Today — like a lot of counterculture of the 1990s that was aimed at challenging corporate power — that position now helps cultivate that power. I’ll be honest: the chief principle I have is curbing corporate concentrations of power, especially when that power comes at the expense of human beings. The pendulum can swing from one side to the other in the process of analyzing this stuff through that lens. It’s meant acknowledging a lot of contradictions in my own thinking about these issues.
The legal issues at stake don’t really change. But it’s symbolically interesting that Mickey Mouse entering the public domain means we can now build an AI to replicate him, the same way that generative AI has been replicated a vast variety of artists’ works. Of course, this happened before Mickey was in the public domain, anyway.
The phenomenon of generative AI reproducing copyrighted material has been well known, but this weekend, Gary Marcus and Reid Southen published a report pointing out numerous examples of direct infringements from generative AI, ranging from Star Wars to Super Mario Brothers, and even showing the originals from which they were derived.
On the text front, The New York Times is suing OpenAI for duplicating its content in text output. It’s clear that generative AI as an industry has never really cared about copyright. GenAI’s argument is that copyright doesn’t apply to their industry at all, because in their estimation, they don’t copy things. There is a common allegory they use: that the machine looks at things, just like people do, and that there is no law against looking at things. Once trained, there are no images in the models, just mathematical patterns of pixel clusters, related to categories that get applied to random noise.
As The Times’s lawyers put it in their lawsuit (and this is about LLMs, not image models):
“Publicly, Defendants insist that their conduct is protected as “fair use” because their unlicensed use of copyrighted content to train GenAI models serves a new “transformative” purpose. But there is nothing “transformative” about using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it. Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.”
A similar argument could be made about image making tools, though the competition aspect of the output is less clear. It’s true that the model doesn’t contain images. But to train one of these models, you have to make a copy of an image, and that copy is copied (with noise introduced) and then copied again (with more noise introduced). Without this copying and introduction of noise, you don’t get the information you need out of the image. Without that information, you can’t generate images. This “information” is a constantly modified copy, and in certain circumstances, you can reproduce an image from the training data when you generate a new picture.
There’s no real denying that copies are made in training. Traditionally, the distinction sought by AI companies was that the retention of those images was highly constrained, and that their influence over the images these models put out was not directly infringing, even if it was highly derivative.
Defenders of AI argue that these images are like an Internet cache, secured in the short-term memory just long enough to analyze, and then discarded. That’s a point I’ll leave to the lawyers. But there’s certainly one key distinction: nobody looks at Google website summaries as a news source. People do go to GPT4 for news, and Microsoft is marketing it (and using it) for that purpose. And image generation tools are marketing themselves as replacement for the very illustrators whose data they were trained on.
But probably, you know all that.
I don’t really care about Nintendo or George Lucas’ money. But I do care about the rest of us, who make and share things — sometimes for free — as a form of personal expression. I care about childhood photos absorb by a machine and how they might appear in unsavory content generated by AI. I care about images of trauma in the training data. I care about responsibility.
Copyright only gets us part of the way to sorting out the relationship between our digital lives and the massive data analytics operation that sorts us into audiences for specific content, opportunities, and liberties.
I want to live online without being studied and analyzed like a lab animal by automated sentiment analysis. I think it’s weird that we are expected to live in a world that has been organized in such a way that any form of expression will be absorbed into a money-making machine. I hate that it makes us subordinate to extractive datafication regimes, constantly lapping up our digital residue like hungry mites.
I know I’m supposed to have an opinion or insight into how this is all going to go down, but I don’t. I suspect that copyright is an imprecise mechanism for deciding how we build digital infrastructure, especially generative media systems. To be clear, I understand and support the concern about copyrights on behalf of independent artists, who have little recourse for protecting their creative work. But in isolation, it’s only one prong.
If AI companies continue to limit access to their datasets, we have no mechanism for enforcing these arguments aside from what the models produce. This would shift hands to the ways their products are used — that is, if I generate SpongeBob and circulate it, I would be liable for “publishing” it — but that’s a weird variation on the legal definition of “publishing.” Some companies put the onus on the user: terms of service agreements that say “don’t make SpongeBob.” But it’s clear that we can accidentally make SpongeBob.
Of course, we also sell cars that let you drive 120mph even though it’s not legal to do that anywhere in the United States. There’s no “responsible speeding.” Putting the onus on consumers over corporations is what America does (and has tricked us into calling that ‘freedom’). To that extent, placing copyright concerns on users of the tech is still a dangerous foundation upon which to build future arguments. Copyright is not in and of itself a way to regulate the AI industry.
Likewise, when allowing copyright to have this outsized weight on the shape of that industry, we need to be mindful that copyright enforcement and intellectual property systems can be gamed and weaponized. We need to think through the principles we build upon from a vision of the future that centers justice for artists, not only the powerhouse agencies that can afford the lawyers.
We need pro-active data rights: controls and limits over our electronic communication, at the consumer level, that go beyond “don’t use the service if you don’t like it.” Today when we sign up for social media platforms we’re asked to give the rights to what we share away to these companies, who can then train or sell that data for AI training. The alternative is not to use them.
Which is a problem, because we’ve built too many community resources into the hands of big tech companies. Look at all the local businesses who require Facebook accounts to look at their menus, or the local emergency service accounts that relied on Twitter to disseminate information. All of this comes from a lack of concern in policy — and probably American culture as a whole — about our rights to privacy with regards to online information.
I have a dream that we start budgeting libraries, or what remains of public access TV, to host decentralized digital infrastructure: things like web hosting and social media accounts that local businesses, school boards, non-profits, or citizens could use to build out digital homes. No need for the fire department to use X to warn people about a disaster, if there was a trained community media organizer who could set them up with a widely accessible Mastodon account.
It’s the 21st century, and we can build technology that works for communities. I don’t know how we get there from here. But copyright is not going to solve everything, and we need to craft a defined, pro-active vision of what meaningful digital infrastructures should look like.
New Piece: SWIM
I’ll post more about this piece next week, but I wanted to share a new piece, called “SWIM.” It’s a video art piece — not a narrative film — about the datafication of memory. I am thinking about the relationship of archives and datasets, and the distinctions between them. You can read more, and watch the piece, on my website.