Entropy Studies
Thinking Through Temporal Mode Collapse
I've been doing a series of experiments: taking diffusion-based image glitches and applying them to diffusion video models to see what they do. What they should do is animate them. But I quickly found that this is more complicated than it seemed.
What's a diffusion-based image glitch? Quickly: it's the result of a prompt that gets the image model stuck in a process of adding structure to its own noise, producing images that never reference the training data and simply come out of the model as a result of its mechanics.
The image model produces abstractions by trying to create an image that contains both noise and structure, and by removing noise from noise to do so. If that's confusing, that's OK: the whole point is that it's a paradoxical instruction, and the model responds by producing something between its random starting point and what it perceives as plausible structure.
The TLDR: I force the model's assumptions to fall apart, then take the image it produces and use it for other things.
One of those other things is to set them into a video model to see how the model responds to these failure-images. The video model basically takes an image and tries to animate it. It looks like this:
(The cello soundtrack is part of the model: LTX 2.3 generates audio, and after a few experiments I just asked it to produce a cello drone).
I liked these, and I wanted to push them further. The model is optimized for 10-20 seconds of video, and after pushing past the optimization window, diffusion video models... stop. That is, they generate video with plausible motion for several seconds, but pushed beyond that, they slow down. Motion locks up: everything stabilizes and smaller sections of the frame show barely any action at all.
That is somewhat intuitive: you ask a video diffusion model to produce beyond its optimized window, you give it a vague image, and it doesn't have any direction. So it pushes pixels to spread within a constrained boundary. Once it hits those edges, there's not much else it can do. Longer video generation is a real question in the industry at the moment (see people grappling with it here, here, here, and here).
Because the model is optimized for 20-second videos, you'd think a solution would be easy: take the last frame of the video and run it as a new source for a new video. But that seems not to work for these either, and that, frankly, is weird.
In the following video, the first thirty seconds comes from a noise-glitch image. The last frame of that is used to generate the next thirty seconds, and the last frame of that is used to produce the next thirty seconds. You will see jumps and hear shifts in the cello at these points. But we can also observe the shift in what I will call liveliness: the amount of, and sweep of, motion across the frame.
The first thirty seconds are very lively, lots of action. The following two segments less so. Why is this? Shouldn't the model just do whatever it does to the first image over again, resulting in a similar same sweep of motion? In other words, if I give it one colorful smear to animate, why is it lively, whereas the resulting colorful smear is slow?
Temporal Mode Collapse
The technical name for this phenomenon is temporal mode collapse, and some of it is fairly straightforward. It's a bit like the copy-of-a-copy problem, where one generation of video loses detail, and so if you make a copy of that copy you lose even more detail, etc. That explains why videos can become more saturated with each extension: local details are missing, the model overemphasizes broader swatches, and so things get more uniform. This makes some degree of sense in terms of color and structure, but it's weird to think about this in terms of time: copying a videotape doesn't make it move slower.
Temporal mode collapse comes about because the directionality of motion is derived from a reference to a sampled distribution: in other words, there are only so many directions a thing can plausibly move. Models are constrained by flow: it is very difficult for an object to disappear and reappear elsewhere, for example, because the motion has to be tracked across the frame. As objects become more saturated, and motion options get exhausted, they equalize and hold in a low-entropy state: they get stuck.
The only way to shake loose from this equilibrium would be to inject new prompts, which shuffles up the action. I didn't do that. I ran the same prompt on all three sweeps. That exhausted the options of that prompt, for that image. Re-generation uses the last frame as the new seed, and the last frame was already at its final resting point. What is weird is that the model seemed to inherit the exhausted state of the previous image.
In other words, it couldn't activate it because the image itself had already been extended as far as it could go. There is something about the last frame of that process that seems to represent that to the model: there is no future in this image!
It's helpful to think of the nature of a diffusion model. It is designed to turn forms of media into noise (audio, video, images) and then reversing that noise into media that resembles the patterns learned by dissolving them. You have to activate lots of uncertainty to do this, and then you have to resolve that uncertainty to reverse it. The origin point of the process is stability (an image, a song, a video) and so is the end point. The entire process is structured around moving you from uncertainty (noise) to certainty (image).
Once you arrive at that stable point, there is nowhere else to go, unless you bring new forms of uncertainty in (change the prompt, change the image, etc).
An early paper about this problem notes that the goal of the system is to capture the dynamics of motion. This is why these things are touted as world models, or physics engines (though we've heard less about that lately). The bleak underlying assumption of these models as they stand is that without external activations, dynamics slow to a halt, and things stop moving until externally motivated.
It's basically video as spilled water: it flows out of a mug onto the table and spreads, and you can see this in these videos. But without gravity to pull it or pressure to push it, water stabilizes in a puddle. That's what we have here: the diffusion video puddle, settled at its stable state, neither pushed nor pulled.
What's interesting, then, is that there are images that represent this end state, that enter the model and are immediately interpreted as that end state even when presented as source images. It is plausible that aspects of the image that suggest motion paths have all been stripped away, as if the final frame has had all the juice squeezed out.
To be clear, this is an open question, and a very preliminary theory. It's a thread I haven't had time to pursue. But I think it is a dead frame, constrained from any possible future action.
I have been thinking a lot about foreclosure lately, and the idea that generative AI assumes a certain foreclosure of the future. But foreclosure assumes certainty in the sense of there being nowhere else to go: no push, no pull, just an uninterrupted arrival into a steady state. The dead frame isn't foreclosure, it's a result of passive observation. It's dead because we don't give it anything to work with.
A Short Note On My Own Temporal Collapse
I want to acknowledge that it has been a while, and this post is short. I hope you'll bear with me: the steady weekly rhythm of this newsletter is probably not likely to survive the rhythm of the PhD life, but the newsletter isn't going away. I am setting a more realistic goal of once a month, but hopefully as things gear up with external publications, conferences and artwork, there will be more to share.
In the meantime, you are a hero if you're a paid subscriber, but if you'd like to adjust your subscriptions based on the new schedule, I get it. I want to make it easy for you: you should be able to go here to modify your subscription.
Thanks for sticking with me so far!