Reading "The Market in the Model"

Share
Abstract blue and green swirls with a dappled texture.
An image produced by asking for "Gaussian Noise" in Stable Diffusion 1.5.
đź’ˇ
Nerd Alert: I have a new pre-print up, called "The Market in the Model." It was written as a preliminary outline of my PhD dissertation, which means it's dense, and this presents a stripped-down version of it.

I have a new paper! Here's the citation, then a link at the button, and then I'd like to give a very brief and speedy overview of what it has to say.

Salvaggio, Eryk. 2026. “The Market in the Model: Latent Diffusion as Neural Economy.” arXiv [Cs.CY]. http://arxiv.org/abs/2606.19151.


The Model in the Middle

If you spend time thinking about "AI images" you might notice something of a split. On the one side, we think about the images themselves: slop, deep fakes, biased representations. On the other side we think about the datasets that train these models – the biases and ideologies they contain – and then see how that data manifests in the images.

There's nothing wrong with this process, but there's also something missing. If image data is what the system eats, and AI images are, well, what comes out, then we're missing the entire digestive system: the automated processes that transform images into a model, and then the systems that go into action to produce images from it.

This metaphor of consumption is intentional. In the paper, I lay the foundation for thinking about the entire operation of training an "AI" system as a process of transferring the social world into a commodity form. The "model" doesn't do this, companies do it – the model is the tool they use to make it happen. Training a model basically means abstracting information from public, social communication into mathematical points, called vectors. These vectors are defined by their position on a map, all of it relative to other positions in the map.

In their book "Vector Media," Impett and Offert talk about the idea of commensurability. To do their argument via a bumper sticker: the goal of a model like CLIP is to figure out where everything is, in position to everything else, so that it can translate images into text and text into images and images into sound and so forth. Essentially: for AI to work, everything has to be abstracted into a point where everything is mapped to everything else, rather than assigned a price.

Exchange Value

Here's a concrete example: You post a picture of your dog to a website because you want people to understand how cute your dog is. That's social exchange, arguably the entire reason we have language and images in the first place: to communicate with each other.

This image of a dog is then absorbed into a computer system. In diffusion training, the image is literally broken apart – abstracted into noise, or "diffused" – so that the patterns that emerge as it's being stripped down can be translated into those vector positions. In that gesture, the machine is stripping down the social function of your dog's photograph into a neural exchange value.

Finally, there's the commodity value of the image: how much is a photograph of your dog worth? What's the value of the labor of taking that photo and uploading to a website to show it to people? What is that, relative to its role in producing the abstract concept of a "dog" in the latent space of a diffusion model, accessed for $20 a month? What's the value of an image generated by that model, in circulation against the picture of your very cute dog?

A lot of the critique of the AI economy, for very good reasons, focuses on these commodity value questions. But the neural exchange is distinct from financial value. It's a unique layer that moves social exchange and commodity exchange into a single space. When we respond by leaning on questions of AI's financial impact, we may lose sight of the fact that lots of what the models train on, scrape, and transform were initially meant to do nothing more than talk to each other in digital spaces.

Follow the Noise

The latent diffusion model can be understood as a system for moving the residue of our social exchange (images, text, etc) into the latent space (the "model") and then out of it again (the generated image). But how?

I look closely at the mechanisms involved. The latent diffusion model is not "a" model, but at least three, designed to automate decisions about sight and seeing. Rather than being closed up into a "black box" metaphor, we can look at these components – the creators of latent diffusion literally wrote a paper about it – and study where they came from.

In doing so, we can ask:

  1. Where did these component pieces come from?
  2. What questions or problems were they designed to address?
  3. How do their way of answering those problems influence the way they solve the problem of producing an image?

This allows us to do a kind of ideological audit of these models, which is precisely what I do, in more detail, in the paper. Below is a very compressed, newsletter-sized speed run.

A few things: I look at the latent diffusion model, which is the model that shaped Stable Diffusion, even though it is dated. I do this, in fact, because it is dated. It is an attempt not to explain how the technology works, or to say how well it works.

My point is to ask, instead: what were the foundational logics of the system that allowed it to produce these images at all? But it turns out that even newer, state-of-the-art models follow a similar logic, even if the precise components that carry out that logic are different. (And to be frank, many are still locked in).

Training

  • Training data is heavily filtered, not only at the point in which it becomes a dataset, but at the point in which it comes to be understood as data. Training data for latent diffusion depended on text and images that were heavily associated among popular websites – popular, itself, being an ideologically defined logic inscribed into an algorithm as "the sites that had the most inbound links."
  • The autoencoder then compresses these images, basically turning them into noise. It introduces random noise into the image and then stores the path back to the corrected, original image: each "fix" will later be used in generation. The tool used to assess this is a perceptual loss function, originally designed to ensure that streaming media and compressed images didn't look compressed to the human eye. This prioritizes the details within an image to that which is most likely to be observed by some computational model of where a human eye would look and what they would register first. What that computer eye assumes we do not see is discarded more readily. The limits of this model of our perception set the limits of how realistic an image can be: the generated image can only ever be as good to our eyes as it is to our eyes' computational proxy.
  • The U-Net is tied to the auto-encoder. It's part of the training and generation process. It was originally meant to solve a problem in medical imaging – how do you detect an anomaly within a human body? The system turned its gaze to cells, and was adopted for tumor segmentation. Initially, it solved the problem of the singular body: I have one body, but you would have needed lots and lots of data about it before you could find something amiss in some part of it, at an unknown location and scale. The U-Net solved this problem: you could take a few images, and you could move in and out of that image. Move out, you get the structure of the body; move in, you get the local detail. The latent diffusion model shifted this gaze to the vast sum of e-commerce images from its training data: moving from a body to a body of images, from the corporeal to the corpus. It's the moment the body is most viscerally replaced, and we can ask what happens when the model shifts its gaze from a singular focus on one body in a waiting room toward a collective body of highly linked e-commerce images.

Generation

  • To prompt is to pre-translate our desire to see into machine-legible language. This introduces an interesting dynamic: as Amoore has written about at length, to prompt means to learn the language of the machine. Language also misses: the image that we conjure is not quite right, and this is part of the appeal for the person who uses the model. The machine is in some way an engine of failed address, always giving us something slightly apart from what we ask. Of course, it could never not do this! The point isn't that the tech fails because it's never the image in our head. The point is, this is the dynamic the system emphasizes, the thing that lures the user to keep prompting.
  • The transformer / attention mechanism maps that text to the composition of the image, a way to ensure that an eye is where an eye is and not where the mouth should be. The mechanism was designed to aid with translation between languages, it's the same attention mechanism used in language models. Only here, it remembers clusters of pixels. It moves from translating one language into another to translating language into vision — presuming that everything seen can be said, and everything said can be depicted, if you find the right translation. The problem is that the neural value of these images and texts is entirely about the close position of concepts in the vector space: as such, "fancy" and "African" and "house", for example, introduce enormous biases based on where those positions have been linked to images in the model.

Conclusions

This is a very light gloss of the paper, and I skipped entire arguments and components of the base model. But I hope it can orient you to the ideas it contains, and that you'll have a closer look if any of this is of interest!

I'll end here with the paper's conclusion:

These systems are positioned to process information from a diminished world, narrowed by a particular ideology of vision: images mediated by platform and attention economies, measured by their suitability for literal description, become subjected to a medical gaze that was once aimed at finding symptoms within a literal body. The user of these systems conjures these symptoms as images but is offered only perpetual misdiagnosis: they believe they see the social world of image circulation and communication. But in the place of the body is a market whose transactions are obscured.