Cinema Without Cameras
Revisiting Gene Youngblood's 1989 Essay on Digital Storytelling in 2023
I’ve been working with AI x Design and SubLab on the <Story&Code> residency, working with six storytellers and six technologists to grapple with the newest wave of generative media tools. The questions we’re asking are what we can make — but also, how do we keep our creative visions in focus when the tools seem so eager to pull us in so many directions?
Between that program and my class on AI Images at Bradley, it’s been exciting to contemplate generativity beyond static images and into moving ones. Right now there are a number of generative video tools popping up, though slightly behind the rapid pace we see with generative images.
Chiefly, there is GEN-1, from RunwayML. GEN-1 can create three second video clips from text, and it can transfer any video into any other video or image style. Here’s an ad for it, if you’re into ads.
So how exactly does GEN-1 work? You might assume that it simply takes video, breaks it down into frames, and then applies a style transformation of every frame. That is one way to make these videos — that’s how I made The Salt and The Women — but the consistency between every frame fluctuates pretty wildly.
The researchers behind GEN-1 broke this process down into two conceptual approaches in their white paper: the first deals with content, the other with structure.
“By structure, we refer to characteristics describing its geometry and dynamics, e.g. shapes and locations of subjects as well as their temporal changes. We define content as features describing the appearance and semantics of the video, such as the colors and styles of objects and the lighting of the scene. The goal of our model is then to edit the content of a video while retaining its structure.”
In the demo, see a person walking down the street turned into claymation. The person, or the geometrical shape of the person, as well as the motion and location of objects moving around them in the video, are part of the video’s structure. The content, as defined by this paper, might be thought of as the style elements: in the case of a video of someone walking down the street, that style is photorealistic. But you can render that into the style of claymation or animation or even a different kind of photorealism using Gen1, which they call editing the content.
There are differences between generative video and images. With images, you don’t have to deal with objects moving through space. With Gen1, you have one process that compresses information about spatial relationships, such as depth estimates and edge recognition, which contribute to the structure of the video, while another process generates information about the content and then maps those two together again in the final product.
In this way, as they write, “To correctly model a distribution over video frames, the architecture must take relationships between frames into account.”
Another distinct thing about training these models is that rather than looking at noise, as we saw with stable diffusion and Dall-E 2’s image generating diffusion models, Gen-1 relies on a deblurring process.
Diffusion builds images around the breakdown of images into noise, and then walks pure noise backward toward the image described by your prompt. For Gen1, Runway’s researchers integrated video information into its training data starting with blurry abstractions rather than static. In their paper they suggest that blurry movements are easier to map and track, and the object shape is more general and broad. Once you know the general movement of the objects in space — the structure of the video — you can de-blur it in the direction of the prompt, or map video to the style of an existing image, cultivating its content elements such as style and so on.
No-Camera Cinema
AI images don’t use cameras at all. Images come from text. And that should open up new ways of thinking about the ways we tell these stories. These images don’t come from nowhere, of course: they are re-rendered composites of pictures in a dataset. Those pictures were taken by people with cameras, or a scanner, or some other optical input device. So “No-Camera Cinema” is not talking so much about where the pictures come from, but what might be possible in terms of what we do with those images — and how we make sense of them.
Vilém Flusser’s “Towards a Philosophy of Photography” was written by the Brazilian-Czech philosopher in 1983. It is bafflingly prescient. As a philosopher with an interest in technology, Flusser wrote not just one but two entire books explaining a world where photorealistic images were made by technical systems: the camera, yes. But abstracted in such a way that the basics of images were understood as the result of scientific thought, and all that it entails in the way we see and document the world.
40 years later, that seems to be exactly where we are. Long before the invention of DeepDream, GANs or DALLE, Flusser described what he called “technical images.” Roughly speaking, technical images are those produced by a technological apparatus, which he defines as a “play thing that simulates thought.” Any technological apparatus — from a camera, to AI — is a product of applied science, and therefore, he suggests, they reflect a scientific frame of thought. In cameras, you have the sciences of chemistry and optics; with AI generated images, you have information and computer sciences, and so on.
Flusser also wrote about the connection of images to magic and myth. Images paint visions of the world reflecting those scientific origins. They allow us to store and share information that change our way of encountering events. Technology constrains the ways we can create those myths and tell those stories. For Flusser, this means that whenever we build a technology, we first build it as a model of the way people think. Soon enough, people think like the technology. You can look beyond photography to assembly lines: we build them based on the way people put parts together, but the technology becomes so fast and efficient that we start to demand people become faster and more efficient, too.
When we talk about cinema, we usually think about the invention of the camera, but the movie camera or projector more or less emerged from a handful of other, often overlooked precursors. So if we want to talk about cinema without cameras, maybe we start there.
Magic Lanterns
Consider the Magic Lantern, here illustrated in 1720. This was a box with a lantern inside it that could project images onto surfaces, and was often used to add surreal effects to performances, such as introducing a demonic or angelic presence into a space. You would paint the image onto glass and insert it into the lantern and the image would be cast outward. Earlier than that, we have shreds of sketches created for a kind of mirror projection system, a series of images in which a skeleton removes its head.
These magic lanterns were designed as an augmenting technology for stories. They were dim, candle-lit projections that showed up on stage alongside human performers. It gave rise to what eventually became known as projectors, where images could be cast forward with powerful light bulbs, flipping between images at 15 then 30 then 60 frames per second, bringing humans to life within the projection rather than merely nearby it.
The glass paintings made way for film strips, and film strips gave us specific ways of shaping the content of the stories we told. Camera lenses let us do close ups or shoot scenes of big, enveloping landscapes. We could cut film — literally, slice it up with scissors and rearrange it — to traverse across time and space, seamlessly shifting where we were in the arc of a story.
And if magic lanterns were projections with people beside the images, and cinema was projections of people within them, well, what exactly is an AI image — or AI cinema? It’s a projection without people at all, or perhaps with multiple people composited into one face.
In any case, it is cinema without cameras. The camera no longer has to be present to any scene in order to generate a compelling story of that scene. These images can be coupled together in sequences.
AI isn’t the first time we’ve explored the language of cinema without cameras, though. One famous example of no-camera-cinema is the experimental filmmaker Stan Brakhage, who made a film called Mothlight in 1963.
Mothlight is a film, run through a projector. But instead of capturing images on that film, Stan Brakhage collected an assortment of flowers, leaves and moth wings, stuck them to a film spool with transparent tape, and then played the assemblage through a projector, which he then recorded. The technology interacted with the images in unexpected ways in this performance, with its flapping of celluloid and wings creating the soundtrack. It is still a film, dependent on the technologies of cinema, but used in absolutely the wrong way.
It’s exciting to think about what this would mean for AI cinema: is there any way that Mothlight could be possible? Is each frame of film a piece of data? What if you cut out representation from an AI system? What might it generate then?
The way we tell stories is shaped not just by the technology we have at our fingertips, but the way people actually use that technology. When people first started to use film cameras, they didn’t know how to tell stories with them. People had to try things: somebody had to take a razor blade and cut the film up and present what they’d shot out of order. Someone, somewhere, tried that to see if it worked. And that’s something fun about where AI images are at the moment. From a historical perspective: we simply don’t know how to use it.
We don’t have to start with super contemporary thinkers or artists. In 1983, Gene Youngblood wrote an essay, “Cinema and the Code,” examining what computer cinema could be. Computers could replicate the language and tools of cinema: we can cut with a scissor icon in movie editing apps, rather than cut with real scissors. You can move from 30 frames per second to 60.
But is there anything that computers might do for movies that film could not do? Can that help us see the future of AI storytelling?
Cinema and the Code
Youngblood looked at a number of experiments done with computers and code at the time and came up with categories worth looking at today. A lot of AI tools that are only now becoming widely accessible are using them. Youngblood mentioned four specifically — Image Transformation, Parallel Event-Streams, Temporal Perspectives, and Images as Objects. So let’s look at some of those.
Image Transformation
Youngblood noted that Image Transformation was distinct from image transitions. Film transitions are constrained by the fact that film is a strip of pictures in a sequence. AI cinema can transcend that. Instead of flipping quickly between images, you can transform from one image to the next: interpolation. Youngblood wrote about it in 1989:
“... In digital image synthesis, where the image is a database... One can begin to imagine a movie composed of thousands of scenes with no cuts, wipes or dissolves, each image metamorphosing into the next.”
So what does that look like? Well, right now RunwayML offers something called Interpolation, which allows us to upload images, or a sequence of images, and the model will analyze the information in each frame and render a transition between the two that fuses the two images together instead of cutting. They metamorphosize.
This is a sample film made using interpolation. It’s just a few pictures of flowers, put together and then the machine draws the transitions.
Interpolation comes from the French. It used to mean a kind of forgery: to interpolate would be to find a larger archive or book, and then create a fake page in that book’s style. You would then insert that page into the book to make the book larger. The most notable early use of this word is connected to the work of a scribe, described as Pseudo-Isidore, who deliberately inserted false laws into books in order to help advance land claims and allow some corrupt Bishops to get away with crimes.
So interpolation is a useful concept when we think about AI and GANs, too: it’s adding false information to an existing body of information. In this case, it’s adding fake frames to connect two real ones, and then moving on to the next. It’s also a perpetual flow of information: there’s no cut, like you might see in the cinema.
In my own work, I’m interested in how this reveals the shifting relationships between people and nature: by merging the two, I hope to suggest the interconnectedness and entanglements between people and the natural world.
If you have to cut between a person and a mushroom, you’re forced to present them as two distinct things. But if you can flow from one to another continuously, the viewer sees that relationship as blending, not superimposed. It’s a small, subtle difference in what is possible, and while Youngblood acknowledges that this could be done with animation (Mothlight, for example, is one continuous, unsegmented strip) it is significantly more accessible as a strategy now that we have these tools.
Temporal Perspective
“We arrive at two possibilities: first, cinema looks from one point to infinity in a spherical point of view. That is one vector, we shall say. The other is the opposite: one looks from each point in space towards a single point. If all these points are in motion around one point, that is the space in which ideal cinema operates. But as long as we are talking about psychological realism we will be bound to an eye-level cinema.” — Gene Youngblood, 1989
We might imagine this as two aspects of machine learning. Basically, you can look out from your own eyes at the stars, and you can see all of infinity from wherever you are looking outward. That’s this spherical point of view. One way to make sense of this in digital cinema today is the latent space walk. When you train a model, the latent space is that huge realm of possibilities: it’s an algorithmically generated space where every possible permutation, or variation of the images in your dataset, can be seen.
Looking at this space is a bit like standing on Earth and looking out at the stars. But instead of looking at a vast universe, you’re looking at the vast possibilities of the universe described by the dataset. This is an interesting tool for telling stories, and many artists have tapped into this space, though they’re still pretty constrained by the capability of these tools.
Imagine a map of this space as a grid, with every square representing a variant on something in the data. It’s interpolated multiple ways, moving in all kinds of directions between every possible combination of image. While you probably won’t find a narrative in that data, you can find stories, patterns, and connections. Put these images together in a sequence, you end up with a video that moves through all of the possibilities of a single moment in time.
Refik Anadol’s Latent History Stockholm is assembled through the exploration of all possible images in a dataset of images of Stockholm. It’s like our eyes are peering into the latent space. The video animates this, so we experience all of these possibilities over the course of 2 minutes. When we look at any depiction of possibilities in a GAN model, we see all of these frames laid out for us simultaneously. And that is, to say the least, not something that happens with traditional animation or cinema.
Youngblood suggests that this offers some very interesting ways of thinking about time, space, and possibility beyond the idea of showing us a single image from the perspective of a single camera.
Deniz Kurt’s piece (seen as a still, above) is one such story: the video moves us through portions of datasets depicting embryos all the way up to seeds and flowers and mammalian life, to cities and beyond. By connecting these permutations of possibility and piecing them together, there is a blossoming potential inherent to the whole piece that begins to tell a story of evolution. This is expressed by seeing all the possibilities of datasets combining various stages of human development.
That’s one aspect of this: one person looking out and seeing the whole world of possibilities that exist from a small series of images.
The other possibility that Gene Youngblood gets at is the idea of fusing many different views of time and space into a single film. In essence, this is what Diffusion is: it’s taking all kinds of pictures that people have taken of apples, or the Eiffel Tower, and it’s showing us a composite of those images. Whenever we look at an image from a Diffusion model, we’re basically seeing this collection of hundreds, if not millions of people presented simultaneously.
With cinema, this is an interesting affordance of the technology, a slice of their composition that has a lot of potential. Generate an image of a street scene, and you are generating a street scene on a composite of many streets, seen by many people, with many cameras.
We might take this in many interesting directions. Imagine a documentary film literally rendered it from the perspective of every subject involved? A work where everyone interviewed has their own camera, and the “film” (if we still call it that) is generated between all of these collected images.
Cinema today is centered on the person telling the story: not even the person speaking to the camera, but the person deciding where the camera points. Could an AI cinema change this? Who would control the camera then?
Images as Objects
Technology allows us to tell stories in certain ways. Aspect ratios are, for example, a constraint. If your screen is widescreen, or horizontal like a phone, or square like TV in the 1980s, we tell stories that fit that frame. Change the shape of a screen, you change the shape of the story.
The artist duo known as the Vasulkas were part of Youngblood’s thinking. These artists were using computers as early as the 1970s to break television images out of these constrictions. What you are about to see is a series of works they made. The TV sets aren’t showing these as videos or computer animations. Instead, they are using computers to kind of hack a TV, and use it to render the light beam of cathode ray tube in visually compelling ways. Their images were not computer generated the way a video game might be generated from a program, and they aren’t recordings. They’re live pictures, created by twisting the way images are rendered on a screen. Still images like the one below don’t do the work justice — it was meant to move and evolve.
Youngblood also talks about Ed Emschwiller’s Sunstone, a work of computer animation from 1979. In this video, one of the earliest computer animated experimental stories, you can see this transition at work: the screen gets smaller, reducing the image area, then joining other image areas in an arrangement, which cycles through, and shifts.
Today, that’s a bit of cheesy office animation that can be done on any Google Slide Deck. But the concept of the story frame is an interesting one to consider: it got cheesy through overuse, but someone had to develop it. The potential of that kind of shifting plane — shifting between planes on a single or multiple screens — is worth noting.
Youngblood wrote all of this in 1989. So this is before data became purely integrated into the way we think about pictures and films. This new era of AI is happening all around us as we speak. The future of digital cinema as Vilem Flusser and Gene Youngblood had envisioned it has arrived, and now we get to beyond even those horizons.
In the early days of any new technology, so much creativity in these tools emerges from creative misuse. Sometimes that creative misuse makes something ugly or incomprehensible. But it could also be part of an emerging language of images and cinema that we haven’t discovered yet. So I am always eager to see the way that people push boundaries on what these tools can do, and bend these technologies in ways that give stories new shapes.
So on the topic of no-camera cinema, there is also some interesting potential in reinventing the camera itself using AI. One example is Ross Goodwin’s Word Camera from 2016, which pieced together various image and text models. If you pointed a camera at an object, it would classify that object, then send its classification to a text generator, which would then expand on that word in the form of a poem. Here’s one example.
Zen for Film
So in the end, we’ve seen how emerging tools like RunwayML’s Gen1 can be used to map style and content to tell stories, in ways that correspond to the way we understand cinema as we now know it. Right now we’re watching one of the earliest 3D renderings ever made, in 1972 at the University of Utah. Rendering wireframes for hands and human faces, these images were, at one point, a major leap forward in storytelling. We take them for granted today: every pixar film, very video game you play, most of the effects in most Hollywood films, are a result of these hands and faces traced by scan lines on the surface of a cathode ray tube.
Our stories - from magic lanterns to hollywood blockbusters - may seem like they follow clear patterns, that there’s always a beginning, a middle, and an end - though, as Jean-Luc Godard would say, perhaps not always in that order. Technology is always pushing us to tell stories in new ways, and sometimes those new shapes are radical, or hold radical possibilities. Sometimes they augment and extend the stories we’ve already told.
Computers alone have already changed the way we tell stories: you’re streaming a video at home, edited on a computer, with clips pulled from all kinds of people who generously shared them. Afterward you might open up a phone that you’ve never used for a phone call to touch a screen and tell a story interactively. All of that has come about because somebody, somewhere, thought about doing something in a new way.
In an exciting recent blog post from _blank (go follow!), I learned about a short piece of cinema, Zen for Film by Nam June Paik.
The author describes it this way:
In 1964, Nam June Paik released an empty film: Zen for Film. This 16mm film is just a roll of transparent film, without sound, projected on a loop. The only thing you see on the screen—or wall, as it is usually shown at museums, not at film theatres—is the light passing through the film. It is like staring at nothing, like looking at a wall. This may sound like pure asceticism, non-cinema or the negation of cinema, but it is quite the opposite. Zen for Film is a film in constant evolution that captures its surroundings. You can see, among other things, the dust particles that have adhered to the film and the scratches provoked by the inner mechanisms of the film projector. The footsteps, voices, and coughs from the public create the soundtrack.
Is this even a movie? Is it a film at all?
I’m reminded of light and its role in the history of computers: light passes through a punch card, or else light stops. If there’s a hole in the punch card, light moves through it, representing a one: a yes, the presence of information. Darkness, or zero, is information’s absence. So in watching this film, in which light passes through a projector on to a wall, I want to see it as giant cinematic “yes!” - a starting point for asking what more we can do to push and shape that yes with the tools we have on hand, or those about to arrive.
A Short Note About Twitter
Recently, Substack — the platform where I write this newsletter — announced a feed-based service that allows people to share ideas in small bite-sized pieces, a bit like Twitter. In response, Elon Musk is now blocking Twitter links to any Substack domain, including this newsletter, as “potentially malicious, misleading, or violent.”
It would be very cool if you would share this link on Twitter anyway, if you are so inclined, with a note letting people know that it’s safe.
I don’t know how long I will stay on Twitter if this kind of censorship continues, so you can find me on Mastodon or Instagram.
Class is in Session
This week I’m sharing the Artist Talk from Merzmensch, an artist and author with a contagious enthusiasm for making AI work, influenced by Dada and surrealism. Merz’s work is as experimental as it is playful. Hope you enjoy it!