Complete Accuracy Collapse

A new paper from Apple proves what I've discussed before – that so-called "reasoning" models aren't doing much reasoning at all – but now that is clear even by the bizarre industry definitions of the term.
The researchers created a series of puzzles with similar problem-solving strategies, but at differing layers of complexity. At low complexity, these models do less well than standard LLMs. At medium complexity, they do better. But as complexity rises, they become far less efficient and less able to solve the problems, before dissolving into garbage – "complete accuracy collapse."
A History of Reasoning
Reasoning models are part of a shift in strategy for AI companies. As data scaling becomes more difficult – because of the requirements of obtaining "fresh" data, the amount of compute required to process it, and the incremental scaling of both, in even the best-case scenario, being unable to match the rapid improvements of rapid scaling in the past.
Instead, model developers have turned to different architectures or strategies for how the models construct text. This is part of a long pattern: ChatGPT was not much more capable than GPT3, but by making it a chatbot, it felt exponentially different. Reasoning models were a similar kind of development, though they delivered better results in some tasks (more on that later).
These models worked by generating an outline from your prompt. Then, the model generated text from that outline. As it moved through the outline, it revisited each step of its outline to strengthen attention on that part, allowing more consistency over long text generation periods. It also, some argued, allowed for more complex problem solving.
Why Reasoning Collapsed
On the surface, this makes sense: generating text, revisiting the outcome of that text, and then generating new text, section by section, keeps the model's "attention" where it needs to be. Attention needs clarification here: it's basically just a matter of how many words back the model can reference internally. As it moves beyond that context window, it loses the originating context and starts to drift into exponentially "hallucinatory" patterns of text.
Reasoning models are still exclusively capable of generating text, but they are generating more text before the text it shows you. This unseen text adds internal reminders – flags that steer the model's inevitable drift of "attention" back to the start of the process.
These models also introduce a self-verification process intended to reduce wrong answers or "hallucinations." But OpenAI's own reports show that its o3 model hallucinated 33% of the time and 48% for o4-mini. That's an increase from 16% of the time in its simpler models.
In addition, they found the models delivered this incorrect information with greater confidence because the models could explain "how" they arrived at the wrong answer. But because all LLM generated text is designed to be plausible, rather than accurate, the models themselves cannot tell the difference. An AI reasoning model even fools itself.
In simple terms:
- Words generated earlier tend to be more reflective of the prompt than words generated later.
- Reasoning models generate longer text at the start, like a to-do list summarizing the problem-solving process it needs to take on.
- That to-do list is a set of new internal prompts the model returns to as it generates more words from each part of it.
- In theory, this to-do list, paired with giving models more time to process each step, should create something like manually re-writing the prompt over a series of steps, making them better at "reasoning."
But they aren't. As the authors of the Apple paper observed, "when problems reach high complexity with longer compositional depth, both [standard LLMs and reasoning] model types experience complete performance collapse." An earlier study found that top of the line “reasoning models” introduced errors on general questions 51% to 79% of the time.
Running Errands in Circles
There is a world of difference between revisiting steps on a to-do list for the sake of reasoning your way through a problem, and revisiting those steps to extend the text based on the words it contains. The latter is just extending language – while reasoning means developing a working memory for solving a problem.
Here's the proof. Reasoning models operate just as well on simple problems at the outset. The issue is that the to-do list becomes mandatory. As a scaffold, it introduces new opportunities to revisit and revise the correct answer over time. It's like navigating a city: you can walk a straight line to quickly arrive somewhere. But if you have a bunch of errands to run, you create opportunities to get lost.
"Reasoning" architecture was meant to serve as a simulation of doubt, but because it is regimented, this doubt is forced. That makes them bad at easy tasks – they must explore wrong answers – and more complex ones, because they have more errands to run. They create longer text than they need, threads that end up never being resolved.
In the end, the paper found that:
"state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments."
Anecdotal Evidence
I keep coming back to a core question about LLMs, which is defining how people use them. I don't think the people who claim to be using them are dupes, or lying, or delusional. But the failure rates assessed on these models are pretty astounding. I think that the way these things are "useful" is different from how they are being tested and what people are doing with them.
Sometimes, bad material is less important than the speed with which it is produced: cynical examples include AI slop's role in spamming your Instagram with weird ideas & products, or public health reports the government can produce when it doesn't really give a shit.
But I also want to resist the idea that all LLM use is cynical. I think there are disconnects between what we mean by "using" LLMs, particularly in cases where the models are being asked to generate ideas or brainstorm.
In my previous post about defining "use," I framed it this way:
Perhaps missing the mark is one thing people are doing with this type of AI. LLMs can generate scaffolding of a text that isn't complete, can generate images that show us what we don't want but help steer us toward something we do want.
A core affordance of generative AI is that it separates wounded pride from mistakes. It shifts accountability in the iteration of ideas: "the AI made this," and if your boss or coworkers say it sucks, no problem. The AI pitched a bad idea and that can help steer you to a good idea.
But I want to believe that many people who use them do so because they're insecure about their own capacity for creativity and professional reputations.
When creativity is called for in professional settings, it nearly always exists in unsupportive environments. If you're the person hired to bring creativity to the office they never really let you do it, and you never really know what the acceptable boundaries are. Rejection stings, and if you're in the industry, this rejection is relentless. Generate 25 pitch ideas, 24 and half get struck out. AI can give you 250 and you don't have to care about any of them.
Lots of people in the intense pressure cooker of marketing, design or media production clearly benefit from showing people text they didn't write framed as "might be kind of ok but yeah no haha it's AI, haha, sure, no, yeah, I can do much better than that obviously but what exactly do you think we should change I mean I know but just open to suggestions while I have you, haha!"
Students, too, are encouraged to be creative but often imagine creativity in a trade-off with rigor. When they're graded for the content instead of the creativity they don't understand the connection, as if it was an either/or. Generative AI creates illusions of rigor or creativity in assignments. It gets a good grade without pushing the limits of what risks a student learns to take. It's true: students are graded on the product, rarely the creative risk they took to get there.
The ultimate sales case for AI is removing accountability, a tool for granting social permission to make proposals without ownership, a safe space distanced from the vulnerability of risk. Maybe this should make me nervous – that we get out of the practice of taking risks, become more conservative, and collectively shift our ideas toward an ever-narrowing acceptable mean.
People get bored of new technologies and narrow means.
But people get bored of new technologies and narrow means. That gives me some hope. Generative AI has a market because people are anxious about taking creative risks. That makes these risks – and the people willing to take them – more valuable than ever. The rapid convergence of ideas generated by creative industries will burn itself out. No creative industry can survive on the rapid delivery of its competitors accumulated averages.
Upcoming In-Person Events,
June & July :: Rome & Melbourne

Rome, 26 June: Is AI Art Net Art?
Gen AI’s images are a distillation of the internet, inheriting the text categories assigned to images alongside the images themselves. How do artists work with, or resist, these competing systems of powers, logics, and communication?
In this exploratory discussion, Vladan Joler, Valentina Tanni and Eryk Salvaggio will examine how AI has changed the way we imagine and frame these systems today. We look to Internet Art, a movement anchored in critique and resistance, to find paths relevant to critical artistic engagement with AI and uncover what has been elided from the movement from net.art to AI art. Free & open to the public!

Melbourne, July 3: Human Movie (Performance!)
With JODI (NL, BE) & Debris Facility Pty Ltd (AUS)
I'll perform Human Movie as part of a series of performances including the net.art legends JODI and the Australian "para-corporate and parasitic entity," Debris Facility Pty Ltd. Open to the public, details below!

Melbourne, 7-8 July: Noisy Joints: Embodying the AI Glitch
With Camila Galaz
The entire conference is going to be great. Here's our part:
Artists and researchers Eryk Salvaggio and Camila Galaz present a participatory workshop on interrupting and reframing the outputs of generative AI systems. Drawing from a critical AI puppetry workshop originally developed at the Mercury Store in Brooklyn, New York, Noisy Joints invites participants to think through the body—its categorisation, misrecognition, and noise—within AI image-generation systems. How do our physical movements interact with machine perception? How can choreographies of shadow, gesture, and failure unsettle the logic of automated categorisation?
Across the session, participants will explore these questions through short talks, collaborative video-making, glitch-puppetry exercises, and experimental use of tools like Runway’s GEN3 model. Using shadows, projections, and improvised movement, the workshop will trace a playful and critical path through the interfaces and assumptions that shape AI perception. No technical experience is required.
Convened by Joel Stern (RMIT), Thao Phan (ANU), and Christopher O’Neill (Deakin).