Unreasonable Effectiveness
Seriously: Machines Do Not Need to Think to Be Thought About
This week, Gary Lupyan and Blaise AgĂĽera y Arcas published a paper, "The unreasonable effectiveness of pattern matching," in which they describe "an astonishing ability of large language models (LLMs) to make sense of 'Jabberwocky' language in which most or all content words have been randomly replaced by nonsense strings."
The authors (Lupyan is from UW-Madison, Agüera y Arcas is from Google) create a series of tests for the machine to make sense of this language, and then "consider whether a system capable of succeeding in such sense-making is best thought of in terms of familiar technologies like databases and search engines, a strange “alien” form of intelligence, or in terms of processes that take place in our own minds."
They reference Jabberwocky, from Lewis Carrol's Alice in Wonderland, a poem which is made up, mostly, of nonsensical words:
’Twas brillig and the slithy toves
did gyre and gimble in the wabe
All mimsy were the borogoves,
And the mome raths outgrabe.
In the paper they take this nonsense approach to language and test it out on the machine. The goal is to see how an LLM might interpret and produce text in response to something like “The gostak distims the doshes.”
Before we go forward, think about that sentence and see what you can make of it. In the paper, they lay out some of what can be inferred by this nonsense alone: "Although we do not know what the specific words mean, we can—after assuming that it is “really” English—make many meaningful inferences: doshes are things that can be counted (hence the plural marker) and can be distimmed; a gostak is something (though not necessarily the only thing) that is capable of distimming doshes."
In their paper, the authors "Jabberwockify" (my phrase) various texts, replacing words with nonsense words. The texts are then given to an LLM, which is asked to translate the nonsense into plain English. Doing so, the LLM is able, for example, to identify the Supremacy Clause of the US Constitution even as most of the words are replaced with the word "BLANK."
In the BLANK BLANK, BLANK BLANK has BLANK over any BLANK BLANK’s BLANK . If a BLANK BLANK BLANKs with BLANK BLANK , the BLANK BLANK is BLANKed and the BLANK BLANK is BLANKed. For BLANK , BLANK BLANK BLANK BLANKs what BLANK should be BLANKed on the BLANKs of BLANK BLANKs . When BLANK BLANKed a BLANK that BLANKed BLANK BLANK on BLANK BLANK BLANKs , BLANK BLANKs BLANKed BLANK from BLANKing this BLANK.
As I was reading this, I was impressed, but an explanation seemed reasonable: the structure of the Supremacy Clause's wording was retained by the model, and the structure was evoked even if the words were nonsense. I have encountered this in the past: asking models to solve word problems about nonsensical terms, the model would find noun-like words and count them up.
But what's more interesting is the secondary example: a Reddit post that was posted on the same day as the experiment, and therefore beyond the scope of the model's available references.
Reddit: Ok, so probably a very dumb question, but I was wondering if ontbijtspek is eaten raw in the Netherlands?
Jabberwockified: Gharp, so phrev a very chelp chusp, but I was smeighthing if psive is veich sprebb in the Splud?
Translation: Hi—sorry for such a basic question, but I was wondering whether MSG is considered safe to eat in the United States.
There's a frustrating paragraph in this paper, though. Here is the argument:
"Wikipedia—no matter its size—is not the type of thing capable of thinking, much less understanding anything. Equating LLMs to technologies like Wikipedia or card catalogs turns any claim that LLMs “think” into a category error because, clearly, Wikipedia cannot think. But then again, Wikipedia also cannot diagnose patients, translate languages or complete your homework assignments, while LLMs can."
I am confused as to why "thinking" is required for these tasks. The model does not think about your diagnosis, it produces text that is loosely associated with a constellation of listed symptoms. It does complete homework assignments – again, through statistical filling-in-the-blanks. The authors fall back into the trap of believing that thought is reducible to calculating the most statistically likely arrangements of words. (To be fair, they would suggest I am falling into the trap that it is not).
The authors follow the trend of beating a dead parrot, that is, dismissing Emily Bender et al's description of the LLM as a stochastic parrot. As I have written before, people tend to emphasize the parrot and miss the modifier, "stochastic." The metaphor of the stochastic parrot describes LLMs quite well: it is a machine that parrots words, based on a pool of words loosely constrained by probabilities. As Bender & Koler wrote in 2020, "the language modeling task, because it only uses form as training data, cannot in principle lead to learning of meaning." That is to say: Bender already accepts that machines learn forms, that is, the structures of language. What is transmitted through those structures is an accepted, open question.
As I have written, an LLM operates as if every word were a grammatical marker, indicating its position. As each word is set down into the reply window, the softmax function has to cut off all other possibilities. This is not a matter of "selection," more of a flood that fills a reservoir until a drop spills out.

What sets the conditions of this spill-out? It's the combination of previous words that have flowed through the model (the ones in your prompt and the ones already generated).
Nonetheless, the paper raises a more sophisticated question. They ask us to imagine we are "about to leave your house and you are wondering whether it might rain." They suggest that we do not care if it is raining; rather, we care if we get wet. If we get wet, consequences will follow: "We do not need to consider the ultimate causes of rain or the geometry of umbrellas." Instead, this word, "rain," does the work of that association. If it rains, we get wet, or we get an umbrella; the soccer game gets canceled: "Because language is in the business of communicating these thoughts, it tends to reflect much of their (deeply relational) structure."
So then they suggest that my argument – the "grammatical position" of all language – is a structural property that allows for inferences about what a word means. Or, in simpler forms: you can write garbage, and the structure of the garbage will still infer a kind of meaning. What's fascinating about this is that we do not know how this meaning might arise from specific structures: how does "a blank blank blanks on the blank" inform a different meaning that "when blanks blank, a blank blanks too"?
In their paper, the authors note: "Chiang invoked the analogy of a blurry JPEG as a disparagement. Why would we settle for a blurry approximation when we can have the real thing? But what we are seeing in the example of LLMs making sense of highly degraded texts is that what they have learned is an astonishingly powerful compression scheme. LLMs have not learned a blurry version of the web, they have learned patterns that allow them to deblur it."
This hit me in the gut a bit, as I have a paper in consideration for a conference dealing entirely with the idea that LLMs should be considered edge-detection algorithms. In that paper, I document how the application of the softmax function to (literally) blurry images produces an effect akin to the find-edges function in Photoshop. This suggests that LLMs find the blurry edges of language and sharpen them. But this sharpening is a result of abstracting language in the training data and then sharpening language via the user prompt.
The model is capable of transforming text, not writing it. The prompt, then, provides the underlying structure of the response: the prompt is where the meaning is made, and to which all future inferences are compared. The prompt, notably, comes from a human being capable of using language, establishing the rules of the response. All of the possibilities of the latent space must fit into the structure set by the prompt. This is a mathematical process, a process of probabilities, not reason. It is nonetheless remarkable that so many inferences of meaning can be made from so little.
What frustrates me in the ongoing conversation about AI is this: we continue to have this weird argument over the limits of stochastic parrot metaphors, mostly because of misplaced fixations on what parrots do. The fundamental mechanism it describes is appropriate and correct, but what emerges from that mechanism can still be complicated and deserving of attention. The fixation on proving LLMs are conducting something "akin to thought" could just as easily be framed as questions about why it is capable of resembling thought, and this paper raises some interesting questions in that arena – questions that remain interesting even if we dismiss the idea of thinking machines in favor of thinking about them.