AI & Analogies: What Works, What Breaks
A 2025 status check
Thanks for reading! If you enjoyed the post, please consider liking it, adding a comment, or best of all, sharing it.
Analogies are one of the most frequently used types of arguments. Hofstadter, as quoted by Melanie Mitchell, defines it as: “…the perception of the common essence between two things.” Philosophers are less settled on how to define them, but what matters is that they are essential to our ability to make sense of the world.
Humans are very, very good at analogical reasoning. Take this famous and beautiful example:
But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun.
Arise, fair sun, and kill the envious moon,
Who is already sick and pale with grief,
That thou her maid art far more fair than she:
It is remarkable that this comparison (young woman = burning ball of plasma) makes any sense at all, and yet, this imagery has crystallized into shorthand for besotted love.
Our ability to extract meaning from comparisons such as this is rooted in an innate capacity. It is, according to Melanie Mitchell (and others), one of the defining features of human intelligence. Because of its centrality to our success, Mitchell argues that without it, LLMs cannot be considered intelligent, let alone scale the heights to Artificial General Intelligence (AGI). Understanding the ability of LLMs to reason analogically, then, is key to determining whether they are intelligent.
Mitchell’s informative post offers a good overview of our understanding of LLMs’ ability to reason analogically as of May 2024. She begins by describing one of the gold standards for testing this ability: Hofstadter’s letter-string problems. These include questions such as: If abc changes to abcd, what would be the analogous change for pqr? You may think that the answer is pqrs, but these questions do not have a “correct” answer. (For example, another potential answer is pqrd.) Nevertheless, humans tend to converge on certain solutions by recognizing abstract roles or properties to apply (here, for example, add the successor to the string). In fact, we do this so easily, analogies barely register as thinking.
LLMs typically fare worse—until 2023, when a major study found that GPT-3 did pretty well answering these questions, sometimes outperforming undergraduates. The results were encouraging for OpenAI, but Mitchell and her collaborator, Martha Lewis, were skeptical. They set out to replicate the study. On problems of the simplest kind, puzzles that use the standard English alphabet in its usual order, Mitchell and Lewis had mixed results. While GPT-3, GPT-3.5, and even the state-of-the-art (at the time) GPT-4, did reasonably well, they did not outperform humans.
But Mitchell and Lewis went further. They designed a series of tasks meant to probe whether models can reason abstractly or were merely calling up training patterns. The most telling experiment involved replacing the familiar alphabet with a completely invented set of symbols (e.g., [% * &]). Humans adapted quickly, applying general concepts like “successor” to solve the problem. LLMs, however, flubbed it. Their accuracy dropped sharply once the problems involved these made-up alphabets.
The authors of the 2023 study did not sit still. They responded with what was a working draft (now published), showing that if GPT-4 is allowed to use code that lets it count and keep track of steps, its performance improves. This may not mirror how humans solve the task, but does suggest that LLMs are not relying purely on training patterns. Mitchell and Lewis remain unconvinced. This workaround doesn’t test what we are really interested in: the ability of LLMs to abstract relationships generally.
Their takeaway: claims about AI reasoning need to be stress-tested with carefully designed, out-of-distribution problems. To be “out of distribution” means that the examples differ in form, context, or causal structure from the training data of the model. Pattern-matching collapses here, but real understanding, based on conceptual generalization, does not.
Mitchell’s verdict in 2024 was that GPT-3 and its successors lacked real understanding of analogical thinking. By that measure, they were not—and would not soon become—intelligent.
But that was more than a year ago, and quite a lot has happened since the publication of those results. We now have GPT-5 and a slew of other updated models, such as Claude Sonnet-4 and Gemini-2.5. I was curious to see whether these newer models were any better than GPT-3 at analogical reasoning.
I had an LLM look into the question and found glimmers of hope here and there, but the bottom line is that there’s no clear evidence that newer models have improved.
There doesn’t seem to be any follow-up research on the letter-string problems since the exchange noted above. (If you’re aware of some, please let me know!) That said, analogical reasoning remains an active area of study. For example, Musker et al. (2025) find that models such as GPT-4o and Claude Sonnet-3.5 do as well or better than the average child on simple verbal analogies. Two caveats, though. Models can get very far on this task just by picking out the word that commonly goes with the target in their training data. Also, the errors they make are different from children’s. So although their performance is good, there are other reasons to believe that these models are not abstracting from general concepts.
Taken together, studies like these show why researchers remain fascinated by analogical reasoning: it sits at the heart of the debate over whether LLMs are truly intelligent. That’s important for many in the AI community, but it’s not really the question many of us care about. As users, we’re less interested in whether these models think like humans and more in how they can help us think better. The real question isn’t “Are they intelligent?” but “How can I use them to make my arguments more intelligent?”
My read is that LLMs can be genuinely helpful for constructing analogies, but only if you guide them carefully. For common analogies, the type that many of us traffic in, chatbots are a great resource. Their better than average ability to make associations can help to tighten an analogy or make a new one. The place to be careful is out-of-the-box thinking.
The ability of models to reason analogically is brittle. Once the items being compared differ too much in form, content, or causal structure, you’re out of luck. You have strayed “out of distribution,” and are in terra incognita. When OOD, do not trust the results of any model, even the most advanced.
Ah, but how to tell whether the analogy is out of distribution? You don’t know what’s in the training data of these models. Here, testing matters. If you are wandering away from common associations in language, test the robustness of your analogy. Here are some questions that you can ask your thinking partner:
Explicitly ask the chatbot for analogies that are conceptually near, medium, and far. Continue until you identify a candidate that you like. This can help you assess how likely it is that your analogy is OOD.
Ask for an analysis of your candidate, prompting the bot to create a table listing information such as the entities under consideration, concepts at play, relations, and constraints.
Be crafty. List your candidate and then prompt the bot to provide three disanalogies, i.e., examples where the mapping would break. If it fails this task, then the understanding of the original analogy is likely to be weak.
Best of all, give it a relevant example before asking it to generate its own. Tell the bot that your example is a schema, not precise wording. Ask it to state exactly what changes it has made to the relations and why the analogy still holds.
In short, test the output thoroughly. The process of questioning suggestions not only safeguards against weak analogies but also deepens your understanding of what it is you are trying to argue.

Thank you, Terry. That is a rich and interesting article. I have come across different categorizations of reasoning, but never in this form. Her research, especially the focus on antinomous reasoning, reminded me of Aristotle's Square of Opposition. She is drawing from a strong tradition. The implications for teaching and learning, though, are harder for me to identify--I'll need to mull the move from theory to practice over a bit. And the analogy of a drafting compass is fantastic. I'd never heard of it. It's hard to imagine a colder, more sterile image. No wonder his lover left!
You might be interested in Patricia Alexander's work in the field of reading on relational reasoning as a substrate of textuality, including analogical reasoning. One of her students won the International Literacy Association's prize for best dissertation based on this theoretical model in 2018, if memory serves. Analogical reasoning is an important resource for human comprehension of text (along with Alexander's other types--anomalous, antithetical, and I can't recall thel exact word she uses but it amounts to paradoxical oh yeah antinomous reasoning. (He twists his arm to pat himself on the back for recalling this detail after plenty of time to forget it:) I'm not what I would think of as an expert in her work, bu I have read some of it, and I agreed with the ILA vis a vis the award in 2018. Btw, 16th century poetry is known ass metaphysical, and the analogies are tortuous--e.g., John Donne's famous drafting compass as an analogy for parted lovers. Nick has a special focus on a special type of poetry. I'm with you: How could we ever measure a poem? Poems are more like dreams than blueprints, eh? Not a great analogy but serviceable maybe. https://www.nature.com/articles/npjscilearn20164