The ability to generalize from individual instances or to reason inductively is a key part of human intelligence. It allows us to reason about everything around us, and just as importantly, about what could be.
Unlike deductive reasoning, determining how well LLMs reason inductively is hard. Understood one way, LLMs are the superstars of inductive reasoning. They are fed trillions of pieces of data, unsorted, tagged, or otherwise organized, and identify patterns within—patterns that far exceed our own limited understanding of the same data. If induction is defined as nothing more than generalizing from discrete instances, then this extraordinary feat of pattern-detection shows that LLMs are superb inductive reasoners.
Pattern detection alone, however, is not what is generally meant by inductive reasoning. It requires the ability to generalize from novel instances, not only those previously experienced. Moreover, some understanding of why the generalization is explanatory is also necessary, especially if any causal explanations of LLMs are to be trusted.
But little rides on whether pre-trained models are able to reason inductively. What matters are trained models. Here, researchers are surprisingly divided on the ability of LLMs to reason inductively. First, there is the issue of blanket skepticism about the ability of LLMs to reason at all. Because we don’t know how the generalization is produced, we don’t know if the generalization is due to the detection of subtle patterns in the data. As with pre-trained data, this too would not be evidence of inductive reasoning.
Skepticism of this order of magnitude requires more than gesturing to the possibility; it requires evidence. The real problem with determining whether LLMs can inductively reason is … well… us. We haven’t really decided what counts as proof of inductive reasoning. The reflex of the LLM research community is to seek out a benchmark to test the skill, but even within this narrow definition of what counts as proof, the answer to the question is surprisingly messy.
Yang et al. (2022) dive deep into the question of whether LLMs can generalize from instances, focusing on the problem of inducing a rule from a set of facts in natural language. They develop a new test (DEER), introduce a framework (COLM) for improving the outputs of the test by incorporating both a rule generation model and filtering mechanisms. The results are meh. Some models are able to do simple generalizations, but fail at those that require abstraction or complex reasoning.
More recent research, from March of 2024, is equally ambivalent about the ability of LLMs to reason inductively. Bowen et al. (2024) set out a more ambitious definition of what induction requires, invoking not only generalization, but also the ability to apply and validate the generalization and integrate the newly discovered information into what it knows. Consequently, the benchmarking test they develop is more sophisticated, testing these multiple dimensions of inductive reasoning: rule generation, application, validation, and finally integration.
Remarkably, they find that rule generalization is impacted by the number of instances. Presented with few instances of a phenomenon—say polygons of a specific shape or color—LLMs can successfully induce a rule, but presented with a lot of polygons, even in situations where they have already “learned” the relevant rule, LLMs fail to deliver the expected generalization.
And of course without the ability to robustly generate a rule, the other dimensions of inductive reasoning fare no better. It is awfully hard to consistently apply, validate or integrate a rule that you can’t consistently generate.
In contrast to these naysayers, Cheng et al. (2024) are quite chuffed about the ability of LLMs to reason inductively. They argue that what previous testers have failed to do is properly isolate deductive from inductive reasoning. Once isolated from each other, the real problem, according to these authors, is the weak ability of LLMs to reason deductively, especially in counterfactual situations. Inductively? LLMs perform superbly.
Cheng and his co-authors introduce a new framework, SolverLearner, that isolates the ability of an LLM to learn an input-output mapping function from a series of instances. In other words, generalization is operationalized as the identification of the correct mapping function, something evidently that LLMs excel at.
Cheng et al.’s results are not exactly surprising given that we already know from Bowen et al. that LLMs are pretty good at rule generation in limited contexts, but like Chen and his co-authors, I too think that induction is so much more than rule generation, and definitely more than the identification of mapping functions.
The core of the issue is that we do not have an agreed-upon method to determine whether something is proof of inductive reasoning. We have the Turing test problem.
Prior to the arrival of generative AI, the Turing test was often touted as the way to determine whether a machine was intelligent. The test is simple: a human must hold a conversation in writing with a computer on any topic for any length of time. If the human interlocutor is unable to determine whether their partner is human or machine, then the computer has passed the test for intelligence.
The test is based on how you determine whether those around you are human or machine. You talk to your neighbour, their responses seem fine, you think, “Human”.
The problem is that generative AI blows the test out of the water. Generative AI passes the test a lot of the time. That is why many turn to chatbots for entertainment, to discuss problems, advice, and even simple companionship. It has also been the source of heartache and why some think that chatbots might actually be “sentient”.
It’s not that generative AI can pass the Turing test every time. As Bender and Koller (2020) argue, communication will eventually breakdown if the topic requires too much specialized knowledge about how the world works, but frankly, you can’t pass the Turing test a lot of the time either. In an emergency, such as a nuclear reactor meltdown, your conversational powers may be reduced to babble as you panic and draw on what little you know about what to do.
We can’t require that generative AI pass the Turing test all of the time nor is it enough to ask that it only pass it some of the time (as it already does). It just doesn’t work as a test of intelligence.
Similarly, benchmarking just isn’t a good test of inductive reasoning. A test, with its requirement of an objectively right answer, is consistently limited testing only certain aspect of the complex behaviour of induction. It’s great that LLMs can identify a mapping function, but it only proves one thing: that LLMs can identify mapping functions.
Moreover, much like the Turing test, failing the test doesn’t mean that the technology cannot reason inductively. It may be a failure of knowledge. Perhaps the question pulls on knowledge, whether applied or common sense, that the machine lacks. This is no different than you. Perhaps the induction required is ingenious, something that many humans would fail to pull off, and yet no one believes that humans are unable to reason inductively.
So what is needed? I don’t know but at the very least a method to determine whether generative AI can reason inductively would be:
Complex - it should measure the range of attributes associated with induction.
Novel - the ability to induce generalizations from new data is key to the ability.
Plausible - explanatory adequacy matters some of the time, so any causal mechanism must make sense.
Without at least these three qualities, no test of inductive reasoning will determine whether generative AI can generalize over instances.