THE AMERICA ONE NEWS
Jun 21, 2025  |  
0
 | Remer,MN
Sponsor:  QWIKET 
Sponsor:  QWIKET 
Sponsor:  QWIKET: Elevate your fantasy game! Interactive Sports Knowledge.
Sponsor:  QWIKET: Elevate your fantasy game! Interactive Sports Knowledge and Reasoning Support for Fantasy Sports and Betting Enthusiasts.
back  
topic
Brian J. Dellinger


NextImg:The Thinking Machines That Weren’t

In 1955, months before anyone had heard of “artificial intelligence,” Herbert Simon and Al Newell began work on a new kind of computer program. Their algorithm, called “Logic Theorist,” could solve mathematical proofs — sometimes better than any human mathematician. The result was incredible, but so was Simon’s description: “Over Christmas, Allen Newell and I invented a thinking machine.”

The field of AI has seen brilliant discoveries, and Logic Theorist was one of the first. But AI scientists have always been over-eager to declare victory — to announce that, at last, we’ve built machines that “really” think.

That brings us to the latest version of generative AI (or “genAI”): the Large Reasoning Model. Most people are familiar with Large Language Models (LLMs) like ChatGPT 4 or Claude 3. Underneath the technical complexity, LLMs work on a simple principle. First, programmers train a model on examples of human writing. Over time, the AI identifies words that go together, until it statistically “learns” the patterns of human speech: if these words have been said, then that word comes next. Start a sentence with “Once upon…,” and the computer predicts “a time.” (RELATED: Mom, Meet My New AI Girlfriend)

But word-prediction has problems. LLMs tend to hallucinate, stringing together plausible chains of words about events that never happened. Similarly, a genAI can regurgitate familiar mathematical arguments, but might struggle to solve fresh problems that use the same reasoning.

Users noticed that LLMs did better if they were asked not just for answers, but for the logic behind those answers. In effect, the models were asked to generate part of an answer, then evaluate their own solution. If the model rejects its first draft, it writes another, and so on. The resulting chain-of-thought often looks shockingly human, complete with statements like “Hm, that won’t work,” or “Aha, but what if I-”

The new generation of Large Reasoning Models (LRMs) — programs like ChatGPT o1, or DeepSeek’s R1 — were explicitly designed to give chain-of-thought responses. As expected, they scored higher on reasoning benchmarks like math puzzles or the LSAT. Increasingly, researchers describe LRMs in anthropomorphic terms: in terms like thought and reasoning. (It’s hard to avoid that language even here!) (RELATED: Is This the Stupidest Sentence of 2025?)

Now, a paper called “The Illusion of Thinking” threatens those descriptions. Researchers from Apple guessed that LRMs might suffer from an old AI problem: data contamination. To test an AI’s reasoning ability, we generally want it to work through new problems. In a contaminated data set, the machine is accidentally exposed to the answers to those problems during training. What looks like reasoning, then, might be the algorithm echoing examples it’s “memorized” — complete with justifications along the way.

To test their theory, the researchers gave the LRMs logic puzzles at increasing levels of difficulty. At low levels, the LRM wasted resources as it searched for complicated solutions. In a midrange, it came into its own, answering problems that stumped traditional LLMs. With high complexity problems, though, all of the AIs collapsed; the LRMs sometimes simply refused to answer, or offered to answer different questions instead.

That result fits well with the original guess. If an LRM suffers from data contamination, it might memorize solutions to smaller, more common versions of a problem. When the problems scale up to more complex versions, for which the AI may be missing good examples, it suddenly fails.

There are still questions around the paper, with some critiquing its methods or conclusions. (For instance, in some cases, the AI may have rightly refused to solve impossible problems.) Yet the authors show the same failures under different conditions, and other researchers have described similar issues with LRMs.

Perhaps all this should be expected. At heart, chain-of-thought AIs are still doing the same thing that LLMs did before: they look at the words that have come so far and predict what a human would say next. It’s just that, with an LRM, the system makes several attempts along the way. To a point, that seems to work; maybe the words of a decent draft push the AI toward a better final answer.

But this verisimilitude can make the chain seem more reliable — more human — than it is. An AI might declare “Aha,” not because of some internal breakthrough, but because its human readers expect an exclamation at that point. In fact, there’s some evidence that human readers may depend more on the chain than the AI itself. At least one paper has shown that LRMs often reach conclusions for reasons that never appear in their explanations. (RELATED: Regarding AI, Is Sin Contagious?)

The thesis of LRMs seems to be that the outward signs of reasoning are interchangeable with reason itself. But that’s at odds with our own experience of thinking: of working toward justified beliefs, and not merely to a likely chain of words. If Apple is right, it may turn out that even the most sophisticated illusion is no substitute for the real thing.

READ MORE from Brian Dellinger:

The Promise and Peril of DeepSeek

Making Friends: AI and Companionship

An Eye on AI: Five New Things to Watch in October