Their reasoning is fake
Yesterday at work someone sent me the PDF of a spreadsheet. I had asked them for the spreadsheet so I could get at the data, but they sent me a PDF, a file format that essentially captures data as sent to a printer. Now I had 2 options:
- Bother them again and wait for a real spreadsheet.
- I heard LLMs can make sense of PDF files. I could have ChatGPT reconstitute the spreadsheet for me.
I chose option 2, because it seemed essentially free. Just type something like "Extract a CSV containing the data in the attached PDF of a spreadsheet", and visually check a fraction of the result.
One hour and many attempts later, I gave up. The news that LLMs can read PDFs are terribly exaggerated.
- Gemini hallucinated half of the data. It hallucinated so much that I thought it might be sabotaging me. It hallucinated so hard that I don't think I can trust an LLM to do a well-defined data conversion operation ever again.
- ChatGPT complained that the job was big and gave me a sample that looked all right. My hope was renewed by the sample. Then I told it to finish the job. ChatGPT broke down the process into several tasks so it would have enough memory/tokens to finish the job. The result was a spreadsheet that contained only &%#2&)%)6*, meaning it probably lost track of the source format at some point in its broken-down tasks.
I was forced to give the idea up. It was taking too long and all results were hope-shattering.
But the thing to notice is, if one-shotting was bad due to hallucinations, trusting the thing to self-manage and break down tasks led to even worse results. As if, in the longer process, the hallucinations compounded.
If you see managers and politicians talking about "AI" today, the trust placed on the tech is the highest possible, they are amazed by its miraculous power – enough to fire thousands of workers (who are bound to make a triumphant comeback). I wonder if these managers have ever tried to do actual work with these things.
LLMs now produce text that looks like reasoning, not because they are reasoning – they aren't –, but as part of a "fake it till you make it" sort of strategy, that by training the things on "reasoning" text, this might lead them to sorta kinda reason. What is that text like? It is shock-full of things like, "But wait, ", "Aha!", "I need to go back and reconsider" etc. But really, that's just fakery: it's not really thinking, it's just yet again predicting the next word – this time, in a new literature genre.
A fundamental difference exists between thinking and fake-thinking. The difference dramatically affects the correctness of the final outcome. For instance, an LLM can invert its conclusion/result/answer if you add to the prompt a single non-relevant sentence.
So I have been trying to find a way to explain that "reasoning models" are fake-reasoning. He who searches... finds.
Of all places, it was in a YouTube comment that I came upon an explanation, much better and more comprehensive than my own attempts, of the fundamental difference between a human's reasoning and an LLM's.
On this video, @user-pw2ro7gt4r commented:
Yes, and the use of terminology in some schools of thought to justify calling them reasoning agents is a real pet peeve of mine, because while one can make a technical argument on that basis, those doing so understand full well that isn't how the general public intuitively understands the term "reasoning" and why these models don't actually reason in the ways we mean.
In the human intuitive or personal level sense, reasoning means something much richer than what any LLM can do. It means a subject with a point of view deliberately attends to its own thoughts, evaluates them, notices uncertainty, manipulates internal models, checks itself, changes strategy, and understands itself as trying to get something right. This is not just "outputs follow from inputs." It includes real agency, conscious access, metacognitive control, intentional evaluation, and a stable first-person relation to one's own thought process. That's what most people mean when we say we "reason" about things.
In the technical/computational sense though, some will argue that reasoning means something like: producing output that functionally serve as conclusions, intermediate steps, plans, classifications, proofs, or decisions by operating over representations according to some procedure. In that strict sense, an LLM can appear to "reason" if it can take premises, transform them, combine them, infer consequences, produce output that functionally appears to solve problems, or approximate valid patterns of inference.
That's the sense used in phrases like "automated reasoning," "reasoning benchmarks," "chain-of-thought reasoning," or "reasoning models." It is largely a functional or behavioral notion: does the system produce outputs that look like the result of inference? It's that "look like" bit that is the whole ball game though, and the fact that they call these systems "reasoning models" without making the distinction clear is a real problem. The distinction matters not just philosophically, but practically, because the two kinds of reasoning have different failure modes, and different causes for them.
A human reasoner can still be wrong, biased, confused, or self-deceived of course. But that's not the point. When a human says, "I used this assumption, then realized it was false," there is at least a normal causal connection between report, attention, memory, and deliberate self-correction. With an LLM, a stated "reason" is often just another generated output. It may be useful, but it is not guaranteed to be a faithful transcript of an inner deliberative process. That matters for debugging, trust, accountability, and confidence calibration.
For example, when an LLM gives a plausible explanation for an answer, the explanation may be entirely post hoc. It may not reveal why the model produced that answer. This is not merely philosophical. It affects whether you should trust the explanation, whether you can audit the system, whether you can rely on its self-assessment, and whether asking it "are you sure?" meaningfully probes its actual basis for confidence.
It also matters for robustness. Human-style reasoning involves flexible goal tracking: "What am I trying to prove?" "Does this answer the question?" "Did I change definitions halfway through?" LLMs can simulate these checks and often do them in ways that appear successful, but the checks are themselves generated behavior, not governed by a stable executive understanding. So a model can perform impressively on one version of a problem and fail on a slightly altered version (or even the very same problem in a different instance) because it is pattern-completing around the surface form rather than deliberately maintaining the same abstract object of thought.
It matters for metacognition. A human reasoner can often distinguish, however imperfectly, between "I know," "I infer," "I am guessing," "I remember," and "I am confused." LLMs can produce those labels, but the labels are not grounded in the same kind of first-person access. Their confidence expressions are outputs to be calibrated, not introspective reports to be trusted at face value.
It matters for planning and agency. A human can form an intention, pursue it across time, notice conflict between subgoals, and revise priorities. An LLM in a chat can generate plans and act through tools and claim to be doing the same things, but unless embedded in a larger agentic system (and often even then,) it does not have an enduring practical standpoint. Even when embedded in such a system, the agency is engineered through scaffolding: memory, tool calls, reward signals, monitoring, state management, and external control loops. That may produce useful agent-like behavior, but it is not stably binding referents in the same way a human reasoner does, and epistemic drift even on the same task can happen.
TLDR: There is nothing there to "do the reasoning" or the understanding in the first place, in the way humans intuitively understand those terms, and so these models' failure modes differ from the ways humans fail at cognitive tasks, and those differences have real world consequences for what these models can and can't do on a practical level. Yet people are prematurely placing their trust in them for "reasoning tasks" nevertheless.
That is a brilliant and fair explanation. It clarifies one way in which the "AI" product is fake – it reasons as much as a parrot –, and highlights why the way people are now trying to use "AI" is bound to fail.
It explains why parrots fail at a data conversion operation that can be well-defined in a single line of text, even though they have encyclopedic "knowledge" of the file formats involved.
It explains why LLM creators are desperate to infringe every human right in their thirst for more data. Give me all your works, that wasn't enough, now give me all your keystrokes and mouseclicks, give me all your thoughts if physically possible. Because the illusion of a miracle comes only from the training data: as soon as something is not covered by the training data, the parrot says something random and useless.
LLMs are good for reworking text. Nothing else. Certainly not reasoning or self-control/self-management.
Data conversion appears to be just "reworking text", but often involves much more.
