I again played a little with my code that implements a functional user interface to play chess with language models.
This time around, I tried to play chess with GPT-5. The model played reasonably, roughly at my level as an amateur: it knows the rules, but its reasoning is superficial and it loses a game even against a weak machine opponent (GNU Chess at its lowest level.)
Tellingly, it is strong in the opening moves, when it can rely on its vast knowledge of the chess literature. It then becomes weak mid-game.
In my implementation, the model is asked to reason and then move. It comments as it reasons. When I showed the result to another instance of GPT-5, it made an important observation: language models have rhetorical competence, but little tactical competence.
This, actually, is a rather damning statement. It implies that efforts to turn language models into autonomous “reasoning agents” are likely misguided.
This should come as no surprise. Language models learn, well, they learn language. They have broad knowledge and can be extremely useful assistants at a wide variety of tasks, from business writing to code generation. But their knowledge is not grounded in experience. Just as they cannot track the state of a chess board, they cannot analyze the consequences of a chain of decisions. The models produce plausible narratives, but they are often hollow shells: there is no real understanding of the consequences of decisions.
This is well in line with recent accounts of LLMs failing at complex coordination or problem-solving tasks. The same LLM that writes a flawless subroutine under the expert guidance of a seasoned software engineer often produces subpar results in a “vibe coding” exercise when asked to deliver a turnkey solution.
My little exercise using chess offers a perfect microcosm. The top-of-the-line LLM, GPT-5, knows the rules of chess, “understands” chess. Its moves are legal. But it lacks the ability to analyze the outcome of its planned moves to any meaningful depth: thus, it pointlessly sacrifices its queen, loses pieces in reckless moves, and ultimately loses the game even against a lowest-level machine opponent. The model’s rhetorical strength is exemplary; its tactical abilities are effectively non-existent.
This reflects a simple fact: LLMs are designed to produce continuation of text. They are not designed to perform in-depth analysis of decisions and consequences.
The inevitable conclusion is that attempts to use LLMs as high-level agents, orchestrators of complex behavior without external grounding are bound to fail. Treating language models as autonomous agents is a mistake: they should serve as components of autonomous systems, but the autonomy itself must come from something other than a language model.