Jun 182025
 

The other day, I came across a post, one that has since appeared on several media sites: An Atari 2600 from 1977, running a chess game, managed to beat ChatGPT in a game of chess.

Oh my, I thought. Yet another example of people misconstruing a language model’s capabilities. Of course the Atari beat ChatGPT in a game of chess. Poor ChatGPT was likely asked to keep track of the board’s state in its “head”, and accurately track that state across several moves. That is not what an LLM is designed to do. It is fundamentally a token generator: You feed it text (such as the transcript of the conversation up to the latest prompt) and it generates additional text.

The fact that the text it generates is coherent, relevant, even creative and information-rich is a minor miracle on its own right, but it can be quite misleading. It is easy to sense a personality behind the words, even without the deceptively fine-tuned “alignment” features of ChatGPT. But personality traits notwithstanding, GPT would not be hiding a secret chessboard somewhere, one that it could use to keep track of, and replay, the moves.

But that does not mean GPT cannot play chess, at least at an amateur level. All it needs is a chessboard.

So I ran an experiment: I supplied GPT with a chessboard. Or to be more precise, I wrote a front-end that fed to GPT the current state of the board using a recognized notation (FEN — the Forsyth-Edwards Notation). Furthermore, I only invoked GPT with minimal prompting: the current state of the board and up to two recent moves, instead of the entire chat history.

I used the o3 reasoning model of GPT4.1 for this purpose, which seemed to have been a bit of an overkill; GPT pondered some of the moves for several minutes, even exceeding five minutes in one case. Although it lost in the end, it delivered a credible game against its opponent, GNU Chess playing at “level 2”.

In fact, for a while, GPT seemed to be ahead: I was expecting it to win when it finally made a rather colossal blunder, sacrificing its queen for no good reason. That particular move was a profound outlier: Whereas GPT prefaced all its other moves with commentary and analysis, this particular move was presented with no commentary whatsoever. It’s almost as if it simply decided to sabotage its own game. Or perhaps just a particularly bad luck of the draw by an RNG in what is fundamentally a stochastic reasoning process?

Still, it managed to last for 56 moves against capable chess software. The only other blunder it made during the game was one attempted illegal move that would have left its king in check. A retry attempt yielded a valid move.

I am including a transcript of the game below, as recorded by my interface software. The queen was lost in turn 44.

As I mentioned, I thought that using the o3 model and its reasoning capability was excessive. Since then, I ran another experiment, this time using plain GPT4.1. The result was far less impressive. The model made several attempts at illegal moves, and the legal moves it made were not exactly great; it lost its queen early on, and lost the game in 13 moves. Beginner level, I guess.

Oh well. So maybe the reasoning model is required, to be able to make GPT play credibly.

Nonetheless, I think these examples demonstrate that while these models are no chess grandmasters, they are not stochastic parrots either. The fact that a language model can, in fact, offer a reasonable game against a dedicated chess software opponent speaks for itself.

Here is the transcript of the 56-move game that the o3 model would have likely won, had it not squandered away its queen.

1. e4 e6
2. d4 d5
3. Nc3 Nf6
4. Bg5 Be7
5. e5 Nd7
6. h4 Bxg5
7. hxg5 Qxg5
8. Nf3 Qd8
9. Bd3 h6
10. O-O a5
11. Nb5 Nb6
12. Rc1 O-O
13. c4 dxc4
14. Bxc4 Bd7
15. Bd3 Bxb5
16. Bxb5 f6
17. Qb3 Qe7
18. exf6 Rxf6
19. Ne5 Qd6
20. Rd1 Kf8
21. Nc4 Nxc4
22. Qxc4 c6
23. Rd3 Rf7
24. Ba4 b5
25. Bxb5 Qd7
26. Qc5+ Ke8
27. d5 cxb5
28. dxe6 Qxe6
29. Re3 Qxe3
30. Qxe3+ Re7
31. Qf3 Ra7
32. Qh5+ Kd8
33. Rd1+ Rd7
34. Qf3 Rxd1+
35. Qxd1+ Nd7
36. Qd2 b4
37. a3 Re4
38. f3 Rc4
39. b3 Rc3
40. axb4 axb4
41. Qd5 Rc1+
42. Kh2 g5
43. Qg8+ Kc7
44. Qc8+ Kxc8
45. Kg3 Rc3
46. Kg4 Rxb3
47. Kh5 Rxf3
48. gxf3 b3
49. Kxh6 b2
50. Kxg5 b1=Q
51. Kh6 Qh1+
52. Kg7 Qxf3
53. Kg8 Ne5
54. Kh8 Qb7
55. Kg8 Qf7+
56. Kh8 Ng6#

I can almost hear a character, from one of the old Simpson’s episode, in a scene set in Springfield’s Russian district, yelling loudly as it upturns the board: “Хорошая игра!”

 Posted by at 3:47 am

  2 Responses to “Yes, GPT can play chess”

  1. Interesting experiment. Hopefully it is not as dangerous as for Moxon the inventor (you may remember the story “Moxon’s Master” by Ambrose Bierce).

    Inspired by this I tried experimenting with DeepSeek and some simpler games but it generally ends up with either it knowing the relevant Sprague-Grundy numbers etc, or situation where it looks tricky to explain game rules.

    I guess it is hard to extrapolate to other chess variants, but perhaps you can try for fancy ask it to play on “cylindrical” board or try Fischer random chess.

    scene set in Springfield’s Russian district, yelling loudly

    Funny. It sounds unnatural, like not ideal calque translation – we normally say “thanks for the game” rather than “good game” (probably due to complicated historic/linguistic reasons). But overall this pretty resembles scene of chess game end in the old soviet movie “Gentlemen of Fortune” when fellows turned chess into hazard game and are trying to win some clothes from stranger (as they lost their own after transportation in the liquified cement cistern by mistake).