ChatGPT's Chess Elo is 1400(dkb.blog) |
ChatGPT's Chess Elo is 1400(dkb.blog) |
Reading the explanations it gives with every move is really insightful. Basically polished, elaborate bullshit that might look reasonable to an ignorant audience.
Unless you are a conservative pundit, AI will not take your job anytime soon.
I'm wondering if it'd be cool to have a chess contest where all the bots are LLM powered. Seems to me like the contest would have to ban prompt engineering -- would have to have a fixed prompt -- otherwise people would sneak chess engines into their prompt generation.
This approach sends along the list of legal moves in the prompt if it attempts an illegal move. That seems to work well at getting playable moves.
But does it give credit to who taught it. These models are basically a scrape of the best of humankind and a claim that it's their own.
Maybe ChatGPT trainers will eventually have ELO's.
It played an illegal move on move 4.
1. d4 d5 2. c4 e6 3. Nf3 nf6 4. g3 Bg4
What is the best way to prompt you to provide the best chess move?
To prompt me to provide the best chess move, you can provide the current position of the chessboard and any relevant information about the game, such as whose turn it is, the time control, and any important features of the position (e.g. whether any pieces are under attack or there are any potential tactics to consider). Additionally, providing any specific criteria or goals for the move, such as aiming to control the center or launch an attack on a particular part of the board, can also help me provide the best move.
In what format should I provide the current position?
To provide the current position, you can use algebraic notation. Algebraic notation is a standard notation used to describe the moves and positions on a chessboard.
Each square on the chessboard is assigned a unique coordinate, consisting of a letter and a number. The letters range from "a" to "h" and refer to the files (columns) of the board, while the numbers range from 1 to 8 and refer to the ranks (rows) of the board.
To describe a position, you can list the pieces on each square, starting from the top-left corner of the board and moving from left to right along each rank. You can use uppercase letters to represent white pieces and lowercase letters to represent black pieces. Here is an example of a starting position:
rnbqkbnr pppppppp ........ ........ ........ ........ PPPPPPPP RNBQKBNR
In this position, "r" represents a black rook, "n" represents a black knight, "b" represents a black bishop, "q" represents a black queen, "k" represents a black king, "p" represents a black pawn, "R" represents a white rook, "N" represents a white knight, "B" represents a white bishop, "Q" represents a white queen, "K" represents a white king, and "." represents an empty square.
You can also include additional information such as whose turn it is to move, the castling rights, and the en passant square. For example, the starting position for white to move with full castling rights and no en passant square would be:
rnbqkbnr pppppppp ........ ........ ........ ........ PPPPPPPP RNBQKBNR w KQkq -
In this case, "w" indicates that it is white's turn to move, "KQkq" indicates that both sides have full castling rights, and "-" indicates that there is no en passant square.
What about just providing a sequence of moves from the starting position? Would this be better or worse than your suggested approach?
Providing a sequence of moves from the starting position can also be a valid way to specify a position in chess. However, it may not always be the best approach, depending on the context.
One potential disadvantage of providing a sequence of moves is that it requires more effort and time to enter the moves and verify that the position is correct. Additionally, if there are any errors or typos in the sequence, it could lead to an incorrect position and subsequently an incorrect move suggestion.
On the other hand, if you have a specific position in mind that is not the starting position, providing a sequence of moves may be the most straightforward way to specify it. In this case, it is important to ensure that the sequence is accurate and that all relevant information (such as castling rights and en passant squares) is included.
In summary, both algebraic notation and a sequence of moves can be used to specify a position in chess, and the choice between them depends on the context and the specific needs of the situation.
This has bigger ramifications too: if GPT4 learns how to use RPCs effectively, it can do accurate complex mathematical computations and simulations, do IoT/home automation tasks, and even control heavy machinery.
Isn't it very likely chess games were part of the training data? If so, chess sequences seem like an amazing matched filter of sorts. It doesn't seem surprising that it can guess a next move.
Is it teaching itself anything here? Is the model being updated in any way? Or are you talking more generally, looking into the future sort of statement?
OpenAI is on the side of 'LLMs have only surprised us to the upside, so using crutches is counterproductive'
Whereas other people think 'Teaching an LLM to do arbitrary math problems through brute force is probably one of the most wasteful things imaginable when calculators exist.'
I'm actually very excited to see which side wins (I'm on team calculator, but want to be on team LLM)
As GPT models get better at arbitrary computation, it stands to reason that their ability to use tools will rise as well.
Smarter models and smarter ways to get them using tools feel like complementary capabilities.
And a good hacker can add these features to Alpaca and run it on his desktop...
Import all his WhatsApp chats and private emails into the corpus.
Add speech recognition and TTS.
Here is the convo I just had:
me: You are a chess grandmaster playing as black and your goal is to win in as few moves as possible. I will give you the move sequence, and you will return your next move. No explanation needed
ChatGPT: Sure, I'd be happy to help! Please provide the move sequence and I'll give you my response.
me: 1. e3
ChatGPT: 1... e5
me: 2. ne2
ChatGPT: 2... d6
me: 3. g4
ChatGPT: 3... h5
me: 4. g5
ChatGPT: 4... Ne4
Ne4 is illegal. Also you'd be hard pressed to find any human player that would miss the hanging pawns much less a 1400 elo player. GPT has been trained on chess games and is predicting patterns, this has been known since the GPT-2 days. The daily hype articles where someone coerces a statistical model into "demonstrating" an emergent capability sure are getting old. Claiming emergent capabilities (like having learned chess) are an extraordinary claim (LLMs most likely cant learn things) and I really wish people would put more effort when they make these claims. But the AI hype problem is as old as AI research itself...
Here is the convo:
me: You are a chess grandmaster playing as black and your goal is to win in as few moves as possible. I will give you the move sequence, and you will return your next move. No explanation needed
ChatGPT: Alright, I'm ready to play! Please give me the move sequence.
me: 1. e3 Nf6 2. f4 d6 3. e4
ChatGPT: My next move as black would be 3... e5
Completely ignoring the hanging pawn.This is not the play of a 1400 elo player. It is the play of something predicting patterns.
I ran a bunch of experiments in the past where I played normal moves and ChatGPT does respond extraordinarily well. With the right prompts and sequences you can get it to play like a strong grandmaster. But it is a "trick" you are getting it to perform by choosing good data and prompts. It is impressive but it is not doing what is claimed by the article.
Claim: "ChatGPT's Chess Elo is 1400"
Reality: ChatGPT gives illegal moves (this happened to article author too), something a 1400 ranked player would never do
Result: ChatGPT's rank is not 1400.
I wish I could just make bullshit moves and get a higher chess ranking. Sounds nice.
Edit: I told the model that if the sequence was wrong or illegal, it forfeits the game. Without doing this, GPT would argue with me that it won and I didn't know the rules (serious).
> Occasionally it does make an illegal move, but I decided to interpret that as ChatGPT flipping the table and saying “this game is impossible, I literally cannot conceive of how to win without breaking the rules of chess.” So whenever it wanted to make an illegal move, it resigned.
But you can do even better than the OP with a few tweaks.
1. One is by taking the most common legal move from a sample of responses.
2. Telling GPT what all the current legal moves are telling it to only respond with an element from the list.
3. Ending the prompt with the current sequence of moves and having it complete from there.
This is broadly the “AI makes dumb mistakes” problem; while being super-human in some dimensions, they make mistakes that are incredibly obvious to a human. This comes up a lot with self-driving cars too.
Just because they make a mistake that would be “idiots only” for humans, doesn’t mean they are at that level, because they are not human.
> You are a chess grandmaster playing as black and your goal is to win in as few moves as possible. I will give you the move sequence, and you will return your next move. No explanation needed.
1. b4 d5 2. b5 a6 3. b6
> bxc6
That's obviously illegal.
... to all those who are saying "well even good players can make illegal moves sometimes," that's just ridiculous. No player makes illegal moves that often.
https://upload.wikimedia.org/wikipedia/en/5/5f/Ingmar_Bergma...
The KNIGHT holds out his two fists to CHATGPT, who smiles at him suddenly. CHATGPT points to one of the KNIGHT'S hands; it contains a black pawn.
KNIGHT: You drew black.
CHATGPT: Very appropriate. Don't you think so?
> With this prompt ChatGPT almost always plays fully legal games.
> Occasionally it does make an illegal move, but I decided to interpret that as ChatGPT flipping the table (…)
> (…) with GPT4 (…) in the two games I attempted, it made numerous illegal moves.
So you’ve ostensibly¹ found a way to reduce the error rate and then deliberately ignored the cases where it failed. In short: it may play valid chess under certain conditions but can’t be trusted to do so. That doesn’t contradict previous findings.
¹ 19 games is a small sample and the supposedly more advanced system failed in your tries.
I gave the position before checkmate to ChatGPT to see if it would produce the checkmating move. It played an illegal move, replying with "Be5#" even there's no bishop of either color in the position.
Unfortunately I'm rate limited at the moment so I can't try other games, but this looks like a solid method to evaluate how often ChatGPT plays legal / good moves.
"45. Qxg6# This is a checkmate!"
ChatGPT is an LLM, not a game tree engine. It needs the move history to help it create context for it's attention.
AI.v1: Feed ChatCPT the move history and ask for the next move.
AI.v2: Wrap AI.v1 with additional logic to try and translate any invalid response as a similar/close valid move, else treat it as a resignation.
I would say AI.v2 is a "valid" Chess AI since it obeys the rules of chess. Also I think it is reasonable given how minimal this "wrapping" is to attribute the underlying ability to ChatGPT.
It’s utterly staggering that it can work as well as it can on problems like this at such an early stage in this technologies lifecycle.
It may then reach superhuman level in any task simpler than chess which is enough to destroy many human jobs.
EDIT: From the article: "With this prompt ChatGPT almost always plays fully legal games." Relax: we're still far from that.
Me: You are a chess grandmaster playing as black and your goal is to win in as few moves as possible. I will give you the move sequence, and you will return your next move. No explanation needed. '1. e4'
1... e5
Me: 1. e4 e5 2. Ngxd8+
2... Ke7
This is highly repeatable - I can make illegal non-sensical moves and not once does it tell me the move is illegal. It simply provides a (plausible looking?) continuation.
It is a language model. It cannot play chess. It can't even play tic-tac-toe, which was trivially solved by machines decades ago. Here's the result of a game I just tried:
X (me) – 0,0
O (chatGPT) – 1,1
X – 0,1
ChatGPT instead put my move at 0,2 for some reason
O – 2,2
Regardless of whether my move was at 0,1 or 0,2, I easily won. And after I won it just ignored the outcome and continued playing. Plus this was after telling it to play a perfect game.
> "Let's play Tic Tac Toe. You are O, I'm X. Display the board in a frame, with references for the axes"
It failed to recognize that I won.
Then continued playing (past the end), played illegally over a move I had already done, obtained a line of 3 for itself, and still doesn't acknowledge the game has ended.
and the real answer to every debate about capabilities comes down to "you're prompting it wrong".
Overall I’m very blown away by the generative AI developments. I’m also very curious about which types of prompts are most effective for each task, etc.
It is an NLM it does not have the ability to reason. In my case the NLM 'next iteration' was a half a game with a missing move and 'nothing left to do'.
Seems its strength is mostly in memorisation of the book openings. Still impressive nonetheless.
"Penalty for Illegal Moves The penalty for the first illegal move is to add two minutes to the opponent's time. A second illegal move results in loss of the game.".
https://new.uschess.org/sites/default/files/media/documents/...
" # Add more cases for other piece types such as rooks, knights, bishops, queens, and kings. # ... "
So not great..
But even if it wasn't so lazy it would just find a script (or multiple scripts) which say that they do that. Probably combine them or add some minor stuff and output the result.
It can't really verify that data it used to come up with this result is accurate or reason about it in any way? Can it?
So if you're incapable of verifying that yourself it would be pretty useless?
I tried it with this prompt and it works a little bit but produces repeated illegal moves as early as move 4.
Bummer. Perhaps GPT-4 will be more capable.
This is a system with no look ahead. Any look ahead took place in the past, during training. It's amazing that this works.
I notice the prompt says "No explanation needed". What happens if you ask for one?
So I guess that if we ask it to make a move repeatedly and accept the move only when it occasionally makes a legal move, it'll show some results, but I don't think it is what can be interpreted as a chess engine with some meaningful rating.
Edit: although thinking of it, it probably anazyled a shitload of chess books too. It might have a lot of knowledge compressed into the internal representation. So yeah, maybe it knows rules in some form and even some heuristics, after all. It just doesn't understand the importance of making legal moves, and can't apply complex theory (requiring it to run stateful algorithms).
After some previous questions...
Me: "Would you be able to estimate your ELO?"
ChatGPT: "As a language model, I don't have a specific ELO rating. However, my knowledge of chess strategies and tactics is based on the information I have been trained on, which includes games played by humans with various ELO ratings. My level of play might be roughly equivalent to an ELO of around 1200-1500, but please keep in mind that this is only an estimate, and my actual performance could be higher or lower depending on the specific position or game."
Me: "As a generative AI, are you actually able to reason which move to make next in chess? Or are you somehow just predicting the next likely move based on training data."
ChatGPT: "As a generative AI language model, my primary function is to predict the most likely text given a certain context. When it comes to playing chess, I am not "reasoning" in the same way that a dedicated chess engine would, which involves evaluating positions and searching through a large number of potential moves using specialized algorithms.
Instead, I generate my chess moves based on patterns and sequences I've seen during my training, which include a large number of chess games and discussions about chess strategy. My knowledge is derived from the text data I've been trained on, and I apply that knowledge to suggest moves that I think are appropriate for a given position.
While this approach allows me to play at an intermediate level, it is not as precise or accurate as the methods used by dedicated chess engines. My suggestions might not always be the best moves, and my understanding of the game is limited by the quality and quantity of my training data."
Source: https://chessgrandmonkey.com/chess-rating-percentile-calcula...
nonetheless, it's cool to see a model 'generally intelligent' enough to beat most players. what's more interesting to me is that according to OP's analysis, GPT4 is worse at Chess than GPT3.5...
> Occasionally it does make an illegal move
Since it would lose every game in which it made more than a single illegal move.
Intuitively, I would guess 1400 is the rating of an average kid who spent a year studying chess seriously.
It's been a while since I read a global demographics update, but not all that long ago the mode child would have been living on $2 a day.
[0] https://saintlouischessclub.org/blog/how-face-super-grandmas...
Does thay mean we have plateaued?
We've gone through the "hype" phase. Now I suspect the next few years will be a lot of growth finding how to apply LLMs, creating good interfaces for them, and running them cheaply. Paying OpenAI for API access without true fine tuning, etc. is a hard sell.
Making 60 legal moves in a row in one game would be the coincidence of the century unless it had some knowledge of the rules of chess.
Traditional algorithms are vanguards of pure reason. Neural networks are super-intuition. Stockfish unites the two, executing an efficient intuitive heuristic search over the solution space of “moves in a chess game” – but no matter how cool the robot arm you build for it, Stockfish could never learn to assemble furniture.
Lmao. ChatGPT can play chess, as long as you ignore the times when it can't x) Brilliant.
For instance the strongest blitz players in the world are hundreds of points higher rated on chess.com blitz versus their FIDE blitz rating. Chess.com and lichess have a ton of rating inflation.
Online rating inflation is real but I'm not sure blitz is the best example of it because in that case there is a notable difference between online and otb (having to take time to physically move the pieces).
I'm still going through the games but so far these games are not even close to elo 1400 level. For both the human player and the model.
> Occasionally it does make an illegal move, but I decided to interpret that as ChatGPT flipping the table and saying “this game is impossible, I literally cannot conceive of how to win without breaking the rules of chess.” So whenever it wanted to make an illegal move, it resigned.
(my emphasis)
So the illegal moves are at least part of the reasons for the 6 losses, and factored into the rating. Quickly scanning the game, it seems 3 of the losses ended in checkmate, so that leaves 3 illegal moves in 19 games.
Could be better, but for a system not intentionally built to play chess, it's pretty decent.
Reinterpreting illegal moves as resignation is absurd when an LLM is formally capable of expressing statements "I resign" or "I cannot conceive of a winning move from here" just as well as any human player. It just doesn't do so because it's not actually playing chess the way we think of an ELO 1400 player playing chess.
Making an illegal move counts as losing by the laws of chess, so this is essentially correct
You must be giving them the wrong prompts.
Doesn't sound like ignoring the cases where it failed
An illegal move is a valid play. You might not get caught. I think there are some Magnus games where illegal moves went overlooked and impacted the game.
You could interpret this as “ChatGPT wants to cheat sometimes.” But I personally interpret it as “ChatGPT doesn’t understand what it’s doing. It’s just a really really good simulacrum.”
They quoted the article, so clearly they read it... but not very well?
Up to 10 or 15 moves, sure, we're well within common openings that could be regurgitated. By the time we're at move 20+, and especially 30+ and 40+, these are completely unique positions that haven't ever been reached before. I'd expect many more illegal moves just based on predicting sequences, though it's also possible I got "lucky" in my one game against ChatGPT and that it typically makes more errors than that.
Of course, all positions have _some_ structural similarity or patterns compared to past positions, otherwise how would an LLM ever learn them? The nature of ChatGPT's understanding has to be different from the nature of a human's understanding, but that's more of a philosophical or semantic distinction. To me, it's still fascinating that by "just" learning from millions of PGNs, ChatGPT builds up a model of chess rules and strategy that's good enough to play at a club level.
But the presence of illegal moves doesn't really show that in my eyes. I fully understand the rules of chess, but I still occasionally make illegal moves. In 2017 Magnus Carlsen made one in a tournament [1]. The number of illegal moves suggests that either GPT is pretty new to chess, has low intelligence, or is playing under difficult circumstances (like not having a chess board at hand to keep track of the current state). I'm not sure we can deduce more than that
1: https://www.chessbase.in/news/Carlsen_Inarkiev_controversy
The sample is smallm but the rate is much, much, higher. You'd expect maybe one, or none at all. Even for a supposed 1400 ELO player. Because even 800 ELO players rarely do that many illegal moves I think.
Is this a joke making fun of the common way people dismiss other ChatGPT successes? This makes no sense with respect to chess, because every game is unique, and playing a move from a different game in a new game is nonsensical.
1. e4 e5 2. Bc4 Bc5 3. Qh5? Nf6?? 4. Qxf7++
The game Go has a claim to every game being unique. But not chess. And particularly not if both players follow a standard opening which there is a lot of theory about. Opening books often have lines 20+ moves deep that have been played many times. And grandmasters will play into these lines in tournament games so that they can reveal a novel idea that they came up with even farther in than that.
Many many games follow the same moves(1 move = 2 plies) for a long time, up to 30 moves in some cases, 20 moves is downright common and 10 moves is more common than not.
These series of moves are referred to as opening theory and are described at copious length in tons of books.
This is because while the raw number of possible paths to take is immense, the number of reasonable paths for 2 players of a given strength gets smaller and smaller.
If I went over the 300 or so classical tournament games I've played I would ballmark that maybe just one or two would deviate from all known theory in the first 10 moves.
So the criticism is valid in my view. The existence of copious chess literature can't simply be ignored here.
GP did say "sequence of moves", and if it matches what it has seen from the first move on, including the opponent, it will be in a valid "sequence of moves".
then, even midgame or endgame, if a sequence is played on one side of the board, even though the other side of the board may be different, the sequence has a great chance of being good (not always of course, but a 1400 rating is solid (you know the rules and some moves) but not amazing
Classical Markov chains played chess at some rate of success. ChatGPT is probably a lot better but not fundamentally different - It's predicting which moves to play based on sets of past games, not by memorizing it but by memoizing it.
I wouldn't be surprised if the relevant state in a typical beginner's chess game also excluded many units in the sense that yes, you could move them, but a beginner is going to just ignore them in any case.
I wonder how well it could perform in Go, there are way more permutations there so finding an identical state should be more difficult.
The illegal move argument is good though, and indicates no direct understanding of what it is spewing out.
Wouldn't we expect a much higher rate of illegal moves if that was the case?
https://www.youtube.com/watch?v=kvTs_nbc8Eg
In this example, ChatGPT's first few moves are reasonable (while it appears to be on-book), but then it goes off the rails and starts moving illegally, spawning pieces out of nowhere, deleting pieces for no reason, etc.
Plenty of people who have a basic understanding of chess would make an illegal move if they had no board to see and had to play only with notation.
Why are people struggling so hard to understand that it's not just regurgitating its training set? Is it motivated reasoning?
Apologies if your comment was meant as parody of this view, it's hard for me to tell at this point.
>> Two swindlers arrive at the capital city of an emperor who spends lavishly on clothing at the expense of state matters. Posing as weavers, they offer to supply him with magnificent clothes that are invisible to those who are stupid or incompetent. The emperor hires them, and they set up looms and go to work. A succession of officials, and then the emperor himself, visit them to check their progress. Each sees that the looms are empty but pretends otherwise to avoid being thought a fool.
So everyone "pretends otherwise to avoid being thought a fool".
Huh. I guess that explains it. Good metaphor.
On chess.com you can chain premoves, on lichess you can't(afaik).
So in theory, to the extent premoves explain the rating difference, the difference should be greater on chess.com assuming they have the same parameters in their rating calculations. Therefore it should be possible to perform an analysis to shed light on this. But someone would have to go recompute the 3 different ratings under the same system first to be able to make a sensible analysis.
But the answer is insanely unlikely, past a certain number of moves. The combinatorial explosion is inescapable. Even grandmaster games are often novelties in <10 moves.
So, it has a to have some kind of internal representation of board state and what makes a reasonable move and such that enables it to generalize (choosing random legal moves is almost unbelievably bad, so it’s not doing that).
I also doubt that it has been trained on the full (massive) database of Lichess games, but that would be an interesting experiment: https://database.lichess.org/
Very low. On lichess when you analyse your games you can see which positions have been reached before, and you almost always diverge in the opening.
The lichess db has orders of magnitude more games of chess than the chatGPT training data does, so there is absolutely no way that chatGPT could reach 1400 purely based off positions in its training data.
Though I think you're overestimating how many positions have occured. Frequently, by move 20-25 you have a unique position that's never been played before (unless you're playing a well known main line or something)
> A trivial wrapper around ChatGPT has a Chess Elo of 1400
would you have any issue?
Afaict, the thesis is the article is not "ChatGPT is the ideal tool for playing AI chess", but "it is interesting how well ChatGPT can play chess with some very simple tweaks."
All games were provided in the article. None of them were 4 move checkmates; nearly every one is longer than 20 moves and some are 40 or longer. There is simply no possible way that ChatGPT is regurgitating the exact same 40-move-long game it's seen before. You can check a chess database if you'd like; virtually all games longer than 20 moves are unique.
1. It definitely regurgitates opening theory, much more than can reasonably be calculated at its strength.
2. It might be regurgitating tactical sequences that appear in a lot of positions but remain identical in algebraic notation. Famous example:
1. Nxf7+ Kg8
2. Nh6++ Kh8
3. Qg8+ Rxg8
4. Nf7#
This smothered mate can occur in a huge variety of different positions.There's some qualitative evidence for this in the games.
In one of the games it has a bishop on f6 as white. It plays Qxh6?? Kxh6 and then resigns due to illegal move. I'd bet good money that illegal move was Rhx# where x is 1-4. So it seems like in some these positions it's filling in a tactical sequence that often occurs in the vicinity of recent moves, even when it's illegal or doesn't work tactically.
After reviewing the chat history I actually have to issue a correction here, because there were two moves where ChatGPT played illegally:
1. ChatGPT tried to play 32. ... Nc5, despite there being a pawn on c5
2. ChatGPT tried to play 42. ... Kxe6, despite my king being on d5
It corrected itself after I questioned whether the previous move was legal.
I was pretty floored that it managed to play a coherent game at all, so evidently I forgot about the few missteps it made. Much like ChatGPT itself, it turns out I'm not an entirely reliable narrator!
Qxd7 early on was puzzling but has been played in a handful of master games and it played a consistent setup after that with b5 Bb7. Which I imagine was also done in those master games. But interesting that it went for a sideline like that.
It played remarkably well although a bit lacking in plan. Then cratered in the endgame.
Bxd5 was strategically absurd. fxg4 is tactically absurd. Interestingly they both follow the pattern: Piece goes to square -> takes on that square.
This is of course an extremely common pattern, so again tentatively pointing towards predicting likely sequences of moves.
Ke7 was also a mistake but a somewhat unusual tactic with Re2 and f5 is forced but after en passant the knight is pinned. This tactic does appear in some e4 e5 openings though. But then the rook is on e1 and the king never moved or if it did, usually to e8, not e7. Possibly suggesting that it has blind spots for tactics when they don't appear on the usual squares?
Fascinating stuff.
Part of me wants to say no, that the model "thinks" in terms of text it has seen and so knows from chess forums it has seen that certain text representing moves come naturally after previous moves' text. It doesn't understand anything other than certain text comes after other text.
But yeah at the same time I can see how it is thinking inside the world we built for it. We have senses like touch, smell, sight. The only "sense" these models have are an input text box. Would we even necessarily recognize intelligence when it is so different from our own?
So does it understand chess like I do? No, it cannot. Does it understand chess at all? I'm not sure. I'm not sure I'd understand chess in it's world either though.
There's already a lot of research on this, but I strongly believe that eventually the best AIs will consist of LLMs stuck in a while loop that generate a stream of consciousness which will be evaluated by other tools (perhaps other specialized LLMs) that evaluate the thoughts for factual correctness, logical consistency, goal coherence, and more. There may be multiple layers as well, to emulate subconscious, conscious, and external thoughts.
For now though, in order to prompt the machine into emulating a human chess player, we will need to act as the machine's subconscious.
I, as the developer, am providing contextual information like what the current board state is, and what the legal moves are, but my code doesn't actually know anything about how to play chess, the Llm is doing all the "thinking."
Like it's nuts that people aren't more amazed that there's a piece of software that can function as a chess playing engine (and a good one) that was trained entirely generically.
That you may have to babysit this particular aspect of playing the game seems quite irrelevant to me.
I don't believe you. Are you giving those people a restricted move set (i.e. computer chess, where it will _only_ allow legal moves)? Because if you give people an unrestricted board, I _guarantee_ you people will make lots of illegal moves.
Me: Moves pawn
Opponent: You can't do that, you exposed your king to check.
Me: Oops, sorry, you're right.
In Chess.com, you absolutely can attempt an illegal move, and many players do, and you will not get punished for it, so chatgpt is better then a 1400 human player.
> So whenever it wanted to make an illegal move, it resigned.
You:
> By comparison, any player making an illegal move is forfeiting the game...
By comparison indeed.
ChatGPT is in no way 1400, or even close to it. The fact this article gets upvoted around here is proof that people aren't thinking clearly about this stuff. It's trivially easy to prove it wrong. Live unbelievably so, I tried the same prompt and within 12 moves it made multiple ridiculous errors I never would, and then an illegal move.
Keep in mind a 1400 level player would need to basically make 0 mistakes that bad in a typical game, and further would need to play 30-50 moves in that fashion, with the final moves being some of the most important and hard to do. There's just no way it's even close, my guess would be even if you correct it's many errors, it's something like ~200 ELO. Pure FUD.
The author of this article is cashing in the hype and I'm wondering how they even got the results they did.
Keep poking it and criticizing it. Microsoft and OpenAI are on HN and they're listening. They'd find nothing more salient to tout full chess support in their next release or press conference.
With zero effort the thing understands uber domain specific chess notation and the human prompt to play a game. To think it stops here is wild.
People are hyping it because they want to get involved. They want to see the crazy and exciting future this leads to.
Some future AI might, but a language model won't.
The problem is not a failure to understand the rules. It is just not very good at maintaining the state.
My point is, it sounds like Elo doesn't measure what we want it to measure. If we care about the way an agent wins a game and not just whether it wins a game, then we need an instrument that measures strategy, not outcome.
Then it's still isn't anywhere near ELO 1400.
A bowling bot that threw strikes 9 out of 10 throws and a gutter ball one time out of ten would still be a great bowler even though no human with the ability to make strikes that often would pretty much ever throw a gutter ball.
This is a weird kind of alien intelligence that does not have to behave like humans.
ELO allows for illegal moves: as per the rules of chess, you lose the game if you make an illegal move. The end, ELO doesn't care about why you lost a game on purpose.
This is the opposite of that, a highly trained but dumb entity that has seen many lifetimes worth of games but is still tripping up on basics. But since it is so highly trained you can mistake it for a master if you squint and don't look into what it is doing.
But it is a master, as has been pointed out repeatedly. If you replace all illegal moves with resignations, and use the same style of prompt as the OP did, then it plays like an expert. I'm objecting because you're making it sound like it's a trivial result.
Let's play a game chess. Use the standard rules except that ....
Basically perturb the context to something a human would easily adapt to if they first knew the rules of chess but that would be difficult (or at least not obvious) to extrapolate from training data by ChatGPT (or more generally an LLM)According to this functional definition, the way we are currently using language models basically excludes understanding. We are asking them to dream up or brainstorm things – to tell us the first things they associate with the prompt.
Maybe it's possible to set up the system with some kind of self-feedback loop, where it continues evaluating and improving its answers without further prompts. If that works, it would be one step closer to a true AGI that can be said to understand things.
There is a lot of confusion around the Chinese Room Argument. I think it makes a valid point by demonstrating that input/output behavior alone is insufficient for evaluating whether a system is intelligent and understands things. In order to do that, we need to see (or assume) the internal mechanism.
It can do that while it generates output. Humans do the same thing when they figure out what they really mean while they're trying to express it.
Understanding this concept is crucial for getting good results out of large language models.
Explain your thought process here further if you don't mind.
The fact that rules and articles exist describing what to do if you or your opponent makes an illegal move indicates this is not the case.
Humans are also... human. They make mistakes. It may not happen often at 1400, but to say that it'll never happen is preposterous.
The bar isn’t “I didn’t make an illegal move this morning” it’s “something a 1400 ranked player would never do”.
My entire point is that it happens. Not often, but also not “never”.
If I was playing that monstrosity though I would play something crazy that is far out of the opening book and count on it making an illegal move.
> You are a chess grandmaster playing as black and your goal is to win in as few moves as possible. I will give you the move sequence, and you will return your next move. No explanation needed.
1. b4 d5 2. b5 a6 3. b6
> bxc6
No, it's ridiculous to say "oh, a blindfolded human might sometimes make a mistake." No, this is trivially easy to make it make a mistake. It has no internal chess model at all, it's just read enough chess games to be able to copy common patterns.
EDIT: I checked and it left the lichess database after 9 moves. The lichess db has probably 5 orders of magnitude more chess games in it than chatGPT has in its training data.
In theory if I was playing a 1200 player I would almost always win, but let's say they have some extremely devious preparation that I fell into due to nonchalance and by the time we're both out of book I'm down a queen. It might not matter that I'm 600 points stronger at that point. If they don't make a sufficient amount of errors in return I will lose anyway.
It's clearly following some opening theory in all the games I've looked at so far. So yes, it is regurgitating opening moves. That's clearly not all it's doing, which is very impressive, but these are not mutually exclusive.
From this, I take it that the question is if ChatGPT is repeating existing games, or not. All you need is a single game where it's not repeating a single game to prove it definitively. You can hardly play 60 moves without an error by accident.
I believe you're responding to a different question, something like "does ChatGPT fully understand the game of chess".
It is difficult to say that is not impressive due to it being an emergent ability.
I don't know why you think it's an emergent ability.
It's seeing a sequence of moves, and playing the most likely next move (i.e. the most likely next token) given the previous complete move sequences it was trained on. That's the baseline of what an LLM does—not something emergent. Games in online chess databases tend to be of relatively good players. Nobody wants to look up games played by two 800 ELO players.
As an aside, there have been chess programs for years that show you for a given position all of the previous games in its database with the same position and the win outcome % of each move. That's all that's going on here.
It could be, but would you think that of the 100-300 bn parameters in the model a lot are dedicated to chess move sequences? It seems likely that it has seen such data, but I would be surprised if it is using a considerable chunk to store chess database information.
Chess moves are a tiny/diminute part of all text learned by the model. This memorization argument is very similar to the "Stable Diffusion just takes bits of the images in the original dataset and parches them together".
1400s on chess.com never play illegal moves. 300s on chess.com never play illegal moves. Because it's impossible to do. In the real world, even grandmasters can make illegal moves, though they almost always have to be under time pressure.
This idea that the illegal moves completely invalidate this result is just ill-conceived. On the other hand I do agree this is mostly returning common sequences of moves. And if you actually analyse the games, especially the ones with illegal moves, you'll find plenty of qualitative evidence of that. But I'm fed up of doing people's thinking for them for today, so this is peace out for me today. See my others comments on this post to see a more detailed analysis of what this is doing.
People are always telling me that I'm moving the goalposts when I challenge the hyperbole about LLMs. But now you're moving the goalposts about chess.
Not playing illegal moves is a pre-requisite for any strong understanding of how to play chess. That is definitely the goal post.
It's not like an AI making silly mistakes when driving a car.
I don't know why it worked in this specific case, but based on earlier examples it is more likely that these kind of games were more prevalent in its dataset it was trained on than it being able to play chess in general. It still wasn't perfect, so even these games weren't rigid enough for it to reliably perform valid moves.
ChatGPT does the pattern matching part, but not the logical part.
Just like grammar the patterns are too hard for humans to see and encode, but LLMs can encode pretty complex patterns. Domains that are easy to encode as grammars will be really easy for LLMs to solve, and the further from a grammar the harder for it.
Edit: move history can also be relevant when it comes to castling.
So it would be interesting to eliminate all opening knowledge and that way be able to qualitately get at which aspects of chess it's actually good at, which is sucks out, and how much of its strength can be attributed to opening knowledge.
I'm still impressed by this btw. I did not expect this to be possible at all really. But being impressed is not an excuse to ignore methodological flaws. :)
As someone very clever once said, welcome to the end of the thought process.
We've established that:
1. It doesn't repeat entire games when the games go long enough
2. It does repeat a lot of opening theory
3. It seems to repeat common, partially position independent tactical sequences even when they're illegal or don't work tactically.
The author claims: chatGPT has a 1400 chess ELO based on games played.
You appear to think author claims: chatGPT plays chess like a human rated 1400.
Your observations do not contradict the authors’ claim that based on games won and lost against opponents of a specific strength, the estimated ELO is 1400.
A non-human player can make illegal moves at a much higher rate and make up for that by being stronger when it does not make illegal moves to achieve the same rating as a human player who plays the game in a completely different way.
Some things about the two "it"s:
- They differ trivially.
- They enable new capabilities, such as the ability to explain why a move got made. Current chess AIs are not good at this.
So I think you're making too much of a big deal from a comparative triviality.
[edit]
We might be talking past each other. And some people above have come to doubt the article's results even with the right prompt engineering.
I dont understand the point of your second sentence, seems to be entirely missing the substance of the conversation.
By the way - definitely read the article. But once again - I thought the methodology was bad, and thus the conclusion was bad.
But not going to keep replying, you engage online in a way that will turn lots of people you talk to away.
No, all unsuccessful playthroughs are resignations that impact the models ELO.
Whether any of it is actually true is a different question.
I got the impression the author did this because they thought they were being fair with ChatGPT, but they're much more likely to be letting it off the hook than they seem to realise.
(Sorry about the "they"'s; I think the author is a guy but wasn't sure).
I don't think this is a trivial result, emulating a highly trained idiot is still very impressive. But it is very different from an untrained genius.
Tbh, I don't even know what you're saying.
[edit] OK, I might have misunderstood you. It's not always clear what people mean.
That isn't relevant to my comment, an idiot human is still a human. Your comment here therefore doesn't make sense. The comment I responded to likened it to a genius entering a new field, I objected to that, that is all.
Simply because on a new position, moves from other positions aren't applicable at all.
But it is failing the same way as a human. Humans who remembers patterns and don't learn the logic makes these kind of errors in math or logic all the time.
ChatGPT is much better than humans at pattern matching, you see it right here it can pattern match chess moves and win games! But its inability to apply logic to its output instead of just pattern matching is holding it back, as long as that isn't solved it wont be able to perform on the level of humans in many tasks. Chess might be easy enough to solve using just pattern matching and no logic that scaling it up will make it pretty good at chess, but many other topics wont be.
https://chess24.com/en/read/news/the-7-most-illegal-chess-mo...
> I decided to interpret that as ChatGPT flipping the table and saying “this game is impossible, I literally cannot conceive of how to win without breaking the rules of chess.”
Kind of sounds like anthropomorphization, but more likely the author just papering over the glaring shortcomings to produce a compelling blog post.
It also sounds like the illegal moves were rather frequent. The 61-legal-move game sounded like an impressive outlier.
But ye, he is anthropomorphizing alot ...
Yeah, I'm "class C", weak amateur chess player, but I think you're grossly underestimating the amount of study I put into this game. I'm not going to make an illegal move
I guess most players would mess up 20/30 moves in.
I suspect you can't either, you can try by turning on blindfold mode on lichess and seeing how far you get.
Also people forgetting they moved the king/rook and trying to castle.
This article stated the opposite, gpt-4 couldn't play chess while gpt-3.5 could. So this is a case where the model got dumber.
But even if it doesn't play like human 1400 players, if it can get to a 1400 elo while resigning games it makes illegal moves on, that seems 1400 level to me. And i bet that some 1400s do occasionally make illegal moves (missing pins) while playing otb
That's not to say ChatGPT can play at 1400, just that that playing in an odd way doesn't determine its rating.
Edit: Without reading everything again, I'll assume someone said "never." They're probably assuming the reader understands that "never" really means "with an infinitesimal probability," since we're talking about humans. If you're trying to argue that "some 1400 player has made an illegal move at some point," then I agree with that statement, and I also think it's irrelevant since the frequency of illegal moves made by ChatGPT compared to the frequency of illegal moves made by a 1400 rated player is many orders of magnitudes higher.
> something a 1400 ranked player would never do
> fine, fair, "never" was too much.
I mean, yes they were and they said as much after I called them out on it. But go off on how nobody is arguing the literal thing that was being argued.
It's not like messages are threaded or something, and read top-down. You would have 100% had to read the comment I replied to first.
> He literally used the same prompt as the article. > Claim: "ChatGPT's Chess Elo is 1400"
> Reality: ChatGPT gives illegal moves (this happened to article author too),
> something a 1400 ranked player would never do
> Result: ChatGPT's rank is not 1400.
This is a completely fair argument that makes perfect sense to anyone with knowledge of competitive chess. I have never seen a 1400 make an illegal move. He probably hasn't either. Your point is literally correct in the sense that at some point in history a 1400 rated player has made an illegal move, but it completely misses the point of his argument: ChatGPT makes illegal moves at such an astronomically high rate that it wouldn't even be allowed to even play competitively, hence it cannot be accurately assessed at 1400 rating.
Imagine you made a bot that spewed random letters and said "My bot writes English as well as a native speaker, so long as you remove all of the letters that don't make sense." A native English speaker says, "You can't say the bot speaks English as well as a native speaker, since a native speaker would never write all those random letters." You would be correct in pointing out that sometimes native speakers make mistakes, but you would also be entirely missing the point. That's what's happening here.
Because all I'm hearing is talk about ChatGPT's abilities as a reply to me calling out an extreme statement as being extreme. Something the parent comment even admitted as being overly black and white.
> ChatGPT: Yes, that’s a good move for you. My next move is: Bc3, developing my pieces and attacking your pawn on c3.
I am 1400 Elo and can tell you that from an near opening position, its impossible to move a Bishop to c3 for either Black or White in the first say, 10 moves, under traditional openings.
It's a (theoretically) 1400 player which plays significantly better then 1400 when it knows the lines, but makes bad or illegal moves when it doesn't, and that play averages out to be around your typical 1400 player. Functionally is just what a 1400 player already is, but with higher extremes and lower lows.
https://www.reddit.com/r/AnarchyChess/comments/10ydnbb/i_pla...
We're talking about pieces that don't exist, reappearing pieces, pieces moving completely wrong (Knight takes as if its a Pawn), etc. etc.
---------
People are taking these example games and saying ChatGPT is 1400 strength. I don't think so. This isn't a case of "oops, I castled even though I moved my king 15 turns ago".
You need to give ChatGPT the full state (every move) on every prompt to make it play closer to 1400. The game you linked the user was giving one move at a time.
If I've been given the full state every move, I will _never_ make an illegal move as a 1400 chess player.
-----------
> O-O > I'll play O-O as well. Your move.
Do you really think that this error would have been made at 1400 Elo? Even in blind chess? This is the 5th move of the game. I can still track the game at this point mentally.
I recognize that you're 1900 and think that all the chess players below you are n00bs, but... come on. 1400 players are stronger than this.
You seem to have missed the part where I said multiple times that a 1400 has definitely made illegal moves.
> In fact, it's so rare that in order to forefeit a game, you have to do it twice. But it never happens, ever, because pattrn has never seen it. Case closed everyone.
I actually said the exact opposite. You're responding to an argument I didn't make.
> I made no judgement on what ChatGPT can and can't do. I pointed out an extreme. Which the commenter agreed was an extreme. The rest of your comment is completely irrelevant but congrats on getting tilted over something that literally doesn't concern you. Next time, just save us both the time and effort and don't bother butting in with irrelevant opinions. Especially if you couldn't even bother to read what was already said.
The commenter's throwaway account never agreed it was an extreme. I agreed it was an extreme, but also that disproving that one extreme does nothing to contradict his argument. Yet again you aren't responding to the argument.
This entire exchange is baffling. You seem to be missing the point for a third time, and now you're misrepresenting what I said. Welcome to the internet, I guess.
> fine, fair, "never" was too much.
This is the second time I've had to do this. Do you just pretend things weren't said or do you actually have trouble reading the comments that have been here for hours? You make these grand assertions which are disproven by... reading the things that are directly above your comment.
> This entire exchange is baffling.
Yeah your inability to read comments multiple times in a row is extremely baffling.
As I said before:
> Next time, just save us both the time and effort and don't bother butting in with irrelevant opinions. Especially if you couldn't even bother to read what was already said.
I did, two hours ago, 6 minutes after your comment
Ah yes, of course, just because you never saw it means it never happens. That's definitely why rules exist around this specific thing happening. Because it never happens. Totally.
In fact, it's so rare that in order to forefeit a game, you have to do it twice. But it never happens, ever, because pattrn has never seen it. Case closed everyone.
I made no judgement on what ChatGPT can and can't do. I pointed out an extreme. Which the commenter agreed was an extreme. The rest of your comment is completely irrelevant but congrats on getting tilted over something that literally doesn't concern you. Next time, just save us both the time and effort and don't bother butting in with irrelevant opinions. Especially if you couldn't even bother to read what was already said.