Deep learning for chess(erikbern.com) |
Deep learning for chess(erikbern.com) |
When sunfish (with its just 111 python lines of chess logic) 'evaluates' a position, it uses the perhaps simplest known effective method: A piece-square table. The method ignores any interplay between pieces on the board and calculates the sum of Table[coord][type] for each piece on the table. E.g. a white knight on a1 may be worth 73 'units', and on f3 it may be worth 98 'units'. That's all there is to it. Any program which has greater than this level of precision, and equally precise searching, should be able to beat sunfish.
The above may sound naive - and it is - but actually most of the advanced ideas used in chess evaluation functions, can be generalized from this method. "Rook connection" is just a measure that includes two pieces instead of one, and "pawn shield" is the generalization to three pieces. Experiments with grandmasters reveal they "recall" positions in 'chunks' of connected pieces. And this memory is what they use to guide their search. (Papers like 'Perception in chess' and lots of newer research).
So, the role of machine learning in modern engines is to tune the parameters for evaluation and search pruning (deciding what positions are worth examining deeper). For the actual decision of which piece to move to where, you still need search algorithms to crunch millions of positions per second.
You're right that everything else equal, a better evaluation function should lead to a better chess engine. However in practice I think better evaluation functions means slower evaluation function. So there's some really interesting trade-off there. I doubt humans evaluate more than a few thousand positions, so it seems like a slow but more accurate evaluation function could play chess pretty well
Anyhow, it's fun to see how engines like these battle it out. It may also be that your approach can yield a more 'fun to play' engine for us mortals.
The author seems to have a not-very-deep understanding of computer chess. Some examples:
>Better search algorithm. I’m currently using Negamax with alpha-beta pruning, whereas Sunfish uses MTD-f
MTD-F is not better, just a different way to accomplish more-or-less the same thing. MTD-F is a binary-search equivalent of the alpha-beta family of search. In fact, naively switching to MTD-F will probably result in worse playing ability. It takes some time to get it tuned right, and even then it is not objectively better.
>Better evaluation function...By generating “harder” training examples (ideally fed from mistakes it made) it should learn a better model
This is what every beginning chess programmer on the Computer Chess Club message boards and rec.games.chess.computer has wanted to try for the last 20+ years. It has been empirically demonstrated that for best results, the evaluation function should remain simple and fast. Improving evaluation rarely fixes "dumb mistakes". That's what search is for. Efficient search makes up for a multitude of evaluation mistakes.
>Faster evaluation function: It might be possible to train a smaller (but maybe deeper) version of the same neural network
If the evaluation function was reduced to literally take zero time to execute, it would not help significantly. It's a linear improvement being thrown at an exponential problem.
I would LOVE if there was a new approach to computer chess, but the current "smart brute force" approach is so far advanced and successful, it is hard to imagine another approach being competitive.
This is far from true. An AI that only looks one-two moves ahead to make sure it doesn't hang a piece or allow mate in one would beat many amateur players (at least with ~5min time control). That's essentially what Sunfish, what he's comparing against, does. Note that Sunfish isn't a particular "good" AI, it would be more interesting comparing it to a "proper" chess AI.
Saying it can beat a terrible player doesn't mean much.
Saying it can beat Sunfish (a python program with grade of ?maybe ELO 1100? (i.e. not at all strong)) sometimes, when it has a time advantage, is not impressive.
I'd really like to know how much better (if at all) the evaluation function is - e.g. can the program beat itself, if one side uses a 'standard' evaluation function?
Machine Learning is big on measuring outcomes. It is odd that the one outcome that is important here is not measured!
Some caveats: I realise this is someone's hobby project - I do not mean to rubbish it. I'm just saying that the work&writeup could have been much improved by adding this information.
I think the take away here is how relatively easy it is to make a decent AI for a complex game using neural networks.
Is your python code e.g. 10x times slower than C code? With a branching factor of e.g. 10, that would mean you only search to depth 9 ply whilst C would search to 10 ply [a]. You are not losing that much! A ballpark ELO grade would not vary qualitatively between python and C.
[a] With alpha-beta search, maybe it is 8 ply vs 10 ply. Also the branching factor is larger than 10. But the point remains valid.
if you had infinite computing capacity,
you could actually solve chess.
The statement is correct. But you do not necessarily need infinite capacity to solve chess. Just enough capacity.Would be interesting to estimate how much capacity.
So effectively infinite.
The number of states a chess field can have is 13. It's either empty or has one of the 6 different pieces in black or white on it.
So 13^64 is an upper limit for the number of positions.
We could solve chess if we could put these 13^64 positions into a tree, right?
13^64 = 10^71.
The number of atoms in the observable universe is estimated to be 10^80.
So even the observable universe might be big enough to form this tree. Even if we use big clunky objects like atoms.
We do not have any idea of the size of the unobservable universe.
And I don't know how many states an atom can have. Who knows, maybe a single atom can solve chess if it's programmed correctly? According to quantum theory, pretty small objects like electrons can store and process an amazing (infinite?) amount of information in certain ways.
Chess requires 100% accuracy and in which just because positions are similar, doesn't mean that best moves in these positions have to be in any way similar too.
On the other hand, it sort of mimics the way human player thinks, in terms of recognizing certain patterns. After all, even grandmasters do not bruteforce their way through all possible combinations. We use a hybrid approach: recognize certain strategic patterns first (to drastically narrow down the search tree), and perform calculations on the top of that.
Chess engines can wipe the floor with any player where tactics is involved; the trick of beating a computer is to close the game and take advantage of the fact that it's not able to formulate a long-term PLAN (whose consequences are beyond its horizon).
See how Nakamura repeatedly beat Rybka in blitz games a few years ago, eg.: http://www.chessgames.com/perl/chessgame?gid=1497429 - very instructive :)
To alleviate this, one can add more abstract/heuristic information about the position to the input (indicators for complex relations between several pieces). This kind of high-dimensional vector would be more robust to small changes, and make the objective function more smooth. Perhaps the non-linearities introduced by the three layers cannot do this as effectively.
Most interesting to me is that it really isn't that hard to create a program to solve chess (i.e. the logic behind it), it just would take too much time/money to actually do it.
It's much more difficult to create AIs and approximations like this.
Kinda weird once you realize that fact...approximating a solution to chess is much more difficult, logic wise, than actually solving chess.
Though I wouldn't be surprised if chess is solved in the next couple decades or so.
This would be a very easy way of getting more training data, and is actually very nice theoretically. Assuming enough training time and a complex enough evaluation function, etc, you'd eventually solve chess.
I may check out the code and try this myself...
See [2] for a twist on the DeepMind Atari player. They use Monte Carlo Tree Search (MCTS of automated Go playing fame) to generate training data. By feeding that more carefully generated gameplay data into the deep q-learning net, they exceed DeepMind's (non-MCTS-coupled) performance.
1. http://karpathy.github.io/2014/07/03/feature-learning-escapa...
2. http://www-personal.umich.edu/~rickl/pubs/guo-singh-lee-lewi...
My babble: if DeepPink can gage its uncertainty on a move, it'd be cool to see a hybrid system in-action. Plus, "DeepFish" has a cool name.
Either way - nice! And thanks for putting the source on GitHub; I will have a goof with it!
Deciding whether a single chess position is reachable can be really hard. There is a whole class of so called "retrograde" chess problems focussed on that.
For example, is the following position reachable?
White: Kc3 Ba4 Black: Kd1 Rb5 Bd5
That being said, I'm not convinced there's anything "magic" about the fast deep search. There's a broad trend of big complex hand-tuned models being replaced by simple models with more parameters and many many orders of magnitude more training data (eg. machine translation, computer vision). There's probably a lot of domains where this approach doesn't work though. Maybe chess is one of those applications. We don't really know until we try :)
I've been associated with computer chess for a while, and was even given a shout-out by Vas Rajlich for optimization discussion prior to the release of Rybka 3.
I think this critic's view on the primacy of fast evaluation, at the expense in accuracy is off base. Looking at the two top engines today, Stockfish and Komodo, Stockfish has a great search function, but a poor eval, while Komodo's eval is better, but it's search isn't as good. It's pretty clear that bad evaluation is limiting, even at the leaves of a deep search.
Anyway, I have more basic questions about why you did what you did. First, it seems that your use of 12 x 64 sets of values to represent state information is non-optimal and even not sufficient. It seems that you are introducing a lot of unnecessary and undesirable additional degrees of freedom by duplicating the piece-square state inputs for white and black. When I've done this in the past, I used 6 x 64 sets with positive ones for squares with a white piece and negative ones for squares with a black piece. Do you see any advantage in not taking advantage of this symmetry?
Second, you really need side to move, castling state for each side, and the pesky en passant indicator (just as is required in a FEN representation). Luckily these don't significantly increase the number of inputs.
I worked on a similar project with a friend at University of Warsaw a number of years back. We generated static evals using FANN based on the complete input state, and trained with the value of the search plus eval after 30 seconds per move We used Rybka 2.2n for this, which was the strongest engine available at the time.
We both moved on after some preliminary success, mainly because there wasn't any way to conceivably make a top engine due to the closed nature of the top engines at that time. This is no longer an issue, and if someone had a black box evaluation function that produced a better eval than the current Stockfish static eval, it would be trivial to do the substitution.
Best Regards, Alan
Landauer's principle says we need at least 10^-21 Joules to change a single bit. That means we need 10^99 Joules to run through all our games. The mass-energy of the observable universe is 10^69 Joules.
Therefore, if we could convert one million trillion trillion observable universes entirely into energy, then we'd have enough to run through all our chess games on our theoretically perfect computing machine.
I'm not sure we'll get that done in the next couple of decades.
[of course, there may well be shortcuts that we can take to cut down that number a bit!]
You don't need to know the best move in every position of every possible chess game in order to play perfectly.
Eg. playing with White it is enough for you to always know what your best move is. And once you stick to it, then 99.9999... of these possible games will NEVER occur, because that would require White making at least one non-perfect move on the way.
I'm curious to see how the solution to chess will play out (hoping it happens in my lifetime).
As for what the acronym stands for, heck if I know.
The problem with deep models is when you end up having more than 1 hidden layers, you have a big matrix multiplication to get between the layers. If your hidden layers are a few thousand units, that's still pretty slow. Doing things in minibatches or on the GPU speeds it up significantly, but I'm guessing it's still orders of magnitudes slower than whatever Stockfish uses
> Faster evaluation function: I didn’t use the GPU for playing, only for training.
So indeed it could probably be optimized. But first I would profile as it can be the case that the bottleneck is Matrix Matrix multiplication in numpy which already delegates computation to an optimized third party library (e.g. OpenBLAS). Maybe it's worth using the GPU at play time as well.
The difficulty there is that sending data to and from the GPU is slow. You need to avoid data transfers, which might mean trying to do everything on the GPU. But the SIMD cores on the GPU are likely to perform poorly due to all the branch statements in chess code.
[1]:https://chessprogramming.wikispaces.com/Forsyth-Edwards+Nota...
Generally you aim to give the GPU a 'large' task (or many small tasks), then ask for the answer(s) in 1 batch and wait while it is transferred across.
If you have many small tasks where each answer is sent back separately, and you need that answer before requesting the next task, then you will be very slow, even if the data sent/received is small. A naive implementation of the chess algorithm here (with alpha-beta pruning (which has many if-then branches)) would be like this :/
Your conclusion is correct by a huge margin but the argument is flawed. There is state that is not in the board position:
- white or black to move?
- is an en passant capture possible?
- is castling still possible?
- how many moves since the last lawn move or capture?That brings us to an upper limit of 10^77 states the board can be in.
Still enough atoms in the observable universe to assign one to each state.
I think you need fewer than 17 bits: 1 for "who is to move", 4 for castling, 6.6 for the 50 moves (100 ply) rule would leave less than 6 for en passant. There potentially are 7 possible en passant moves (white pawns at a5, c5, e5, g5 and black ones at b5, d5, f5, and h5 or vice versa), but all you need to store is "the n-th pawn just moved" with n in 0..8. That is just over 3 bits.
There is a way more significant flaw in your logic, though: the "three times the same board position with the same player to move is a draw" rule. That explodes your estimate, as there could be (if we do not call in additional chess rules) up to 10^71 positions hat have been visited earlier.
With that in mind, I am not sure your limit is correct. There are at most 96 pawn moves in a game (16 pawns walk in 6 moves to the other side of the board) and 30 captures (7 pieces each plus 8 promoted pawns each). With the 50-move rule, that gives at most 126 x 50 = 6300 moves in a game. Each move can be encoded in 12 bits (starting and ending square for white's and black's move), so we need 75600 bits. Taking 2^10 = 1000, that gives me 10^22680 possible games as an upper limit.
That is different from what http://www.chess.com/article/view/the-open-file---is-chess-i... or http://members.iinet.net.au/~ray/Chessgames.htm (several estimates) claim, but (if we assume that zero is a misprint) close to Hardy's upper limit, so at least I am in good company.
The tree would start from the starting position and then branch out to all legal moves in each step. So each node in the tree would have the previous moves encoded in it's position in the tree.
So my initial upper limit of 10^71 nodes might hold true. No need to encode information about castling, black or white to move, en passant etc.
Repeat positions are another issue though. Do we have to encode them? My 10^71 tree would not contain them. At first I thought we can leave them out. Now I'm not so sure anymore.
A move that leads to a repeat position is certainly not necessary to win a game. You could play the winning moves right away.
But it can be necessary to force a draw. Hmm...
You might be right. Maybe the "repeat position" issue kills the upper limit of 10^71 nodes in the tree of chess moves. So we have to resort to your upper limit of 10^22680.