Skip to content

Home

wrongu edited this page · 7 revisions

Welcome to the AlphaGo Replication Wiki

Here you can learn more about how this project is structured, and a little about how DeepMind's AlphaGo itself works.

How AlphaGo Works

https://www.deepmind.com/alpha-go.html

DeepMind's AlphaGo is a combination of Monte Carlo Tree Search (MCTS) with Convolutional Neural Networks. There are three networks actively contributing to finding a move:

  • the policy network guesses where an expert would play. Think of this network as memorizing patterns that it has seen others play but without any sense of the value of playing there.
  • the value network estimates the probability of winning from the current position. It is perhaps analogous to an expert's 'intuition' of preferring one position to another.

Together, these can be used to choose a move - by playing out a short series of reasonable moves then evaluating the results, then picking the best option. This is what AlphaGo's MCTS does - it simply follows the search tree in the most promising directions, where how 'promising' a branch is is a combination of expert policy and estimated value.

The final 'network' is a faster policy function. It is not what you would think of as a typical neural network - it is more like multinomial logistic regression to choose the next move. This rollout policy is not nearly as good as the deeply-learned policy, but it is much much faster. Since the good policy function is so slow, it is only used to search ahead about 20 moves. Normally the value function alone could evaluate how good the board looks after those 20 moves, but AlphaGo's value function is augmented using this fast rollout all the way to the end of the game and simply seeing who wins.

The networks here aren't just processing stone positions or images of boards. A given board position is preprocessed into features that the neural network can make better use of. Part of the success of AlphaGo is attributable to carefully choosing which features to use. More complex features can be informative but at the expense of having to compute them for each position in the search tree.

The other big development contributing to AlphaGo's success is how these networks are trained using a combination of supervised and reinforcement learning.

Milestones and Initial Development Timeline

Keep in mind that each training phase depends on the previous one, and each phase of training took DeepMind on the order of weeks to complete with access to significant computing power. Training may be a bigger bottleneck than implementing new functionality.

3/21 Training Phase 1: supervised training of policy

4/4 Training Phase 2: self-play training of policy

4/18 Training Phase 3: augmented self-play training of value

5/? First full prototype

Something went wrong with that request. Please try again.