# Understanding AlphaGo

Datetime:2017-03-29 05:31:42         Topic: AlphaGo          Share        Original >>

## How AI beat us in Go — game of profound complexity

One of required skills as an Artificial Intelligence engineer is ability to understand and explain highly technical research papers in this field. One of my projects as a student in AI Nanodegree classes is an analysis of seminal paper in the field of Game-Playing. The target of my analysis was Nature’s paper about technical side of AlphaGo — Google Deepmind system which for the first time in history beat elite professional Go player, winning by 5 games to 0 with European Go champion — Fan Hui.

The goal of this summary (and my future publications) is to make this knowledge widely understandable, especially for those who are just starting the journey in field of AI or those who doesn’t have any experience in this area at all.

The original paper — Mastering the game of Go with deep neural networks and tree search :

### AlphaGo and the numbers

AlphaGo is narrow AI created by Google DeepMind team to play (and win) board game Go. Before it was presented publicly, the predictions said that according to our state-of-art, we are about 1 decade away from having system with AlphaGo skills (capability to beat a human professional Go player).

But why it’s so big deal to win the oldest Chinese board game which is simpler than Chess dominated by AI for about 20 years now?

Go and Chess are games of perfect information what means that each player is perfectly informed of all the events that have previously occurred. This kind of games have function called optimal value which can determine outcome from every board state/position, assuming that each player makes the best possible move. To solve game of perfect information, agent has to evaluate each possible move by recursive simulations. Simulations are presented by search tree containing approximately b^d (b raised to power d) possible sequences of moves, where b is a number of possible legal moves and d is a game length. Approximate numbers are: b≈35, d≈80 for chess (10¹²³ different games) and b≈250, d≈150 for Go (10³⁶⁰ different games). Average multi-core gigahertz processor can do 10⁹ operations per seconds so it means that evaluation of all possible moves in Go would take longer than our Universe will exist — math has no mercy .

The highest level goal for AlphaGo and other AI game agents is effectiveness in reducing search space, to the number of possible games (move sequences leading to the end of game) which are computable in reasonable time (5s of computation time per move for AlphaGo).

To estimate value of each game state, AlphaGo uses Monte Carlo tree search (MCTS) heuristics — analysis of the most promising moves by expanding the search tree based on random sampling of the search space. The application of MCTS in games is based on many playouts, where the game is played out to the very end by selecting moves at random. Result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts.

MCTS enhanced by additional policies (e.g. predicting human expert moves) gave us possibility to achieve strong-amateur level plays.

Further improvements are based on well pretrained deep convolutional network. These are currently widely used in image classification, face recognition or playing Atari games. Side note: Facebook’s intro to AI movies can be good starter to understand how deep learning or convolutional networks work.

The goals for neural networks used in AlphaGo are: effective position evaluation (value network) and actions sampling (policy network).

### Policy network

To train policy network, there are a couple stages in machine learning.

First stage is Supervised Learning (SL) (it extends MCTS solutions for predicting human expert moves). SL policy network was trained from 30 million positions from the KGS Go Server. Trained network was able to predict expert moves with accuracy 57% (the highest value from other research teams was 44.4%).

While new accuracy improved playing strength, evaluation is significantly slower. And because of it there was another trained policy called Fast Rollout (FR) policy. It was less accurate (27%), but much faster (2μs vs 3ms) and is used to run rollouts to the end of the game when evaluating nodes in MCTS.

Second stage of training is Reinforcement Learning (RL) . While SL policy network is good in predicting next most likely moves, RL helps with prediction of the best possible (winning) moves. At this stage AlphaGo played 1.2milion games with randomly selected previous iterations of itself (it helps to increase stability by preventing overfitting to the single opponent).

Reinforcement Learning was able to win more than 80% games against Supervised Learning policy and about 85% games with Pachi — an open source program based on Monte Carlo tree search heuristics ranked at 2nd amateur dan on KGS. Previous solutions based on SL-only were able to win about 11% games with Pachi.

### Value network

Final stage of training is focused on position evaluation (estimating probability that current move leads to win). Training on KGS dataset led to overfitting (value network memorized the game outcomes rather than generalizing to new positions), so to prevent it, new training set was generated from self-play data (30 million distinct positions, each sampled from a separate game).

Trained value function was more accurate than Monte Carlo rollouts using rollout policy and its single evaluation had similar accuracy to Monte Carlo rollouts using Reinforcement Learning policy (but used about 15000 times less computations).

### Policy and value networks search

AlphaGo uses combination of policy and value networks in Monte Carlo search tree. Game tree is searched in simulations composed from 4 phases:

1. Selection — simulation traverses tree by selecting edges with maximum action value Q (how good this move is).
2. Expansion — if any node is expanded, it is processed once by SL policy network to get prior probabilities for each legal action.
3. Evaluation — each node is evaluated by value network and by FR policy.
4. Backup — action values Q are updated by values collected during evaluation step.

When time dedicated for game agent is over, it chooses the best move by the highest action value Q.

### Results

Performance of AlphaGo is widely known. In 2015 it won with human professional player Fan Hui, the multiple winner of European Go championships. It was the very first time when machine defeated a professional in the full game of Go. In tournament simulation AlphaGo also won 99.8% matches against other, the best Go programs (like Crazy Stone , Pachi , Zen and others) . Distributed version of AlphaGo wins 77% of games against single-machine AlphaGo, what means that it’s not only performant but also very well scalable.

During match with Fan Hui AlphaGo evaluated a thousands times fewer moves than Deep Blue did in its chess match against Kasparov what suggests that positions are selected more intelligently.

These all conclusions clearly show that thanks to AlphaGo we are another step closer to find General AI and the Singularity.

The future is coming.