Reinforcement learning agents have mastered games like Go that provide complete information about the state of the game to players. They’ve also excelled at Texas Hold ’Em poker, which provides incomplete information, as few cards are revealed. Recent work trained an agent to excel at a popular board game that, like poker, provides incomplete information but, unlike poker, involves long-term strategy.
What’s new: Julien Perolat, Bart De Vylder, Karl Tuyls, and colleagues at DeepMind teamed up with former Stratego world champion Vincent de Boer to conceive DeepNash, a reinforcement learning system that reached expert-level capability at Stratego.
Stratego basics: Stratego is played by two opposing players. The goal is to capture the opponent’s flag piece by moving a piece onto a space that contains it. The game starts with a deployment phase, in which the players place on a board 40 pieces that represent military ranks, as well as a flag and a bomb. The pieces face away from the opposing player, so neither one knows the other’s starting formation. The players move their pieces by turns, potentially attacking each other’s pieces by moving onto a space occupied by an opponent’s piece; which reveals the rank of the opponent’s piece. If the attacking piece has a higher rank, the attack is successful and the opponent’s piece is removed from the board. If the attacking piece has a lower rank, the attack fails and the attacking piece is removed.
Key insight: A reinforcement learning agent like AlphaGo learns to play games through self-play; that is, it plays iteratively against a copy of itself, adjusts its weights according to rewards it has received, and — after an interval of learning — adopts the weights of the better-performing copy. Typically, each copy predicts the potential outcome of every possible action and chooses the one that’s most likely to confer an advantage. However, this approach can go awry if one of the copies learns to win by exploiting a vulnerability that’s idiosyncratic to the agent but not to human players. That’s where regularization can help: To prevent such overfitting and enable agents to learn a more generalized strategy, previous work showed that it helps to reward an agent for — in addition to good moves and winning — predicting the same probabilities that actions will be advantageous as an earlier version of itself. Updating this earlier version periodically enables the agent to keep improving.
How it works: DeepNash comprised five U-Net convolutional neural networks. One produced an embedding based on the current state of the game board and the most recent 40 previous states. The remaining four U-Nets used the embedding as follows: (i) during training, to estimate the total future reward to be expected after executing a deployment or move, (ii) during the game’s deployment phase, to predict where each piece should be deployed, (iii) during the play phase, to select which piece to move and (iv) to decide where that piece should move.
- The authors copied DeepNash’s architecture and weights to use as a regularization system, which was updated periodically.
- DeepNash played a game against a copy of itself. It recorded the game state, actions (piece positions and moves), rewards for actions, and probabilities that those actions would be advantageous. It received a reward for taking an opponent's piece and a higher reward for winning. It also received a reward based on how well its probabilities matched the regularization system’s.
- The authors trained DeepNash for a fixed number of steps to estimate the total future reward for a given action and take actions likely to bring higher total future rewards.
- They updated the regularization system using DeepNash’s latest weights. Then they repeated the self-play process. They stopped when the regularization system’s weights no longer changed — a signal that the system had reached its optimal capability, according to game theory.
Results: DeepNash beat the most powerful Stratego bots on the Gravon game platform, winning 97.1 percent of 800 games. It beat Gravon’s human experts 84 percent of the time, ranking third as of April 22, 2022. Along the way, it developed deceptive tactics, fooling opponents by moving less-powerful pieces as though they were more powerful and vice-versa.
Why it matters: Reinforcement learning is a computationally inefficient way to train a model from scratch to find good solutions among a plethora of possibilities. But it mastered Go, a game with 10360 possible states, and it predicts protein shapes among 10300 possible configurations of amino acids. DeepNash sends the message that reinforcement learning can also handle Stratego’s astronomical number of 10535 states, even when those states are unknown.
We’re thinking: DeepNash took advantage of the Stratego board’s imperfect information by bluffing. Could it have developed a theory of mind?