Dots and boxes (with AlphaZero)

The Dots-and-Boxes game
Dots-and-Boxes is a popular combinatorial game that can be played with only paper and pencil. It is played on a rectangular grid of dots that represents r × c squares (“boxes”). The two players take turns in drawing a horizontal or vertical line between a pair of neighboring dots. If a player closes a box, that box is assigned to that player and she must draw another line. The player who owns the most boxes wins the game.3 While the game has simple rules, it has a large number of possible states (more specifically, ∣S∣ = 2 r(c+1)+(r+1)c states), and an even larger number (log2 (∣S∣)!) of unique games can be played. No general strategy for optimal play is known. Currently, a 4 × 5 game is the largest game that is solved.

Goal
Solve a tiny version of Dots-and-Boxes using the minimax algorithm. The minimax algorithm is a basic algorithm in game theory to decide what is an optimal move for the current player. By backtracking the entire tree of possibilities, the minimax algorithm computes, for each state, a value indicating how good it would be to reach that position. You can start from a naive template available in the repository. You will notice that this game has a very wide game tree making minimax already slow for tiny grids. Your task is to optimize the provided template for the minimax algorithm. Such optimization requires two steps:

1. Implement transposition tables for the minimax algorithm.
Transposition tables are a standard technique to improve the minimax algorithm. The intuition is that the same state can be reached by different trajectories. Caching the value of a state will then avoid re-searching the game tree below that same state along a different trajectory. Pay particular attention to what is a state in Dots-and-boxes and how to represent it (i.e. what is the key you will use for the cache?).

2. Exploit symmetries.
The previous improvement exploited an equivalence among different trajectories to improve the search. Now, we want to exploit the fact that multiple states are actually completely symmetric. Think for instance of two states, which are one the rotation of the other. They will necessarily share the same value as one can simply play the same strategy with the actions rotated accordingly. By exploiting such symmetries, you can avoid re-searching states that are symmetric.

Learning strategies
The Dots-and-Boxes games used in the previous section are too small to be interesting to play. In this task you are expected to find a more advanced strategy that works for any size of game. The following approaches are some starting points:

1. Reinforcement Learning (RL) can be used to learn a policy for playing games. Often, given the very large number of state-action combinations, standard model-free techniques such as Q-learning, with a tabular representation of state-action pairs will not work; instead some kind of generalization is needed (predicting the Q-value of unseen state-action pairs), often done using deep learning. There will be a need to design or learn features describing states and actions that correlate with the quality of state-action pairs.

2. Monte Carlo tree search (MCTS) is a stochastic version of minimax search. It has been found to work better than standard minimax with α-β pruning. Many modern game-playing programs use it. Contrary to reinforcement learning, MCTS explores the states space during play. Like minimax, MCTS may use an evaluation function V that models how promising a state is (e.g., in AlphaGo they used deep learning to represent this function).

For all approaches you can use implementations from OpenSpiel (DQN, policy gradient, MCTS, etc.) or make implementations of your own. For all strategies you will need to represent states and state-action pairs. Simple games such as matrix games are stateless models, but for more complex settings you will need to use models that can generalize states. To this end, you will have to think about features that can accurately represent the game state and that allow for learning models that generalize well. The features can be designed by an expert, or learned automatically. The Q or V function will be expressed in terms of these features. Learning this function can be done using almost any of the machine learning techniques you have seen in the machine learning course.

Goal
You will develop an agent that has learned to play Dots-and-Boxes, making use of the code infrastructure provided in OpenSpiel. Your agent should work for any board size. If this is too challenging, you can train an agent that can play a 7×7 board (this reduces your maximal score by 2 points). You are free to use any technique you prefer (e.g., deep Q-learning, Monte-Carlo tree search, minimax). In the case of Dots-and-boxes it won’t be possible anymore to evaluate your agent exactly with respect to Nash optimality or other metrics due to the size of the game. Therefore, we expect you to evaluate your agent head-2-head against simple baselines including but not limited to random play, first open edge, … Use the results of your evaluation and literature to motivate your design choices. Being able to run and experiment with your agent (and for us to be able to reproduce this) is part of the evaluation. Thus explicitly describe your model (e.g., network structure, input tensor and exploited symmetries), what types of gameplay you observe (e.g., does your model learn to recognize the concept of chains, a popular strategy in Dots-and-Boxes), and discuss the learning statistics you used.

Our solution