MCTS vs Minimax: when to explore, when to exploit

30 Nov 2024 · 10 min read · vmargb

When I built my chess AI last year, I originally started with the Minimax algorithm as it's the standard choice for board game AI, it's what every tutorial starts with. Then I later added Monte-Carlo Tree Search on top of it. The combination ended up being more interesting than with either algorithm alone, and the process of figuring out why taught me something deeper between these two approaches.

Minimax: the exhaustive reasoner

Minimax is clean in a sort of brute-force manner. You build a game tree and alternate between maximising your score and minimising the opponents score, then pick the move from the root that leads to the best outcome(the highest overall evaluation from maximising your scores). Additionally, alpha-beta pruning proved to cut the tree roughly in half allowing to look ahead by n=3. Now if you add a good evaluation function, this plays decently good chess.

The problem is that chess has a branching factor of around 35. At depth n=5, you're evaluating roughly 35⁵ ≈ 52 million positions. At depth 7, 35⁷ ≈ 64 billion. Allowing this many evaluations to happen for every move in the game becomes computationaly expensive and tedious. Moreover, the evaluation function is also a load-bearing assumption, if your heuristic is wrong then Minimax will still confidently find the best move toward a bad outcome. Is there a way to better determine which paths are worth exploring?

MCTS: the gambler

Monte-Carlo Tree Search takes a slightly different approach. Rather than applying static evaluation to every possible position, it instead runs playouts, which basically simulates full games to completion and uses the result(win/loss) to estimate ahead of time which moves are promising. The key to this stategy is the UCB1 formula, which balances exploration and exploitation:

UCB1 = (w / n) + c * sqrt(ln(N) / n)

# w = wins from this node
# n = times this node was visited
# N = times parent was visited
# c = exploration constant (typically sqrt(2))

The first term is exploitation, which prefer to exploit nodes that have won often. The second is exploration which prefers nodes that haven't been tried/tested as much(worth exploring). But if you set c too high then you start to explore just randomly which is not good. Too low and you exploit greedily and miss better moves(favouring exploitation more over exploration). This is why there needs to be a balance.

However, pure MCTS with random playouts is actually bad at chess. Random chess games are just noise. The winning side in a random playout is almost unrelated to who's winning after a good play.

Why the combination works

The approach I ended up with is: use MCTS to decide which parts of the tree to explore, and replace random playouts with shallow Minimax evaluations (depth 1-3).

MCTS handles the strategic decisions, which is which branches are worth spending time on. It grows the tree asymmetrically, allocating more computation to positions that actually matter and are worth exploring, rather than giving equal computation to paths that don't matter as much.

Minimax handles the tactical decisions within a branch MCTS has decided to explore, it can calculate forcing lines, spot checkmates, avoid blunders etc.

MCTS asks "which direction should I think harder about?" whereas Minimax asks "given that I'm thinking about this direction, what's the best move?"

The evaluation function, revisited

This is an example of a simple staic evaluation function used for chess: material count, piece-square tables, a bonus for castling and connected rooks.

def evaluate(board):
    score = 0
    for square, piece in board.piece_map().items():
        value = PIECE_VALUES[piece.piece_type]
        positional = PIECE_SQUARE_TABLES[piece.piece_type][square]
        if piece.color == chess.WHITE:
            score += value + positional
        else:
            score -= value + positional
    return score

Despite implementing Minimax and every module correctly, everything is held back by the strength of the static evaluation. A weak evaluation leads to a weak agent, and a strong evaluation leads to a strong agent. The problem is that not every game of chess can be accurately represented by an evaluation that is hardcoded and never changes.

Interestingly, MCTS has to compensate substantially for a weak evaluation function. Because it selects which positions to evaluate rather than evaluating everything. For example, a mediocre heuristic applied to the right positions outperforms the same heuristic applied uniformly. This is probably why AlphaZero was such a leap where a more sophisticated learned evaluator replaced the traditional method of random playouts.

Things I'd do differently

The exploration constant c was 1.4 (the theoretical default). In practice the right value depends on your evaluation function and time budget. I never tuned it systematically, which is a bad way to tune a hyperparameter when you can generate objective win/loss data.

I also never implemented transposition tables for the MCTS tree. Which is the case where you reach the same position via different move orders, you should obviously reuse the accumulated statistics rather than treating it as a new node and repeating the same calculation again. Transpostion tables essentially cache the accumulated statistic for common positions to save cost throughout the game.

Lastly, the biggest weakness of any board game agent is an evaluation that does not adapt. A more dynamic evaluation beats static evaluation in because it tackles the horizon effect. Basically, static eval just scores a position as-is, which can be wildly misleading mid-tactics like a capture sequence where you're temporarily up material but about to lose more. Instead, a better approach is to keep searching "forcing" moves (captures, checks, promotions) via quiescence search, ensuring a stable, realistic score that avoids tactical blunders. The way this works is, run normal minimax to an odd-depth limit like n=3, then at the leaves switch over to a quiescence searce. This makes the engine way sharper while being able to adapt to different situations. wikipedia

An even more sophisticated approach would be to use a large dataset of grandmaster games with a collection of random eval functions that test for different things. Then test which ones adapt or mimic good plays the best and assign them a higher fitness score. You can then use something like simulated annealing or a genetic algorithm that dynamically refines the evaluation into something that closely resembles grandmaster-level evals. The problem is this process is extremely compute-heavy and takes a long time to get right.

What I came away with is a better intuition for when each approach is right. Minimax is right when the search space is small enough to reason through completely, or when you need precise tactical calculation(perfect for simpler games like connect 4 and checkers). MCTS is right when the space is too large to enumerate, or when you want to allocate computation dynamically. Combining them is right when you need both. Interestingly, this explore-exploit tradeoff UCB1 balances also shows up in other scenarios like A/B testing, recommendation systems, and reinforcement learning broadly.

The code for this project can be found here: https://github.com/physicsKnight/MCTS-Minimax