Exploring Monte Carlo Tree Search (MCTS) Algorithm in Online Planning

Slide Note
Embed
Share

Monte Carlo Tree Search (MCTS) is an intelligent tree search algorithm that balances exploration and exploitation by using random sampling through simulations. It is widely used in AI applications such as games (e.g., AlphaGo), scheduling, planning, and optimization. This algorithm involves steps like selection, expansion, simulation, and backpropagation to find the most promising move based on playouts. By iteratively repeating these steps, MCTS improves decision-making processes in various domains.


Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Monte Carlo Tree Search (MCTS) Online planning algorithm. Intelligent tree search that balances exploration and exploitation. Random sampling in the form of simulations (playouts). Casino de Monte-Carlo, Monaco Storage of the statistics of actions to improve choices in the future. Combined with neural nets, it is behind many major successes of AI applications such as games (e.g., AlphaGo), scheduling, planning, and optimization. * Figures/images are from the textbook site (or by the instructor) unless the sources are given otherwise.

  2. Step 1: Selection We need to decide a move for black (at the root) to make. Root: state just after the move by white, who has won 37 out of the 100 playouts at the node so far. move by black White black has won 60 out of 79 playouts at the node Also reasonable to select for the purpose of exploration Black move by white Unlike in the minimax game tree, a directed edge represents a move by the player at the edge s destination. Hyperthetical executions White Back At each node a move is selected according to some policy (e.g., the upper confidence bound to be introduced).

  3. Steps 2 & 3: Expansion & Simulation Create one or more child nodes of the selected node. White You may generate a child node by executing every feasible action. Black Or you may generate child nodes by executing a subset of feasible actions. White But you shouldn t always generate just one child node when there is more than one feasible action. Otherwise, the search tree within each iteration will have just one leaf which will then be selected and the tree will degenerate into a chain. One of the child nodes, say, ?, is (randomly) chosen. Black selected (27 wins for Black out of 35 playouts) White Perform a playout from the node ?.

  4. Step 4: Back Propagation Update all the nodes upward along the path until the root. White 100 Black Black wins this playout: 60/79 At a white node, increment #playouts only. 53 White Black 27/35 At a black node, increment #wins and #playouts. 0 White

  5. Termination MCTS repeats the four steps (selection, expansion, simulation, back-propagation) in order until a set number ? of iterations have been performed, or the allotted time has expired. It returns the move with the highest number of playouts. Why not the highest ratio? Since better moves are more likely to be chosen, the most promising move is expected to have the highest number of playouts. A node with 65/100 wins is better than one with 2/3 wins (which has a lot of uncertainty).

  6. Monte Carlo Tree Search Algorithm // decide a move at state. // initialize the tree with state at the root // ? iterations, each expanding the tree by one node. // the node to be expanded must be a leaf. // tree is expanded to a child of leaf. // playout: moves are not recorded in the tree. // update nodes on the path upward to the root. state (root) ACTIONS(state) Computing one playout takes time linear in the length of the path from child to the utility node (result). leaf child playout result

  7. Three Issues child EXPAND(leaf) leaf Create one or more child nodes and choose node child from one of them. child result SIMULATE(child) A playout may be as simple as choosing uniformly random moves until the game is decided. child playout result IS-TIME-REMAINING() Pure Monte Carlo search does N simulations instead.

  8. Ranking of Possible Moves At a node ? in the expanding MCTS tree rooted at the real state. Upper confidence bound formula: #wins for the player making a move at ? out of all playouts through the node #playouts through the parent of ? UCB ? =?(?) log ?(??????(?)) ?(?) ?(?)+ ? #playouts through the node balance between exploitation and exploration Exploration term: high value if ? has been explored a few times. Exploitation term: average utility of ? ? ? = 27 ? ? = 35 ? ?????? ? ? = UCB ? =27 ??????(?) = 53 (choice by a theoretical argument) ? 2 log 53 35 35+ 2

  9. Constant ? Balances exploitation and exploration. White Multiple values are tried and the one that performs the best is chosen. Black ? = 1.4 White The 60/79 node has the highest score. Black ? = 1.5 The 2/11 node has the highest score.

  10. Example: The Pair Game FromMonte Carlo Tree Search: A Guide Four circular holes in the board. Two players (Blue and Red) take turns to place disks in their colors to cover the holes. Blue goes first. Whoever first covers two adjacent holes wins the game. Tie Red wins Blue wins

  11. Complete Game Tree (up to Symmetry)

  12. Iteration 1: Selection & Expansion 0/0 Selected randomly 0/0 0/0 Leaf nodes UCB UCB

  13. Simulation 0/0 0/0 0/0

  14. Backpropagation 0/1 0/0 0/1

  15. Iteration 2: Selection 0/1 Selected 0/0 0/1 UCB = 0 UCB =

  16. More on MCTS Time to compute a playout is linear in the depth of the game tree. Compute many playouts before taking one move. E.g., branching factor ? = 32, 100 plies on average for a game. enough computing power to consider 109 states Minimax can search 6 ply deep: 326 109. Alpha-beta can search up to 12 plies. Monte Carlo can do 107 playouts. MCTS has advantage over alpha-beta when ? is high. MCTS is less vulnerable to a single error. MCTS can be applied to brand-new games via training by self-play. MCTS is less desired than alpha-beta on a game like chess with low ? and good evaluation function.

Related


More Related Content