Exploring Monte Carlo Tree Search (MCTS) Algorithm in Online Planning

Monte Carlo Tree Search (MCTS)
 
* Figures/images are from the 
(or by the instructor) unless the sources are given otherwise.  
textbook site
 
Intelligent tree search that balances
    exploration and exploitation.
 
Random sampling in the form of simulations
    (
playouts
).
Casino de Monte-Carlo, Monaco
 
Combined with neural nets, it is behind many major successes of AI
   applications such as games (e.g., AlphaGo), scheduling, planning, and
   optimization.
 
Online planning algorithm.
 
Storage of the statistics of actions to improve choices in the future.
Step 1: Selection
 
Root: state just after the
move by white, who has 
won 37 out of the 100 
playouts
 at the node so far. 
White
Black
White
Back
 
move
by black
 
move
by white
 
Also reasonable to select for
the purpose of exploration
 
black has won
60 out of 79
playouts at
the node
We need to decide a move for black (at the root) to make. 
Unlike in the minimax game 
   tree, a directed edge 
   represents a move by the 
   player at the edge’s destination.     
 
At each node a move is selected
according to some policy (e.g., the upper
confidence bound to be introduced).
Hyperthetical 
executions
Steps 2 & 3: Expansion & Simulation
 
selected (27 wins for 
Black out of 35 playouts) 
White
Black
White
Black
 
White
 
 Create 
one or more 
child nodes of
   the selected node.
 You may generate a child node
   by executing every feasible action. 
 Or you may generate child nodes
   by executing a subset of feasible 
   actions. 
 But you shouldn’t 
always
 generate
   just one child node when there 
   is more than one feasible action.  
   Otherwise, the search tree within
   each iteration will have just one 
   leaf – which will then be selected – 
   and the tree will degenerate into 
   a chain. 
Step 4: Back Propagation
 
White
Black
White
Black
White
 
Update all the nodes upward 
   along the path until the root.
 
Black wins this playout:
 
 
At a black node, increment
    #wins and #playouts.
 
 
At a white node, increment
    #playouts only.
27/35
60/79
53
100
0
Termination
 
 
MCTS repeats the four steps (selection, expansion, simulation,
    back-propagation) in order until
 It returns the move with the 
highest number of playouts
. 
 the allotted time has expired.
 
 
Since better moves are more likely to be chosen, the most promising
   move is expected to have the highest number of playouts.
 
Why not the highest ratio?
 
 
A node with 65/100 wins is better than one with 2/3 wins (which has a
   lot of uncertainty).
Monte Carlo Tree Search Algorithm
 
leaf
// decide a move at 
state.
result
 
Computing one playout takes time
linear
 in the length of the path from
child
 to the utility node (result).
A
CTIONS
(
state
)
Three Issues
C
reate one or more child nodes and 
choose node 
child
 
from one of them.
 
A playout may be as simple as choosing
uniformly random 
moves until the game
is decided.
 
 
I
S-
T
IME-
R
EMAINING
()
 
Pure Monte Carlo search does 
N
simulations instead.
 
playout
Ranking of Possible Moves
 
Upper confidence bound formula
: 
 
#playouts through
the node
 
balance between
exploitation and exploration
 
(choice by a theoretical argument)
 
White
Black
White
Black
 
Balances exploitation and exploration.
 Multiple values are tried and the one
    that performs the best is chosen.
The 60/79 node has the highest score.
The 2/11 node has the highest score.
Example: The Pair Game
 
From
 
Monte Carlo Tree Search: A Guide
 Four circular holes in the board.
 Two players (Blue and Red) take turns
    to place disks in their colors to cover 
    the holes.
 Whoever first covers two adjacent holes wins the game.
 Blue goes first.
Blue wins
Red wins
Tie
Complete Game Tree (up to Symmetry)
 
Iteration 1: Selection & Expansion
 
0/0
0/0
0/0
Leaf nodes
Selected
randomly
Simulation
 
0/0
0/0
0/0
Backpropagation
 
0/1
0/0
0/1
Iteration 2: Selection
 
0/1
0/0
0/1
Selected
More on MCTS
 
 
Time to compute a playout is 
linear
 in the depth of the game tree. 
     Compute many playouts before taking one move.
 
 
MCTS is less vulnerable to a single error.
 
 
MCTS can be applied to brand-new games via training by self-play.
 
Alpha-beta can search up to 12 plies.
Slide Note
Embed
Share

Monte Carlo Tree Search (MCTS) is an intelligent tree search algorithm that balances exploration and exploitation by using random sampling through simulations. It is widely used in AI applications such as games (e.g., AlphaGo), scheduling, planning, and optimization. This algorithm involves steps like selection, expansion, simulation, and backpropagation to find the most promising move based on playouts. By iteratively repeating these steps, MCTS improves decision-making processes in various domains.


Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Monte Carlo Tree Search (MCTS) Online planning algorithm. Intelligent tree search that balances exploration and exploitation. Random sampling in the form of simulations (playouts). Casino de Monte-Carlo, Monaco Storage of the statistics of actions to improve choices in the future. Combined with neural nets, it is behind many major successes of AI applications such as games (e.g., AlphaGo), scheduling, planning, and optimization. * Figures/images are from the textbook site (or by the instructor) unless the sources are given otherwise.

  2. Step 1: Selection We need to decide a move for black (at the root) to make. Root: state just after the move by white, who has won 37 out of the 100 playouts at the node so far. move by black White black has won 60 out of 79 playouts at the node Also reasonable to select for the purpose of exploration Black move by white Unlike in the minimax game tree, a directed edge represents a move by the player at the edge s destination. Hyperthetical executions White Back At each node a move is selected according to some policy (e.g., the upper confidence bound to be introduced).

  3. Steps 2 & 3: Expansion & Simulation Create one or more child nodes of the selected node. White You may generate a child node by executing every feasible action. Black Or you may generate child nodes by executing a subset of feasible actions. White But you shouldn t always generate just one child node when there is more than one feasible action. Otherwise, the search tree within each iteration will have just one leaf which will then be selected and the tree will degenerate into a chain. One of the child nodes, say, ?, is (randomly) chosen. Black selected (27 wins for Black out of 35 playouts) White Perform a playout from the node ?.

  4. Step 4: Back Propagation Update all the nodes upward along the path until the root. White 100 Black Black wins this playout: 60/79 At a white node, increment #playouts only. 53 White Black 27/35 At a black node, increment #wins and #playouts. 0 White

  5. Termination MCTS repeats the four steps (selection, expansion, simulation, back-propagation) in order until a set number ? of iterations have been performed, or the allotted time has expired. It returns the move with the highest number of playouts. Why not the highest ratio? Since better moves are more likely to be chosen, the most promising move is expected to have the highest number of playouts. A node with 65/100 wins is better than one with 2/3 wins (which has a lot of uncertainty).

  6. Monte Carlo Tree Search Algorithm // decide a move at state. // initialize the tree with state at the root // ? iterations, each expanding the tree by one node. // the node to be expanded must be a leaf. // tree is expanded to a child of leaf. // playout: moves are not recorded in the tree. // update nodes on the path upward to the root. state (root) ACTIONS(state) Computing one playout takes time linear in the length of the path from child to the utility node (result). leaf child playout result

  7. Three Issues child EXPAND(leaf) leaf Create one or more child nodes and choose node child from one of them. child result SIMULATE(child) A playout may be as simple as choosing uniformly random moves until the game is decided. child playout result IS-TIME-REMAINING() Pure Monte Carlo search does N simulations instead.

  8. Ranking of Possible Moves At a node ? in the expanding MCTS tree rooted at the real state. Upper confidence bound formula: #wins for the player making a move at ? out of all playouts through the node #playouts through the parent of ? UCB ? =?(?) log ?(??????(?)) ?(?) ?(?)+ ? #playouts through the node balance between exploitation and exploration Exploration term: high value if ? has been explored a few times. Exploitation term: average utility of ? ? ? = 27 ? ? = 35 ? ?????? ? ? = UCB ? =27 ??????(?) = 53 (choice by a theoretical argument) ? 2 log 53 35 35+ 2

  9. Constant ? Balances exploitation and exploration. White Multiple values are tried and the one that performs the best is chosen. Black ? = 1.4 White The 60/79 node has the highest score. Black ? = 1.5 The 2/11 node has the highest score.

  10. Example: The Pair Game FromMonte Carlo Tree Search: A Guide Four circular holes in the board. Two players (Blue and Red) take turns to place disks in their colors to cover the holes. Blue goes first. Whoever first covers two adjacent holes wins the game. Tie Red wins Blue wins

  11. Complete Game Tree (up to Symmetry)

  12. Iteration 1: Selection & Expansion 0/0 Selected randomly 0/0 0/0 Leaf nodes UCB UCB

  13. Simulation 0/0 0/0 0/0

  14. Backpropagation 0/1 0/0 0/1

  15. Iteration 2: Selection 0/1 Selected 0/0 0/1 UCB = 0 UCB =

  16. More on MCTS Time to compute a playout is linear in the depth of the game tree. Compute many playouts before taking one move. E.g., branching factor ? = 32, 100 plies on average for a game. enough computing power to consider 109 states Minimax can search 6 ply deep: 326 109. Alpha-beta can search up to 12 plies. Monte Carlo can do 107 playouts. MCTS has advantage over alpha-beta when ? is high. MCTS is less vulnerable to a single error. MCTS can be applied to brand-new games via training by self-play. MCTS is less desired than alpha-beta on a game like chess with low ? and good evaluation function.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#