Finite-time Analysis of the Multiarmed Bandit Problem in Advanced Algorithms Seminar
This study delves into the Stochastic Multiarmed Bandit Problem and explores achieving logarithmic regret uniformly over time. It covers problem settings, notations, previous work, objectives, results, and proofs, including the usage of the UCB1 policy. The theorem and its proof demonstrate the expected regret after a number of plays using the UCB1 policy.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Finite-time Analysis of the Multiarmed Bandit Problem Advanced Algorithms Seminar - Fall 2017 Instructed by Professor Haim Kaplan Guy Inbar
Contents Introduction Problem Settings & Notations Previous work Objective Results & Proofs Experiments Conclusions
Introduction Problem Settings Stochastic Multiarmed Bandit Problem No Adversarial K arms Reward expectation ?? ??????? ?? ??? ??= {??,1, ??,2, } Identically distributed by ?? Complete reward independence
Introduction Notations Times machine i played after n turns ??? ? ? ?? Regret: ? ?=1 ??[??(?)]
Introduction Previous work Lai & Robbins (1985) ln ? ? ??? ???? ????????? ????? ?? (??,? ),? Optimal machine is played exponentially more often than any other machine, at least asymptotically Based on Upper Confidence Index Agrawal (1995) Index expressed as simple function of total machine reward
Introduction Objective Achieve logarithmic regret uniformly over time
UCB1 Derived from the index-based policy of Agrawal Deterministic policy: UCB1 Initialization: Play each machine once. Loop: 2ln(? 1) Play machine j that maximizes ??,??? 1+ ??? 1 ??,??? 1- average reward of machine j after turn t-1 t-1 total number of plays so far
Theorem 1 ? ? ?? Policy UCB1 expected regret after any number n of plays is at most ? + (1 +?2 ln ? ? 8 3) ? ?:??<? ?=1
Theorem 1 Proof Let ? ?? ?? ????????? ???????? ??????? ? ??? = 1 + ?=?+1 ? + ?=?+1 Remember that USB1 selects the machine maximizing {??= ?} {??= ?,??? 1 ?} ? 2ln(? 1) ??? 1 ??,??? 1+ ? 2ln(? 1) ? ? 1 2ln(? 1) ??? 1, { ?? ? 1 ??,??? 1+ + ? + ?=?+1 ??? 1 ?}
Theorem 1 Proof ? 2ln(? 1) ? ? 1 2ln(? 1) ??? 1, { ?? ? 1 ??,??? 1+ + ??? ? + ?=?+1 ??? 1 ?} 0< ? ? 1 < ? ? ??? 1 < ? ? 2ln(? 1) ? 2ln(? 1) ?? 0<?<?( ?? + ??,??+ ? + { min ) max( ? ??<? )} ?=?+1
Theorem 1 Proof ? 2ln ? 1 ? 2ln ? 1 ?? 0<?<?( ?? + ??,??+ ??? ? + { min ) max( ? ??<? )} ?=?+1 ? 1 ? 1 ? 2ln(? 1) ? 2ln(? 1) ?? { ?? + ??,??+ ? + } ?=?+1 ?=1 ??=? ? ? 2ln(?) ? 2ln(?) ?? { ?? + ??,??+ ? + } ?=1 ?=1 ??=?
Theorem 1 Proof 2ln(?) ? 2ln(?) ?? ?? + ??,??+ 2ln(?) ? ?? ? (7) 2ln(?) ?? ??,?? ??+ (8) 2ln(?) ?? (9) (? < ??+ 2 ) Let me prove this thing by contradiction
Theorem 1 Proof 2ln(?) ? ?? > ? (7) 2ln(?) ?? 2ln(?) ?? ??,??< ??+ ??> ??,?? (8) 8.1 2ln(?) ?? (9) (? ??+ 2 ) 2ln(?) ?? 2ln(?) ? 7 + 9 ?? > ??+ 2 2ln(?) ?? 2ln(?) ? 8.1 ?? > ??,??+
Theorem 1 Proof 2ln(?) ? 2ln(?) ?? ?? + ??,??+ 2ln(?) ? ?? ? (7) 2ln(?) ?? ??,?? ??+ (8) 2ln(?) ?? (9) (? < ??+ 2 )
Fact 1 ? ?????? ????????? ????? ?1, ,?? [0,1] ?[??] = ? ? ?? ? ? ? 2??2
Theorem 1 Proof 2ln(?) ? ?? ? (7) 2ln(?) ?? ??,?? ??+ (8) 2 ln ? ?? 2 ln ? ?? (9) ? < ??+ 2 (9.1) ?< 2 We bound the probability of (7) and (8) using ? ?? ? ? ? 2??2 2ln(?) ? ?? ? ? 4 ln ?= ? 4 ? 2 , (9) is false. For ? = (8ln(?)/ ?
Theorem 1 proof 2ln(?) ? 2ln(?) ?? ?{ ?? + ??,??+ } 2? 4 ? ? 2ln(?) ? 2ln(?) ?? { ?? + ??,??+ ??? ? + } ?=1 ?=1 ??=1
Theorem 1 proof 2ln(?) ? 2ln(?) ?? ?{ ?? + ??,??+ } 2? 4 ? ? 2ln(?) ? 2ln(?) ?? { ?? + ??,??+ ??? ? + } ?=1 ?=1 ??=1 8ln ? ? ? ? 2? 4 ? ??? + 2 ?=1 ?=1 ??=1 + 1 +?2 8ln ? ? 1 ?2 8ln ? + 2 2 2 3 ? ?=1
Theorem 1 Proof + 1 +?2 8ln ? ? ??? 2 ? 3 ? ??[??(?)] ?????? = ?=1 ln ? ? ? + (1 +?2 8 3 ) ? ?:??<? ?=1
UCB2 Deterministic policy: UCB2 Initialization: Play each machine once. Set ??=0 for j=1, K. Loop: Select machine j that ?? 1+? ln ???? (??) maximizes ??,??? 1+ ???? (??) Play machine j for an epoch growing exponentially Increment epoch of machine j
Theorem 2 Policy UCB2 expected regret after any number n of plays is at most 1 2 ? ? max 2 ?:??<? 2?) 1+? 1+4? ln(2? ? 2 ? +?? ?:??<? ( ?)
Randomized Policy n -GREEDY Parameters: c > 0 and 0<d<1 Initialization: Define the sequence ?? 0,1 ,? = 1,2, by ?? min 1, ?2? Loop: For each n = 1,2, Let ?? be the machine with the highest current average reward. With probability 1-?? play ?? with and with probability ?? play a random arm ??
Theorem 3 Probability that Policy n GREEDY (0 < ? min ?:??<? ?) chooses a suboptimal machine after any number ? ?? plays is at most ? of 1 2 ? ? 1 ?2? ?? ? ? ?? 5?2 ?2?+ 2 ?2 ln + 1 2 (? 1)?2? ?/2 +4? ?2 ?? 1 2 (? 1)?2?
Experiments Settings E H E H
Experiments Comparison between Policies UCB1-TUNED Theorem 1 bound fine tuned with small changes that couldn t be proven For practical purposes UCB2 n GREEDY ?? ?2?, we will see different fine tuning values ?? min 1,
Experiments Intuition? UCB1-TUNED UCB2 n GREEDY ?? ?2? ?? min 1,
Easy 2 machines with parameters 0.9, 0.6
Easy 2 machines with parameters 0.9, 0.6
Medium 2 machines with parameters 0.9, 0.8
Medium 2 machines with parameters 0.9, 0.8
Hard 2 machines with parameters 0.55, 0.45
Hard 2 machines with parameters 0.55, 0.45
Easy 10 machines with parameters 0.9, 0.6, ,0.6
Easy 10 machines with parameters 0.9, 0.6, ,0.6
Medium 10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6, 0.6, 0.6
Medium 10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6, 0.6, 0.6
Medium 10 machines with parameters 0.9, 0.8, . . . , 0.8
Medium 10 machines with parameters 0.9, 0.8, . . . , 0.8
Hard 10 machines with parameters 0.55, 0.45, . . . , 0.45
Hard 10 machines with parameters 0.55, 0.45, . . . , 0.45
Experiments Results UCB1-TUNED In most cases performs comparably to a well-tuned n GREEDY Not sensitive to the variance of the machines UCB2 Performs similarly to UCB1-TUNED, but always slightly worse n GREEDY Performs almost always best Sensitive to non optimal machines (explores uniformly) If not well tuned performance degrades rapidly
Conclusion Shown simple and efficient policies for the bandit problem that exhibit uniform logarithmic regret Shown some strengths and weaknesses of different algorithms