Finite-time Analysis of the Multiarmed Bandit Problem in Advanced Algorithms Seminar

undefined

Finite-time Analysis of the

Multiarmed Bandit Problem

Advanced Algorithms Seminar - Fall 2017

Instructed by Professor Haim Kaplan

Guy Inbar

Contents



Introduction



Problem Settings & Notations



Previous work



Objective



Results & Proofs



Experiments



Conclusions

Introduction

Problem Settings

Introduction

Notations

Introduction

Previous work

Introduction

Objective



Achieve logarithmic regret uniformly over time

UCB1

Theorem 1

Theorem 1 Proof

Theorem 1 Proof

Theorem 1 Proof

Theorem 1 Proof

Theorem 1 Proof

Theorem 1 Proof

Fact 1

Theorem 1 Proof

Theorem 1 proof

Theorem 1 proof

Theorem 1 Proof

UCB2

Theorem 2

Randomized Policy

ε

 -GREEDY

Theorem 3

Theorem 3 proof

Theorem 3 proof

Theorem 3 proof

Hoeffding bound

Theorem 3 proof

Theorem 3 proof

Independent!

Fact 2

Theorem 3 proof

Experiments

Experiments – Settings

Experiments

Comparison between Policies

Experiments

Intuition?

Easy

2 machines with

parameters 0

Easy

2 machines with

parameters 0

Medium

2 machines with parameters 0

Medium

2 machines with parameters 0

Hard

2 machines with parameters 0

Hard

2 machines with parameters 0

Easy

 machines with parameters

6,…,0.6

Easy

 machines with parameters

6,…,0.6

Medium

10 machines with parameters

Medium

10 machines with parameters

Medium

10 machines with parameters

, . . . ,

Medium

10 machines with parameters

, . . . ,

Hard

10 machines with parameters

, . . . ,

Hard

10 machines with parameters

, . . . ,

Experiments

Results



UCB1-TUNED



In most cases performs comparably to a well-tuned

ε

 –GREEDY



Not sensitive to the variance of the machines



UCB2



Performs similarly to UCB1-TUNED, but always slightly worse



ε

 –GREEDY



Performs almost always best



Sensitive to non optimal machines (explores uniformly)



If not well tuned performance degrades rapidly

Conclusion



Shown simple and efficient policies for the bandit

problem that exhibit uniform logarithmic regret



Shown some strengths and weaknesses of

different algorithms

Slide Note

Embed Share

Download

This study delves into the Stochastic Multiarmed Bandit Problem and explores achieving logarithmic regret uniformly over time. It covers problem settings, notations, previous work, objectives, results, and proofs, including the usage of the UCB1 policy. The theorem and its proof demonstrate the expected regret after a number of plays using the UCB1 policy.

allisan Follow

Uploaded on Sep 28, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Finite-time Analysis of the Multiarmed Bandit Problem Advanced Algorithms Seminar - Fall 2017 Instructed by Professor Haim Kaplan Guy Inbar

Contents Introduction Problem Settings & Notations Previous work Objective Results & Proofs Experiments Conclusions

Introduction Problem Settings Stochastic Multiarmed Bandit Problem No Adversarial K arms Reward expectation ?? ??????? ?? ??? ??= {??,1, ??,2, } Identically distributed by ?? Complete reward independence

Introduction Notations Times machine i played after n turns ??? ? ? ?? Regret: ? ?=1 ??[??(?)]

Introduction Previous work Lai & Robbins (1985) ln ? ? ??? ???? ????????? ????? ?? (??,? ),? Optimal machine is played exponentially more often than any other machine, at least asymptotically Based on Upper Confidence Index Agrawal (1995) Index expressed as simple function of total machine reward

Introduction Objective Achieve logarithmic regret uniformly over time

UCB1 Derived from the index-based policy of Agrawal Deterministic policy: UCB1 Initialization: Play each machine once. Loop: 2ln(? 1) Play machine j that maximizes ??,??? 1+ ??? 1 ??,??? 1- average reward of machine j after turn t-1 t-1 total number of plays so far

Theorem 1 ? ? ?? Policy UCB1 expected regret after any number n of plays is at most ? + (1 +?2 ln ? ? 8 3) ? ?:??<? ?=1

Theorem 1 Proof Let ? ?? ?? ????????? ???????? ??????? ? ??? = 1 + ?=?+1 ? + ?=?+1 Remember that USB1 selects the machine maximizing {??= ?} {??= ?,??? 1 ?} ? 2ln(? 1) ??? 1 ??,??? 1+ ? 2ln(? 1) ? ? 1 2ln(? 1) ??? 1, { ?? ? 1 ??,??? 1+ + ? + ?=?+1 ??? 1 ?}

Theorem 1 Proof ? 2ln(? 1) ? ? 1 2ln(? 1) ??? 1, { ?? ? 1 ??,??? 1+ + ??? ? + ?=?+1 ??? 1 ?} 0< ? ? 1 < ? ? ??? 1 < ? ? 2ln(? 1) ? 2ln(? 1) ?? 0<?<?( ?? + ??,??+ ? + { min ) max( ? ??<? )} ?=?+1

Theorem 1 Proof ? 2ln ? 1 ? 2ln ? 1 ?? 0<?<?( ?? + ??,??+ ??? ? + { min ) max( ? ??<? )} ?=?+1 ? 1 ? 1 ? 2ln(? 1) ? 2ln(? 1) ?? { ?? + ??,??+ ? + } ?=?+1 ?=1 ??=? ? ? 2ln(?) ? 2ln(?) ?? { ?? + ??,??+ ? + } ?=1 ?=1 ??=?

Theorem 1 Proof 2ln(?) ? 2ln(?) ?? ?? + ??,??+ 2ln(?) ? ?? ? (7) 2ln(?) ?? ??,?? ??+ (8) 2ln(?) ?? (9) (? < ??+ 2 ) Let me prove this thing by contradiction

Theorem 1 Proof 2ln(?) ? ?? > ? (7) 2ln(?) ?? 2ln(?) ?? ??,??< ??+ ??> ??,?? (8) 8.1 2ln(?) ?? (9) (? ??+ 2 ) 2ln(?) ?? 2ln(?) ? 7 + 9 ?? > ??+ 2 2ln(?) ?? 2ln(?) ? 8.1 ?? > ??,??+

Theorem 1 Proof 2ln(?) ? 2ln(?) ?? ?? + ??,??+ 2ln(?) ? ?? ? (7) 2ln(?) ?? ??,?? ??+ (8) 2ln(?) ?? (9) (? < ??+ 2 )

Fact 1 ? ?????? ????????? ????? ?1, ,?? [0,1] ?[??] = ? ? ?? ? ? ? 2??2

Theorem 1 Proof 2ln(?) ? ?? ? (7) 2ln(?) ?? ??,?? ??+ (8) 2 ln ? ?? 2 ln ? ?? (9) ? < ??+ 2 (9.1) ?< 2 We bound the probability of (7) and (8) using ? ?? ? ? ? 2??2 2ln(?) ? ?? ? ? 4 ln ?= ? 4 ? 2 , (9) is false. For ? = (8ln(?)/ ?

Theorem 1 proof 2ln(?) ? 2ln(?) ?? ?{ ?? + ??,??+ } 2? 4 ? ? 2ln(?) ? 2ln(?) ?? { ?? + ??,??+ ??? ? + } ?=1 ?=1 ??=1

Theorem 1 proof 2ln(?) ? 2ln(?) ?? ?{ ?? + ??,??+ } 2? 4 ? ? 2ln(?) ? 2ln(?) ?? { ?? + ??,??+ ??? ? + } ?=1 ?=1 ??=1 8ln ? ? ? ? 2? 4 ? ??? + 2 ?=1 ?=1 ??=1 + 1 +?2 8ln ? ? 1 ?2 8ln ? + 2 2 2 3 ? ?=1

Theorem 1 Proof + 1 +?2 8ln ? ? ??? 2 ? 3 ? ??[??(?)] ?????? = ?=1 ln ? ? ? + (1 +?2 8 3 ) ? ?:??<? ?=1

UCB2 Deterministic policy: UCB2 Initialization: Play each machine once. Set ??=0 for j=1, K. Loop: Select machine j that ?? 1+? ln ???? (??) maximizes ??,??? 1+ ???? (??) Play machine j for an epoch growing exponentially Increment epoch of machine j

Theorem 2 Policy UCB2 expected regret after any number n of plays is at most 1 2 ? ? max 2 ?:??<? 2?) 1+? 1+4? ln(2? ? 2 ? +?? ?:??<? ( ?)

Randomized Policy n -GREEDY Parameters: c > 0 and 0<d<1 Initialization: Define the sequence ?? 0,1 ,? = 1,2, by ?? min 1, ?2? Loop: For each n = 1,2, Let ?? be the machine with the highest current average reward. With probability 1-?? play ?? with and with probability ?? play a random arm ??

Theorem 3 Probability that Policy n GREEDY (0 < ? min ?:??<? ?) chooses a suboptimal machine after any number ? ?? plays is at most ? of 1 2 ? ? 1 ?2? ?? ? ? ?? 5?2 ?2?+ 2 ?2 ln + 1 2 (? 1)?2? ?/2 +4? ?2 ?? 1 2 (? 1)?2?

Experiments

Experiments Settings E H E H

Experiments Comparison between Policies UCB1-TUNED Theorem 1 bound fine tuned with small changes that couldn t be proven For practical purposes UCB2 n GREEDY ?? ?2?, we will see different fine tuning values ?? min 1,

Experiments Intuition? UCB1-TUNED UCB2 n GREEDY ?? ?2? ?? min 1,

Easy 2 machines with parameters 0.9, 0.6

Easy 2 machines with parameters 0.9, 0.6

Medium 2 machines with parameters 0.9, 0.8

Medium 2 machines with parameters 0.9, 0.8

Hard 2 machines with parameters 0.55, 0.45

Hard 2 machines with parameters 0.55, 0.45

Easy 10 machines with parameters 0.9, 0.6, ,0.6

Easy 10 machines with parameters 0.9, 0.6, ,0.6

Medium 10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6, 0.6, 0.6

Medium 10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6, 0.6, 0.6

Medium 10 machines with parameters 0.9, 0.8, . . . , 0.8

Medium 10 machines with parameters 0.9, 0.8, . . . , 0.8

Hard 10 machines with parameters 0.55, 0.45, . . . , 0.45

Hard 10 machines with parameters 0.55, 0.45, . . . , 0.45

Experiments Results UCB1-TUNED In most cases performs comparably to a well-tuned n GREEDY Not sensitive to the variance of the machines UCB2 Performs similarly to UCB1-TUNED, but always slightly worse n GREEDY Performs almost always best Sensitive to non optimal machines (explores uniformly) If not well tuned performance degrades rapidly

Conclusion Shown simple and efficient policies for the bandit problem that exhibit uniform logarithmic regret Shown some strengths and weaknesses of different algorithms

Finite-time Analysis of the Multiarmed Bandit Problem in Advanced Algorithms Seminar

Download Presentation

Presentation Transcript

Related

More Related Content