Reinforcement Learning for Queueing Systems

Convergence for Natural Policy Gradient

on Infinite-State Average-Reward

Markov Decision Processes

Natural Policy Gradient for Queues

Reinforcment Learning for Stochastic Networks, June 17

Problem: Optimize Queueing System

Queueing system with unknown

parameters (generalized switch)

Want to optimize policy.

Ex: Minimize mean queue length.

Challenge 1: Unknown system dynamics

Challenge 2: Even if knew dynamics,

how to pick an optimal policy?

Reinforcement learning!

Reinforcment Learning for Stochastic Networks, June 17

Reinforcement Learning for Queueing Systems

Framework: Queueing system as Markov Decision Process (MDP) with

unknown parameters

Policy gradient methods: Actor-critic, Trust Region Policy Optimization,

Proximal Policy Optimization

[Schulman et al. ’15, ‘17]

Developed for finite state-space settings

Empirically, highly effective for countably-infinite queues!

Recent variance reduction work, even better empirically!

[Dai & Gluzman ’22]

Want to analyze: Converge to global optimum? Convergence rate?

Underpinning: Natural Policy Gradient

Reinforcment Learning for Stochastic Networks, June 17

Natural Policy Gradient for Queueing Systems

Natural Policy Gradient: Algorithm for MDP optimization

Perfect information, tabular policy space. Simplest setting.

Open problem!

Key challenges: Average cost, countably-infinite state space

Q: Does NPG converge to the globally

optimal policy, for queueing MDPs?

If so, how fast?

Reinforcment Learning for Stochastic Networks, June 17

Outline

Setting: RL for queueing systems

Define NPG algorithm

Challenges & prior work

The expert-advice framework

Result! First convergence for NPG on queues

Reinforcment Learning for Stochastic Networks, June 17

Natural Policy Gradient: Definition

Q: Does NPG converge to the globally

optimal policy, for queueing MDPs?

Prior work on NPG

Average cost, infinite state

Results for specialized

settings

[Kunnumkal & Topaloglu ’08],

[Fazel et al. ‘14]

Open in general, incl.

queueing systems

Expert-advice Framework [Even-dar et al. ’09]

Expert Advice Problem: Background

Reinforcment Learning for Stochastic Networks, June 17

NPG and the expert-advice framework

New!

Reinforcment Learning for Stochastic Networks, June 17

Outline

Setting: RL for queueing systems

Define NPG algorithm

Challenges & prior work

The expert-advice framework

Bound relative value

Analyze MaxWeight

Result!

Reinforcment Learning for Stochastic Networks, June 17

Reinforcment Learning for Stochastic Networks, June 17

Initial policy: MaxWeight

Reinforcment Learning for Stochastic Networks, June 17

Relative Value for MaxWeight

Reinforcment Learning for Stochastic Networks, June 17

Putting the result together

Reinforcment Learning for Stochastic Networks, June 17

Putting the result together

Main Result

Reinforcment Learning for Stochastic Networks, June 17

Follow-up work

Reinforcment Learning for Stochastic Networks, June 17

Future directions

Reinforcment Learning for Stochastic Networks, June 17

Conclusion

Reinforcment Learning for Stochastic Networks, June 17

Slide Note

Embed Share

Download

Natural Policy Gradient is explored as an algorithm for optimizing Markov Decision Processes in queueing systems with unknown parameters. The challenges of unknown system dynamics and policy optimization are addressed through reinforcement learning techniques such as Actor-critic and Trust Region Policy Optimization. Recent variance reduction methods have shown improved empirical effectiveness for countably-infinite queues. Key questions revolve around the convergence to the global optimum and the speed of convergence, especially in scenarios with average cost and countably-infinite state space.

currence_u Follow

Uploaded on Oct 07, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Convergence for Natural Policy Gradient on Infinite-State Average-Reward Markov Decision Processes Natural Policy Gradient for Queues Izzy Grosof (Isaac) (they/she) Georgia Tech UIUC Northwestern IEMS Siva Theja Maguluri (Georgia Tech) R. Srikant (UIUC) 1 Reinforcment Learning for Stochastic Networks, June 17

Problem: Optimize Queueing System Queueing system with unknown parameters (generalized switch) Want to optimize policy. Ex: Minimize mean queue length. Challenge 1: Unknown system dynamics Challenge 2: Even if knew dynamics, how to pick an optimal policy? Reinforcement learning! ? ? ? ? ? ? ? ? 2 Reinforcment Learning for Stochastic Networks, June 17

Reinforcement Learning for Queueing Systems Framework: Queueing system as Markov Decision Process (MDP) with unknown parameters Policy gradient methods: Actor-critic, Trust Region Policy Optimization, Proximal Policy Optimization [Schulman et al. 15, 17] Developed for finite state-space settings Empirically, highly effective for countably-infinite queues! Recent variance reduction work, even better empirically! [Dai & Gluzman 22] Want to analyze: Converge to global optimum? Convergence rate? Underpinning: Natural Policy Gradient 3 Reinforcment Learning for Stochastic Networks, June 17

Natural Policy Gradient for Queueing Systems Natural Policy Gradient: Algorithm for MDP optimization Perfect information, tabular policy space. Simplest setting. Q: Does NPG converge to the globally optimal policy, for queueing MDPs? If so, how fast? Open problem! Key challenges: Average cost, countably-infinite state space 4 Reinforcment Learning for Stochastic Networks, June 17

? Outline ? ? ? Setting: RL for queueing systems Define NPG algorithm Challenges & prior work The expert-advice framework Result! First convergence for NPG on queues ? ? ? ? 5 Reinforcment Learning for Stochastic Networks, June 17

Natural Policy Gradient: Definition Relative value function: Solution to Poisson equation: ?? ?,? = ?1 ?? ?1+ ?? ???[?? ? ,??] Natural Policy Gradient algorithm: Given initial policy ?0 and learning rate ?: For each iteration, ? = 0 to ? 1, For each state ?, update the action distribution ?(?| ?): ??+1?| ? ??? ? ? ????,? Q: Does NPG converge to the globally optimal policy, for queueing MDPs? 6

Prior work on NPG Investigate further! Discounted cost Average cost, infinite state Average cost, finite state ?(1/?) convergence [Geist et al. 19] [Agarwal et al. 21] Results for specialized settings [Kunnumkal & Topaloglu 08], [Fazel et al. 14] Even-Dar et al. 09: ?(1/ ?) convergence To transfer to average cost, need better dependence on discount rate Assumes bounded mixing time Open in general, incl. queueing systems Murthy & Srikant 23: ?(1/?) convergence Assumes bounded : ??? ?? ? = sup ? 7

Expert-advice Framework [Even-dar et al. 09] Performance difference lemma: ?? ?1 ?? If action ?? is close to the optimal ?? , evaluated according to ??( ?, ), then ? is near optimal. Add up all NPG iterations, look at a particular state: ? ?1 = ?? ?? ?,?? ?? ?,?? ??? ?,??? ??? ?,?? ?=1 Instance of expert advice problem, with specific reward function: ??? = ???( ?,?) 8

??? = ???( ?,?) Expert Advice Problem: Background Each time step ? = 1 to ?: 1. Select an action ??. 2. Rewards ??? are revealed. Goal: Minimize regret, relative to best single action: ? ??? ??(??) ?=1 Compare: ? ??? ?,??? ??? ?,?? ?=1 9 Reinforcment Learning for Stochastic Networks, June 17

??? = ???( ?,?) NPG and the expert-advice framework In expert-advice framework, NPG is Weighted Majority algorithm: ??+1? ??? ???(?) Thm (Cesa-Bianci et al. 97): If |??? | ?, and ? chosen as a function of ?, Weighted Majority achieves regret ? ?? Even-Dar: Bounded mixing time bounded ???( ?,?) ? ? regret Our plan: 1. Bound growth rate of ???( ?,?) as a function of ?. 2. Set learning rate ?? as a function of ?. 3. State-dependent regret bound 4. Prove convergence! Reinforcment Learning for Stochastic Networks, June 17 10

? ? Outline ? ? Setting: RL for queueing systems Define NPG algorithm Challenges & prior work The expert-advice framework Bound relative value Analyze MaxWeight Result! ? ? ? ? 11 Reinforcment Learning for Stochastic Networks, June 17

Bound on Relative Value ???( ?,?) Na ve idea: Bound relative value ??( ?,?) of arbitrary policy ?? No! Arbitrary policy could be unstable or pathological. Intuition: NPG is trying to improve policy. ???( ?,?)shouldn t be much worse than ??? 1( ?,?). We prove: ??? ?,? ???0 ?,? + ? Most technically challenging step, based on relationship of relative value ???( ?,?) and mixing time ???( ?,0). Next: Bound relative value ??0 of initial policy! 12 Reinforcment Learning for Stochastic Networks, June 17

Initial policy: MaxWeight For initial policy, need to bound relative value ??0( ?,?) We choose initial policy: MaxWeight Known to have optimal stability region in the generalized switch [Stolyar 04]. Define MaxWeight: Let {?} be service options. Let ??,? be the rate of completion of class ? jobs under service option ?. MaxWeight( ?) = argmax? ????,? ? 13 Reinforcment Learning for Stochastic Networks, June 17

Relative Value for MaxWeight Key MaxWeight property [Srikant & Ying 13]: Good Lyapunov stability, with test function ?1 ? ? ? + 1 1 where ? is the stability gap: Distance to boundary of stability. Typically used to prove MaxWeight has optimal stability region. We use to bound MaxWeight s relative value function: ??? ?,? 1 2: 2 ?(?)1 2 ? ? = ?] 2? ?1+ ?, 2+ ? ?1 ? 14 Reinforcment Learning for Stochastic Networks, June 17

Putting the result together 1. MaxWeight achieves quadratic relative value: ??? ?,? 1 2+ ?1 ?1 ? ?? 2. Under NPG initialized with MaxWeight, every iterate policy ?? achieves quadratic relative value: ??? 3. Using Even-Dar s expert advice + weighted majority framework, set learning rate ?? to bound regret in each state: ? 2+ ? ?? ?,? ?2 ?1 2+ ? ? ?3 ?1 ? ??? ?,??? ??? ?,?? ?? = ?2 ?1 ?=1 15 Reinforcment Learning for Stochastic Networks, June 17

Putting the result together 3. Using Even-Dar s expert advice + weighted majority framework, set learning rate ?? to bound regret in each state: ? ??? ?,??? ??? ?,?? ?3 ?1 ? ?=1 4. Use performance difference lemma to bound optimality gap: ? ? ?? ??? ?,??? ??? ?,?? = ??? ?1 ?? ?1 Finite and small! Needed ?1 bounds! 2 ?=1 ?3?? ?=1 ? ? 5. Use monotonicity of NPG to bound final policy: ??? ?1 ?4 ?1 ?? ? 16

Main Result New! New! Theorem: Given any generalized switch queueing system, the Natural Policy Gradient algorithm, with initial policy MaxWeight and learning rate function ?? chosen according to the quadratic relative value bound, achieves ? to the globally optimal policy: ??? ?1 ?? 1 ? convergence rate ?1 ?4 ? ?, ? 1! First such result! Generalizes to ? ?1 17 Reinforcment Learning for Stochastic Networks, June 17

Follow-up work Focus of Srikant s keynote Add function approximation: Rather than exact ?? ?,? , use approximate ?? ?,? Same expert-advice framework Narrow focus to policies with Uniform Lyapunov Stability : Class of policies which satisfy same drift condition as MaxWeight Converge to optimal policy within that class 18 Reinforcment Learning for Stochastic Networks, June 17

Future directions 1 ? convergence? Tighter dependence on parameters of queueing system? More general class of MDPs? Sampling/learning? ? 19 Reinforcment Learning for Stochastic Networks, June 17

Conclusion ? ? In the generalized switch queueing system, the Natural Policy Gradient algorithm, initialized with the MaxWeight policy, using specific learning rate function ??, achieves ?( ? ) convergence rate to optimal policy. ? ? ? ? ? ? 1 20 Reinforcment Learning for Stochastic Networks, June 17

Reinforcement Learning for Queueing Systems

Download Presentation

Presentation Transcript

Related

More Related Content