Reinforcement Learning for Queueing Systems

Slide Note
Embed
Share

Natural Policy Gradient is explored as an algorithm for optimizing Markov Decision Processes in queueing systems with unknown parameters. The challenges of unknown system dynamics and policy optimization are addressed through reinforcement learning techniques such as Actor-critic and Trust Region Policy Optimization. Recent variance reduction methods have shown improved empirical effectiveness for countably-infinite queues. Key questions revolve around the convergence to the global optimum and the speed of convergence, especially in scenarios with average cost and countably-infinite state space.


Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Convergence for Natural Policy Gradient on Infinite-State Average-Reward Markov Decision Processes Natural Policy Gradient for Queues Izzy Grosof (Isaac) (they/she) Georgia Tech UIUC Northwestern IEMS Siva Theja Maguluri (Georgia Tech) R. Srikant (UIUC) 1 Reinforcment Learning for Stochastic Networks, June 17

  2. Problem: Optimize Queueing System Queueing system with unknown parameters (generalized switch) Want to optimize policy. Ex: Minimize mean queue length. Challenge 1: Unknown system dynamics Challenge 2: Even if knew dynamics, how to pick an optimal policy? Reinforcement learning! ? ? ? ? ? ? ? ? 2 Reinforcment Learning for Stochastic Networks, June 17

  3. Reinforcement Learning for Queueing Systems Framework: Queueing system as Markov Decision Process (MDP) with unknown parameters Policy gradient methods: Actor-critic, Trust Region Policy Optimization, Proximal Policy Optimization [Schulman et al. 15, 17] Developed for finite state-space settings Empirically, highly effective for countably-infinite queues! Recent variance reduction work, even better empirically! [Dai & Gluzman 22] Want to analyze: Converge to global optimum? Convergence rate? Underpinning: Natural Policy Gradient 3 Reinforcment Learning for Stochastic Networks, June 17

  4. Natural Policy Gradient for Queueing Systems Natural Policy Gradient: Algorithm for MDP optimization Perfect information, tabular policy space. Simplest setting. Q: Does NPG converge to the globally optimal policy, for queueing MDPs? If so, how fast? Open problem! Key challenges: Average cost, countably-infinite state space 4 Reinforcment Learning for Stochastic Networks, June 17

  5. ? Outline ? ? ? Setting: RL for queueing systems Define NPG algorithm Challenges & prior work The expert-advice framework Result! First convergence for NPG on queues ? ? ? ? 5 Reinforcment Learning for Stochastic Networks, June 17

  6. Natural Policy Gradient: Definition Relative value function: Solution to Poisson equation: ?? ?,? = ?1 ?? ?1+ ?? ???[?? ? ,??] Natural Policy Gradient algorithm: Given initial policy ?0 and learning rate ?: For each iteration, ? = 0 to ? 1, For each state ?, update the action distribution ?(?| ?): ??+1?| ? ??? ? ? ????,? Q: Does NPG converge to the globally optimal policy, for queueing MDPs? 6

  7. Prior work on NPG Investigate further! Discounted cost Average cost, infinite state Average cost, finite state ?(1/?) convergence [Geist et al. 19] [Agarwal et al. 21] Results for specialized settings [Kunnumkal & Topaloglu 08], [Fazel et al. 14] Even-Dar et al. 09: ?(1/ ?) convergence To transfer to average cost, need better dependence on discount rate Assumes bounded mixing time Open in general, incl. queueing systems Murthy & Srikant 23: ?(1/?) convergence Assumes bounded : ??? ?? ? = sup ? 7

  8. Expert-advice Framework [Even-dar et al. 09] Performance difference lemma: ?? ?1 ?? If action ?? is close to the optimal ?? , evaluated according to ??( ?, ), then ? is near optimal. Add up all NPG iterations, look at a particular state: ? ?1 = ?? ?? ?,?? ?? ?,?? ??? ?,??? ??? ?,?? ?=1 Instance of expert advice problem, with specific reward function: ??? = ???( ?,?) 8

  9. ??? = ???( ?,?) Expert Advice Problem: Background Each time step ? = 1 to ?: 1. Select an action ??. 2. Rewards ??? are revealed. Goal: Minimize regret, relative to best single action: ? ??? ??(??) ?=1 Compare: ? ??? ?,??? ??? ?,?? ?=1 9 Reinforcment Learning for Stochastic Networks, June 17

  10. ??? = ???( ?,?) NPG and the expert-advice framework In expert-advice framework, NPG is Weighted Majority algorithm: ??+1? ??? ???(?) Thm (Cesa-Bianci et al. 97): If |??? | ?, and ? chosen as a function of ?, Weighted Majority achieves regret ? ?? Even-Dar: Bounded mixing time bounded ???( ?,?) ? ? regret Our plan: 1. Bound growth rate of ???( ?,?) as a function of ?. 2. Set learning rate ?? as a function of ?. 3. State-dependent regret bound 4. Prove convergence! Reinforcment Learning for Stochastic Networks, June 17 10

  11. ? ? Outline ? ? Setting: RL for queueing systems Define NPG algorithm Challenges & prior work The expert-advice framework Bound relative value Analyze MaxWeight Result! ? ? ? ? 11 Reinforcment Learning for Stochastic Networks, June 17

  12. Bound on Relative Value ???( ?,?) Na ve idea: Bound relative value ??( ?,?) of arbitrary policy ?? No! Arbitrary policy could be unstable or pathological. Intuition: NPG is trying to improve policy. ???( ?,?)shouldn t be much worse than ??? 1( ?,?). We prove: ??? ?,? ???0 ?,? + ? Most technically challenging step, based on relationship of relative value ???( ?,?) and mixing time ???( ?,0). Next: Bound relative value ??0 of initial policy! 12 Reinforcment Learning for Stochastic Networks, June 17

  13. Initial policy: MaxWeight For initial policy, need to bound relative value ??0( ?,?) We choose initial policy: MaxWeight Known to have optimal stability region in the generalized switch [Stolyar 04]. Define MaxWeight: Let {?} be service options. Let ??,? be the rate of completion of class ? jobs under service option ?. MaxWeight( ?) = argmax? ????,? ? 13 Reinforcment Learning for Stochastic Networks, June 17

  14. Relative Value for MaxWeight Key MaxWeight property [Srikant & Ying 13]: Good Lyapunov stability, with test function ?1 ? ? ? + 1 1 where ? is the stability gap: Distance to boundary of stability. Typically used to prove MaxWeight has optimal stability region. We use to bound MaxWeight s relative value function: ??? ?,? 1 2: 2 ?(?)1 2 ? ? = ?] 2? ?1+ ?, 2+ ? ?1 ? 14 Reinforcment Learning for Stochastic Networks, June 17

  15. Putting the result together 1. MaxWeight achieves quadratic relative value: ??? ?,? 1 2+ ?1 ?1 ? ?? 2. Under NPG initialized with MaxWeight, every iterate policy ?? achieves quadratic relative value: ??? 3. Using Even-Dar s expert advice + weighted majority framework, set learning rate ?? to bound regret in each state: ? 2+ ? ?? ?,? ?2 ?1 2+ ? ? ?3 ?1 ? ??? ?,??? ??? ?,?? ?? = ?2 ?1 ?=1 15 Reinforcment Learning for Stochastic Networks, June 17

  16. Putting the result together 3. Using Even-Dar s expert advice + weighted majority framework, set learning rate ?? to bound regret in each state: ? ??? ?,??? ??? ?,?? ?3 ?1 ? ?=1 4. Use performance difference lemma to bound optimality gap: ? ? ?? ??? ?,??? ??? ?,?? = ??? ?1 ?? ?1 Finite and small! Needed ?1 bounds! 2 ?=1 ?3?? ?=1 ? ? 5. Use monotonicity of NPG to bound final policy: ??? ?1 ?4 ?1 ?? ? 16

  17. Main Result New! New! Theorem: Given any generalized switch queueing system, the Natural Policy Gradient algorithm, with initial policy MaxWeight and learning rate function ?? chosen according to the quadratic relative value bound, achieves ? to the globally optimal policy: ??? ?1 ?? 1 ? convergence rate ?1 ?4 ? ?, ? 1! First such result! Generalizes to ? ?1 17 Reinforcment Learning for Stochastic Networks, June 17

  18. Follow-up work Focus of Srikant s keynote Add function approximation: Rather than exact ?? ?,? , use approximate ?? ?,? Same expert-advice framework Narrow focus to policies with Uniform Lyapunov Stability : Class of policies which satisfy same drift condition as MaxWeight Converge to optimal policy within that class 18 Reinforcment Learning for Stochastic Networks, June 17

  19. Future directions 1 ? convergence? Tighter dependence on parameters of queueing system? More general class of MDPs? Sampling/learning? ? 19 Reinforcment Learning for Stochastic Networks, June 17

  20. Conclusion ? ? In the generalized switch queueing system, the Natural Policy Gradient algorithm, initialized with the MaxWeight policy, using specific learning rate function ??, achieves ?( ? ) convergence rate to optimal policy. ? ? ? ? ? ? 1 20 Reinforcment Learning for Stochastic Networks, June 17

Related


More Related Content