Challenges in Model-Based Nonlinear Bandit and Reinforcement Learning
Delving into advanced topics of provable model-based nonlinear bandit and reinforcement learning, this content explores theories, complexities, and state-of-the-art analyses in deep reinforcement learning and neural net approximation. It highlights the difficulty of statistical learning with even one-layer neural nets and uncovers challenges in sample complexity guarantees for dynamics approximations.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Provable Model-based Nonlinear Bandit (and RL) Tengyu Ma Stanford University Kefan Dong Stanford University Jiaqi Yang Tsinghua University
Toward a Theory for Deep Reinforcement Learning? Prior comprehensive studies for tabular case linear function approximation abstract complexity measures (bellman rank, witness rank, etc.) state-action aggregation Nearly nothing for neural nets in particular Is this the right timing? Assume optimization/supervised learning oracle, and focus on sample efficiency and sequential extrapolation
State-of-the-art Analyses for RL Claim: none of these applies to even RL with one-layer neural net approximation for dynamics (more evidence later) [Bilinear Classes: A Structural Framework for Provable Generalization in RL. Du-Kakade-Lee-Lovett- Mahajan- Sun-Wang 21]
Neural Net Bandit: A Simplification With ? = 1 Reward function ?(?,?) ? : model parameter ? ?: continuous action Ex1: linear bandit: ? ?,? = ? ? Ex2: neural net bandit: ? ?,? = NN?? Realizable and deterministic reward setting: Ground-truth ? We observe the ground-truth reward ?(? ,??) after playing ?? Goal: to find the best arm ? = argmax ?(? ,?) ? ?
Even One-layer Neural Net Bandit is Statistically Hard! and ? are unit 2-balls in ? ? ?,? = relu(? ? 0.9) ? = argm?? ||?||2 1 relu ? ? 0.9 = ? ?(? , ) Random actions have zero rewards and no info about ? ?2 ? ?1 {?:? ? 0.9} has exp( ?) prob. mass needle in a haystack!
Hard Instances Can Also Have Smooth and Non-Sparse Rewards local maximum global maximum random actions can only help learn the linear part ? (?,?),? = ? ? + ?0 ?(? ? 0.9) Extendable to RL with nonlinear family of dynamics and known reward
State-of-the-art Analyses for RL Claim: none of these applies to even RL with one-layer neural net approximations for dynamics because they give polynomial sample complexity guarantees
A New Paradigm Toward Nonlinear Bandit / RL Theory General convergences to local maxima Focus of this talk 1. Analysis of the quality of local maxima of the ground-truth ?(? , ) All local maxima are global (or satisfactory)? 2.
Baselines for Converging to Local Maxima: Zero-order Optimization for Bandit and Policy Gradient for RL Let ? ? = ?(? ,?) Zero-order optimization (evolution): estimate gradient ? ? for every ? individually For ? ?(0,?2?), ?[ ? (? ? + ? ? (?)) ] ?[?? ? (?)] = ?2 ? (?) ?(?) sample complexity Q: can we leverage the model extrapolation to improve sample efficiency? Inspiration: model-based methods are more sample-efficient than model-free methods (theoretically for tabular cases, empirically for deep RL as well)
Our Main Results on Bandit Theorem (informal): Suppose reward ? is ?(1)-Lipschitz and smooth, there exists an bandit algorithm that converges to ?-approximate local maximum with ?( /?4) samples, where ( ) is a complexity measure of the model class ?:? ?, ,? . Linear bandit: ? ?,? = ? ?, ? When is finite: O(log| |) samples When contains ?-sparse vectors or has ? degrees of freedom : ?(?) samples Local max are also global max because ? ? , is concave. ?(?1?) Neural net bandit: ? ?,? = ?2 Assume ?(1) norms bounds on ||?2||1,|?1|1 Algo. finds for local max with ?(1) samples If all negative weights, then ?(?, ) is concave in ? local max are also global max
Does Classical Model-based UCB Converge to Local Max? ??,??= argmax ? ? ? confid. region ?(?,?) Given random actions, C.R. can pin down ? but has no clue about ? ? ?? ? ?? UCB keeps optimistically guessing ?? and chose ??= ?? ? (?,?),? = ? ? + ?0 ?(? ? 0.9) UCB fundamentally aims for global maximum and keeps exploring It also fails for deep RL empirically because the optimistic model fantasizes too much
virtual reward: ? ??, real reward: ?(? , ) Where Does UCB s Analysis Break? 1. High virtual reward: by def. of optimism, ? ??,?? ?(? ,? ) 2. Extrapolation (in average): 2 ? ? ??,?? ? ? ,?? ?=1 dim( ) ? Eluder dim or dim of linear span 1 + 2 ? ? ,?? ?(? ,? ) Step 2 fails for neural net models because dimEluder( ) exp(?) It may learn ?? suboptimally: we only know that ?? fits past data [Russo-Van Roy 13, Eluder Dimension and the Sample Complexity of Optimistic Exploration ]
Our Idea: Re-Prioritizing the Two Steps 2. Extrapolation by an online learning (OL) oracle 2 SRC?( ) ? ? ??,?? ? ? ,?? ?=1 Sequential Rademacher Complexity [Rakhlin-Sridharan-Tewari 15] For finite hypothesis , SRCT = log| | ? For neural nets: SRC = poly ? ? vs. Eluder dim = exp(?) SRC can be dimension-free and only depend on the weight norm The gain stems from choosing ?? from OL oracle, instead of an optimistic point in the confidence region
OL Oracle Extrapolates Better 2 loss = ??,?? ? ,?? ground-truth ?(? , ) OL: loss = 0 at action ??= ?? UCB: loss 0 at action ??= ?? ? ?? ?? ??= 0 loss ? ??
Our Idea: Re-Prioritizing the Two Steps caveat set aside: OL oracle should be randomized; ??+1 can only depend on distribution of ?? 1. High virtual reward: best attempt: ??= argmax ?(??,?) ? ? 2. Extrapolation by an online learning (OL) oracle 2 SRC?( ) ? ? ??,?? ? ? ,?? ?=1 Sequential Rademacher Complexity [Rakhlin-Sridharan-Tewari 15] For finite hypothesis , SRCT = log| | ? For neural nets: SRC = poly ? ? vs. Eluder dim = exp(?) SRC can be dimension-free and only depend on the weight norm The gain stems from choosing ?? from OL oracle, instead of an optimistic point in the confidence region
1. High virtual reward: best attempt: ??= argmax ?(??,?) ? ? 2. Extrapolation by online learning (OL) oracles 2 SRC?( ) ? ? ??,?? ? ? ,?? ?=1 Issues with lack of optimism: UCB will pick optimistic models ?(? , ) Getting stuck at virtual local max! ?? ?? 1 ?(??, ) ? ?1
Diagnoses and Fix Need the online learner to work harder to guarantee an increasing virtual reward Estimating the curvature: learn ?? such that for ? { past actions } 1. ? ??,? ?(? ,?) 2. ?? ??,? ??(? ,?) 3. ? 2? ??,? ? 2?(? ,?) ?(? , ) Getting stuck at virtual local max! ?? ?? 1 ?(??, ) ? ?1
Diagnoses and Fix Need the online learner to work harder to guarantee an increasing virtual reward Estimating the curvature: learn ?? such that for ? { past actions } 1. ? ??,? ?(? ,?) 2. ?? ??,? ??(? ,?) 3. ? 2? ??,? ? 2?(? ,?) ??+1 ?(? , ) matching gradient ?? ?? 1 ? ?1
Curvature Estimates Improvements on Real/Virtual Rewards Assume ?(?, ) is smooth If not at a critical point of real reward, we can improve. ? ? ,??+1 ? ??+1,??+1 by descent lemma : 2) ? ??+1,?? + (|| ? ??+1,??||2 ? ? ,?? + (|| ? ? ,??||2 max ? ? ? by OL guarantee 2) 2) ? ?0 + (|| ? ?0||2 caveat set aside: OL oracle should be randomized ??+1 ?(? , ) ?? ?? 1 ? ?1
Learning Gradients With Model-Based Extrapolation 2 Idealized OL loss: || ? ?,?? ? ? ,??||2 But we don t observe real gradient ? ? ,?? Model-free gradient estimate takes ?(?) samples Surrogate OL loss via Johnson Lindenstrauss: for ? ?(0,?) ?;??,? := ? ?,?? ? ? ,??,?2 Unbiased estimator: 2 = || ? ?,?? ? ? ,??||2 ?? ?;??,? Directional gradient ? ? ,??,? can be computed from rewards of two actions ? ? ,??,? ? ? ,??+ ?? ? ? ,?? (? 0) ? As the spirit of JL, sample complexity depends on family of possible gradients { ? ?,? :? }, but not ambient dimension.
Formal Algorithm and Theorem for Converging to Critical Points 2+ ? ?,? ? ? ,? 2 ?;(?,? ,?) = ? ?,? ? ? ,? + ? ?,? ? ? ,? ,?2 ViOL (Virtual Ascent with Online Model Learner): Use OL to minimize losses ?( ) = ( ; ??,?? 1,??) where ?? ? 0,? , and obtain randomized model ?? 1. Take ??= argmax? ???[?(??,?)] 2. Theorem: Suppose the sequential Rademacher complexity of = { ?; ;? } satisfies ???? critical point of ? ? , : ? ?(?/?4), || ? ? ,??||2 ? ??. Then, ViOL converges to a Extendable to convergence to approximate local maximum
A First-Cut Extension to Model-based RL Dynamics ?? and policy ?? ? ?,? = total expected return of policy ?? on dynamics ?? Goal: find local max of ? ? , Challenge: How does learning dynamics help estimate the ? ? , and its gradient? A result for stochastic policies ? ?,? ? ? ,? ? ?,? ? ? ,? 2? ?,? 2? ? ,? ???,? ?? (?,?)2 ??,? ?? ,?? ??,? ?? ,??[ ???,? ?? (?,?)2] ??,? ?? ,??[ ???,? ?? (?,?)2] With many assumptions: Value functions are Lipschitz/smooth in states and policy paramaters log?? is bounded in various ways Not vacuous: e.g., ? ?,? = ???(? + ?) + linear policy can work
Summary Global regret for nonlinear models is statistically intractable ViOL converges to a local maximum with sample complexity that only depends on the model class complexity Open question: Bandit with stochastic rewards Faster convergence rate / smaller regret More realistic RL results Thank you!