Previous Articles    

A STOCHASTIC TRUST-REGION FRAMEWORK FOR POLICY OPTIMIZATION

Mingming Zhao, Yongfeng Li, Zaiwen Wen   

  1. Beijing International Center for Mathematical Research, Peking University, Beijing 100871, China
  • Received:2021-01-07 Online:2022-11-15 Published:2022-11-18
  • Contact: Zaiwen Wen, Email: wenzw@pku.edu.cn

Mingming Zhao, Yongfeng Li, Zaiwen Wen. A STOCHASTIC TRUST-REGION FRAMEWORK FOR POLICY OPTIMIZATION[J]. Journal of Computational Mathematics, 2022, 40(6): 1004-1030.

In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find a policy that maximizes the total expected reward when the agent acts according to the policy. The trust region subproblem is constructed with a surrogate function coherent to the total expected reward and a general distance constraint around the latest policy. We solve the subproblem using a preconditioned stochastic gradient method with a line search scheme to ensure that each step promotes the model function and stays in the trust region. To overcome the bias caused by sampling to the function estimations under the random settings, we add the empirical standard deviation of the total expected reward to the predicted increase in a ratio in order to update the trust region radius and decide whether the trial point is accepted. Moreover, for a Gaussian policy which is commonly used for continuous action space, the maximization with respect to the mean and covariance is performed separately to control the entropy loss. Our theoretical analysis shows that the deterministic version of the proposed algorithm tends to generate a monotonic improvement of the total expected reward and the global convergence is guaranteed under moderate assumptions. Comparisons with the state-of-the-art methods demonstrate the effectiveness and robustness of our method over robotic controls and game playings from OpenAI Gym.

CLC Number: 

[1] A. Abdolmaleki, J.T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess and Martin Riedmiller, Relative entropy regularized policy iteration. arXiv:1812.02256, 2018.
[2] J. Achiam, D. Held, A. Tamar and P. Abbeel, Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 22-31. JMLR. org, 2017.
[3] A.S. Bandeira, K. Scheinberg and Luís N Vicente, Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization, 24:3(2014), 1238-1264.
[4] M. G Bellemare, Y. Naddaf, J. Veness and M. Bowling, The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47(2013), 253-279.
[5] G. Chen, Y.M. Peng and M.J. Zhang, An adaptive clipping approach for proximal policy optimization. arXiv preprint arXiv:1804.06461, 2018.
[6] R.B. Chen, M. Menickelly and K. Scheinberg, Stochastic optimization using a trust-region method and random models. Mathematical Programming, 169:2(2018), 447-487.
[7] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y.H. Wu and P. Zhokhov, Openai baselines. https://github.com/openai/baselines, 2017.
[8] Y. Duan, X. Chen, R. Houthooft, J. Schulman and P. Abbeel, Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, (2016), 1329-1338.
[9] S. Fujimoto, H. van Hoof and D. Meger. Addressing function approximation error in actor-critic methods. arXiv:1802.09477, 2018.
[10] T. Furmston, G. Lever and D. Barber, Approximate newton methods for policy search in Markov decision processes. The Journal of Machine Learning Research, 17:1(2016), 8055-8105.
[11] T, Haarnoja, H.R. Tang, P. Abbeel and S. Levine, Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1352-1361. JMLR. org, 2017.
[12] T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv:1801.01290, 2018.
[13] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup and D. Meger, Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[14] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar and D. Silver, Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[15] A. Ilyas, L. Engstrom, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph and A. Madry, Are deep policy gradient algorithms truly policy gradient algorithms? arXiv:1811.02553, 2018.
[16] S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning. In ICML, 2(2002), 267-274.
[17] S. M Kakade, A natural policy gradient. In Advances in neural information processing systems, (2002), 1531-1538.
[18] D. P Kingma and J. Ba, Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
[19] H. Kumar, A. Koppel and A. Ribeiro. On the sample complexity of actor-critic method for reinforcement learning with function approximation. arXiv preprint arXiv:1910.08412, 2019.
[20] T. Kurutach, I. Clavera, Y. Duan, A. Tamar and P. Abbeel, Model-ensemble trust-region policy optimization. arXiv:1802.10592, 2018.
[21] G.H Lan, Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. arXiv preprint arXiv:2102.00135, 2021.
[22] N. Lazić, B. Hao, Y. Abbasi-Yadkori, D. Schuurmans and C. Szepesvári, Optimization issues in kl-constrained approximate policy iteration. arXiv preprint arXiv:2102.06234, 2021.
[23] Y. LeCun, Y. Bengio and G. Hinton, Deep learning. nature, 521:7553(2015), 436.
[24] T. P Lillicrap, J. J Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, Continuous control with deep reinforcement learning. arXiv:1509.02971, 2015.
[25] V. Mnih, K. Kavukcuoglu, D. Silver, A. A Rusu, J. Veness, M. G Bellemare, A. Graves, M. Riedmiller, A. K Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg and D. Hassabis, Human-level control through deep reinforcement learning. Nature, 518:7540(2015), 529.
[26] G. Neu, A. Jonsson and V. Gòmez, A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
[27] J. Nocedal and S.J. Wright, Numerical optimization. 2006.
[28] B. Recht, A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2018.
[29] J. Schulman, S. Levine, P. Abbeel, M. Jordan and P. Moritz, Trust region policy optimization. In International Conference on Machine Learning, (2015), 1889-1897.
[30] J. Schulman, P. Moritz, S. Levine, M. Jordan and P. Abbeel, High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438, 2015.
[31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
[32] W.Y. Sun and Y.X. Yuan, Optimization theory and methods: nonlinear programming, volume 1. Springer Science & Business Media, 2006.
[33] R. S Sutton, A. G Barto, Reinforcement learning: An introduction. MIT press, 1998.
[34] R. S Sutton, D. A McAllester, S. P Singh and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, (2000), 1057-1063.
[35] E. Todorov, T. Erez and Y. Tassa, Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, (2012), 5026-5033. IEEE.
[36] M. Tomar, L. Shani, Y. Efroni and M. Ghavamzadeh, Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
[37] X.Y. Wang and Y.X. Yuan, Stochastic trust region methods with trust region radius depending on probabilistic models. arXiv:1904.03342, 2019.
[38] L. Yang, G. Zheng, H.T. Zhang, Y. Zhang, Q. Zheng, J. Wen and G. Pan, Policy optimization with stochastic mirror descent. arXiv preprint arXiv:1906.10462, 2019.
[39] K.Q Zhang, A. Koppel, H. Zhu and T. Basar, Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58:6(2020), 3586-3612.
[1] Xiaoyu Wang, Ya-xiang Yuan. STOCHASTIC TRUST-REGION METHODS WITH TRUST-REGION RADIUS DEPENDING ON PROBABILISTIC MODELS [J]. Journal of Computational Mathematics, 2022, 40(2): 294-334.
[2] Yuting Chen, Mingyuan Cao, Yueting Yang, Qingdao Huang. AN ADAPTIVE TRUST-REGION METHOD FOR GENERALIZED EIGENVALUES OF SYMMETRIC TENSORS [J]. Journal of Computational Mathematics, 2021, 39(3): 358-374.
[3] Keke Zhang, Hongwei Liu, Zexian Liu. A NEW ADAPTIVE SUBSPACE MINIMIZATION THREE-TERM CONJUGATE GRADIENT ALGORITHM FOR UNCONSTRAINED OPTIMIZATION [J]. Journal of Computational Mathematics, 2021, 39(2): 159-177.
[4] Wenjuan Xue, Weiai Liu. A MULTIDIMENSIONAL FILTER SQP ALGORITHM FOR NONLINEAR PROGRAMMING [J]. Journal of Computational Mathematics, 2020, 38(5): 683-704.
[5] Bothina El-Sobky, Abdallah Abotahoun. A TRUST-REGION ALGORITHM FOR SOLVING MINI-MAX PROBLEM [J]. Journal of Computational Mathematics, 2018, 36(6): 776-791.
[6] Fan Jiang, Deren Han, Xiaofei Zhang. A TRUST-REGION-BASED ALTERNATING LEAST-SQUARES ALGORITHM FOR TENSOR DECOMPOSITIONS [J]. Journal of Computational Mathematics, 2018, 36(3): 351-373.
[7] Caixia Kou, Zhongwen Chen, Yuhong Dai, Haifei Han. AN AUGMENTED LAGRANGIAN TRUST REGION METHOD WITH A BI-OBJECT STRATEGY [J]. Journal of Computational Mathematics, 2018, 36(3): 331-350.
[8] Yangyang Xu. FAST ALGORITHMS FOR HIGHER-ORDER SINGULAR VALUE DECOMPOSITION FROM INCOMPLETE DATA [J]. Journal of Computational Mathematics, 2017, 35(4): 397-422.
[9] Jinkui Liu, Shengjie Li. SPECTRAL DY-TYPE PROJECTION METHOD FOR NONLINEAR MONOTONE SYSTEM OF EQUATIONS [J]. Journal of Computational Mathematics, 2015, 33(4): 341-355.
[10] Jinghui Liu, Changfeng Ma. A NEW NONMONOTONE TRUST REGION ALGORITHM FOR SOLVING UNCONSTRAINED OPTIMIZATION PROBLEMS [J]. Journal of Computational Mathematics, 2014, 32(4): 476-490.
[11] Fusheng Wang, Chuanlong Wang, Li Wang. A NEW TRUST-REGION ALGORITHM FOR FINITE MINIMAX PROBLEM [J]. Journal of Computational Mathematics, 2012, 30(3): 262-278.
[12] Xuebin Wang, Changfeng Ma, Meiyan Li. A SMOOTHING TRUST REGION METHOD FOR NCPS BASED ON THE SMOOTHING GENERALIZED FISCHER-BURMEISTER FUNCTION [J]. Journal of Computational Mathematics, 2011, 29(3): 261-286.
[13] Duoquan Li. A NEW SQP-FILTER METHOD FOR SOLVING NONLINEAR PROGRAMMING PROBLEMS [J]. Journal of Computational Mathematics, 2006, 24(5): 609-634.
[14] Chang-yin Zhou,Guo-ping He,Yong-li Wang . A NEW CONSTRAINTS IDENTIFICATION TECHNIQUE-BASED QP-FREE ALGORITHM FOR THE SOLUTION OF INEQUALITY CONSTRAINED MINIMIZATION PROBLEMS [J]. Journal of Computational Mathematics, 2006, 24(5): 591-608.
[15] Xin-long Luo. Singly Diagonally Implicit Runge-Kutta Methods Combining Line Search Techniques for Unconstrained Optimization [J]. Journal of Computational Mathematics, 2005, 23(2): 153-164.
Viewed
Full text


Abstract