Continuous-time mean–variance portfolio selection: A reinforcement learning framework
Haoran Wang
CAI Data Science and Machine Learning, The Vanguard Group, Inc., Malvern, Pennsylvania
Search for more papers by this authorCorresponding Author
Xun Yu Zhou
Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, New York
Correspondence
Xun Yu Zhou, Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, NY 10027.
Email: xz2574@columbia.edu
Search for more papers by this authorHaoran Wang
CAI Data Science and Machine Learning, The Vanguard Group, Inc., Malvern, Pennsylvania
Search for more papers by this authorCorresponding Author
Xun Yu Zhou
Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, New York
Correspondence
Xun Yu Zhou, Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, NY 10027.
Email: xz2574@columbia.edu
Search for more papers by this authorAbstract
We approach the continuous-time mean–variance portfolio selection with reinforcement learning (RL). The problem is to achieve the best trade-off between exploration and exploitation, and is formulated as an entropy-regularized, relaxed stochastic control problem. We prove that the optimal feedback policy for this problem must be Gaussian, with time-decaying variance. We then prove a policy improvement theorem, based on which we devise an implementable RL algorithm. We find that our algorithm and its variant outperform both traditional and deep neural network based algorithms in our simulation and empirical studies.
REFERENCES
- Almgren, R., & Chriss, N. (2001). Optimal execution of portfolio transactions. Journal of Risk, 3, 5–40.
10.21314/JOR.2001.041 Google Scholar
- Bielecki, T. R., Jin, H., Pliska, S. R., & Zhou, X. Y. (2005). Continuous-time mean-variance portfolio selection with bankruptcy prohibition. Mathematical Finance, 15(2), 213–244.
- Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial markets. Princeton, NJ: Princeton University Press.
10.1515/9781400830213 Google Scholar
- Chen, H.-F., & Guo, L. (2012). Identification and stochastic adaptive control. Berlin: Springer Science & Business Media.
- DeMiguel, V., Garlappi, L., & Uppal, R. (2007). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy? The Review of Financial Studies, 22(5), 1915–1953.
- Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245.
- Duan, Y., Chen, X., Houthooft, R., Schulman, J., & Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. International Conference on Machine Learning (pp. 1329–1338), New York, NY.
- Duffie, D., & Richardson, H. R. (1991). Mean-variance hedging in continuous time. The Annals of Applied Probability, 1(1), 1–15.
10.1214/aoap/1177005978 Google Scholar
- Fouque, J.-P., Papanicolaou, G., Sircar, R., & Solna, K. (2003). Multiscale stochastic volatility asymptotics. Multiscale Modeling & Simulation, 2(1), 22–42.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press. http://www.deeplearningbook.org.
- Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. Proceedings of the 34th International Conference on Machine Learning (pp. 1352–1361), Sydney.
- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA.
- Hendricks, D., & Wilcox, D. (2014). A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution. 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) (pp. 457–464). Piscataway, NJ: IEEE.
10.1109/CIFEr.2014.6924109 Google Scholar
- Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance, 49(3), 851–889.
- Jacka, S. D., & Mijatović, A. (2017). On the policy improvement algorithm in continuous time. Stochastics, 89(1), 348–359.
10.1080/17442508.2016.1187609 Google Scholar
- Kumar, P. R., & Varaiya, P. (2015). Stochastic systems: Estimation, identification, and adaptive control (Vol. 75). Philadelphia, PA: SIAM.
10.1137/1.9781611974263 Google Scholar
- Kushner, H., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications (Vol. 35). Berlin: Springer Science & Business Media.
- Li, D., & Ng, W.-L. (2000). Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance, 10(3), 387–406.
- Li, X., Zhou, X. Y., & Lim, A. E. (2002). Dynamic mean-variance portfolio selection with no-shorting constraints. SIAM Journal on Control and Optimization, 40(5), 1540–1555.
- Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … Wierstra, D. (2016). Continuous control with deep reinforcement learning. International Conference on Learning Representations, San Juan.
- Lim, A. E., & Zhou, X. Y. (2002). Mean-variance portfolio selection with random parameters in a complete market. Mathematics of Operations Research, 27(1), 101–120.
- Luenberger, D. G. (1998). Investment science. New York: Oxford University Press.
- Mannor, S., & Tsitsiklis, J. N. (2013). Algorithmic aspects of mean–variance optimization in Markov decision processes. European Journal of Operational Research, 231(3), 645–653.
- Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … Ostrovski, G. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
- Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4), 875–889.
- Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6), 441–470.
- Munos, R. (2000). A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3), 265–299.
- Munos, R., & Bourgine, P. (1998). Reinforcement learning for continuous stochastic control problems. Advances in Neural Information Processing Systems (pp. 1029–1035), Denver, CO.
- Nevmyvaka, Y., Feng, Y., & Kearns, M. (2006). Reinforcement learning for optimized trade execution. Proceedings of the 23rd International Conference on Machine Learning (pp. 673–680). New York, NY: ACM.
10.1145/1143844.1143929 Google Scholar
- Prashanth, L., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. Advances in Neural Information Processing Systems (pp. 252–260), Lake Tahoe, NV.
- Prashanth, L., & Ghavamzadeh, M. (2016). Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning, 105(3), 367–417.
- Sato, M., Kimura, H., & Kobayashi, S. (2001). TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3), 353–362.
10.1527/tjsai.16.353 Google Scholar
- Sato, M., & Kobayashi, S. (2000). Variance-penalized reinforcement learning for risk-averse asset allocation. International Conference on Intelligent Data Engineering and Automated Learning (pp. 244–249). Berlin: Springer.
10.1007/3-540-44491-2_34 Google Scholar
- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. International Conference on Learning Representations, Puerto Rico.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … Lanctot, M. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
- Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4), 794–802.
- Strotz, R. H. (1955). Myopia and inconsistency in dynamic utility maximization. The Review of Economic Studies, 23(3), 165–180.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
- Tamar, A., Di Castro, D., & Mannor, S. (2013). Temporal difference methods for the variance of the reward to go. Proceedings of the 30th International Conference on Machine Learning (pp. 495–503), Atlanta, GA.
- Tamar, A., & Mannor, S. (2013). Variance adjusted actor critic algorithms. Preprint, arXiv:1310.3697.
- Wang, H. (2019). Large scale continuous-time mean-variance portfolio allocation via reinforcement learning. Available at SSRN 3428125.
- Wang, H., Zariphopoulou, T., & Zhou, X. Y. (2019). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, forthcoming.
- Zhou, X. Y., & Li, D. (2000). Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1), 19–33.
- Zhou, X. Y., & Yin, G. (2003). Markowitz's mean-variance portfolio selection with regime switching: A continuous-time model. SIAM Journal on Control and Optimization, 42(4), 1466–1482.