Volume 30, Issue 4 p. 1273-1308

ORIGINAL ARTICLE

Continuous-time mean–variance portfolio selection: A reinforcement learning framework

Haoran Wang,

Haoran Wang

CAI Data Science and Machine Learning, The Vanguard Group, Inc., Malvern, Pennsylvania

Search for more papers by this author

Xun Yu Zhou,

Corresponding Author

Xun Yu Zhou

xz2574@columbia.edu

orcid.org/0000-0001-9908-5697

Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, New York

Correspondence

Xun Yu Zhou, Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, NY 10027.

Email: xz2574@columbia.edu

Search for more papers by this author

Haoran Wang,

Haoran Wang

CAI Data Science and Machine Learning, The Vanguard Group, Inc., Malvern, Pennsylvania

Search for more papers by this author

Xun Yu Zhou,

Corresponding Author

Xun Yu Zhou

xz2574@columbia.edu

orcid.org/0000-0001-9908-5697

Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, New York

Correspondence

Xun Yu Zhou, Department of Industrial Engineering and Operations Research, and Data Science Institute, Columbia University, New York, NY 10027.

Email: xz2574@columbia.edu

Search for more papers by this author

First published: 23 June 2020

https://doi.org/10.1111/mafi.12281

Citations: 56

Share a link

Email
Facebook
Twitter
LinkedIn
Reddit
Wechat

Abstract

We approach the continuous-time mean–variance portfolio selection with reinforcement learning (RL). The problem is to achieve the best trade-off between exploration and exploitation, and is formulated as an entropy-regularized, relaxed stochastic control problem. We prove that the optimal feedback policy for this problem must be Gaussian, with time-decaying variance. We then prove a policy improvement theorem, based on which we devise an implementable RL algorithm. We find that our algorithm and its variant outperform both traditional and deep neural network based algorithms in our simulation and empirical studies.

REFERENCES

Almgren, R., & Chriss, N. (2001). Optimal execution of portfolio transactions. Journal of Risk, 3, 5–40.
10.21314/JOR.2001.041
Google Scholar
Bielecki, T. R., Jin, H., Pliska, S. R., & Zhou, X. Y. (2005). Continuous-time mean-variance portfolio selection with bankruptcy prohibition. Mathematical Finance, 15(2), 213–244.
10.1111/j.0960-1627.2005.00218.x
Web of Science®Google Scholar
Campbell, J. Y., Lo, A. W., & MacKinlay, A. C. (1997). The econometrics of financial markets. Princeton, NJ: Princeton University Press.
10.1515/9781400830213
Google Scholar
Chen, H.-F., & Guo, L. (2012). Identification and stochastic adaptive control. Berlin: Springer Science & Business Media.
Google Scholar
DeMiguel, V., Garlappi, L., & Uppal, R. (2007). Optimal versus naive diversification: How inefficient is the 1/N portfolio strategy? The Review of Financial Studies, 22(5), 1915–1953.
10.1093/rfs/hhm075
Web of Science®Google Scholar
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245.
10.1162/089976600300015961
CASPubMedWeb of Science®Google Scholar
Duan, Y., Chen, X., Houthooft, R., Schulman, J., & Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. International Conference on Machine Learning (pp. 1329–1338), New York, NY.
Google Scholar
Duffie, D., & Richardson, H. R. (1991). Mean-variance hedging in continuous time. The Annals of Applied Probability, 1(1), 1–15.
10.1214/aoap/1177005978
Google Scholar
Fouque, J.-P., Papanicolaou, G., Sircar, R., & Solna, K. (2003). Multiscale stochastic volatility asymptotics. Multiscale Modeling & Simulation, 2(1), 22–42.
10.1137/030600291
Web of Science®Google Scholar
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press. http://www.deeplearningbook.org.
Google Scholar
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). Reinforcement learning with deep energy-based policies. Proceedings of the 34th International Conference on Machine Learning (pp. 1352–1361), Sydney.
Google Scholar
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA.
Google Scholar
Hendricks, D., & Wilcox, D. (2014). A reinforcement learning extension to the Almgren-Chriss framework for optimal trade execution. 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr) (pp. 457–464). Piscataway, NJ: IEEE.
10.1109/CIFEr.2014.6924109
Google Scholar
Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance, 49(3), 851–889.
10.1111/j.1540-6261.1994.tb00081.x
Web of Science®Google Scholar
Jacka, S. D., & Mijatović, A. (2017). On the policy improvement algorithm in continuous time. Stochastics, 89(1), 348–359.
10.1080/17442508.2016.1187609
Google Scholar
Kumar, P. R., & Varaiya, P. (2015). Stochastic systems: Estimation, identification, and adaptive control (Vol. 75). Philadelphia, PA: SIAM.
10.1137/1.9781611974263
Google Scholar
Kushner, H., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications (Vol. 35). Berlin: Springer Science & Business Media.
Google Scholar
Li, D., & Ng, W.-L. (2000). Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance, 10(3), 387–406.
10.1111/1467-9965.00100
Web of Science®Google Scholar
Li, X., Zhou, X. Y., & Lim, A. E. (2002). Dynamic mean-variance portfolio selection with no-shorting constraints. SIAM Journal on Control and Optimization, 40(5), 1540–1555.
10.1137/S0363012900378504
Web of Science®Google Scholar
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., … Wierstra, D. (2016). Continuous control with deep reinforcement learning. International Conference on Learning Representations, San Juan.
Google Scholar
Lim, A. E., & Zhou, X. Y. (2002). Mean-variance portfolio selection with random parameters in a complete market. Mathematics of Operations Research, 27(1), 101–120.
10.1287/moor.27.1.101.337
Web of Science®Google Scholar
Luenberger, D. G. (1998). Investment science. New York: Oxford University Press.
Web of Science®Google Scholar
Mannor, S., & Tsitsiklis, J. N. (2013). Algorithmic aspects of mean–variance optimization in Markov decision processes. European Journal of Operational Research, 231(3), 645–653.
10.1016/j.ejor.2013.06.019
Web of Science®Google Scholar
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91.
10.1111/j.1540-6261.1952.tb01525.x
Web of Science®Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … Ostrovski, G. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
10.1038/nature14236
CASPubMedWeb of Science®Google Scholar
Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4), 875–889.
10.1109/72.935097
CASPubMedGoogle Scholar
Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6), 441–470.
10.1002/(SICI)1099-131X(1998090)17:5/6<441::AID-FOR707>3.0.CO;2-#
Web of Science®Google Scholar
Munos, R. (2000). A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3), 265–299.
10.1023/A:1007686309208
Web of Science®Google Scholar
Munos, R., & Bourgine, P. (1998). Reinforcement learning for continuous stochastic control problems. Advances in Neural Information Processing Systems (pp. 1029–1035), Denver, CO.
Google Scholar
Nevmyvaka, Y., Feng, Y., & Kearns, M. (2006). Reinforcement learning for optimized trade execution. Proceedings of the 23rd International Conference on Machine Learning (pp. 673–680). New York, NY: ACM.
10.1145/1143844.1143929
Google Scholar
Prashanth, L., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. Advances in Neural Information Processing Systems (pp. 252–260), Lake Tahoe, NV.
Google Scholar
Prashanth, L., & Ghavamzadeh, M. (2016). Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning, 105(3), 367–417.
10.1007/s10994-016-5569-5
Web of Science®Google Scholar
Sato, M., Kimura, H., & Kobayashi, S. (2001). TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3), 353–362.
10.1527/tjsai.16.353
Google Scholar
Sato, M., & Kobayashi, S. (2000). Variance-penalized reinforcement learning for risk-averse asset allocation. International Conference on Intelligent Data Engineering and Automated Learning (pp. 244–249). Berlin: Springer.
10.1007/3-540-44491-2_34
Google Scholar
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized experience replay. International Conference on Learning Representations, Puerto Rico.
Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … Lanctot, M. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
10.1038/nature16961
CASPubMedWeb of Science®Google Scholar
Sobel, M. J. (1982). The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4), 794–802.
10.2307/3213832
Web of Science®Google Scholar
Strotz, R. H. (1955). Myopia and inconsistency in dynamic utility maximization. The Review of Economic Studies, 23(3), 165–180.
10.2307/2295722
Web of Science®Google Scholar
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Tamar, A., Di Castro, D., & Mannor, S. (2013). Temporal difference methods for the variance of the reward to go. Proceedings of the 30th International Conference on Machine Learning (pp. 495–503), Atlanta, GA.
Google Scholar
Tamar, A., & Mannor, S. (2013). Variance adjusted actor critic algorithms. Preprint, arXiv:1310.3697.
Google Scholar
Wang, H. (2019). Large scale continuous-time mean-variance portfolio allocation via reinforcement learning. Available at SSRN 3428125.
Google Scholar
Wang, H., Zariphopoulou, T., & Zhou, X. Y. (2019). Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, forthcoming.
Web of Science®Google Scholar
Zhou, X. Y., & Li, D. (2000). Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1), 19–33.
10.1007/s002450010003
Web of Science®Google Scholar
Zhou, X. Y., & Yin, G. (2003). Markowitz's mean-variance portfolio selection with regime switching: A continuous-time model. SIAM Journal on Control and Optimization, 42(4), 1466–1482.
10.1137/S0363012902405583
Web of Science®Google Scholar

Citing Literature

Volume30, Issue4

October 2020

Pages 1273-1308

Continuous-time mean–variance portfolio selection: A reinforcement learning framework

Abstract

REFERENCES

Citing Literature

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Continuous-time mean–variance portfolio selection: A reinforcement learning framework

Abstract

REFERENCES

Citing Literature

References

Related

Information