机器学习实验室博士生系列论坛（第九期）—— Off-Policy Evaluation: Problem and Methods
报告人：Erdong Wang (PKU)
地点：北大静园六院一楼大会议室&腾讯会议 498 2865 6467
Abstract: Off-Policy Evaluation(OPE) Problem is one of the most important directions in offline reinforcement learning, which has a wide range of applications in both theory research and industry. We give a brief review of some representative methods of OPE problem. For finite horizon MDP, the earliest and most classic method is Importance Sampling(IS), which yields an unbiased and consistent estimator of high variance. The variants of IS estimator provides tiny bias and lower variance, but still not act well in practical. For infinite horizon MDP or general cases, model-based method and the idea of function approximation give different views of the problem through Bellman Equation. While model-based estimator has much lower variance than IS estimator but suffers high bias instead, Doubly Robust(DR) estimator gathers both the forms and the advantages of the two estimators, which control the bias and the variance well, verified in experiments. Traditional statistical point of view seems hard to achieve new results other than the three classic methods, while new ideas and methods have been raised in recent years. As a representative example, the paper Off-policy Evaluation via the Regularized Lagrangian turns the primal statistics problem into an optimization problem, which yields a reliable confidence interval algorithm for the problem.