Wei Lin @ PKU

00102892: Statistical Learning

Course Description

This is an introductory statistical machine learning course for graduate and upper-level undergraduate students in statistics, applied mathematics, computer science, and other fields which involve learning from data. The course covers fundamental principles of machine learning and major topics in supervised, unsupervised, and semi-supervised learning, including linear regression and classification, spline and kernel smoothing, model selection and regularization, additive models, tree-based methods, support vector machines, clustering, principal component analysis, nonnegative matrix factorization, graphical models, etc.

Syllabus

Final project

Lectures and Assignments

Week Date Topics References Assignments Notes and Further Reading
1 9/19 Basics of machine learning, Occam's razor and no free lunch theorems ML Chap. 1 ML 1.4
9/21 Linear regression and least squares ESL Secs. 2.2, 3.2 ESL 3.3, 3.4
2 9/26 Multivariate linear regression, subset selection, ridge regression ESL Secs. 3.2.4, 3.3, 3.4.1 ESL 3.5, 3.6, 3.11, 3.12 For seemingly unrelated regressions, see Zellner (1962); for mixed integer optimization, see Bertsimas et al. (2016).
Homework 1 complete, due 10/10
3 10/3, 10/5 National Day
4 10/10 Lasso and its variants, model selection consistency of Lasso ESL Secs. 3.4.2, 3.4.3, 3.8.3, 3.8.5 ESL 3.16, 3.28, 3.30 For model selection consistency of Lasso, see Zhao and Yu (2006) and Wainwright (2009); for MCP, see Zhang (2010).
5 10/17 More theory, algorithms for Lasso ESL Secs. 3.4.4, 3.8.6, 3.9 ESL 3.23, 3.24 For comparisons of conditions for Lasso, see van de Geer and Bühlmann (2009); for ADMM, see Boyd et al. (2011).
10/19 Group Lasso, regularized multivariate linear regression, linear and quadratic discriminant analysis ESL Secs. 3.8.4, 3.7, 4.1–4.3 ESL 4.2, 4.3 For nuclear-norm regularized multivariate linear regression, see Yuan et al. (2007); for sparse discriminant analysis, see Mai et al. (2012).
Homework 2 complete, due 10/24
6 10/24 Logistic regression, separating hyperplanes ESL Secs. 4.4, 4.5 ESL 4.5, 4.7
7 10/31 Regression splines ESL Secs. 5.1–5.3 ESL 5.4, 5.7 For nonlinear interaction models, see Radchenko and James (2010); for the use of piecewise constant approximation in survival models, see Zeng and Lin (2007).
11/2 Smoothing splines, multidimensional splines ESL Secs. 5.4–5.7 ESL 5.13 Homework 3 complete, due 11/7
8 11/7 Reproducing kernel Hilbert spaces, wavelets ESL Secs. 5.8, 5.9 ESL 5.15
9 11/14 Kernel smoothing, local polynomial regression ESL Secs. 6.1–6.5 ESL 6.2, 6.3, 6.5 For generalized partially linear single-index models, see Carroll et al. (1997).
11/16 Midterm 1
Kernel density estimation
ESL Sec. 6.6.1 Asymptotic properties of kernel density estimators were adapted from Tsybakov (2009), Sec. 1.2.
Midterm 1: mean = 46, median = 44, Q1 = 33, Q3 = 58, high score = 89
10 11/21 Kernel density classification and naive Bayes, model assessment and selection ESL Secs. 6.6.2, 6.6.3, 7.1–7.3 ESL 6.8, 7.2
Lab 1
Homework 4 complete, due 11/28; Lab 1 due 12/5
11 11/28 Estimation of generalization error, information criteria ESL Secs. 7.4–7.9 ESL 7.6, 7.7 For the AIC–BIC dilemma, see Yang (2005) and van Erven et al. (2012).
11/30 Cross-validation and the bootstrap, generalized additive models, classification and regression trees ESL Secs. 7.10–7.12, 9.1, 9.2 For a review of diversity indices, see Morris et al. (2014).
12 12/5 Bump hunting, multivariate adaptive regression splines, hierarchical mixtures of experts, boosting ESL Secs. 9.3–9.5, 10.1–10.4 ESL 10.2, 10.5 Schapire and Freund (2012) is a book-length treatment of boosting.
13 12/12 More on boosting, boosting trees, gradient boosting ESL Secs. 10.5, 10.6, 10.9–10.12 ESL 10.8 Homework 5 complete, due 12/14
12/14 Support vector machines for classification and regression ESL Secs. 12.1–12.3, ML Chap. 6 ESL 12.1, 12.2 For multiclass SVMs, see Lee et al. (2004).
14 12/19 Clustering, principal component analysis ESL Secs. 14.3, 14.5 ESL 14.2, 14.7 Consistency of K-means clustering was studied by Pollard (1981).
15 12/26 Spectral clustering, nonnegative matrix factorization ESL Secs. 14.5.3, 14.6 ESL 14.21, 14.23 For consistency of spectral clustering and its application to community detection in social network models, see von Luxburg et al. (2008) and Rohe et al. (2011).
Homework 6 complete, due 1/2
12/28 Midterm 2 Mean = 57, median = 56, Q1 = 45, Q3 = 67, high score = 93
16 1/2 Ensemble learning, random forests, Gaussian graphical models ESL Chaps. 15–17 For recent theoretical and methodological developments on random forests, see Biau and Scornet (2016).