Wei Lin @ PKU

00102892: Statistical Learning

Course Description

This is an introductory statistical machine learning course for graduate and upper-level undergraduate students in statistics, applied mathematics, computer science, and other fields which involve learning from data. The course covers fundamental principles of machine learning and major topics in supervised, unsupervised, and semi-supervised learning, including linear regression and classification, spline and kernel smoothing, model selection and regularization, additive models, tree-based methods, support vector machines, clustering, principal component analysis, nonnegative matrix factorization, graphical models, etc.

Syllabus

Final project

Lectures and Assignments

Week	Date	Topics	References	Assignments	Notes and Further Reading
1	9/19	Basics of machine learning, Occam's razor and no free lunch theorems	ML Chap. 1	ML 1.4
	9/21	Linear regression and least squares	ESL Secs. 2.2, 3.2	ESL 3.3, 3.4
2	9/26	Multivariate linear regression, subset selection, ridge regression	ESL Secs. 3.2.4, 3.3, 3.4.1	ESL 3.5, 3.6, 3.11, 3.12	For seemingly unrelated regressions, see Zellner (1962); for mixed integer optimization, see Bertsimas et al. (2016). Homework 1 complete, due 10/10
3	10/3, 10/5	National Day
4	10/10	Lasso and its variants, model selection consistency of Lasso	ESL Secs. 3.4.2, 3.4.3, 3.8.3, 3.8.5	ESL 3.16, 3.28, 3.30	For model selection consistency of Lasso, see Zhao and Yu (2006) and Wainwright (2009); for MCP, see Zhang (2010).
5	10/17	More theory, algorithms for Lasso	ESL Secs. 3.4.4, 3.8.6, 3.9	ESL 3.23, 3.24	For comparisons of conditions for Lasso, see van de Geer and Bühlmann (2009); for ADMM, see Boyd et al. (2011).
	10/19	Group Lasso, regularized multivariate linear regression, linear and quadratic discriminant analysis	ESL Secs. 3.8.4, 3.7, 4.1–4.3	ESL 4.2, 4.3	For nuclear-norm regularized multivariate linear regression, see Yuan et al. (2007); for sparse discriminant analysis, see Mai et al. (2012). Homework 2 complete, due 10/24
6	10/24	Logistic regression, separating hyperplanes	ESL Secs. 4.4, 4.5	ESL 4.5, 4.7
7	10/31	Regression splines	ESL Secs. 5.1–5.3	ESL 5.4, 5.7	For nonlinear interaction models, see Radchenko and James (2010); for the use of piecewise constant approximation in survival models, see Zeng and Lin (2007).
	11/2	Smoothing splines, multidimensional splines	ESL Secs. 5.4–5.7	ESL 5.13	Homework 3 complete, due 11/7
8	11/7	Reproducing kernel Hilbert spaces, wavelets	ESL Secs. 5.8, 5.9	ESL 5.15
9	11/14	Kernel smoothing, local polynomial regression	ESL Secs. 6.1–6.5	ESL 6.2, 6.3, 6.5	For generalized partially linear single-index models, see Carroll et al. (1997).
	11/16	Midterm 1 Kernel density estimation	ESL Sec. 6.6.1		Asymptotic properties of kernel density estimators were adapted from Tsybakov (2009), Sec. 1.2. Midterm 1: mean = 46, median = 44, Q1 = 33, Q3 = 58, high score = 89
10	11/21	Kernel density classification and naive Bayes, model assessment and selection	ESL Secs. 6.6.2, 6.6.3, 7.1–7.3	ESL 6.8, 7.2 Lab 1	Homework 4 complete, due 11/28; Lab 1 due 12/5
11	11/28	Estimation of generalization error, information criteria	ESL Secs. 7.4–7.9	ESL 7.6, 7.7	For the AIC–BIC dilemma, see Yang (2005) and van Erven et al. (2012).
	11/30	Cross-validation and the bootstrap, generalized additive models, classification and regression trees	ESL Secs. 7.10–7.12, 9.1, 9.2		For a review of diversity indices, see Morris et al. (2014).
12	12/5	Bump hunting, multivariate adaptive regression splines, hierarchical mixtures of experts, boosting	ESL Secs. 9.3–9.5, 10.1–10.4	ESL 10.2, 10.5	Schapire and Freund (2012) is a book-length treatment of boosting.
13	12/12	More on boosting, boosting trees, gradient boosting	ESL Secs. 10.5, 10.6, 10.9–10.12	ESL 10.8	Homework 5 complete, due 12/14
	12/14	Support vector machines for classification and regression	ESL Secs. 12.1–12.3, ML Chap. 6	ESL 12.1, 12.2	For multiclass SVMs, see Lee et al. (2004).
14	12/19	Clustering, principal component analysis	ESL Secs. 14.3, 14.5	ESL 14.2, 14.7	Consistency of K-means clustering was studied by Pollard (1981).
15	12/26	Spectral clustering, nonnegative matrix factorization	ESL Secs. 14.5.3, 14.6	ESL 14.21, 14.23	For consistency of spectral clustering and its application to community detection in social network models, see von Luxburg et al. (2008) and Rohe et al. (2011). Homework 6 complete, due 1/2
	12/28	Midterm 2			Mean = 57, median = 56, Q1 = 45, Q3 = 67, high score = 93
16	1/2	Ensemble learning, random forests, Gaussian graphical models	ESL Chaps. 15–17		For recent theoretical and methodological developments on random forests, see Biau and Scornet (2016).