PKU

Mathematical Introduction to Data Science (数据中的数学)
Fall 2012


Course Information

Synopsis (摘要)

This course is open to graduates and senior undergraduates in applied mathematics and statistics who are involved in dealing with data. It covers some topics on high dimensional statistics, manifold learning, diffusion geometry, random walks on graphs, concentration of measure, random matrix theory, geometric and topological methods, etc.
Prerequisite: linear algebra, basic probability and multivariate statistics, basic stochastic process (Markov chains); familarity with Matlab or R.

Lecture Notes

[pdf download]

Time and Place:

Monday 3:10-5:00pm;
Thursday (odd weeks) 1:00-2:50pm
 
Rm 316 Lecture Hall 2nd; 二教 316
Biweekly Seminars will be notified separately (Friday).

Homework and Projects:

We are targeting weekly homeworks with monthly mini-projects, and a final major project. No final exam. Scribers will get bonus credit for their work!

Teaching Assistant (助教):

LV, Yuan (吕渊) Email: shirleybluely (add "AT gmail DOT com" afterwards)

Schedule (时间表)

Date Topic Instructor Scriber
09/10/2012, Mon Lecture 01: Introduction: Data Representation, Sample Mean, Variance, and PCA [lecture note 1.pdf]
09/11/2012, Tue Seminar: Recent Progresses on Linear Programming and the Simplex Method
    Abstract: Linear programming (LP), together with the simplex method, remain a core Operations Research, Computer Science and Mathematics topic since 1947. Due to the relentless research effort, a linear program can be solved today one million times faster than it was done thirty years ago. Businesses, large and small, now use LP models to control manufacture inventories, price commodities, design civil/communication networks, and plan investments. LP even becomes a popular subject taught in under/graduate and MBA curriculums, advancing human knowledge and promoting science education. The aim of the talk is to describe several recent exciting progresses on LP and the simplex method, include counter examples to the Hirsch conjecture, some pivoting rules and their exponential behavior, strongly polynomial-time bounds of the simplex and policy-iteration methods for solving Markov decision process (MDP) and turn-based zero-sum game with any constant discount factor, the strongly polynomial-time complexity of the simplex method for solving deterministic MDP regardless discounts, etc.
    Bio: Yinyu Ye is currently a full Professor of Management Science and Engineering and Institute of Computational and Mathematical Engineering and the Director of the MS&E Industrial Affiliates Program, Stanford University. He received the B.S. degree in System Engineering from the Huazhong University of Science and Technology, China, and the M.S. and Ph.D. degrees in Engineering-Economic Systems and Operations Research from Stanford University. His current research interests include Continuous and Discrete Optimization, Algorithm Design and Analysis, Computational Game/Market Equilibrium, Metric Distance Geometry, Dynamic Resource Allocation, and Stochastic and Robust Decision Making, etc. He is an INFORMS (The Institute for Operations Research and The Management Science) Fellow, and has received several research awards including the inaugural 2012 ISMP Tseng Lectureship Prize for outstanding contribution to continuous optimization, the 2009 John von Neumann Theory Prize for fundamental sustained contributions to theory in Operations Research and the Management Sciences, the inaugural 2006 Farkas prize on Optimization, and the 2009 IBM Faculty Award. He has supervised numerous doctoral students at Stanford who received the 2008 Nicholson Prize and the 2006 and 2010 INFORMS Optimization Prizes for Young Researchers.
09/13/2012, Thu Lecture 02: Stein's Phenomenon and Shrinkage [lecture note 2.pdf]
    [Reference]:
  • [Tsybakov09] Chapter 3.4 [pdf] describes Stein's phenomenon and derives James-Stein's estimators.
  • Charles Stein (1956) [Stein56] firstly defines the inadmissiblity and finds better estimators than sample mean in terms of uniformly smaller mean square error, which starts a new Odyssey toward high dimensional inference.
  • [EfronMorris74] [pdf] gives 3 data examples that JS estimator is significantly better than sample mean.
    [Homework 1]:
  • Homework 1 [pdf]. Deadline: 09/24/2012, Friday. Two ways for submissions:
  • Submit your electronic version to TAs by email before deadline; or
  • Hand in your paper version to TA on the class 09/24/2011, Monday.
  • Mark on the head of your homework: Name - Student ID
Y.Y.
09/17/2012, Mon Lecture 03: Random Matrix Theory and PCA [lecture note 3.pdf]
    [Reference]:
  • [Johnstone06] shows that in high-dimensional inference when p/n is fixed, PCA will not converge by random matrix theory.
  • [Nadakuditi10] gives a brief treatment on phase transitions appeared in PCA when p/n is fixed.
Y.Y. Tengyuan Liang;
Bowei Yan
09/21/2012, Fri Seminar: Amazing growth of Real-time advertising ecosystem
  • Speaker: Dr. Xuehua Shen
  • Location: 1560 Science Building 1st (理科一号楼 1560)
  • Time: 2-3pm, Friday 9/21
    Abstract: We have been witnessing the amazing growth of real-time bidding in USA in the past two years and China since late 2011. In this talk, I discuss three things about real-time bidding in display advertising. First what is real-time bidding (RTB) and important players such as DSP and ad exchanges in this ecosystem; second, what is the current status of RTB in USA and China and technical challenges; third, how do we grow to make RTB ecosystem even bigger. In the third part, I particularly mention the difference of BigData and cloud computing work between academia and industry, and describe three challenging research problems in computational advertising.
    Bio: Xuehua Shen received B.S. of Computer Science in Nanjing University, China, and Ph.D. of Computer Science at University of Illinois at Urbana-Champaign, USA. His Ph.D. thesis is personalized search. After Ph.D. research, he worked in Google search quality at Mountain View, CA, doing personalized search, and search quality live experiment platform based on real user interactions. He then worked in BlueKai, the biggest data exchange and data management platform (DMP) in Silicon Valley, using Hadoop cloud-computing platform to do personalized ads and predictive modeling. Now, he is co-founder and CTO in iPinyou (www.ipinyou.com), the leader of real-time advertising and audience targeting in China.
09/24/2012, Mon Lecture 04: Random Matrix Theory and PCA (continued)
    [Reference]:
  • [Johnstone06] shows that in high-dimensional inference when p/n is fixed, PCA will not converge by random matrix theory.
  • [Nadakuditi10] gives a brief treatment on phase transitions appeared in PCA when p/n is fixed.
    [Mini-Project 1]:
  • Mini-Project 1 [pdf]. Deadline: 10/8/2012, Monday.
  • Mark on the head of your homework: Name - Student ID
  • Submit your report with source codes (as appendix in report or .zip files).
09/26/2012, Wed Seminar:Geometric Inference Using Distance-like Functions [slides]
  • Speaker: Prof. Frederic Chazal (INRIA Saclay -- Ile-de-France)
  • Location: 1st Science Building, Rm 1273
  • Time: 10:30-11:30am, Wednesday 9/26
    Abstract: Data often comes in the form of a point cloud sampled from an unknown compact subset of Euclidean space. The general goal of geometric inference is then to recover geometric and topological features of this subset from the approximating point cloud data. In recent years, it appeared that the study of distance functions allows to address many of these questions successfully. However, one of the main limitations of this framework is that it does not cope well with outliers nor with background noise. In this talk, we will show how to extend the framework of distance functions to overcome this problem. Replacing compact subsets by probability measures, we will introduce a notion of distance-to-measure functions. These functions share many properties with classical distance functions, which makes them suitable for inference purposes. In particular, by considering appropriate level sets of these distance functions, it is possible to associate in a robust way topological and geometric features to a probability measure.
    Bio: Frederic Chazal is a Senior Researcher in the GEOMETRICA team at INRIA Saclay, France. He obtained his Ph.d in Mathematics from University of Burgundy. His research interests are in computational geometry and topology. He has made significant contributions to this area. His work includes geometric inference for probability measure, sampling theory for compact sets in Euclidean spaces, stability of persistence diagram, etc.
09/27/2012, Thu Lecture 5: Multidimensional Scaling
    [Reference]:
  • [Bavaud10]: a survey on MDS and Schoeberg Transformation in data analysis.
10/08/2012, Mon Lecture 6: Random Projections and Almost Isometry: Johnson-Lindenstrauss Lemma
    [Reference]:
  • [Dasgupta99]: an elementary proof of Johnson-Lindenstrauss Lemma.
  • [Achlioptas01]: easy-to-operate random projections in database search.
  • [Indyk98]: random projections for approximate nearest neighbors.
  • [Jones11]: randomized approximate nearest neighbors.
Y.Y. Y.Y.
10/11/2012, Tue Lecture 7: Robust and Sparse PCA: SDP Extensions
Y.Y. Y.Y.
10/12/2012, Fri Seminar: Topological Landscape of Complex Networks
  • Speaker: Yuan Yao (PKU)
  • Location: 1114 Science Building 1st (理科一号楼 1114)
  • Time: 10-11am, Friday 10/12
    Abstract: Topological landscape is introduced for networks with functions defined on the nodes. By extending the notion of gradient flows to the network setting, critical nodes of different indices are defined. This leads to a concise and hierarchical representation of the network. Persistent homology from computational topology is used to design efficient algorithms for performing such analysis. Applications to some examples in social and biological networks are demonstrated, which show that critical nodes carry important information about structures and dynamics of such networks. This is a joint work with Weinan E and Jianfeng Lu, et al.
10/15/2012, Monday Presentation of the first project
    [Homework 4]:
  • Homework 4 [pdf]. Deadline: 10/22/2012, Monday. (problems marked by * are optional)
10/16/2012, Tuesday Seminar: Learning Theory for Kernel and Metric Learning
  • Speaker: Prof. Yiming Ying (University of Exeter, UK)
  • Location: 1st Science Building, Rm 1560
  • Time: 3-4pm, Tuesday 10/16
    Abstract: The performance of many machine learning algorithms largely depends on the data representation via the choice of kernel function or distance metric. Hence, one central issue is the problem of learning a kernel and metric from data. In this talk, I will present our work on theoretical analysis of kernel learning methods, and also present our recent results on the analysis of metric and similarity learning.
    Bio: Dr. Yiming Ying received his B.S. degree in mathematics from Hangzhou University in 1997, Hangzhou, China and his PhD degree in mathematics from Zhejiang University in 2002, Hangzhou, China. Currently he is a Lecturer (Assistant Professor) in Computer Science in the School of Engineering, Computing and Mathematics at the University of Exeter, United Kingdom. His research interests include machine learning, learning theory, optimization, probabilistic graphical models and the applications to computer vision, bioinformatics and multimedia data analysis.
10/18/2012, Thursday Seminar: Multi-component models for object detection
  • Speaker: Dr. Chunhui Gu (Google)
  • Location: 1st Science Building, Rm 1114
  • Time: 2-3pm, Thursday 10/18
    Abstract: In this talk, I will present a multi-component approach for object detection. Rather than attempting to represent an object category with a monolithic model, or pre-defining a reduced set of aspects, we form visual clusters from the data that are tight in appearance and configuration spaces. We train individual classifiers for each component, and then learn a second classifier that operates at the category level by aggregating responses from multiple components. In order to reduce computation cost during detection, we adopt the idea of object window selection, and our segmentation-based selection mechanism produces fewer than 500 windows per image while preserving high object recall. When compared to the leading methods in the challenging VOC PASCAL 2010 dataset, our multi-component approach obtains highly competitive results. Furthermore, unlike monolithic detection methods, our approach allows the transfer of finer-grained semantic information from the components, such as keypoint location and segmentation masks.
    Bio: Chunhui Gu's research focuses on computer vision and machine learning, specifically in object detection and segmentation. He joined Google in January 2012 and works on applying computer vision techniques to various Google products. Before that, he received his PhD in Electrical Engineering and Computer Sciences from UC Berkeley in 2012, and bachelor's degree in Electrical Engineering from California Institute of Technology in 2006.
10/22/2012, Monday Lecture 8: Random Projections and Compressed Sensing (Chapter 3.4)
    [Homework 5]:
  • Homework 5 [pdf]. Deadline: 10/29/2012, Monday. (problems marked by * are optional)
10/25/2012, Thursday Lecture 9: MDS with uncertainty: SDP embedding [Chapter 4.5-4.7]
    [Reference]:
  • [Ye06]: a semidefinite programming (SDP) approach for MDS with missing values (Sensor Network Localization).
  • [MVU]: another use of SDP in manifold learning, Maximum Variance Unfolding (MVU).
  • [Ye11]: Yinyu Ye's talk at Fields Insitute (2011) on Universal Rigidity and SDP, some state-of-the-art open problems.
    [Matlab]:
  • SNLSDP: SDP for SNL problem with up to 200 sensors
  • DISCO: SDP for anchor-free SNL problem with a few thousands sensors
10/29/2012, Mon Lecture 10: Manifold Learning (Nonlinear Dimensionality Reduction): ISOMAP vs. LLE [my slides before]
    [Reference]:
  • [ISOMAP]: a science (2000) paper on MDS with geodesic distance (graph shortest path distance);
  • [LLE]: a science (2000) paper on Locally Linear Embedding, i.e. local pca (complement) with global alignment;
Sun, Jian (Tsinghua)
11/05/2012, Mon Lecture 11: Other Manifold Learning Techniques: Laplacian, Diffusion, Hessian, LTSA [slides]
    [Matlab]:
  • hlle.m : Hessian LLE (eigenmap) matlab codes.
  • mani.m : Todd Wittman's manifold learning demo (MDS, ISOMAP, LLE, Hessian LLE, Laplacian, Diffusion, LTSA)
    [Mini-Project 2]:
  • Mini-Project 2 [pdf]. Deadline: 11/19/2012, Monday.
  • Mark on the head of your homework: Name - Student ID
  • Submit your report with source codes (as appendix in report or .zip files).
11/12/2012, Mon Lecture 12: Vector Laplacian and Diffusion Map [lecture notes]
    [Reference]:
  • [VDM]: Singer and Wu, Vector diffusion maps and the connection Laplacian, Comm. Pure Appl. Math. 65(8):1067-1144, 2012;
  • [ABT]: Aswani, Bickel and Tomlin, Regression on manifolds: Estimation of the exterior derivative, Ann. Stat. 39(1):48-81, 2011;
  • [MALLER]: Cheng and Wu, Local linear regression on manifolds and its geometric interpretation, arxiv.org/abs/1201.0327.
Yuwei Jiang; Chendi Huang
11/19/2011, Mon Lecture 13: Random Walk on Graphs: Perron-Frobenius Vector and PageRank
    [Reference]:
  • [PageRank]: A book on mathematics for Google's PageRank. Perron-Frobenius Theory on pp. 168-174.
  • [Meyer]: Chapter 8 Perron-Frobenius Theory for Nonnegative Matrices.
  • [BrinPage98] Have you ever read the original paper by Brin-Page on PageRank?
  • [Kleinberg99] Another important class of ranking of authorities and hubs, based on singular value decomposition of link matrix, by Jon Kleinberg.
  • [Chang08]: Chang-Pearson-Zhang generalizes this to nonnegative tensors. Can you develop it into an application as "PageRank" on hypergraphs?
Y.Y.
11/22/2011, Thu Lecture 14: Random Walk on Graphs: Fiedler Theory and Cheeger inequality
Y.Y.
11/26/2011, Mon Lecture 15: Random Walk on Graphs: Lumpability (metastability) and MNcut
    [Reference]:
  • [KemenySnell76]: Chapter 6.3, 6.4 give definitions of lumpability.
  • [MeilaShi01]: relationship between lumpability and multiple spectral clustering (MNcut).
  • [ShiMalik00]: spectral clustering and image segmentation (Ncut).
  • [Luxburg07]: a tutorial on spectral clustering.
  • [Hochbaum10]: shows that a variant Ncut without 1/2 volume constraint, is surprisingly of polyonomial complexity although original one is NP-hard.
Y.Y.
11/28/2012, Wed Seminar: Laplacian and Commute Time of Directed Graphs [lecture note.pdf]
Tianqi Wu (Tsinghua) Foling Zou
Liying Li
12/3/2012, Mon Lecture 16: Diffusion Map and Diffusion Distance
Y.Y.
12/6/2012, Thu Lecture 17: Commute Time Map and Distance
    [Reference]:
  • [QiuHancock07]: diffusion distance vs. commute time distance, in applications of spectral embedding and clustering.
  • [Fouss06-CommuteDistance]: shows applications of average commute time distance, or the pseudoinverse of the Laplacian matrix of the graph, in measuring stochastic similiary between nodes on large graphs.
  • [RadLuxHei09]: however, shows in large geometric random graphs commute distance converges to something meaningless for similarity measure!
  • [GobelJagers74-CommuteDistance]: shows that the average commuting time derived from mean first passage time is in fact an Euclidean distance metric
  • [KleinRandic93]: shows that effective resistance is a distance, which is upto a constant equivalent to average commuting time distance.
Y.Y.
12/10/2012, Mon Lecture 18: Introduction to Topic Models [slides]
    [Homework 6]:
  • Homework 6 [pdf]. Deadline: 12/17/2012, Monday. (problems marked by * are optional)
Xin Zhao (PKU)
12/17/2012, Mon Lecture 19: Smoothed Sparse Optimization and Parallel LASSO
    [Reference]:
  • As the slides contains some unpublished content, we will withhold the slides from the public at the speaker's request. Please email me if you attended the class and would like his slides.
Wotao Yin (Rice)
12/20/2012, Thu Lecture 20: Project 2 presentation
12/24/2012, Mon Lecture 21: Final Project
    [Final Project]:
  • Final Project [pdf]. Deadline: 1/3/2013, Thursday. You may send me the report after the presentation on Jan 3rd, 2013.
1/3/2012, Thu Lecture 22: Final Project Presentation

Reference


by YAO, Yuan.