北京大学2016数据科学与概率统计学术研讨会
 北京大学2016数据科学与概率统计学术研讨会

日程安排

时间:20161217

地点:北京大学理科一号楼1114

地点

时间

报告人

内容

主持人

 

8:30-9:30

 

报到

 

理科一号楼

9:30-10:10

 

开幕式、嘉宾发言

房祥忠

理科一号楼1114

10:10-10:30

 

茶歇、合影

杨瑛

 

10:30-11:00

郁彬

Theory to Gain Insight and Inform Practice

邓明华

理科一号楼1114

11:00-11:30

赵彭亮

Considerations for Pediatric Trial Designs and Analyses

11:30-12:00

赵泉水

Applications of Probability and Statistics in Finance

12:00-14:00

 

 

 

 

14:00-14:30

刘军

Evaluating parallelizable Markov chain Monte Carlo algorithms via waste-recycling

耿直

理科一号楼1114

14:30-15:00

Paul Yip

The Use of Geospatial Analysis for Suicide Prevention

15:00-15:30

孙嘉阳

Subsampling for Feature Selection in Large Regression Data

15:30-16:00

陈家鼎

关于回归变量的选择

16:00-16:10

 

茶歇

 

理科一号楼1556

16:10-17:00

 

讨论

刘力平

理科一号楼1556

 

 

 

 

 


 

学术报告摘要

 

1.  20161217日上午10:30-11:00, 地点:北京大学理科一号楼1114

 

Theory to gain insight and inform practice

 

Prof. Bin Yustatistics.berkeley.edu/~binyu

 

Abstract: Henry L. Rietz, the first president of IMS, published his book “Mathematical Statistics”in 1927. One review wrote in 1928:Professor Rietz has developed this theory so skillfully that the ’workers in other fields’, providedonly that they have a passing familiarity with the grammar of mathematics, can secure a satisfactory understanding of the points involved.”In this lecture, I would like to promote the good tradition of mathematical statistics as expressed in Rietzs book in order to gain insight and inform practice. In particular, I will recount the beginning of our theoretical study of dictionary learning (DL) as part of a multi-disciplinary project to “map a cell’s destiny” in Drosophila embryo. I will share insights gained regarding local identifiability of primal and dual formulations of DL. Furthermore, comparing the two formulations is leading us down the path of seeking confidence measures of the learned dictionary elements (corresponding to biologically meaningful regions in Drosophila embryo). Finally, I will present preliminary work using our confidence measures to identify potential knockout (or gene editing) experiments in an iterative interaction between biological and data sciences.

 

2.  20161217日上午11:00-11:30, 地点:北京大学理科一号楼1114

Considerations for Pediatric Trial Designs and Analyses

Dr. PengLiang ZhaoSanofi Ltd.

 

Abstract:Pediatric trials are often conducted to obtain extended marketing exclusivity or to satisfy regulatory requirements. There are many challenges in designing and analyzing pediatric trials arising from special ethical issues and the relatively small accessible patient population. The application of conventional phase 3 trial designs to pediatrics is not realistic in some therapeutic areas. To address this issue we propose various approaches for designing pediatric trials that incorporate data available from adult studies and we also apply the concept of consistency used in multi-regional trials. The performance of these methods is assessed through simulations.

 

3.  20161217日上午11:30-12:00, 地点:北京大学理科一号楼1114

Applications of Probability and Statistics in Finance

Dr. Quanshui Zhao

摘要:本报告通过大量的例子阐述了金融实践中的概率统计的应用。

 

4.  20161217日下午14:00-14:30, 地点:北京大学理科一号楼1114

 

Evaluating parallelisable Markov chain Monte Carlo algorithms via waste-recycling

Prof. Liu JunHarvard University

 

Abstract:Parallelisable Markov chain Monte Carlo algorithms generate multiple proposals and parallelise the evaluations of the likelihood functions on different cores at each iteration. Here we give simple-to-use criteria for evaluations and comparisons of general (parallelisable) waste-recycling Markov chain Monte Carlo algorithms. We give a formula for the effective sample size of multiple-proposal algorithms, which is easy to implement using moment estimators.

(Joint work with Espen Bernton, Yang Chen, Shihao Yang, and Neil Shephard)

 

5.  20161217日下午14:30-15:00, 地点:北京大学理科一号楼1114

The use of Geospatial analysis for suicide prevention

Paul Yip, HongKong University

Abstract:

 

6.  20161217日下午15:00-15:30, 地点:北京大学理科一号楼1114

Subsampling for Feature Selection in Large Regression Data

Prof. Jiayang SunCase Western Reserve University

Abstract:Feature selection from a large number of features in a regression analysis remains a challenge to data science. One popular approach to feature selection in large regression data with sparse features is to use a penalized likelihood or a shrinkage estimation, such as LASSO, SCAD, elastic net, and MCP penalty.We present a different approach using a new subsampling method, called a Subsampling Winner algorithm (SWA) for feature selection in large regression data. The central idea of our approach is analogous to that for the election of National Merit Scholars. SWA uses a `base procedure' on each of subsamples, computes the scores of all features according to their performance in each of the subsample analyses, then obtains the `semifinalist' by ranking the resulting scores, and finally determines the `finalists,' aka the important features from the `semifinalist.' Due to its subsampling nature, SWA applies to data of any dimension in principle, including data that are too large to use a statistical procedure on the full data by an existing software package. We provide paneling plots for choosing the subsample size, compare our method with ElasticNet (a generalization of LASSO), SCAD, MCP and RandomForest, and illustrate an SWA's application to a genomic data about Ovarian cancer. (Joint work with Y. Richard Fan)

7.  20161217日上午15:30-16:00, 地点:北京大学理科一号楼1114

关于回归变量的选择

陈家鼎教授,北京大学

摘要:在线性回归模型建模中,回归自变量选择是一个受到广泛关注、文献众多,具有很强的理论和实际意义的问题.回归自变量选择子集的相合性是其中一个重要问题,如果某种自变量选择方法选择的子集在样本量趋于无穷时是相合的,而且预测均方误差较小,则这种方法是可取的.利用BIC准则可以挑选相合的自变量子集,但是在自变量个数很多时计算量过大;适应lasso方法具有较高计算效率,也能找到相合的自变量子集;本文提出一种更简单的自变量选择方法,只需要计算两次普通线性回归:第一次进行全集回归,得到全集的回归系数估计,然后利用这些回归系数估计挑选子集,然后只要在挑选的自变量子集上再进行一次普通线性回归就得到了回归结果.考虑如下的回归模型:Y_n=X_nβ~*+ε~((n)),其中回归系数β~*中非零分量下标的集合为J_O,J_n是本文方法选择的自变量子集下标集合,β~((n))是本文方法估计的回归系数(未选中的自变量对应的系数为零),本文证明了,在适当条件下,其中(β~((n))-β~*)J_O表示β~((n))-β~*的分量下标在J_O中的元素的组成的向量,σ~2是误差方差,∑,c是与矩阵(X_n~TX_n)/n极限有关的矩阵和常数.数值模拟结果表明本文方法具有很好的中小样本性质.