Causal
inference is one of the most important goals for many scientific researches.
Statistical approaches for causal inference are often used to remove spurious
association and correlation, to evaluate causal effects and discover causal
relationships. Causal inference is one of the most challenge topics in
statistics. Researchers from Peking University have made major contributions to
causal inference in recent years.
Another
important problem is casual inference in missing data. Missing data problems
arise in many applied research studies. If the missing mechanism is
nonignorable, the model of interest is often not identifiable without imposing
further assumptions. In a recent work [Miao-Ding-Geng, JASA 2016], Geng and his collaborators investigated the
identifiability of causal effects and statistical approaches for nonignorable
missing data under a number of important model setups such as normal and normal
mixture models. This is a major advancement of the field because this work does
not require an instrumental variable, which is often required in earlier works
but is often infeasible to find in many real applications, to obtain
identifiability.
High
dimensional data is wide-spread in many real applications, especially in
genomic studies. The effort of using these high dimensional genomic data to
dissect the causal genetic mechanisms of complex traits, however, has not
always been successful and is often compromised by the critical issue of
confounding. Many factors, such as unmeasured variables, experimental
conditions, and environmental perturbations, may lead to spurious associations
or distortion of true effects. Instrumental variable models provide an ideal
framework for joint analysis and control for confounding in genomics studies,
but high-dimensionality of both covariates and instrument variables poses great
challenges. Lin and his collaborators recently developed a class of two-stage
regularization methods for identifying and estimating important covariate
effects while selecting and estimating optimal instruments [Lin-Feng-Li, JASA (2015)]. The proposed methodology
extends the classical two-stage least squares method to high dimension by
exploiting sparsity using sparsity inducing penalties in both stages.
2. Experimental Design
(AI Mingyao)
Experimental
design is a statistical procedure for planning experiments such that the
collected data can yield valid and objective conclusions efficiently.
Statistical experiment design procedures have broad applications in many
scientific and applied areas. Researchers (mainly Ai’s group) from Peking
University have been very active in experimental design researches. In recent
years, they have made major contributions to optimal designs of interference
models and theory of Latin hypercube sampling.
In many
agricultural experiments, the treatment assigned to a particular plot could
also have effects on its neighbor plots. To adjust the biases caused by these
neighbor effects, the interference model is widely adopted. Identifying optimal
designs for interference models is a fundamental problem for these experiments.
In [Li-Zheng-Ai, AoS (2015)], Ai and
his collaborators studied optimal circular designs for the proportional
interference model, in which the neighbor effects of a treatment are
proportional to its direct effect. Kiefer's equivalence theorems for both
directional and unidirectional models were established. One can easily develop
computer programs to find the optimal design based on these theorems. In
[Zheng-Ai-Li, AoS (2017)], Ai and his
collaborators studied optimal circular designs for the interference model. The circular
neighbor balanced designs at distances 1 and 2 (CNBD2) are two major designs
for estimating the direct treatment effects and can be viewed as two special
classes of pseudo symmetric designs. Ai and his collaborators showed that CNBD2
is highly efficient among all possible designs if the error terms are
homoscedastic and uncorrelated, but is not efficient if the error terms are
correlated. They further established equivalent conditions for any design,
pseudo symmetric or not, to be universally optimal for any size of experiment
and any covariance structure of the error terms.
Orthogonal
array based Latin hypercube sampling (LHS) is popularly adopted for computer
experiments. Because of its stratification on multivariate margins in addition
to univariate uniformity, the associated samples may provide better estimators
for the gross mean of a complex function on a domain. In [Ai-Kong-Li, SS (2016)], Ai and his collaborators
developed a unified expression of the variance of the sample mean for LHS methods
based on an orthogonal array of strength t. An approximate estimator for the
variance of the sample mean was also established that is helpful in obtaining
the confidence interval of the gross mean. They extend these statistical
properties to three types of LHS: strong orthogonal array-based LHS, nested
orthogonal array-based LHS, and correlation-controlled orthogonal array-based
LHS.
3. Statistical Methods in Computational Biology and Bioinformatics
(DENG Minghua, XI Ruibin)
The recent
breakthrough of biological technologies allows biologist capable of
accumulating large amount of data in a short time period. Statistical analysis
of these big biological data plays a critical role in many biological
researches. Recently, researchers from Peking University have developed a
series of statistical tools for analyzing these biological data and these tools
have been widely used in many biological researches. They mainly focus on
statistical methods for biological network analysis and genomics analysis.
Many biological
problems are modelled as networks and network analysis played an important role
in these researches. In recent years, Deng’s group and Xi’s group published a
series papers in biological network analysis in top tier journals. In gene
co-expression network analysis, Deng and his colleagues developed a vector based co-expression
network construction method VCNet [Wang-Fang-Tang-Deng, Bioinformatics (2017)]. The unique advantage of this method is that
such a method can deal with cases with less samples than the number of exons. This method is
significantly more powerful than available methods. In metagenomics studies,
one can only obtain relative abundance of different microbial communities and
biologists are often interested in the correlation network of different
microbial communities. This type of data is called the compositional data.
However, direct application of traditional Pearson correlation to compositional
data would lead to spurious correlation. In [Fang-Huang-Zhao-Deng, Bioinformatics (2016)], Deng and his collaborators have developed
a lasso based method CCLasso that can directly estimate the correlation matrix
of the latent absolute abundance value. This method can have wide applications
in metagenomics studies. In another work [Yuan-Xi-Chen-Deng, Biometrika (2017)], Deng and Xi and
their collaborators studied
differential network and developed a new loss function called the D-trace loss
for estimating the differential network. Many real biological networks change under
different genetic and environmental conditions. Investigation of the
differential network would help to gain insights into biological systems. In
this work, Deng and Xi modelled the network as the Gaussian network and the difference
network is modelled as the difference of two precision matrices. The paper
showed that with the lasso penalty, under a number of regularity conditions,
the D-trace loss function can give consistent estimators even if the network
size increases as the sample size increases. An efficient algorithm was also
developed based on the alternating direction method of multipliers.
Researchers in Peking University also
developed a series of statistical methods for genomics studies. All of these
methods were published in top bioinformatics or computational biology journals.
In [Wang-Zheng-Zhao-Deng, PG (2013)],
Deng and his collaborators developed a new method for expression quantitative
trait loci (eQTL) analysis. Unlike other early works, this work did not focus
on the single genomic variation’s effect on gene expression. Instead, this work
considered synergistic effect of pairs of genomic variations on gene expression
and therefore can detect previously unexplored eQTL effects. The method is
mainly based on a bi-variable model and an efficient screening statistics for
computation speeding-up. In [Xi-Lee-Xia-Park, NAR (2016) and Xia-Liu-Deng-Xi, Bioinformatics (2016)], Xi and Deng and their collaborators developed new algorithms for
detecting structural variations (SV) and copy number variations (CNV) based on
high-throughput sequencing data. SVs and CNVs are wide spread in normal genomes
as well as in diseased genomes. Accurate detection of SVs and CNVs is a
critical step for biological and biomedical researches and clinical applications.
In these two works, the authors developed two new algorithms called BIC-seq2
and SVmine for CNV and SV detection. These two methods are significantly
superior in terms of sensitivity, specificity, detection resolution and
replicability.