Seminar overview
×
Modal title
Modal content
Spring Semester 2013
Date & Time | Speaker | Title | Location |
---|---|---|---|
Thr 21.02.2013 16:15-17:15 |
Mark van de Wiel VU University Medical Center, Amsterdam |
Abstract
Next generation sequencing is quickly replacing microarrays as a technique to probe different molecular levels of the cell, such as DNA or mRNA. The technology has the advantage to provide higher resolution, while reducing biases, in particular at the lower end of the spectrum. mRNA sequencing (RNAseq) data consist in counts of pieces of RNA called tags. This type of data imposes new challenges for statistical analysis. We present a novel approach to model and analyze these data.
Method and softwares for differential expression analysis usually use a generalization of the Poisson or Binomial distribution that accounts for overdispersion. A popular choice is the negative binomial (i.e. Poisson-Gamma) model. However, there is no consensus on what model fits best to RNAseq data, and this may depend on the technology used. With RNAseq, the number of features vastly exceeds the sample size. This implies that shrinkage of variance-related parameters may lead to more stable estimates and inference. Methods to do so are available, but only for a single parameter and in the context of restrictive study designs, e.g. two-group comparisons or fixed-effect designs.
We present a Bayesian framework that allows for a) various count models b) flexible designs c) random effects and d) multi-parameter shrinkage. The latter is implemented using Empirical Bayes principles by several procedures that estimate hyper-parameters of (mixture) priors or nonparametric priors. Moreover, the framework provides Bayesian multiplicity correction, thereby providing solid inference. In data-based simulations, we show that our method outperforms other popular methods (edgeR, DESeq, baySeq, NOISeq). Moreover, we illustrate our approach on three data sets. The first is a CAGE data set containing 25 samples representing five regions of the human brain from seven individuals. The design is incomplete and a batch effect is present. The data motivates use of the zero-inflated negative binomial as a powerful alternative to the negative binomial, because it leads to less bias of the overdispersion parameter and improved detection power for the low-count tags. The second is a large, standard two-sample RNAseq data set that we repeatedly split into a small data set and its large complement. Compared to other methods, our results from the small sample data sets validate much better on their large sample complements, illustrating the importance of the type of shrinkage.
The methodology and these results are available in Van de Wiel et al. (2012).
The framework is not restricted to RNAseq data nor to differential expression analysis. It is currently being extended towards analysis of proteomics, microRNAs, methylation, and high-throughput screening data. In addition, we currently study multivariate, graphical applications using Bayesian ridge regression. If time permits, some of these extensions will be discussed. The R software package, termed `ShrinkBayes', is build upon INLA, which provides the machinery for computing marginal posteriors in a variety of models.
Co-authors:
Gwenael Leday (i), Luba Pardo (iii), Havard Rue (iv), Aad van der Vaart (ii), Wessel van Wieringen (i,ii)
Affiliations:
i. Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam
ii. Department of Mathematics, VU University, Amsterdam
iii. Department of Clinical Genetics, VU University Medical Center, Amsterdam
iv. Department of Mathematical Sciences, Norwegian University for Science and Technology,
Trondheim, Norway
Reference:
Van de Wiel MA, Leday GGR, Pardo L, Rue H, Van der Vaart AW, Van Wieringen WN (2012). Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics, 14, 113-128
ZüKoSt Zürcher Kolloquium über StatistikBayesian analysis of RNA sequencing data by estimating multiple shrinkage priorsread_more |
HG G 19.2 |
Fri 22.02.2013 15:15-16:15 |
Thanh Mai Pham Ngoc Université de Paris Sud Orsay |
Abstract
In astrophysics,a burning issue consists in understanding the behaviour of the so-called UUltra High Energy Cosmic Rays (UHECR). These latter are cosmic rays with an extreme kinetic energy and the rarest particles in the universe. The source of those most energetic
particles remains a mystery. Finding out more about the law of probability of those incoming directions is crucial to gain an insight into the mechanisms generating the UHECR.
Astrophysicists have at their disposal directional data which are measurements of the incoming directions of the UHECR on Earth. Unfortunately their trajectories are deflected by Galactic and intergalactic fields. A first way to model the deflection in the incoming directions can be done thanks to the following model with random rotations : Zi =εiXi, i=1,...,N We define a nonparametric test procedure to distinguish H0 : ”the density f of Xi is the uniform density f0 on the sphere” and H1. we show that an adaptive procedure cannot have a faster rate of separation than ψad(s) = (N/loglog(N))−2s/ (2s+2ν+1) and we provide a procedure which reaches this rate. We illustrate the theory by implementing our test procedure for
various kinds of noise on SO(3) and by comparing it to other procedures.
Applications to real data in astrophysics and paleomagnetism are provided.
Research Seminar in StatisticsGoodness of fit tests for noisy directional dataread_more |
HG G 19.1 |
Thr 28.02.2013 16:15-17:15 |
Steve Smith University of Oxford |
Abstract
For nearly 20 years researchers have been studying spontaneous correlations in the brain with resting-state Functional MRI. However, most work has simply characterised "spatial maps" of correlation, rather than attempting to precisely model brain networks. More recently there has been increased interest in attempting to define functional network nodes, estimate what are the direct connections between these (as opposed to potentially indirect correlations), and even to estimate causality (directionality). Approaches such as Bayes Nets (graphical models), Granger causality and neurobiological Bayesian modelling have been proposed. Some of these approaches seem reasonable, while others seem doomed to failure. I will discuss the issues of network modelling specific to such data, discuss some of the various methods that are being developed for brain network modelling, and present some exciting new data, coming out of the Human Connectome Project.
ZüKoSt Zürcher Kolloquium über StatistikBrain network modelling from resting-state Functional MRI data: More than just correlations?read_more |
HG G 19.1 |
Fri 01.03.2013 15:15-16:15 |
Sébastien Loustau Université d'Angers, France |
Abstract
We propose to consider the problem of statistical learning when we observe a contaminated sample. More precisely, we state fast rates of convergence in classification with errors in variables for deconvolution empirical risk minimizers. These rates depends on the ill-posedness, the margin and the complexity of the problem. The cornerstone of the proof is a bias variance decomposition of the excess risk.
After a theoretical study of the problem, we turn out into more practical considerations by presenting a new algorithm for noisy finite dimensional clustering called noisy K-means.
Research Seminar in StatisticsInverse Statistical Learningread_more |
HG G 19.1 |
Fri 08.03.2013 15:15-16:15 |
Alexei Onatski University of Cambridge |
Abstract
In this paper, we obtain asymptotic approximations to the mean squared error of the least squares estimator of the common component in large approximate factor models with possibly misspecified number of factors. The approximations are derived under both strong and weak factors asymptotics assuming that the cross-sectional and temporal dimensions of the data are comparable. We develop consistent estimators of these approximations and propose to use them as new criteria for selection of the number of factors. We show that the estimators of the number of factors that minimize these criteria are asymptotically loss efficient in the sense of Shibata (1980), Li (1987), and Shao (1997).
Research Seminar in StatisticsAsymptotic Analysis of the Squared Estimation Error in Misspecified Factor Modelsread_more |
HG G 19.1 |
Wed 27.03.2013 15:15-16:15 |
Patrik Guggenberger University of California, San Diego |
Abstract
In the linear instrumental variables model we are interested in testing a hypothesis on the coefficient of an exogenous variable when one right hand side endogenous variable is present. Under the assumption of conditional homoskedasticity but no restriction on the reduced form coefficient vector, we derive the asymptotic size of the subset Lagrange multiplier (LM) test and provide the nonrandom size corrected (SC) critical value that ensures that the resulting SC subset LM test has correct asymptotic size. We introduce an easy-to-implement generalized moment selection plug-in SC (GMS-PSC) subset LM test that uses a data-dependent critical value. We compare the local power properties of the GMS-PSC subset LM and subset AR test and also provide a Monte Carlo study that compares the finite-sample properties of the two tests. The GMS-PSC is shown to have competitive power properties.
Research Seminar in StatisticsSubset inference in the linear IV modelread_more |
HG G 19.1 |
Tue 02.04.2013 15:15-16:15 |
Joris M. Mooij Radboud University Nijmegen |
Abstract
Causal feedback loops play important roles in many biological systems. In the
absence of time series data, inferring the structure of cyclic causal systems
can be extremely challenging. An example of such a biological system is a
cellular signalling network that plays an important role in human immune system
cells (Sachs et al., Science 2005), consisting of several interacting proteins
and phospholipids. The protein concentration data measured by Sachs et al.
utilizing flow cytometry has been analyzed by different researchers in order to
evaluate various causal inference methods. Most of these methods only consider
acyclic causal structures, even though the data shows strong evidence that
feedback loops are present. In this talk I will propose a new method for cyclic
causal discovery from a combination of observational and interventional
equilibrium data. I will show that the method indeed finds evidence for feedback
loops in the flow cytometry data and that it gives a more accurate quantitative
description of the data at comparable model complexity.
Research Seminar in StatisticsCyclic Causal Discovery from Equilibrium Dataread_more |
HG G 19.2 |
Thr 11.04.2013 16:15-17:15 |
Thomas Kneib Universität Göttingen |
Abstract
Usual exponential family regression models focus on only one designated quantity of the response distribution, namely the mean.While this entails easy interpretation of the estimated regression effects,it may often lead to incomplete analyses when more complex relationships are indeed present and also bears the risk of false conclusions about the significance / importance of covariates. We will therefore give an overview on extended types of regression models that allow us to go beyond mean regression. More specifically, we will study generalized additive models for location, scale and shape as well as semiparametric quantile and expectile regression.
ZüKoSt Zürcher Kolloquium über StatistikBeyond Mean Regressionread_more |
HG G 19.1 |
Fri 19.04.2013 15:15-16:15 |
Alexander Sokol University of Copenhagen |
Abstract
We define a notion of interventions in a stochastic differential equation
based on simple substitution in the SDE. We prove that this notion of intervention
is the same as can be obtained by making do()-interventions in the Euler scheme for
the SDE and taking the limit. We show that when the driving semimartingale is a Lévy
process and there are no latent variables, the postintervention distribution is
always identifiable from the observational distribution. We also relate our results
to litterature on weak conditional local independence by Gégout-Petit and
Commenges.
Research Seminar in StatisticsStochastic differential equations as causal modelsread_more |
HG G 19.1 |
Fri 19.04.2013 16:30-17:30 |
Johanna G. Neslehova and Christian Genest McGill University, Montréal, Canada |
Abstract
New statistics are proposed for testing the hypothesis that arbitrary random variables are mutually independent. These tests are consistent and well-behaved for any type of data, even for sparse contingency tables and tables whose dimension depends on the sample size. The statistics are Cram?ér-von Mises and Kolmogorov-Smirnov type functionals of the empirical checkerboard copula. The asymptotic behavior of the corresponding empirical process will be characterized and illustrated; it will also be shown how replicates from the limiting process can be generated using a multiplier bootstrap procedure. As will be seen through simulations, the new tests are considerably more powerful than those based on the Pearson chi squared, likelihood ratio, and Zelterman statistics often used in this context.
Research Seminar in StatisticsTests of independence for sparse contingency tables and beyondread_more |
HG G 19.1 |
Thr 25.04.2013 16:15-17:15 |
David Ginsbourger Universität Bern |
Abstract
Gaussian field models have become commonplace in the design and analysis of costly experiments.
Thanks to convenient properties of associated conditional distributions (Gaussianity, interpolation
in the case of deterministic responses, etc.), Gaussian field models not only allow predicting black-box
responses for untried input configurations, but can also be used as a basis for evaluation strategies
dedicated to optimization, inversion, uncertainty quantification, probability of failure estimation, and more.
After an introduction to Gaussian field modeling and some of its popular applications in adaptive design of deterministic numerical experiments, we will present two recent contributions. First, an extension of the Expected Improvement criterion dedicated to Monte-Carlo simulations with controlled precision will be presented, with application to an online resource allocation problem in safety engineering.
Second, we will focus on a high-dimensional application of Gaussian field modeling to an inversion problem in water sciences, where an original non-stationary covariance kernel relying on fast proxy simulations is used.
ZüKoSt Zürcher Kolloquium über StatistikGaussian field models for the adaptive design of costly experimentsread_more |
HG G 19.1 |
Fri 03.05.2013 15:15-16:15 |
Niels Richard Hansen University of Copenhagen |
Abstract
A main challenge in neuron science is to model the dynamic activity of the brain and how it responds to external stimuli. We present models of neuron network activity based on multichannel spike data. The models form a class of point orocess models with spike rates determined through linear filters of the spike histories. The filters are given in terms of filter functions that are estimated non-parametrically as elements in e.g. a reproducing kernel Hilbert space. We discuss how the models can be used to infer network connectivity and predictions of stimuli (intervention) effects. The methods used are available via the R package ppstat.
Research Seminar in StatisticsNon-parametric estimation of linear filters for point processesread_more |
HG G 19.1 |
Thr 16.05.2013 16:15-17:15 |
Carolin Strobl Universität Zürich |
Abstract
The main aim of educational and psychological testing is to provide a means for objective and fair comparisons between the test takers. However, in practice a phenomenon called differential item functioning (DIF) can lead to an unfair advantage or disadvantage for certain groups of test takers.
A variety of statistical methods has been suggested for detecting DIF in the Rasch model, that is used increasingly in educational and psychological testing. However, most of these methods are designed for the comparison of pre-specified focal and reference groups, such as males and females, whereas the actual groups of advantaged or disadvantaged test takers may be formed by (complex interactions of) several covariates, as in the case of females up to a certain age.
In this talk a new method for DIF detection based on model-based recursive partitioning is presented that can detect groups of test takers exhibiting DIF in a data-driven way. The talk outlines the statistical methodology behind the new approach as well as its practical application by means of an illustrative example.
ZüKoSt Zürcher Kolloquium über StatistikDetecting Differential Item Functioning in Psychological Testsread_more |
HG G 19.1 |
Thr 23.05.2013 16:15-17:15 |
Yves Rozenholc University Paris Descartes, Paris |
Abstract
In the context of anti-angiogenic cancer treatments, a major issue is to follow the drug effect. If parametric models have been developed to achieve this goal, they suffer from being tissue-related and, moreover, if their pertinence is already questionable in heathy tissue, they are certainly wrong in tumors where the cell growth changes the nature of the tissue. In order to face these problems, nonparametric modeling of the blood flow exchanges has been imagined early in the 80's and started to be used in the second half of the 90's with the availability of high-frequency imaging techniques. Unfortunately, to date the estimation in such nonparametric models is highly unstable due to high level of ill-posedness.
After recalling the medical context which has motivated our study and describing the associated models, I will present two new nonparametric estimators for Laplace deconvolution in the regression setting. The first estimator is derived from the statistical analysis of Volterra equations of the first type intimately linked to Laplace deconvolution. This point-wised estimate is shown to be adaptive in the sense that it achieves optimal rates of convergence up to the regularity of the unknown function even if this regularity is also unknown. Because this estimator needs the knowledge of the roots of a polynomial, it remains hardly usable from a practical point of view. The second estimator relies on a decomposition of the functions of interest on the basis of the Laplace functions. This global estimator tuned by model selection satisfies an oracle inequality and is easily implementable. This theoretical study is completed by simulations which show the proper behavior of this two estimators.
Collaboration with Charles-A. Cuénod (MD-PhD), Felix Abramovich, Fabienne Comte and Marianna Pensky.
ZüKoSt Zürcher Kolloquium über StatistikLaplace deconvolution in Regression - Application to angiogenosis follow-up in cancerread_more |
HG G 19.1 |
Thr 20.06.2013 15:15-16:15 |
Andrew B. Nobel University of North Carolina at Chapel Hill |
Abstract
The problem of finding large average submatrices of a real-valued matrix arises in the exploratory analysis of data from a variety of disciplines, ranging from genomics to social sciences. This talk details several new theoretical results concerning the asymptotic behavior of large average submatrices of an nxn Gaussian random matrix. The first result identifies the average and joint distribution of the (globally optimal) kxk submatrix having largest average value. We then turn our attention to submatrices with dominant row and column sums, which arise as the local optima of a useful iterative search procedure for large average submatrices. Paralleling the result for global optima, the second result identifies the average and joint distribution of a typical locally optimal kxk submatrix. The last part of the talk considers the *number* of locally optimal kxk submatrices, L_n(k), beginning with the asymptotic behavior of its mean and variance for fixed k and increasing n. The final result is a Gaussian central limit theorem for L_n(k) that is based on a new variant of Stein's method for normal approximation.
Joint work with Shankar Bhamidi and Partha S. Dey
Research Seminar in StatisticsLarge Average Submatrices of a Gaussian Random Matrix: Landscapes and Local Optima.read_more |
HG G 19.1 |
Tue 02.07.2013 15:15-16:00 |
Rajen Shah Statistical Laboratory, University of Cambridge, UK |
Abstract
The “Big Data” era in which we are living has brought with it a combination of statistical and computational challenges that often must be met with approaches that draw on developments from both the fields of statistics and computer science. In this talk I will present a method for performing regression where the n by p design matrix may have both n and p in the millions, but where the design matrix is sparse, that is most of its entries are zero; such sparsity is common in many large-scale applications such as text analysis.
In this setting, performing regression using the original data can be computationally infeasible. Instead, we first map the design matrix to an n by L matrix with L << p, using a modified version of a scheme known as b-bit min-wise hashing in computer science. From a statistical perspective, we study the performance of regression using this compressed data, and give finite sample bounds on the prediction error. Interestingly, despite the loss of information through the compression scheme, we will see that ordinary least squares or ridge regression applied to the reduced data can actually allow us to fit a model containing interactions in the original data.
This is joint (and ongoing) work with Nicolai Meinshausen.
Research Seminar in StatisticsLarge-scale regression with sparse dataread_more |
HG G 19.1 |