Seminar overview

×

Modal title

Modal content

Spring Semester 2013

Date & Time Speaker Title Location
Thr 21.02.2013
16:15-17:15
Mark van de Wiel
VU University Medical Center, Amsterdam
Abstract
Next generation sequencing is quickly replacing microarrays as a technique to probe different molecular levels of the cell, such as DNA or mRNA. The technology has the advantage to provide higher resolution, while reducing biases, in particular at the lower end of the spectrum. mRNA sequencing (RNAseq) data consist in counts of pieces of RNA called tags. This type of data imposes new challenges for statistical analysis. We present a novel approach to model and analyze these data. Method and softwares for differential expression analysis usually use a generalization of the Poisson or Binomial distribution that accounts for overdispersion. A popular choice is the negative binomial (i.e. Poisson-Gamma) model. However, there is no consensus on what model fits best to RNAseq data, and this may depend on the technology used. With RNAseq, the number of features vastly exceeds the sample size. This implies that shrinkage of variance-related parameters may lead to more stable estimates and inference. Methods to do so are available, but only for a single parameter and in the context of restrictive study designs, e.g. two-group comparisons or fixed-effect designs. We present a Bayesian framework that allows for a) various count models b) flexible designs c) random effects and d) multi-parameter shrinkage. The latter is implemented using Empirical Bayes principles by several procedures that estimate hyper-parameters of (mixture) priors or nonparametric priors. Moreover, the framework provides Bayesian multiplicity correction, thereby providing solid inference. In data-based simulations, we show that our method outperforms other popular methods (edgeR, DESeq, baySeq, NOISeq). Moreover, we illustrate our approach on three data sets. The first is a CAGE data set containing 25 samples representing five regions of the human brain from seven individuals. The design is incomplete and a batch effect is present. The data motivates use of the zero-inflated negative binomial as a powerful alternative to the negative binomial, because it leads to less bias of the overdispersion parameter and improved detection power for the low-count tags. The second is a large, standard two-sample RNAseq data set that we repeatedly split into a small data set and its large complement. Compared to other methods, our results from the small sample data sets validate much better on their large sample complements, illustrating the importance of the type of shrinkage. The methodology and these results are available in Van de Wiel et al. (2012). The framework is not restricted to RNAseq data nor to differential expression analysis. It is currently being extended towards analysis of proteomics, microRNAs, methylation, and high-throughput screening data. In addition, we currently study multivariate, graphical applications using Bayesian ridge regression. If time permits, some of these extensions will be discussed. The R software package, termed `ShrinkBayes', is build upon INLA, which provides the machinery for computing marginal posteriors in a variety of models. Co-authors: Gwenael Leday (i), Luba Pardo (iii), Havard Rue (iv), Aad van der Vaart (ii), Wessel van Wieringen (i,ii) Affiliations: i. Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam ii. Department of Mathematics, VU University, Amsterdam iii. Department of Clinical Genetics, VU University Medical Center, Amsterdam iv. Department of Mathematical Sciences, Norwegian University for Science and Technology, Trondheim, Norway Reference: Van de Wiel MA, Leday GGR, Pardo L, Rue H, Van der Vaart AW, Van Wieringen WN (2012). Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics, 14, 113-128
ZüKoSt Zürcher Kolloquium über Statistik
Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors
HG G 19.2
Fri 22.02.2013
15:15-16:15
Thanh Mai Pham Ngoc
Université de Paris Sud Orsay
Abstract
In astrophysics,a burning issue consists in understanding the behaviour of the so-called UUltra High Energy Cosmic Rays (UHECR). These latter are cosmic rays with an extreme kinetic energy and the rarest particles in the universe. The source of those most energetic particles remains a mystery. Finding out more about the law of probability of those incoming directions is crucial to gain an insight into the mechanisms generating the UHECR. Astrophysicists have at their disposal directional data which are measurements of the incoming directions of the UHECR on Earth. Unfortunately their trajectories are deflected by Galactic and intergalactic fields. A first way to model the deflection in the incoming directions can be done thanks to the following model with random rotations : Zi =εiXi, i=1,...,N We define a nonparametric test procedure to distinguish H0 : ”the density f of Xi is the uniform density f0 on the sphere” and H1. we show that an adaptive procedure cannot have a faster rate of separation than ψad(s) = (N/loglog(N))−2s/ (2s+2ν+1) and we provide a procedure which reaches this rate. We illustrate the theory by implementing our test procedure for various kinds of noise on SO(3) and by comparing it to other procedures. Applications to real data in astrophysics and paleomagnetism are provided.
Research Seminar in Statistics
Goodness of fit tests for noisy directional data
HG G 19.1
Thr 28.02.2013
16:15-17:15
Steve Smith
University of Oxford
Abstract
For nearly 20 years researchers have been studying spontaneous correlations in the brain with resting-state Functional MRI. However, most work has simply characterised "spatial maps" of correlation, rather than attempting to precisely model brain networks. More recently there has been increased interest in attempting to define functional network nodes, estimate what are the direct connections between these (as opposed to potentially indirect correlations), and even to estimate causality (directionality). Approaches such as Bayes Nets (graphical models), Granger causality and neurobiological Bayesian modelling have been proposed. Some of these approaches seem reasonable, while others seem doomed to failure. I will discuss the issues of network modelling specific to such data, discuss some of the various methods that are being developed for brain network modelling, and present some exciting new data, coming out of the Human Connectome Project.
ZüKoSt Zürcher Kolloquium über Statistik
Brain network modelling from resting-state Functional MRI data: More than just correlations?
HG G 19.1
Fri 01.03.2013
15:15-16:15
Sébastien Loustau
Université d'Angers, France
Abstract
We propose to consider the problem of statistical learning when we observe a contaminated sample. More precisely, we state fast rates of convergence in classification with errors in variables for deconvolution empirical risk minimizers. These rates depends on the ill-posedness, the margin and the complexity of the problem. The cornerstone of the proof is a bias variance decomposition of the excess risk. After a theoretical study of the problem, we turn out into more practical considerations by presenting a new algorithm for noisy finite dimensional clustering called noisy K-means.
Research Seminar in Statistics
Inverse Statistical Learning
HG G 19.1
Fri 08.03.2013
15:15-16:15
Alexei Onatski
University of Cambridge
Abstract
In this paper, we obtain asymptotic approximations to the mean squared error of the least squares estimator of the common component in large approximate factor models with possibly misspecified number of factors. The approximations are derived under both strong and weak factors asymptotics assuming that the cross-sectional and temporal dimensions of the data are comparable. We develop consistent estimators of these approximations and propose to use them as new criteria for selection of the number of factors. We show that the estimators of the number of factors that minimize these criteria are asymptotically loss efficient in the sense of Shibata (1980), Li (1987), and Shao (1997).
Research Seminar in Statistics
Asymptotic Analysis of the Squared Estimation Error in Misspecified Factor Models
HG G 19.1
Wed 27.03.2013
15:15-16:15
Patrik Guggenberger
University of California, San Diego
Abstract
In the linear instrumental variables model we are interested in testing a hypothesis on the coefficient of an exogenous variable when one right hand side endogenous variable is present. Under the assumption of conditional homoskedasticity but no restriction on the reduced form coefficient vector, we derive the asymptotic size of the subset Lagrange multiplier (LM) test and provide the nonrandom size corrected (SC) critical value that ensures that the resulting SC subset LM test has correct asymptotic size. We introduce an easy-to-implement generalized moment selection plug-in SC (GMS-PSC) subset LM test that uses a data-dependent critical value. We compare the local power properties of the GMS-PSC subset LM and subset AR test and also provide a Monte Carlo study that compares the finite-sample properties of the two tests. The GMS-PSC is shown to have competitive power properties.
Research Seminar in Statistics
Subset inference in the linear IV model
HG G 19.1
Tue 02.04.2013
15:15-16:15
Joris M. Mooij
Radboud University Nijmegen
Abstract
Causal feedback loops play important roles in many biological systems. In the absence of time series data, inferring the structure of cyclic causal systems can be extremely challenging. An example of such a biological system is a cellular signalling network that plays an important role in human immune system cells (Sachs et al., Science 2005), consisting of several interacting proteins and phospholipids. The protein concentration data measured by Sachs et al. utilizing flow cytometry has been analyzed by different researchers in order to evaluate various causal inference methods. Most of these methods only consider acyclic causal structures, even though the data shows strong evidence that feedback loops are present. In this talk I will propose a new method for cyclic causal discovery from a combination of observational and interventional equilibrium data. I will show that the method indeed finds evidence for feedback loops in the flow cytometry data and that it gives a more accurate quantitative description of the data at comparable model complexity.
Research Seminar in Statistics
Cyclic Causal Discovery from Equilibrium Data
HG G 19.2
Thr 11.04.2013
16:15-17:15
Thomas Kneib
Universität Göttingen
Abstract
Usual exponential family regression models focus on only one designated quantity of the response distribution, namely the mean.While this entails easy interpretation of the estimated regression effects,it may often lead to incomplete analyses when more complex relationships are indeed present and also bears the risk of false conclusions about the significance / importance of covariates. We will therefore give an overview on extended types of regression models that allow us to go beyond mean regression. More specifically, we will study generalized additive models for location, scale and shape as well as semiparametric quantile and expectile regression.
ZüKoSt Zürcher Kolloquium über Statistik
Beyond Mean Regression
HG G 19.1
Fri 19.04.2013
15:15-16:15
Alexander Sokol
University of Copenhagen
Abstract
We define a notion of interventions in a stochastic differential equation based on simple substitution in the SDE. We prove that this notion of intervention is the same as can be obtained by making do()-interventions in the Euler scheme for the SDE and taking the limit. We show that when the driving semimartingale is a Lévy process and there are no latent variables, the postintervention distribution is always identifiable from the observational distribution. We also relate our results to litterature on weak conditional local independence by Gégout-Petit and Commenges.
Research Seminar in Statistics
Stochastic differential equations as causal models
HG G 19.1
Fri 19.04.2013
16:30-17:30
Johanna G. Neslehova and Christian Genest
McGill University, Montréal, Canada
Abstract
New statistics are proposed for testing the hypothesis that arbitrary random variables are mutually independent. These tests are consistent and well-behaved for any type of data, even for sparse contingency tables and tables whose dimension depends on the sample size. The statistics are Cram?ér-von Mises and Kolmogorov-Smirnov type functionals of the empirical checkerboard copula. The asymptotic behavior of the corresponding empirical process will be characterized and illustrated; it will also be shown how replicates from the limiting process can be generated using a multiplier bootstrap procedure. As will be seen through simulations, the new tests are considerably more powerful than those based on the Pearson chi squared, likelihood ratio, and Zelterman statistics often used in this context.
Research Seminar in Statistics
Tests of independence for sparse contingency tables and beyond
HG G 19.1
Thr 25.04.2013
16:15-17:15
David Ginsbourger
Universität Bern
Abstract
Gaussian field models have become commonplace in the design and analysis of costly experiments. Thanks to convenient properties of associated conditional distributions (Gaussianity, interpolation in the case of deterministic responses, etc.), Gaussian field models not only allow predicting black-box responses for untried input configurations, but can also be used as a basis for evaluation strategies dedicated to optimization, inversion, uncertainty quantification, probability of failure estimation, and more. After an introduction to Gaussian field modeling and some of its popular applications in adaptive design of deterministic numerical experiments, we will present two recent contributions. First, an extension of the Expected Improvement criterion dedicated to Monte-Carlo simulations with controlled precision will be presented, with application to an online resource allocation problem in safety engineering. Second, we will focus on a high-dimensional application of Gaussian field modeling to an inversion problem in water sciences, where an original non-stationary covariance kernel relying on fast proxy simulations is used.
ZüKoSt Zürcher Kolloquium über Statistik
Gaussian field models for the adaptive design of costly experiments
HG G 19.1
Fri 03.05.2013
15:15-16:15
Niels Richard Hansen
University of Copenhagen
Abstract
A main challenge in neuron science is to model the dynamic activity of the brain and how it responds to external stimuli. We present models of neuron network activity based on multichannel spike data. The models form a class of point orocess models with spike rates determined through linear filters of the spike histories. The filters are given in terms of filter functions that are estimated non-parametrically as elements in e.g. a reproducing kernel Hilbert space. We discuss how the models can be used to infer network connectivity and predictions of stimuli (intervention) effects. The methods used are available via the R package ppstat.
Research Seminar in Statistics
Non-parametric estimation of linear filters for point processes
HG G 19.1
Thr 16.05.2013
16:15-17:15
Carolin Strobl
Universität Zürich
Abstract
The main aim of educational and psychological testing is to provide a means for objective and fair comparisons between the test takers. However, in practice a phenomenon called differential item functioning (DIF) can lead to an unfair advantage or disadvantage for certain groups of test takers. A variety of statistical methods has been suggested for detecting DIF in the Rasch model, that is used increasingly in educational and psychological testing. However, most of these methods are designed for the comparison of pre-specified focal and reference groups, such as males and females, whereas the actual groups of advantaged or disadvantaged test takers may be formed by (complex interactions of) several covariates, as in the case of females up to a certain age. In this talk a new method for DIF detection based on model-based recursive partitioning is presented that can detect groups of test takers exhibiting DIF in a data-driven way. The talk outlines the statistical methodology behind the new approach as well as its practical application by means of an illustrative example.
ZüKoSt Zürcher Kolloquium über Statistik
Detecting Differential Item Functioning in Psychological Tests
HG G 19.1
Thr 23.05.2013
16:15-17:15
Yves Rozenholc
University Paris Descartes, Paris
Abstract
In the context of anti-angiogenic cancer treatments, a major issue is to follow the drug effect. If parametric models have been developed to achieve this goal, they suffer from being tissue-related and, moreover, if their pertinence is already questionable in heathy tissue, they are certainly wrong in tumors where the cell growth changes the nature of the tissue. In order to face these problems, nonparametric modeling of the blood flow exchanges has been imagined early in the 80's and started to be used in the second half of the 90's with the availability of high-frequency imaging techniques. Unfortunately, to date the estimation in such nonparametric models is highly unstable due to high level of ill-posedness. After recalling the medical context which has motivated our study and describing the associated models, I will present two new nonparametric estimators for Laplace deconvolution in the regression setting. The first estimator is derived from the statistical analysis of Volterra equations of the first type intimately linked to Laplace deconvolution. This point-wised estimate is shown to be adaptive in the sense that it achieves optimal rates of convergence up to the regularity of the unknown function even if this regularity is also unknown. Because this estimator needs the knowledge of the roots of a polynomial, it remains hardly usable from a practical point of view. The second estimator relies on a decomposition of the functions of interest on the basis of the Laplace functions. This global estimator tuned by model selection satisfies an oracle inequality and is easily implementable. This theoretical study is completed by simulations which show the proper behavior of this two estimators. Collaboration with Charles-A. Cuénod (MD-PhD), Felix Abramovich, Fabienne Comte and Marianna Pensky.
ZüKoSt Zürcher Kolloquium über Statistik
Laplace deconvolution in Regression - Application to angiogenosis follow-up in cancer
HG G 19.1
Thr 20.06.2013
15:15-16:15
Andrew B. Nobel
University of North Carolina at Chapel Hill
Abstract
The problem of finding large average submatrices of a real-valued matrix arises in the exploratory analysis of data from a variety of disciplines, ranging from genomics to social sciences. This talk details several new theoretical results concerning the asymptotic behavior of large average submatrices of an nxn Gaussian random matrix. The first result identifies the average and joint distribution of the (globally optimal) kxk submatrix having largest average value. We then turn our attention to submatrices with dominant row and column sums, which arise as the local optima of a useful iterative search procedure for large average submatrices. Paralleling the result for global optima, the second result identifies the average and joint distribution of a typical locally optimal kxk submatrix. The last part of the talk considers the *number* of locally optimal kxk submatrices, L_n(k), beginning with the asymptotic behavior of its mean and variance for fixed k and increasing n. The final result is a Gaussian central limit theorem for L_n(k) that is based on a new variant of Stein's method for normal approximation. Joint work with Shankar Bhamidi and Partha S. Dey
Research Seminar in Statistics
Large Average Submatrices of a Gaussian Random Matrix: Landscapes and Local Optima.
HG G 19.1
Tue 02.07.2013
15:15-16:00
Rajen Shah
Statistical Laboratory, University of Cambridge, UK
Abstract
The “Big Data” era in which we are living has brought with it a combination of statistical and computational challenges that often must be met with approaches that draw on developments from both the fields of statistics and computer science. In this talk I will present a method for performing regression where the n by p design matrix may have both n and p in the millions, but where the design matrix is sparse, that is most of its entries are zero; such sparsity is common in many large-scale applications such as text analysis. In this setting, performing regression using the original data can be computationally infeasible. Instead, we first map the design matrix to an n by L matrix with L << p, using a modified version of a scheme known as b-bit min-wise hashing in computer science. From a statistical perspective, we study the performance of regression using this compressed data, and give finite sample bounds on the prediction error. Interestingly, despite the loss of information through the compression scheme, we will see that ordinary least squares or ridge regression applied to the reduced data can actually allow us to fit a model containing interactions in the original data. This is joint (and ongoing) work with Nicolai Meinshausen.
Research Seminar in Statistics
Large-scale regression with sparse data
HG G 19.1
JavaScript has been disabled in your browser