Seminar overview
×
Modal title
Modal content
Autumn Semester 2020
Date & Time | Speaker | Title | Location |
---|---|---|---|
Fri 18.09.2020 15:00-16:00 |
Hongzhou Lin MIT (Massachusetts Institute of Technology) |
Abstract
We investigate stochastic optimization problems under relaxed assumptions on the distribution of noise that are motivated by empirical observations in neural network training. Standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this behavior by studying convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle as the iterations progress. When the variation in the noise is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp. Consequently, our results reveal an important scenario where adaptive stepsize methods outperform SGD.
Young Data Science Researcher Seminar ZurichStochastic Optimization with Non-stationary Noiseread_more |
Zoom Callcall_made |
Fri 25.09.2020 17:00-18:00 |
Raaz Dwivedi UC Berkeley |
Abstract
In this talk, I will present some recent work where we introduce a novel methodology for Stable Discovery of Interpretable Subgroups via Calibration (StaDISC), with large heterogeneous treatment effects, for randomized experiments. Building on Yu and Kumbier's PCS framework, StaDISC was developed during our re-analysis of the 1999-2000 VIGOR study, an 8076 patient randomized controlled trial, that compared the risk of adverse events from a then newly approved drug, Rofecoxib (Vioxx), to that from an older drug Naproxen. On average, and in comparison to Naproxen, Vioxx was found to reduce the risk of gastrointestinal events but increase the risk of thrombotic cardiovascular events. Applying StaDISC, we fit 18 popular conditional average treatment effect (CATE) estimators for both outcomes and use calibration to demonstrate their poor global performance. However, we find that CATE methods are locally well-calibrated and stable, thereby enabling the identification of clinically interpretable patient groups with larger than (estimated) average treatment effects. External validation of the found subgroups provides further evidence for the promises of the proposed methodology.
Based on joint work with Briton Park, Yan Shuo Tan, Mian Wei, Kevin Horgan, David Madigan, and Bin Yu
arxiv preprint https://arxiv.org/abs/2008.10109 (in submission to international statistical review)
Young Data Science Researcher Seminar ZurichStaDISC: Stable discovery of interpretable subgroups via calibrationread_more |
Zoom Callcall_made |
Fri 02.10.2020 15:00-16:00 |
Mona Azadkia ETH Zürich |
Abstract
There are numerous problems where one needs to quantify the dependence between two random variables and how this dependence changes by conditioning on a third random variable. Correlated random variables might become independent when we observe a third random variable or two independent random variables might become dependent after conditioning on the third one. Thanks to the wide potential application range e.g., bioinformatics, economics, psychology, etc, finding efficient measures of conditional dependence has been an active area of research in many subareas of statistics and machine learning. However, the literature on measures of conditional dependence is not so large, especially in the non-parametric setting. We introduce two novel measures of conditional dependence and propose estimators based on i.i.d. samples. Using these statistics, we devise a new variable selection algorithm, called Feature Ordering by Conditional Independence (FOCI). FOCI is model-free with no tuning parameters and is provably consistent under sparsity assumptions. We provide a number of example application analyses to both synthetic and real datasets
Young Data Science Researcher Seminar ZurichA Simple Measure of Conditional Dependenceread_more |
Zoom Callcall_made |
Fri 09.10.2020 15:00-16:00 |
Alex Wein NYU, New York University |
Abstract
One fundamental goal of high-dimensional statistics is to detect or recover planted structure (such as a low-rank matrix) hidden in noisy data. A growing body of work studies low-degree polynomials as a restricted model of computation for such problems. Many leading algorithmic paradigms (such as spectral methods and approximate message passing) can be captured by low-degree polynomials, and thus, lower bounds against low-degree polynomials serve as evidence for computational hardness of statistical problems.
Prior work has studied the power of low-degree polynomials for the detection (i.e. hypothesis testing) task. In this work, we extend these methods to address problems of estimating (i.e. recovering) the planted signal instead of merely detecting its presence. For a large class of "signal plus noise" problems, we give a user-friendly lower bound for the best possible mean squared error achievable by any degree-D polynomial. These are the first results to establish low-degree hardness of recovery problems for which the associated detection problem is easy. As applications, we study the planted submatrix and planted dense subgraph problems, resolving (in the low-degree framework) open problems about the computational complexity of recovery in both cases.
Joint work with Tselil Schramm, available at: https://arxiv.org/abs/2008.02269
Young Data Science Researcher Seminar ZurichComputational Barriers to Estimation from Low-Degree Polynomialsread_more |
Zoom Callcall_made |
Fri 16.10.2020 15:00-16:00 |
Ramya Korlakai Vinayak University of Wisconsin-Madison |
Abstract
Many scientific domains such as social sciences and epidemiology study heterogeneous populations (varying demographics) with only a few (small) observations available at the level of individuals. Limited observations prohibit accurate estimation of parameters of interest for any given individual. In this small
data regime, the key question is, how accurately can we estimate the distribution of parameters over the population? In this talk, we investigate this fundamental and practically relevant problem of learning from a heterogeneous population with small data in the Binomial observation setting. While
the maximum likelihood estimator (MLE) is widely used for this problem, its optimality and sample complexity in the small data regime were not well understood. We prove that the MLE is optimal even in the small data regime, resolving this problem open since the 1960s. We then use these results to construct new, optimal estimators for learning the change in the parameters over the population.
Young Data Science Researcher Seminar ZurichLearning from Small Dataread_more |
Zoom Callcall_made |
Fri 23.10.2020 15:00-16:00 |
Yuting Wei Carnegie Mellon University |
Abstract
Modern scientific discovery and decision making require the development of trustworthy and informative inferential procedures, which are particularly challenging when coping with high-dimensional data. This talk presents two vignettes on the theme of reliable high-dimensional inference.
The first vignette considers performing inference based on the Lasso estimator when the number of covariates is of the same order or larger than the number of observations. Classical asymptotic statistics theory fails due to two fundamental reasons: (1) The regularized risk is non-smooth; (2) The discrepancy between the estimator and the true parameter vector cannot be neglected. We pin down the distribution of the Lasso, as well as its debiased version, under a broad class of Gaussian correlated designs with non-singular covariance structure. Our findings suggest that a careful degree-of-freedom correction is crucial for computing valid confidence intervals in this challenging regime.
The second vignette investigates the Model-X knockoffs framework --- a general procedure that can leverage any feature importance measure to produce a variable selection algorithm. Model-X knockoffs rely on the construction of synthetic random variables and is, therefore, random. We propose a method for derandomizing --- and hence stabilizing --- model-X knockoffs. By aggregating the selection results across multiple runs of the knockoffs algorithm, our method provides stable decisions without compromising statistical power. Our approach, when applied to the multi-stage GWAS of prostate cancer, reports locations on the genome that have been replicated with other studies.
The first vignette is based on joint work with Michael Celentano and Andrea Montanari, whereas the second one is based on joint work with Zhimei Ren and Emmanuel Candes.
Young Data Science Researcher Seminar ZurichReliable hypothesis testing paradigms in high dimensionsread_more |
Zoom Callcall_made |
Fri 30.10.2020 15:00-16:00 |
Merle Behr University of California, Berkeley |
Abstract
Many data problems, in particular in biogenetics, often come with a highly complex underlying structure. This often makes it difficult to extract interpretable information. In this talk we want to demonstrate that often these complex structures are well approximated by a composition of a few simple parts, which provides very descriptive insights into the underlying data generating process. We demonstrate this with two examples.
In the first example, the single components are finite alphabet vectors (e.g., binary components), which encode some discrete information. For instance, in genetics a binary vector of length n can encode whether or not a mutation (e.g., a SNP) is present at location i = 1,…,n in the genome. On the population level studying genetic variations is often highly complex, as various groups of mutations are present simultaneously. However, in many settings a population might be well approximated by a composition of a few dominant groups. Examples are Evolve and Resequence experiments where the outer supply of genetic variation is limited and thus, over time, only a few haplotypes survive. Similarly, in a cancer tumor, often only a few competing groups of cancer cells (clones) come out on top.
In the second example, the single components relate to separate branches of a tree structure. Tree structures, showing hierarchical relationships between samples, are ubiquitous in genomic and biomedical sciences. A common question in many studies is whether there is an association between a response variable and the latent group structure represented by the tree. Such a relation can be highly complex, in general. However, often it is well approximated by a simple composition of relations associated with a few branches of the tree.
For both of these examples we first study theoretical aspects of the underlying compositional structure, such as identifiability of single components and optimal statistical procedures under probabilistic data models. Based on this, we find insights into practical aspects of the problem, namely how to actually recover such components from data.
Young Data Science Researcher Seminar ZurichLearning Compositional Structuresread_more |
Zoom Callcall_made |
Fri 06.11.2020 15:00-16:00 |
Eugene Katsevich University of Pennsylvania |
Abstract
For testing conditional independence of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this talk, we study the power of conditional independence testing under MX, focusing our analysis on the conditional randomization test (CRT). The validity of the CRT conditionally on Y,Z allows us to view it as a test of a point null hypothesis involving the conditional distribution of X, from which we can use the Neyman-Pearson lemma to derive the most powerful CRT statistic against a point alternative. We obtain an analogous result for MX knockoffs as well. Next, we derive expressions for the power of the CRT against local semiparametric alternatives, establishing a direct link between the performance of the power of the CRT and the performance of the machine learning method on which it is based. If time permits, we will discuss a recent computational acceleration of the CRT that permits its application to large-scale datasets.
Young Data Science Researcher Seminar ZurichFinite-sample optimality and large-sample power analysis of the conditional randomization testread_more |
Zoom Callcall_made |
Fri 13.11.2020 15:00-16:00 |
Nabarun Deb Columbia University |
Abstract
In this work, we propose a class of simple, nonparametric, yet interpretable measures of association between two random variables X and Y taking values in general topological spaces. These nonparametric measures — defined using the theory of reproducing kernel Hilbert spaces — capture the strength of dependence between X and Y and have the property that they are 0 if and only if the variables are independent and 1 if and only if one variable is a measurable function of the other. Further, these population measures can be consistently estimated using the general framework of geometric graphs which include k-nearest neighbor graphs and minimum spanning trees. Moreover, a sub-class of these estimators are also shown to adapt to the intrinsic dimensionality of the underlying distribution. Some of these empirical measures can also be computed in near-linear time. If X and Y are independent, these empirical measures (properly normalized) have a standard normal limiting distribution and hence, can be readily used to test for independence. In fact, as far as we are aware, these are the only procedures that possess all the above mentioned desirable properties. The correlation coefficient proposed in Dette et al. [31], Chatterjee [22], and Azadkia and Chatterjee [7] can be seen as a special case of this general class of measures. If time permits, I will also describe how the same ideas can be effectively used to measure the strength of conditional dependence.
The talk is based on this paper https://arxiv.org/abs/2010.01768.
Young Data Science Researcher Seminar ZurichMeasuring Association on Topological Spaces Using Kernels and Geometric Graphsread_more |
Zoom Callcall_made |
Thr 19.11.2020 16:00-17:00 |
Ivan Dokmanić Universität Basel |
Abstract
Injectivity plays an important role in generative models where it enables inference; in inverse problems and compressed sensing with generative priors it is a precursor to well posedness. We establish sharp characterizations of injectivity of fully-connected and convolutional ReLU layers and networks. We begin by a layerwise analysis and show that an expansivity factor of two is necessary and sufficient for injectivity by constructing appropriate weight matrices. We show that global injectivity with iid Gaussian matrices, a commonly used tractable model, requires larger expansivity between 3.4 and 5.7. We also characterize the stability of inverting an injective network via worst-case Lipschitz constants of the inverse. Next, we use arguments from differential topology to study injectivity of deep networks and prove that any Lipschitz map can be approximated by an injective ReLU network; we . Finally, using an argument based on random projections, we show that an end-to-end---rather than layerwise---doubling of the dimension suffices for injectivity. We close with numerical experiments on injective generative models showing that injectivity improves inference.
ETH-FDS seminar Injective Neural Networks for Inference and Inverse Problemsread_more |
Zoom Callcall_made |
Fri 20.11.2020 15:00-16:00 |
Martin Wahl HU Berlin |
Abstract
In settings where the number of observations is comparable to the dimension, principal component analysis (PCA) reveals some unexpected phenomena, ranging from eigenprojector inconsistency to eigenvalue upward bias. While such high-dimensional phenomena are now well understood in the spiked covariance model, the goal of this talk is to discuss some extensions for the case of PCA in infinite dimensions. In particular, we will introduce a new perturbation-theoretic framework that will allow us to characterize the behavior of eigenvalues and eigenprojectors of empirical covariance operators by the so-called ``relative ranks''. If time permits, we will also present some corresponding minimax lower bounds for the estimation of eigenprojectors. These are obtained by a van Trees inequality for invariant statistical models.
Young Data Science Researcher Seminar ZurichUpper and lower bounds for the estimation of principal componentsread_more |
Zoom Callcall_made |
Fri 27.11.2020 16:00-17:00 |
Matus Telgarsky University of Illinois Urbana-Champaign |
Abstract
The implicit bias of gradient descent has arisen as a promising explanation for
the good generalization properties of deep networks (Soudry-Hoffer-Nacson-Gunasekar-Srebro, 2018). The purpose of this talk is to
demonstrate the effectiveness of a certain dual problem in the analysis of this
implicit bias. Concretely, this talk will develop this dual, as well as a variety of consequences in linear and nonlinear settings.
In the linear case, this dual perspective firstly will allow a characterization
of the implicit bias, even outside the standard setting of exponentially-tailed
losses; in this sense, it is gradient descent, and not a particular loss
structure which leads to implicit bias. Secondly, invoking duality in the
margin convergence analysis will yield a fast 1/t rate; by contrast, all prior
analyses never surpassed 1/sqrt{t}, even in the well-studied boosting setting.
In the nonlinear case, duality will enable the proof of a gradient alignment
property: asymptotically, the parameters and their gradients become colinear.
Although abstract, this property in turn implies various existing and new
margin maximization results.
Joint work with Ziwei Ji.
bio:
Matus Telgarsky is an assistant professor at the University of Illinois, Urbana-Champaign, specializing in deep learning theory. He was fortunate to receive a PhD at UCSD under Sanjoy Dasgupta. Other highlights include:
co-founding, in 2017, the Midwest ML Symposium (MMLS) with Po-Ling Loh;
receiving a 2018 NSF CAREER award; organizing a Simons Insititute summer 2019
program on deep learning with Samy Bengio, Aleskander Madry, and Elchanan
Mossel. In 2020, meanwhile, he's thankful to be alive and employed.
ETH-FDS seminar The dual of the margin: improved analyses and rates of gradient descent's implicit biasread_more |
Room Detailscall_made |
Fri 04.12.2020 15:00-16:00 |
Zhou Fan Yale University |
Abstract
This talk will be divided into two halves. In a first more applied half, I will describe a new empirical Bayes procedure for principal components analysis in high dimensions, which aims to learn a prior distribution for the PCs from the observed data. Its ideas are based around the Kiefer-Wolfowitz NPMLE, some basic results in asymptotic random matrix theory, and Approximate Message Passing (AMP) algorithms for Bayesian inference. I will explain the interplay between these ideas and demonstrate the method on several genetics examples. In a second more theoretical half, motivated by this application, I will then describe a general extension of AMP algorithms to a class of rotationally invariant matrices. The usual bias correction and state evolution in AMP are replaced by forms involving the free cumulants of the spectral law. I hope to explain the main ideas behind this algorithm, and connect this back to the PCA application.
This is joint work with Xinyi Zhong and Chang Su.
Young Data Science Researcher Seminar ZurichEmpirical Bayes and Approximate Message Passing algorithms for PCA in high dimensionsread_more |
Zoom Callcall_made |
Fri 11.12.2020 15:00-16:00 |
Kweku Abraham Université Paris-Sud |
Abstract
Multiple testing has become a very important statistical problem in the age of high-dimensional data sets. Abstractly, the goal is as follows: Given measurements of very many covariates (for example, the nucleotides forming someones DNA), return those covariates which are predictive of some output data (such as health outcomes).
A typical design aim for the statistician is to give a procedure with controlled "size", in that the so-called False Discovery Rate (FDR) is controlled at some target level, while maximising the number of true discoveries. I will explain that a commonly used method based on posterior ("smoothing") probabilities achieves this goal when the covariates have a Markov dependence structure.
Young Data Science Researcher Seminar ZurichOptimal false discovery rate control of a nonparametric hidden Markov model multiple testing procedureread_more |
Zoom Callcall_made |