Seminar overview

×

Modal title

Modal content

Spring Semester 2021

Date & Time Speaker Title Location
Fri 12.02.2021
15:00-16:00
Tengyao Wang
UCL
Abstract
We introduce a new method for two-sample testing of high-dimensional linear regression coefficients without assuming that those coefficients are individually estimable. The procedure works by first projecting the matrices of covariates and response vectors along directions that are complementary in sign in a subset of the coordinates, a process which we call `complementary sketching'. The resulting projected covariates and responses are aggregated to form two test statistics, which are shown to have essentially optimal asymptotic power under a Gaussian design when the difference between the two regression coefficients is sparse and dense respectively. Simulations confirm that our methods perform well in a broad class of settings.
Young Data Science Researcher Seminar Zurich
Two-sample testing of high-dimensional linear regression coefficients via complementary sketching
Zoom Call
Fri 19.02.2021
15:00-16:00
Evgenii Chzhen
Université Paris-Saclay
Abstract
A theoretical framework is proposed for the problem of learning a real-valued function which meets fairness requirements. This framework is built upon the notion of alpha-relative (fairness) improvement of the regression function which we introduce using the theory of optimal transport. Setting alpha = 0 corresponds to the regression problem under the Demographic Parity constraint, while alpha = 1 corresponds to the classical regression problem without any constraints. For alpha in-between 0 and 1 the proposed framework allows to continuously interpolate between these two extreme cases and to study partially fair predictors. Within this framework we precisely quantify the cost in risk induced by the introduction of the fairness constraint. We put forward a statistical minimax setup and derive a general problem-dependent lower bound on the risk of any estimator satisfying alpha-relative improvement constraint. We illustrate our framework on a model of linear regression with Gaussian design and systematic group-dependent bias, deriving matching (up to absolute constants) upper and lower bounds on the minimax risk under the introduced constraint. This talk is based on a joint work with Nicolas Schreuder, see [arXiv:2007.14265].
Young Data Science Researcher Seminar Zurich
A minimax framework for quantifying risk-fairness trade-off in regression
Zoom Call
Fri 26.02.2021
16:00-17:00
Feng Ruan
UC Berkeley
Abstract
We tackle the problem of variable selection with a focus on discovering interactions between variables. With p variables, there are O(p^k) possible interactions of order k making exhaustive search infeasible. We show that it is nonetheless possible to identify the variables involved in interactions (of any order) with only linear computation cost, O(p), and in a nonparametric fashion. Our algorithm is based on minimizing a non-convex objective, carefully designed to have a favorable landscape. We provide finite sample guarantees on both false positives (we show all stationary points of the objective exclude noise variables) and false negatives (we characterize the sample sizes needed for gradient descent to converge to a "good’’ stationary point).
Young Data Science Researcher Seminar Zurich
Searching for Interactions in Linear Time
Zoom Call
Fri 05.03.2021
14:30-15:30
Aaditya Ramdas
Carnegie Melon University
Abstract
We derive confidence intervals (CI) and time-uniform confidence sequences (CS) for the classical problem of estimating an unknown mean from bounded observations. We present a general approach for deriving concentration bounds, that can be seen as a generalization (and improvement) of the celebrated Chernoff method. At its heart, it is based on deriving a new class of composite nonnegative martingales, with strong connections to betting and the method of mixtures. We show how to extend these ideas to sampling without replacement, another heavily studied problem. In all cases, our bounds are adaptive to the unknown variance, and empirically vastly outperform competing approaches based on Hoeffding or empirical Bernstein inequalities and their recent supermartingale generalizations. In short, we establish a new state-of-the-art for four fundamental problems: CSs and CIs for bounded means, with and without replacement. This work is joint with Ian Waudby-Smith, a preprint is here: https://arxiv.org/abs/2010.09686.
Young Data Science Researcher Seminar Zurich
Estimating means of bounded random variables by betting
Zoom Call
Fri 12.03.2021
16:30-17:30
Christos Thrampoulidis
University of British Columbia
Abstract
State-of-the-art deep neural networks generalize well despite being exceedingly overparameterized and trained without explicit regularization. Understanding the principles behind this phenomenon —termed benign overfitting or double descent— poses a new challenge to modern learning theory as it contradicts classical statistical wisdoms. Key questions include: What are the fundamental mechanics behind double descent? How do its features, such as the transition threshold and global minima, depend on the training data and on the algorithms used for training? While increasing overparameterization can improve classification accuracy, it also comes with larger, thus slower and computationally more expensive, architectures that can be prohibitive in resource-constrained applications. Is then overparameterization only relevant in training large networks or can it also benefit training smaller models when combined with appropriate model pruning techniques? What are the generalization dynamics of pruning overparameterized models? Finally, while overparameterization leads to lower misclassification error, what is its effect on fairness performance metrics, such as balanced error and equal opportunity? Can we design better loss functions, compared to standard losses such as cross entropy, which provably improve fairness performance of large models in the presence of label-imbalanced and/or group-sensitive datasets? This talk will shed light on the questions raised above. At the heart of the results presented lies a powerful analysis framework for precise high-dimensional statistical analysis. This so-called convex Gaussian min-max Theorem framework builds on Gordon’s Gaussian comparison inequality and is rooted in the study of sharp phase-transitions in Compressed Sensing.
Young Data Science Researcher Seminar Zurich
Blessings and Curses of Overparameterization: A Precise High-dimensional Approach
Zoom Call
Fri 19.03.2021
16:00-17:00
Stephen Bates
UC Berkeley
Abstract
To enable valid statistical inference in prediction tasks, we show how to generate set-valued predictions with black-box models that control various notions of statistical error. Our approach guarantees that the expected loss on future test points falls below a user-specified level, for any predictive model and underlying distribution. Building on conformal prediction, we use a holdout set to calibrate the size of the prediction sets, generalizing the approach to control error notions such as the false rejection rate. We demonstrate our procedure in four large-scale problems: (1) multi-label classification, where each observation has multiple associated labels; (2) classification problems where the labels have a hierarchical structure; (3) image segmentation, where we wish to predict a set of pixels containing an object of interest; and (4) protein structure prediction.
Young Data Science Researcher Seminar Zurich
Distribution-Free, Risk-Controlling Prediction Sets
Zoom Call
Fri 26.03.2021
15:00-16:00
Dana Yang
Duke University
Abstract
Motivated by the application of tracking moving particles from snapshots, we study the problem of reconstructing a perfect matching hidden in a randomly weighted nxn bipartite graph. The edge set includes every node pair in the hidden matching and each of the other n(n-1) node pairs independently with probability d/n. The weight of each edge is independently drawn from distributions P or Q, depending on whether the edge is in the hidden matching. In this talk, we establish the information-theoretic threshold for recovering almost all the edges of the hidden matching. We show that the sharp threshold occurs at \sqrt{d}B(P,Q)=1, where B(P,Q) stands for the Bhattacharyya coefficient, resolving the conjecture in [Moharrami et al. 2019, Semerjian et al. 2020]. Furthermore, in the special case of complete exponentially weighted graphs with d=n, P=\exp(\lambda), and Q=\exp(1/n), for which the sharp threshold simplifies to \lambda=4, we prove that when \lambda <= 4-\epsilon, the optimal reconstruction error (the average number of misclassified edges) is \exp( \Theta(1/\sqrt{\epsilon})), confirming the conjectured infinite-order phase transition in [Semerjian et al. 2020]. This is joint work with Jian Ding, Yihong Wu and Jiaming Xu. The preprint of this work is available at https://arxiv.org/abs/2103.09383.
Young Data Science Researcher Seminar Zurich
The planted matching problem: Sharp threshold and infinite-order phase transition
Zoom Call
Fri 16.04.2021
16:30-17:30
Spencer Frei
UCLA
Abstract
Can overparameterized neural networks trained by SGD provably generalize when the labels are corrupted with substantial random noise? We answer this question in the affirmative by showing that for a broad class of distributions, one-hidden-layer networks trained by SGD generalize when the distribution is linearly separable but corrupted with adversarial label noise, despite the capacity to overfit. Equivalently, such networks have classification accuracy competitive with that of the best halfspace over the distribution. Our results hold for networks of arbitrary width and for arbitrary initializations of SGD. In particular, we do not rely upon the approximations to infinite width networks that are typically used in theoretical analyses of SGD-trained neural networks.
Young Data Science Researcher Seminar Zurich
Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise
Zoom Call
Fri 23.04.2021
15:00-16:00
Yuqi Gu
Duke University
Abstract
High dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. We propose a class of interpretable discrete latent structure models for discrete data and develop a general identifiability theory. Our theory is applicable to various types of latent structures, ranging from a single latent variable to deep layers of latent variables organized in a sparse graph (termed a Bayesian pyramid). The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate identifiability and estimability of the model parameters. Applications of the methodology to DNA nucleotide sequence data uncover discrete latent features that are both interpretable and highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data, and can be a useful alternative to popular machine learning methods.
Young Data Science Researcher Seminar Zurich
Bayesian pyramids:Identifying Interpretable Discrete Latent Structures from Discrete Data
Zoom Call
Fri 30.04.2021
16:00-17:00
Dominik Rothenhäusler
Stanford
Abstract
Statistical uncertainty has many sources. P-values and confidence intervals usually quantify the overall uncertainty, which may include variation due to sampling and uncertainty due to measurement error, among others. Practitioners might be interested in quantifying only one source of uncertainty. For example, one might be interested in the uncertainty of a regression coefficient of a fixed set of subjects, which corresponds to quantifying the uncertainty due to measurement error and ignoring the variation induced by sampling the covariates. In causal inference, it is common to infer treatment effects for a certain set of subjects, only accounting for uncertainty due to random treatment assignment. Motivated by these examples, we consider conditional inference for conditional parameters in parametric and semi-parametric models; where we condition on observed characteristics of a population. We discuss methods to obtain conditionally valid p-values and confidence intervals. Conditional p-values can be used to construct a hierarchy of statistical evidence that may help clarify the generalizability of a statistical finding. In addition, we will discuss preliminary results on how to conduct transfer learning of conditional parameters, with rigorous conditional guarantees. This is ongoing work with Ying Jin.
Young Data Science Researcher Seminar Zurich
Conditional inference: Towards a hierarchy of statistical evidence
Zoom Call
Tue 04.05.2021
17:00-18:00
Manfred K. Warmuth
University of California, Santa Cruz / Google Research
Abstract
It was conjectured that any neural network of any structure and arbitrary differentiable transfer functions at the nodes cannot learn the following problem sample efficiently when trained with gradient descent: The instances are the rows of a \(d\)-dimensional Hadamard matrix and the target is one of the features, i.e. very sparse. We essentially prove this conjecture: We show that after receiving a random training set of size \(k < d\), the expected square loss is still \(1 - k/(d - 1)\). The only requirement needed is that the input layer is fully connected and the initial weight vectors of the input nodes are chosen from a rotation invariant distribution. Surprisingly the same type of problem can be solved drastically more efficient by a simple 2-layer linear neural network in which the \(d\) inputs are connected to the output node by chains of length 2 (Now the input layer has only one edge per input). When such a network is trained by gradient descent, then it has been shown that its expected square loss is \(\frac{\log d}{k}\). Our lower bounds essentially show that a sparse input layer is needed to sample efficiently learn sparse targets with gradient descent when the number of examples is less than the number of input features.
ETH-FDS seminar
A case where a spindly two-layer linear network whips any neural network with a fully connected input layer
Zoom Call
Fri 07.05.2021
15:00-16:00
Cheng Mao
Georgia Tech
Abstract
Graph matching, also known as network alignment, refers to the problem of matching the vertices of two unlabeled, edge-correlated graphs. The problem finds applications in many areas, such as computational biology, network privacy, and computer vision. In this talk, I will discuss several recent advances in efficient recovery and detection of the latent matching between the two graphs. In particular, under a correlated Erdős–Rényi model, we can recover the exact matching with high probability if the correlation between the two graphs on n vertices is at least 1-1/polyloglog(n), while the associated detection problem can be efficiently solved if the correlation is at least a constant. On the other hand, there is evidence for computational hardness when the correlation is small. Moreover, I will discuss the theoretical significance of the graph matching problem, its connection to other problems with a planted signal, and many future directions. The talk is based on joint works with Zhou Fan, Yihong Wu, Jiaming Xu, Mark Rudelson, Konstantin Tikhomirov, and Sophie H. Yu.
Young Data Science Researcher Seminar Zurich
Random Graph Matching: Efficient Algorithms for Recovery and Detection
Zoom Call
Fri 14.05.2021
15:00-16:00
Boris Muzellec
ENS Paris-Saclay
Abstract
It is well-known that plug-in statistical estimation of optimal transport suffers from the curse of dimensionality. While recent works were able to leverage smoothness to improve the rate of estimation, the computational complexity of the resulting methods still degrades exponentially with the dimension. In this talk, we show how to leverage smoothness using a kernel sum-of-squares representation of the dense set of inequalities satisfied by optimal transport. Using this technique, we propose a polynomial-time algorithm that results in estimation rates that do not depend on the dimension – at the price of constants that may still depend exponentially on the dimension, in the worst case.
Young Data Science Researcher Seminar Zurich
Breaking the curse of dimensionality in smooth optimal transport.
Zoom Call
Thr 20.05.2021
16:15-17:15
Gitta Kutyniok
Ludwig-Maximilians-Universität München
Abstract
The tremendous importance of graph structured data due to recommender systems or social networks led to the introduction of graph convolutional neural networks (GCN). Those split into spatial and spectral GCNs, where in the later case filters are defined as elementwise multiplication in the frequency domain of a graph. Since often the dataset consists of signals defined on many different graphs, the trained network should generalize to signals on graphs unseen in the training set. One instance of this problem is the transferability of a GCN, which refers to the condition that a single filter or the entire network have similar repercussions on both graphs, if two graphs describe the same phenomenon. However, for a long time it was believed that spectral filters are not transferable. In this talk by modelling graphs mimicking the same phenomenon in a very general sense, also taking the novel graphon approach into account, we will debunk this misconception. In general, we will show that spectral GCNs are transferable, both theoretically and numerically. This is joint work with R. Levie, S. Maskey, W. Huang, L. Bucci, and M. Bronstein.
ETH-FDS seminar
Graph Convolutional Neural Networks: The Mystery of Generalization
Zoom Call
Fri 21.05.2021
15:00-16:00
Johannes Wiesel
Columbia University
Abstract
A number of researchers have independently introduced topologies on the set of laws of stochastic processes that extend the usual weak topology. Depending on the respective scientific background this was motivated by applications and connections to various areas (e.g. Plug-Pichler - stochastic programming, Hellwig - game theory, Aldous - stability of optimal stopping, Hoover-Keisler - model theory). Remarkably, all these seemingly independent approaches define the same adapted weak topology in finite discrete time. Our first main result is to construct an adapted variant of the empirical measure that consistently estimates the laws of stochastic processes in full generality. A natural compatible metric for the weak adapted topology is the given by an adapted refinement of the Wasserstein distance, as established in the seminal works of Pflug-Pichler. Specifically, the adapted Wasserstein distance allows to control the error in stochastic optimization problems, pricing and hedging problems, optimal stopping problems, etc. in a Lipschitz fashion. The second main result of this talk yields quantitative bounds for the convergence of the adapted empirical measure with respect to adapted Wasserstein distance. Surprisingly, we obtain virtually the same optimal rates and concentration results that are known for the classical empirical measure wrt. Wasserstein distance. Lastly we construct a coefficient of association as an application of the above theory. This talk is based on joint work with Julio Backhoff, Daniel Bartl and Mathias Beiglböck.
Young Data Science Researcher Seminar Zurich
Estimating processes in adapted Wasserstein distance
Zoom Call
Thr 27.05.2021
16:00-17:00
Rina Foygel Barber
The University of Chicago
Abstract
In data analysis problems where we are not able to rely on distributional assumptions, what types of inference guarantees can still be obtained? Many popular methods, such as holdout methods, cross-validation methods, and conformal prediction, are able to provide distribution-free guarantees for predictive inference, but the problem of providing inference for the underlying regression function (for example, inference on the conditional meanE[Y|X]) is more challenging. If X takes only a small number of possible values, then inference on E[Y|X] is trivial to achieve. At the other extreme, if the features X are continuously distributed, we show that any confidence interval for E[Y|X] must have non-vanishing width, even as sample size tends to infinity - this is true regardless of smoothness properties or other desirable features of the underlying distribution. In between these two extremes, we find several distinct regimes - in particular, it is possible for distribution-free confidence intervals to have vanishing width if and only if the effective support size of the distribution ofXis smaller than the square of the sample size. This work is joint with Yonghoon Lee. Bio: Rina Foygel Barber is a Louis Block Professor in the Department of Statistics at the University of Chicago. She was a NSF postdoctoral fellow during 2012-13 in the Department of Statistics at Stanford University, supervised by Emmanuel Candès. She received her PhD in Statistics at the University of Chicago in 2012, advised by Mathias Drton and Nati Srebro, and a MS in Mathematics at the University of Chicago in 2009. Prior to graduate school, she was a mathematics teacher at the Park School of Baltimore from 2005 to 2007.
ETH-FDS seminar
Distribution-free inference for regression: discrete, continuous, and in between
Room Details
Fri 28.05.2021
15:00-16:00
Morgane Austern
Microsoft Research New England
Abstract
Classical statistical inference relies on numerous tools from probability theory to study the properties of estimators. However, these same tools are often inadequate to study modern machine problems that frequently involve structured data (e.g networks) or complicated dependence structures (e.g dependent random matrices). In this talk, we extend universal limit theorems beyond the classical setting. Firstly, we consider distributionally \structured" and dependent random object{i.e random objects whose distribution are invariant under the action of an amenable group. We show, under mild moment and mixing conditions, a series of universal second and third order limit theorems: central-limit theorems, concentration inequalities, Wigner semi-circular law and Berry-Esseen bounds. The utility of these will be illustrated by a series of examples in machine learning, network and information theory. Secondly by building on these results, we establish the asymptotic distribution of the cross-validated risk with the number of folds allowed to grow at an arbitrary rate. Using this, we study the statistical speed-up of cross validation compared to a train-test split procedure, which reveals surprising results even when used on simple estimators.
Young Data Science Researcher Seminar Zurich
Asymptotics of learning on dependent and structured random objects
Thr 03.06.2021
17:00-18:00
Sewoong Oh
Allen School of Computer Science & Engineering University of Washington
Abstract
Differential privacy has emerged as a standard requirement in a variety of applications ranging from the U.S. Census to data collected in commercial devices, initiating an extensive line of research in accurately and privately releasing statistics of a database. An increasing number of such databases consist of data from multiple sources, not all of which can be trusted. This leaves existing private analyses vulnerable to attacks by an adversary who injects corrupted data. We are in a dire need for algorithms that guarantee privacy and robustness (to a fraction of data being corrupted) simultaneously. However, even the simplest questions remain open. For the canonical problem of estimating the mean (and the covariance) from i.i.d. samples under both privacy and robustness, I will present a minimax optimal algorithm, but requires an exponential run-time. This is followed by an efficient algorithm that requires a factor of $d^{1/2}$ more samples when the samples are in $d$-dimensions. It remains an open question if this computational gap can be closed, either with a more sample efficient algorithm or a tighter lower bound.
ETH-FDS seminar
Robust and Differentially Private Estimation
Zoom Call
Fri 04.06.2021
16:30-17:30
Lihua Lei
Stanford University
Abstract
Recent progress in machine learning provides us with myriad powerful tools for prediction. When they are deployed for high-stakes decision-making, it is also crucial to have valid uncertainty quantification, which is challenging for complex predictive algorithms. The challenge is more pronounced in situations where the predicted targets are not fully observed in the data. This talk introduces conformal inference-based approaches to generate calibrated prediction intervals for two types of partially observed outcomes: (1) counterfactuals characterized by potential outcomes, observable only for those in a particular treatment arm, and (2) time-to-event outcomes, observable only for those whose event has occurred. When the missing data mechanism is known, as in randomized experiments, both approaches achieve desired coverage in finite samples without any assumption on the distribution of the outcome conditional on the covariates or the accuracy of the predictive algorithm. When the missing data mechanism is unknown, both approaches satisfy a doubly robust guarantee of coverage. We demonstrate on both simulated and real datasets that our prediction intervals are calibrated and relatively tight.
Young Data Science Researcher Seminar Zurich
Conformal Inference of Counterfactuals and Time-to-event Outcomes
Zoom Call
Fri 11.06.2021
15:00-16:00
Yuhao Wang
Tsinghua University
Abstract
We consider estimation of average treatment effects given observational data with high-dimensional pretreatment variables. Existing methods for this problem typically assume some form of sparsity for the regression functions. In this work, we introduce a debiased inverse propensity score weighting (DIPW) scheme for average treatment effect estimation that delivers \sqrt{n}-consistent estimates of the average treatment effect when the propensity score follows a sparse logistic regression model; the regression functions are permitted to be arbitrarily complex. Our theoretical results quantify the price to pay for permitting the regression functions to be unestimable, which shows up as an inflation of the variance of the estimator compared to the semiparametric efficient variance by at most O(1) under mild conditions. Given the lack of assumptions on the regression functions, averages of transformed responses under each treatment may also be estimated at the \sqrt{n} rate, and so for example, the variances of the potential outcomes may be estimated. We show how confidence intervals centred on our estimates may be constructed, and also discuss an extension of the method to estimating projections of the heterogeneous treatment effect function.
Young Data Science Researcher Seminar Zurich
Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders
Zoom Call
JavaScript has been disabled in your browser