Seminar overview

×

Modal title

Modal content

Autumn Semester 2022

Date & Time Speaker Title Location
Thr 08.09.2022
16:15-17:15
Aaditya Ramdas
Carnegie Mellon University
Abstract
Conformal prediction is a popular, modern technique for providing valid predictive inference for arbitrary machine learning models. Its validity relies on the assumptions of exchangeability of the data, and symmetry of the given model fitting algorithm as a function of the data. However, exchangeability is often violated when predictive models are deployed in practice. For example, if the data distribution drifts over time, then the data points are no longer exchangeable; moreover, in such settings, we might want to use an algorithm that treats recent observations as more relevant, which would violate the assumption that data points are treated symmetrically. This paper proposes a new methodology to deal with both aspects: we use weighted quantiles to introduce robustness against distribution drift, and design a new technique to allow for algorithms that do not treat data points symmetrically. Our algorithms are provably robust, with substantially less loss of coverage when exchangeability is violated due to distribution drift or other challenging features of real data, while also achieving the same algorithm and coverage guarantees as existing conformal prediction methods if the data points are in fact exchangeable. Finally, we demonstrate the practical utility of these new tools with simulations and real-data experiments. This is joint work with Rina Barber, Emmanuel Candes and Ryan Tibshirani. A preprint is at https://arxiv.org/abs/2202.13415. Bio: Aaditya Ramdas (PhD, 2015) is an assistant professor at Carnegie Mellon University, in the Departments of Statistics and Machine Learning. He was a postdoc at UC Berkeley (2015–2018) and obtained his PhD at CMU (2010–2015), receiving the Umesh K. Gavaskar Memorial Thesis Award. His undergraduate degree was in Computer Science from IIT Bombay (2005-09). Aaditya was an inaugural inductee of the COPSS Leadership Academy, and a recipient of the 2021 Bernoulli New Researcher Award. His work is supported by an NSF CAREER Award, an Adobe Faculty Research Award (2020), an ARL Grant on Safe Reinforcement Learning, the Block Center Grant for election auditing, a Google Research Scholar award (2022) for structured uncertainty quantification, amongst others. Aaditya's main theoretical and methodological research interests include selective and simultaneous inference, game-theoretic statistics and safe anytime-valid inference, and distribution-free uncertainty quantification for black-box ML. His areas of applied interest include privacy, neuroscience, genetics and auditing (elections, real-estate, financial), and his group's work has received multiple best paper awards.
ETH-FDS seminar
Conformal prediction beyond exchangeability (= quantifying uncertainty for black-box ML without distributional assumptions)
OAS J 10
ETH AI Center, OAS, Binzmühlestrasse 13, 8050 Zürich
Fri 09.09.2022
14:15-15:15
Bin Yu
UC Berkeley
Abstract
Occam's razor is a general principle for science to pursue the simplest explanation or model when the empirical support evidence is the same for the explanations or models under consideration. To quantify simplicy, a complexity measure is necessary and many such measures have been used in the literature including uniform stability. Both complexity and stability are central to interpretable machine learning. In this talk, we first give an overview of interpretable machine learning and then delve into our recent work on decisions trees, which are especially useful interpretable methods in high-stake applications such as medicine and public policy. In particular, we show that decision trees are sub-optimal for additive regression models. To improve upon decision trees, we introduce a new method called Fast Interpretable Greedy-Tree Sums (FIGS) that fits additive trees while controlling the total number of splits. The state-of-the-art performance of FIGS will be illustrated through case studies for clinical decision rules.
Research Seminar in Statistics
Complexity, simplicity, and decision trees
HG G 19.1
Fri 30.09.2022
15:15-16:15
Alexander Henzi
ETH, Seminar for Statistics
Abstract
Statistical predictions should provide a quantification of forecast uncertainty. Ideally, this uncertainty quantification is in the form of a probability distribution for the outcome of interest conditional on the available information. Isotonic distributional regression (IDR) is a nonparametric method that allows to derive probabilistic forecasts from a training data set of point predictions and observations, solely under the assumption of stochastic monotonicity. IDR does not require parameter tuning, and it has interesting properties when analyzed under the paradigm of maximizing sharpness subject to calibration. The method can serve as a natural benchmark for postprocessing forecasts both from statistical models and external sources, which is illustrated through applications in weather forecasting and medicine.
Research Seminar in Statistics
Isotonic distributional regression
HG G 19.1
Fri 07.10.2022
17:15-18:15
Daniel A. Spielman
Yale University
Abstract
In randomized experiments, we randomly assign the treatment that each experimental subject receives. Randomization can help us accurately estimate the difference in treatment effects with high probability. It also helps ensure that the groups of subjects receiving each treatment are similar. If we have already measured characteristics of our subjects that we think could influence their response to treatment, then we can increase the precision of our estimates of treatment effects by balancing those characteristics between the groups. We show how to use the recently developed Gram-​Schmidt Walk algorithm of Bansal, Dadush, Garg, and Lovett to efficiently assign treatments to subjects in a way that balances known characteristics without sacrificing the benefits of randomization. These allow us to obtain more accurate estimates of treatment effects to the extent that the measured characteristics are predictive of treatment effects, while also bounding the worst-​case behavior when they are not. This is joint work with Chris Harshaw, Fredrik Sävje, and Peng Zhang.
ETH-FDS Stiefel Lectures
Balancing covariates in randomized experiments
HG F 30
Fri 21.10.2022
15:15-16:15
Mona Azadkia
ETH Zürich
Abstract
Consider the regression problem where the response Y∈ ℝ and the covariate X ∈ ℝ^d for d≥1 are \textit{unmatched}. Under this scenario, we do not have access to pairs of observations from the distribution of (X,Y), but instead, we have separate datasets {Yi}_ni=1 and {Xj}_mj=1, possibly collected from different sources. We study this problem assuming that the regression function is linear and the noise distribution is known or can be estimated. We introduce an estimator of the regression vector based on deconvolution and demonstrate its consistency and asymptotic normality under an identifiability assumption. In the general case, we show that our estimator (DLSE: Deconvolution Least Squared Estimator) is consistent in terms of an extended ℓ2 norm. Using this observation, we devise a method for semi-supervised learning, i.e., when we have access to a small sample of matched pairs (Xk,Yk). Several applications with synthetic and real datasets are considered to illustrate the theory.
Research Seminar in Statistics
Linear regression with unmatched data: a deconvolution perspective
HG G 19.1
Wed 26.10.2022
16:00-17:00
Muriel Pérez
Centrum Wiskunde & Informatica (CWI) Amsterdam
Abstract
We study worst-case-growth-rate-optimal (GROW) E-statistics for hypothesis testing between two dominated group models. If the underlying group G acts freely on the observation space, there exists a maximally invariant statistic of the data. We show that among all E-statistics, invariant or not, the likelihood ratio of the maximally invariant statistic is GROW and that an anytime-valid test can be based on it. By virtue of a representation theorem of Wijsman, the GROW E-statistic is equivalent to a Bayes factor with a right Haar prior on G. Such Bayes factors are known to have good frequentist and Bayesian properties. We show that reductions through sufficiency and invariance can be made in tandem without affecting optimality. A crucial assumption on the group G is its amenability, a well-known group-theoretical condition, which holds, for instance, in general scale-location families. Our results also apply to finite-dimensional linear regression.
Young Data Science Researcher Seminar Zurich
E-statistics, group invariance and anytime-valid testing
Zoom Call
Thr 27.10.2022
16:15-17:15
Samory K. Kpotufe
Columbia University
Abstract
In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has received attention for many years, no adaptive procedure was known till a recent breakthrough of Auer et al (2018, 2019) which guarantees an optimal regret (LT)^{1/2}, for T rounds and L stationary phases. However, while this rate is tight in the worst case, we show that significantly faster rates are possible, adaptively, if few changes in distribution are actually severe, e.g., involve no change in best arm. This is arrived at via a new notion of 'significant change', which recovers previous notions of change, and applies in both stochastic and adversarial settings (generally studied separately). If time permits, I’ll discuss the more general case of contextual bandits, i.e., where rewards depend on contexts, and highlight key challenges that arise. This is based on ongoing work with Joe Suk.
ETH-FDS seminar
Tracking Most Significant Changes in Bandits
HG F 3
Thr 03.11.2022
16:15-17:15
Holger Rauhut
RWTH Aachen
Abstract
Deep neural networks are usually trained by minimizing a non-convex loss functional via (stochastic) gradient descent methods. Unfortunately, the convergence properties are not very well-understood. Moreover, a puzzling empirical observation is that learning neural networks with a number of parameters exceeding the number of training examples often leads to zero loss, i.e., the network exactly interpolates the data. Nevertheless, it generalizes very well to unseen data, which is in stark contrast to intuition from classical statistics which would predict a scenario of overfitting. A current working hypothesis is that the chosen optimization algorithm has a significant influence on the selection of the learned network. In fact, in this overparameterized context there are many global minimizers so that the optimization method induces an implicit bias on the computed solution. It seems that gradient descent methods and their stochastic variants favor networks of low complexity (in a suitable sense to be understood), and, hence, appear to be very well suited for large classes of real data. Initial attempts in understanding the implicit bias phenomen considers the simplified setting of linear networks, i.e., (deep) factorizations of matrices. This has revealed a surprising relation to the field of low rank matrix recovery (a variant of compressive sensing) in the sense that gradient descent favors low rank matrices in certain situations. Moreover, restricting further to diagonal matrices, or equivalently factorizing the entries of a vector to be recovered, leads to connections to compressive sensing and l1-minimization. After giving a general introduction to these topics, the talk will concentrate on results by the speaker on the convergence of gradient flows and gradient descent for learning linear neural networks and on the implicit bias towards low rank and sparse solutions.
ETH-FDS seminar
The implicit bias of gradient descent for learning linear neural networks
HG F 3
Thr 10.11.2022
16:30-17:30
Asaf Weinstein
Hebrew University of Jerusalem
Abstract
Suppose you observe Y_i = mu_i + e_i, where e_i are i.i.d. from some fixed and known zero-mean distribution, and mu_i are fixed and unknown parameters. In this "canonical" setting, a simultaneous inference statistical problem, as we define it here, is such that no preference is given to any of the mu_i's before seeing the data. For example, estimating all mu_i's under sum-of-squares loss; or testing H_{0i}: mu_i=0 simultaneously for i=1,...,n while controlling FDR; or estimating mu_{i^*} where i^* = argmax{Y_i} under squared loss; or even testing the global null H_0 = \cap{i=1}^n H_{0i}. What is the optimal solution to a simultaneous inference problem? In a Bayesian setup, i.e., when mu_i are assumed random, the answer is conceptually straightforward. In the frequentist setup considered here, the answer is far less obvious, and various approaches exist for defining notions of frequentist optimality and for designing procedures that pursue them. In this work we define the optimal solution to a simultaneous inference problem to be the procedure that, for the true mu_i's, has the best performance among all procedures that are oblivious to the labels i=1,...,n. This is a natural and arguably the weakest condition one could possibly impose. For such procedures we observe that the problem can be cast as a Bayesian problem with respect to a particular prior, which immediately reveals an explicit form for the optimal solution. The argument actually holds more generally for any permutation-invariant model, e.g. when the e_i above are exchangeable, not independent, noise terms, which is sometimes a much more realistic assumption. Finally, we discuss the relation to Robbins's empirical Bayes approach, and explain why nonparametric empirical Bayes procedures should, at least when the e_i's are independent, asymptotically attain the optimal performance uniformly in the parameter value.
Young Data Science Researcher Seminar Zurich
On permutation invariant problems in simultaneous statistical inference
Zoom Call
Thr 17.11.2022
16:00-17:00
Qiuqi Wang
University of Waterloo
Abstract
In the recent Basel Accords, the Expected Shortfall (ES) replaces the Value-at-Risk (VaR) as the standard risk measure for market risk in the banking sector, making it the most important risk measure in financial regulation. One of the most challenging tasks in risk modeling practice is to backtest ES forecasts provided by financial institutions. Ideally, backtesting should be done based only on daily realized portfolio losses without imposing specific models. Recently, the notion of e-values has gained attention as potential alternatives to p-values as measures of uncertainty, significance and evidence. We use e-values and e-processes to construct a model-free backtesting procedure for ES using a concept of universal e-statistics, which can be naturally generalized to many other risk measures and statistical quantities.
Young Data Science Researcher Seminar Zurich
E-backtesting risk measures
Zoom Call
Fri 18.11.2022
15:15-16:15
Eva Ceulemans
Universität Leuven
Abstract
Intensive longitudinal studies (e.g., experience sampling studies) have demonstrated that detecting changes in statistical features across time is crucial to better capture and understand psychological phenomena. For example, it has been uncovered that emotional episodes are characterized by changes in both means and correlations. In psychopathology research, recent evidence revealed that changes in means, variance, autocorrelation and correlation of experience sampling data can serve as early warning signs of an upcoming relapse into depression. In this talk, I will discuss flexible statistical tools for retrospectively and prospectively capturing such changes. First, I will present the KCP-RS framework, a retrospective change point detection framework that can be tailored to capture changes in not only the means but in any statistic that is relevant to the researcher. Second, I will turn to the prospective change detection problem, where I will argue that statistical process control procedures, originally developed for monitoring industrial processes, are promising tools but need tweaking to the problem at hand.
ZüKoSt Zürcher Kolloquium über Statistik
KCP-RS and statistical process control: Flexible tools to flag changes in time series
HG G 19.1
Zoom Call
Fri 25.11.2022
15:15-16:15
Mats Stensrud
EPFL Lausanne
Abstract
Investigators often express interest in effects that quantify the mechanism by which a treatment (exposure) affects an outcome. In this presentation, I will discuss how to formulate and choose effects that quantify mechanisms, beyond conventional average causal effects. I will consider the perspective of a decision maker, such as a patient, doctor or drug developer. I will emphasize that a careful articulation of a practically useful research question should either map to decision making at this point in time or in the future. A common feature of effects that are practically useful is that they correspond to possibly hypothetical but well-defined interventions in identifiable (sub)populations. To illustrate my points, I will consider examples that were recently used to motivate consideration of mechanistic effects, e.g. in clinical trials. In all of these examples, I will suggest different causal effects that correspond to explicit research questions of practical interest. These proposed effects also require less stringent identification assumptions.
ZüKoSt Zürcher Kolloquium über Statistik
Bridging data and decisions: How strings of numbers can honestly guide future policies
HG G 19.1
Thr 01.12.2022
16:00-17:00
Niklas Pfister
University of Copenhagen
HG G 19.1
Thr 01.12.2022
16:00-17:00
Pfister Niklas
University of Copenhagen
Abstract
Causal models can provide good predictions even under distributional shifts. This observation has led to the development of various methods that use causal learning to improve the generalization performance of predictive models. In this talk, we consider this type of approach for instrumental variable (IV) models. IV allows us to identify a causal function between covariates X and a response Y, even in the presence of unobserved confounding. In many practical prediction settings the causal function is however not fully identifiable. We consider two approaches for dealing with this under-identified setting: (1) By adding a sparsity constraint and (2) by introducing the invariant most predictive (IMP) model, which deals with the under-identifiability by selecting the most predictive model among all feasible IV solutions. Furthermore, we analyze to which types of distributional shifts these models generalize.
Research Seminar in Statistics
Distribution Generalization and Identifiability in IV Models
HG G 19.1
Fri 02.12.2022
15:15-16:15
Gaudenz Koeppel
Chief Analytics Officer at Axpo Trading & Sales
Abstract
In this talk, Gaudenz Koeppel, Chief Analytics Officer at Axpo Trading & Sales, will talk us through their journey of building machine learning models for power trading applications and taking them into 24/7 operation. Gaudenz will expand on some of the learnings, the importance of explainers as well as how and what aspects of such models must be monitored and how this monitoring information creates new insights. This will be a very practical, hands-on talk.
ZüKoSt Zürcher Kolloquium über Statistik
Machine Learning Models in Energy Markets
HG G 19.1
Thr 08.12.2022
16:00-17:00
Yuling Yan
Princeton University
Abstract
Many high-dimensional problems involve reconstruction of a low-rank matrix from highly incomplete and noisy observations. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained low-rank estimates, and how to construct valid yet short confidence intervals for the unknown low-rank matrix. In this talk, I will discuss how to perform inference and uncertainty quantification for two widely encountered low-rank models: (1) noisy matrix completion, and (2) heteroskedastic PCA with missing data. For both problems, we identify statistically efficient estimators that admit non-asymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals for, say, the unseen entries of the low-rank matrix of interest. Our inferential procedures do not rely on sample splitting, thus avoiding unnecessary loss of data efficiency. All this is accomplished by a powerful leave-one-out analysis framework that originated from probability and random matrix theory. This is based on joint work with Yuxin Chen, Jianqing Fan and Cong Ma.
Young Data Science Researcher Seminar Zurich
Inference and Uncertainty Quantification for Low-Rank Models
Zoom Call
Thr 15.12.2022
16:00-17:00
Yixin Wang
University of Michigan
Abstract
Representation learning constructs low-dimensional representations to summarize essential features of high-dimensional data like images and texts. Ideally, such a representation should efficiently capture non-spurious features of the data. It shall also be disentangled so that we can interpret what feature each of its dimensions captures. However, these desiderata are often intuitively defined and challenging to quantify or enforce. In this talk, we take on a causal perspective of representation learning. We show how desiderata of representation learning can be formalized using counterfactual notions, enabling metrics and algorithms that target efficient, non-spurious, and disentangled representations of data. We discuss the theoretical underpinnings of the algorithm and illustrate its empirical performance in both supervised and unsupervised representation learning. This is joint work with Michael Jordan: https://arxiv.org/abs/2109.03795
Young Data Science Researcher Seminar Zurich
Representation Learning: A Causal Perspective
Zoom Call
Fri 16.12.2022
15:15-16:15
Weijie Su
Wharton, University of Pennsylvania
Abstract
In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of the layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training. This is based on joint work with Cong Fang, Hangfeng He, and Qi Long (arXiv:2101.12699). In the second part, we study how real-world deep neural networks process data in the interior layers. Our finding is a simple and quantitative law that governs how deep neural networks separate data according to class membership throughout all layers for classification. This law shows that each layer improves data separation at a constant geometric rate, and its emergence is observed in an authoritative collection of network architectures and datasets during training. This law offers practical guidelines for designing architectures, improving model robustness and out-of-sample performance, as well as interpreting the predictions. This is based on joint work with Hangfeng He (arXiv:2210.17020).
Research Seminar in Statistics
Some Geometric Patterns of Real-World Deep Neural Networks
HG G 19.1
JavaScript has been disabled in your browser