Seminar overview
×
Modal title
Modal content
Autumn Semester 2022
Date & Time | Speaker | Title | Location |
---|---|---|---|
Thr 08.09.2022 16:15-17:15 |
Aaditya Ramdas Carnegie Mellon University |
Abstract
Conformal prediction is a popular, modern technique for providing valid predictive inference for arbitrary machine learning models. Its validity relies on the assumptions of exchangeability of the data, and symmetry of the given model fitting algorithm as a function of the data. However, exchangeability is often violated when predictive models are deployed in practice. For example, if the data distribution drifts over time, then the data points are no longer exchangeable; moreover, in such settings, we might want to use an algorithm that treats recent observations as more relevant, which would violate the assumption that data points are treated symmetrically.
This paper proposes a new methodology to deal with both aspects: we use weighted quantiles to introduce robustness against distribution drift, and design a new technique to allow for algorithms that do not treat data points symmetrically. Our algorithms are provably robust, with substantially less loss of coverage when exchangeability is violated due to distribution drift or other challenging features of real data, while also achieving the same algorithm and coverage guarantees as existing conformal prediction methods if the data points are in fact exchangeable. Finally, we demonstrate the practical utility of these new tools with simulations and real-data experiments.
This is joint work with Rina Barber, Emmanuel Candes and Ryan Tibshirani. A preprint is at https://arxiv.org/abs/2202.13415.
Bio: Aaditya Ramdas (PhD, 2015) is an assistant professor at Carnegie Mellon University, in the Departments of Statistics and Machine Learning. He was a postdoc at UC Berkeley (2015–2018) and obtained his PhD at CMU (2010–2015), receiving the Umesh K. Gavaskar Memorial Thesis Award. His undergraduate degree was in Computer Science from IIT Bombay (2005-09). Aaditya was an inaugural inductee of the COPSS Leadership Academy, and a recipient of the 2021 Bernoulli New Researcher Award. His work is supported by an NSF CAREER Award, an Adobe Faculty Research Award (2020), an ARL Grant on Safe Reinforcement Learning, the Block Center Grant for election auditing, a Google Research Scholar award (2022) for structured uncertainty quantification, amongst others.
Aaditya's main theoretical and methodological research interests include selective and simultaneous inference, game-theoretic statistics and safe anytime-valid inference, and distribution-free uncertainty quantification for black-box ML. His areas of applied interest include privacy, neuroscience, genetics and auditing (elections, real-estate, financial), and his group's work has received multiple best paper awards.
ETH-FDS seminar Conformal prediction beyond exchangeability (= quantifying uncertainty for black-box ML without distributional assumptions)read_more |
OAS J 10 ETH AI Center, OAS, Binzmühlestrasse 13, 8050 Zürich |
Fri 09.09.2022 14:15-15:15 |
Bin Yu UC Berkeley |
Abstract
Occam's razor is a general principle for science to pursue the simplest explanation or model when the empirical support evidence is the same for the explanations or models under consideration. To quantify simplicy, a complexity measure is necessary and many such measures have been used in the literature including uniform stability. Both complexity and stability are central to interpretable machine learning.
In this talk, we first give an overview of interpretable machine learning and then delve
into our recent work on decisions trees, which are especially useful interpretable methods in high-stake applications such as medicine and public policy. In particular, we show that decision trees are sub-optimal for additive regression models. To improve upon decision trees, we introduce a new method called Fast Interpretable Greedy-Tree Sums (FIGS) that fits additive trees while controlling the total number of splits. The state-of-the-art performance of FIGS will be illustrated through case studies for clinical decision rules.
Research Seminar in StatisticsComplexity, simplicity, and decision treesread_more |
HG G 19.1 |
Fri 30.09.2022 15:15-16:15 |
Alexander Henzi ETH, Seminar for Statistics |
Abstract
Statistical predictions should provide a quantification of forecast uncertainty. Ideally, this uncertainty quantification is in the form of a probability distribution for the outcome of interest conditional on the available information. Isotonic distributional regression (IDR) is a nonparametric method that allows to derive probabilistic forecasts from a training data set of point predictions and observations, solely under the assumption of stochastic monotonicity. IDR does not require parameter tuning, and it has interesting properties when analyzed under the paradigm of maximizing sharpness subject to calibration. The method can serve as a natural benchmark for postprocessing forecasts both from statistical models and external sources, which is illustrated through applications in weather forecasting and medicine.
Research Seminar in StatisticsIsotonic distributional regressionread_more |
HG G 19.1 |
Fri 07.10.2022 17:15-18:15 |
Daniel A. Spielman Yale University |
Abstract
In randomized experiments, we randomly assign the treatment that each experimental subject receives. Randomization can help us accurately estimate the difference in treatment effects with high probability. It also helps ensure that the groups of subjects receiving each treatment are similar. If we have already measured characteristics of our subjects that we think could influence their response to treatment, then we can increase the precision of our estimates of treatment effects by balancing those characteristics between the groups. We show how to use the recently developed Gram-Schmidt Walk algorithm of Bansal, Dadush, Garg, and Lovett to efficiently assign treatments to subjects in a way that balances known characteristics without sacrificing the benefits of randomization. These allow us to obtain more accurate estimates of treatment effects to the extent that the measured characteristics are predictive of treatment effects, while also bounding the worst-case behavior when they are not. This is joint work with Chris Harshaw, Fredrik Sävje, and Peng Zhang.
ETH-FDS Stiefel LecturesBalancing covariates in randomized experimentsread_more |
HG F 30 |
Fri 21.10.2022 15:15-16:15 |
Mona Azadkia ETH Zürich |
Abstract
Consider the regression problem where the response Y∈ ℝ and the covariate X ∈ ℝ^d for d≥1 are \textit{unmatched}. Under this scenario, we do not have access to pairs of observations from the distribution of (X,Y), but instead, we have separate datasets {Yi}_ni=1 and {Xj}_mj=1, possibly collected from different sources. We study this problem assuming that the regression function is linear and the noise distribution is known or can be estimated. We introduce an estimator of the regression vector based on deconvolution and demonstrate its consistency and asymptotic normality under an identifiability assumption. In the general case, we show that our estimator (DLSE: Deconvolution Least Squared Estimator) is consistent in terms of an extended ℓ2 norm. Using this observation, we devise a method for semi-supervised learning, i.e., when we have access to a small sample of matched pairs (Xk,Yk). Several applications with synthetic and real datasets are considered to illustrate the theory.
Research Seminar in StatisticsLinear regression with unmatched data: a deconvolution perspectiveread_more |
HG G 19.1 |
Wed 26.10.2022 16:00-17:00 |
Muriel Pérez Centrum Wiskunde & Informatica (CWI) Amsterdam |
Abstract
We study worst-case-growth-rate-optimal (GROW) E-statistics for hypothesis
testing between two dominated group models. If the underlying group G acts
freely on the observation space, there exists a maximally invariant statistic
of the data. We show that among all E-statistics, invariant or not, the
likelihood ratio of the maximally invariant statistic is GROW and that an
anytime-valid test can be based on it. By virtue of a representation theorem
of Wijsman, the GROW E-statistic is equivalent to a Bayes factor with a
right Haar prior on G. Such Bayes factors are known to have good frequentist
and Bayesian properties. We show that reductions through sufficiency and
invariance can be made in tandem without affecting optimality. A crucial
assumption on the group G is its amenability, a well-known group-theoretical
condition, which holds, for instance, in general scale-location families. Our
results also apply to finite-dimensional linear regression.
Young Data Science Researcher Seminar ZurichE-statistics, group invariance and anytime-valid testingread_more |
Zoom Callcall_made |
Thr 27.10.2022 16:15-17:15 |
Samory K. Kpotufe Columbia University |
Abstract
In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has received attention for many years, no adaptive procedure was known till a recent breakthrough of Auer et al (2018, 2019) which guarantees an optimal regret (LT)^{1/2}, for T rounds and L stationary phases.
However, while this rate is tight in the worst case, we show that significantly faster rates are possible, adaptively, if few changes in distribution are actually severe, e.g., involve no change in best arm. This is arrived at via a new notion of 'significant change', which recovers previous notions of change, and applies in both stochastic and adversarial settings (generally studied separately).
If time permits, I’ll discuss the more general case of contextual bandits, i.e., where rewards depend on contexts, and highlight key challenges that arise.
This is based on ongoing work with Joe Suk.
ETH-FDS seminar Tracking Most Significant Changes in Banditsread_more |
HG F 3 |
Thr 03.11.2022 16:15-17:15 |
Holger Rauhut RWTH Aachen |
Abstract
Deep neural networks are usually trained by minimizing a non-convex loss functional via (stochastic) gradient descent methods. Unfortunately, the convergence properties are not very well-understood. Moreover, a puzzling empirical observation is that learning neural networks with a number of parameters exceeding the number of training examples often leads to zero loss, i.e., the network exactly interpolates the data. Nevertheless, it generalizes very well to unseen data, which is in stark contrast to intuition from classical statistics which would predict a scenario of overfitting.
A current working hypothesis is that the chosen optimization algorithm has a significant influence on the selection of the learned network. In fact, in this overparameterized context there are many global minimizers so that the optimization method induces an implicit bias on the computed solution. It seems that gradient descent methods and their stochastic variants favor networks of low complexity (in a suitable sense to be understood), and, hence, appear to be very well suited for large classes of real data.
Initial attempts in understanding the implicit bias phenomen considers the simplified setting of linear networks, i.e., (deep) factorizations of matrices. This has revealed a surprising relation to the field of low rank matrix recovery (a variant of compressive sensing) in the sense that gradient descent favors low rank matrices in certain situations. Moreover, restricting further to diagonal matrices, or equivalently factorizing the entries of a vector to be recovered, leads to connections to compressive sensing and l1-minimization.
After giving a general introduction to these topics, the talk will concentrate on results by the speaker on the convergence of gradient flows and gradient descent for learning linear neural networks and on the implicit bias towards low rank and sparse solutions.
ETH-FDS seminar The implicit bias of gradient descent for learning linear neural networksread_more |
HG F 3 |
Thr 10.11.2022 16:30-17:30 |
Asaf Weinstein Hebrew University of Jerusalem |
Abstract
Suppose you observe Y_i = mu_i + e_i, where e_i are i.i.d. from some fixed and known zero-mean distribution, and mu_i are fixed and unknown parameters. In this "canonical" setting, a simultaneous inference statistical problem, as we define it here, is such that no preference is given to any of the mu_i's before seeing the data. For example, estimating all mu_i's under sum-of-squares loss; or testing H_{0i}: mu_i=0 simultaneously for i=1,...,n while controlling FDR; or estimating mu_{i^*} where i^* = argmax{Y_i} under squared loss; or even testing the global null H_0 = \cap{i=1}^n H_{0i}.
What is the optimal solution to a simultaneous inference problem? In a Bayesian setup, i.e., when mu_i are assumed random, the answer is conceptually straightforward. In the frequentist setup considered here, the answer is far less obvious, and various approaches exist for defining notions of frequentist optimality and for designing procedures that pursue them. In this work we define the optimal solution to a simultaneous inference problem to be the procedure that, for the true mu_i's, has the best performance among all procedures that are oblivious to the labels i=1,...,n. This is a natural and arguably the weakest condition one could possibly impose. For such procedures we observe that the problem can be cast as a Bayesian problem with respect to a particular prior, which immediately reveals an explicit form for the optimal solution. The argument actually holds more generally for any permutation-invariant model, e.g. when the e_i above are exchangeable, not independent, noise terms, which is sometimes a much more realistic assumption. Finally, we discuss the relation to Robbins's empirical Bayes approach, and explain why nonparametric empirical Bayes procedures should, at least when the e_i's are independent, asymptotically attain the optimal performance uniformly in the parameter value.
Young Data Science Researcher Seminar ZurichOn permutation invariant problems in simultaneous statistical inferenceread_more |
Zoom Callcall_made |
Thr 17.11.2022 16:00-17:00 |
Qiuqi Wang University of Waterloo |
Abstract
In the recent Basel Accords, the Expected Shortfall (ES) replaces the Value-at-Risk (VaR) as the standard risk measure for market risk in the banking sector, making it the most important risk measure in financial regulation. One of the most challenging tasks in risk modeling practice is to backtest ES forecasts provided by financial institutions. Ideally, backtesting should be done based only on daily realized portfolio losses without imposing specific models. Recently, the notion of e-values has gained attention as potential alternatives to p-values as measures of uncertainty, significance and evidence. We use e-values and e-processes to construct a model-free backtesting procedure for ES using a concept of universal e-statistics, which can be naturally generalized to many other risk measures and statistical quantities.
Young Data Science Researcher Seminar ZurichE-backtesting risk measuresread_more |
Zoom Callcall_made |
Fri 18.11.2022 15:15-16:15 |
Eva Ceulemans Universität Leuven |
Abstract
Intensive longitudinal studies (e.g., experience sampling studies) have demonstrated that detecting changes in statistical features across time is crucial to better capture and understand psychological phenomena. For example, it has been uncovered that emotional episodes are characterized by changes in both means and correlations. In psychopathology research, recent evidence revealed that changes in means, variance, autocorrelation and correlation of experience sampling data can serve as early warning signs of an upcoming relapse into depression. In this talk, I will discuss flexible statistical tools for retrospectively and prospectively capturing such changes. First, I will present the KCP-RS framework, a retrospective change point detection framework that can be tailored to capture changes in not only the means but in any statistic that is relevant to the researcher. Second, I will turn to the prospective change detection problem, where I will argue that statistical process control procedures, originally developed for monitoring industrial processes, are promising tools but need tweaking to the problem at hand.
ZüKoSt Zürcher Kolloquium über StatistikKCP-RS and statistical process control: Flexible tools to flag changes in time seriesread_more |
HG G 19.1 Zoom Callcall_made |
Fri 25.11.2022 15:15-16:15 |
Mats Stensrud EPFL Lausanne |
Abstract
Investigators often express interest in effects that quantify the mechanism by which a treatment (exposure) affects an outcome. In this presentation, I will discuss how to formulate and choose effects that quantify mechanisms, beyond conventional average causal effects. I will consider the perspective of a decision maker, such as a patient, doctor or drug developer. I will emphasize that a careful articulation of a practically useful research question should either map to decision making at this point in time or in the future. A common feature of effects that are practically useful is that they correspond to possibly hypothetical but well-defined interventions in identifiable (sub)populations. To illustrate my points, I will consider examples that were recently used to motivate consideration of mechanistic effects, e.g. in clinical trials. In all of these examples, I will suggest different causal effects that correspond to explicit research questions of practical interest. These proposed effects also require less stringent identification assumptions.
ZüKoSt Zürcher Kolloquium über StatistikBridging data and decisions: How strings of numbers can honestly guide future policiesread_more |
HG G 19.1 |
Thr 01.12.2022 16:00-17:00 |
Niklas Pfister University of Copenhagen |
HG G 19.1 |
|
Thr 01.12.2022 16:00-17:00 |
Pfister Niklas University of Copenhagen |
Abstract
Causal models can provide good predictions even under distributional
shifts. This observation has led to the development of various methods
that use causal learning to improve the generalization performance of
predictive models. In this talk, we consider this type of approach for
instrumental variable (IV) models. IV allows us to identify a causal
function between covariates X and a response Y, even in the presence
of unobserved confounding. In many practical prediction settings the
causal function is however not fully identifiable. We consider two
approaches for dealing with this under-identified setting: (1) By adding a sparsity constraint and (2) by introducing the invariant most predictive (IMP) model, which deals with the under-identifiability by selecting the most predictive model among all feasible IV solutions. Furthermore, we analyze to which types of distributional shifts these
models generalize.
Research Seminar in StatisticsDistribution Generalization and Identifiability in IV Modelsread_more |
HG G 19.1 |
Fri 02.12.2022 15:15-16:15 |
Gaudenz Koeppel Chief Analytics Officer at Axpo Trading & Sales |
Abstract
In this talk, Gaudenz Koeppel, Chief Analytics Officer at Axpo Trading & Sales, will talk us through their journey of building machine learning models for power trading applications and taking them into 24/7 operation. Gaudenz will expand on some of the learnings, the importance of explainers as well as how and what aspects of such models must be monitored and how this monitoring information creates new insights. This will be a very practical, hands-on talk.
ZüKoSt Zürcher Kolloquium über StatistikMachine Learning Models in Energy Marketsread_more |
HG G 19.1 |
Thr 08.12.2022 16:00-17:00 |
Yuling Yan Princeton University |
Abstract
Many high-dimensional problems involve reconstruction of a low-rank matrix from highly incomplete and noisy observations. Despite substantial progress in designing efficient estimation algorithms, it remains largely unclear how to assess the uncertainty of the obtained low-rank estimates, and how to construct valid yet short confidence intervals for the unknown low-rank matrix.
In this talk, I will discuss how to perform inference and uncertainty quantification for two widely encountered low-rank models: (1) noisy matrix completion, and (2) heteroskedastic PCA with missing data. For both problems, we identify statistically efficient estimators that admit non-asymptotic distributional characterizations, which in turn enable optimal construction of confidence intervals for, say, the unseen entries of the low-rank matrix of interest. Our inferential procedures do not rely on sample splitting, thus avoiding unnecessary loss of data efficiency. All this is accomplished by a powerful leave-one-out analysis framework that originated from probability and random matrix theory.
This is based on joint work with Yuxin Chen, Jianqing Fan and Cong Ma.
Young Data Science Researcher Seminar ZurichInference and Uncertainty Quantification for Low-Rank Modelsread_more |
Zoom Callcall_made |
Thr 15.12.2022 16:00-17:00 |
Yixin Wang University of Michigan |
Abstract
Representation learning constructs low-dimensional representations to
summarize essential features of high-dimensional data like images and texts. Ideally, such a representation should efficiently capture non-spurious features of the data. It shall also be disentangled so
that we can interpret what feature each of its dimensions captures. However, these desiderata are often intuitively defined and
challenging to quantify or enforce.
In this talk, we take on a causal perspective of representation learning. We show how desiderata of representation learning can be
formalized using counterfactual notions, enabling metrics and algorithms that target efficient, non-spurious, and disentangled
representations of data. We discuss the theoretical underpinnings of the algorithm and illustrate its empirical performance in both
supervised and unsupervised representation learning.
This is joint work with Michael Jordan: https://arxiv.org/abs/2109.03795
Young Data Science Researcher Seminar ZurichRepresentation Learning: A Causal Perspectiveread_more |
Zoom Callcall_made |
Fri 16.12.2022 15:15-16:15 |
Weijie Su Wharton, University of Pennsylvania |
Abstract
In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of the layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training. This is based on joint work with Cong Fang, Hangfeng He, and Qi Long (arXiv:2101.12699).
In the second part, we study how real-world deep neural networks process data in the interior layers. Our finding is a simple and quantitative law that governs how deep neural networks separate data according to class membership throughout all layers for classification. This law shows that each layer improves data separation at a constant geometric rate, and its emergence is observed in an authoritative collection of network architectures and datasets during training. This law offers practical guidelines for designing architectures, improving model robustness and out-of-sample performance, as well as interpreting the predictions. This is based on joint work with Hangfeng He (arXiv:2210.17020).
Research Seminar in StatisticsSome Geometric Patterns of Real-World Deep Neural Networksread_more |
HG G 19.1 |