Seminar overview

×

Modal title

Modal content

Autumn Semester 2023

Date & Time Speaker Title Location
Mon 21.08.2023
16:00-17:00
Cun-Hui Zhang
Rutgers University, USA
Abstract
We provide necessary and sufficient conditions for the chi-squared and normal approximations of Pearson's chi-squared statistics for the test of independence and the goodness-of- t test, as well as necessary and sufficient conditions for the normal approximation of the likelihood ratio and Hellinger statistics, when the cell probabilities of the multinomial data are in general pattern and the dimension diverges with the sample size. A cross-sample chi-squared statistic for testing independence applies to two-way contingency tables with diverging dimensions. A degrees-of-freedom adjusted chi-squared approximation applies continuously throughout the high-dimensional regime and matches Pearson's chi-squared statistic in both the mean and variance. Specific examples are provided to demonstrate the asymptotic normality of the three types of test statistics when the classical regularity conditions for the chi-squared and normal approximations are violated. Simulation results demonstrate that the chi-squared and normal approximations are more robust for the likelihood ratio and Hellinger statistics, compared with Pearson's chi-squared statistics. This talk is based on joint work with Chong Wu and Yisha Yao.
Research Seminar in Statistics
Chi-Squared and Normal Approximations in Large Contingency Tables
HG G 26.5
Fri 22.09.2023
15:15-16:15
Zijian Guo
Rutgers University, USA
Abstract
Instrumental variable methods are among the most commonly used causal inference approaches to deal with unmeasured confounders in observational studies. The presence of invalid instruments is the primary concern for practical applications, and a fast-growing area of research is inference for the causal effect with possibly invalid instruments. This paper illustrates that the existing confidence intervals may undercover when the valid and invalid instruments are hard to separate in a data-dependent way. To address this, we construct uniformly valid confidence intervals that are robust to the mistakes in separating valid and invalid instruments. We propose to search for a range of treatment effect values that lead to sufficiently many valid instruments. We further devise a novel sampling method, which, together with searching, leads to a more precise confidence interval. Our proposed searching and sampling confidence intervals are uniformly valid and achieve the parametric length under the finite-sample majority and plurality rules. We apply our proposal to examine the effect of education on earnings. The proposed method is implemented in the R package \texttt{RobustIV} available from CRAN.
Research Seminar in Statistics
Joint talk: Robust Causal Inference with Possibly Invalid Instruments: Post-selection Problems and A Solution Using Searching and Sampling
HG G 19.1
Fri 22.09.2023
15:15-16:15
Zijian Guo
Rutgers University, USA
Abstract
Instrumental variable methods are among the most commonly used causal inference approaches to deal with unmeasured confounders in observational studies. The presence of invalid instruments is the primary concern for practical applications, and a fast-growing area of research is inference for the causal effect with possibly invalid instruments. This paper illustrates that the existing confidence intervals may undercover when the valid and invalid instruments are hard to separate in a data-dependent way. To address this, we construct uniformly valid confidence intervals that are robust to the mistakes in separating valid and invalid instruments. We propose to search for a range of treatment effect values that lead to sufficiently many valid instruments. We further devise a novel sampling method, which, together with searching, leads to a more precise confidence interval. Our proposed searching and sampling confidence intervals are uniformly valid and achieve the parametric length under the finite-sample majority and plurality rules. We apply our proposal to examine the effect of education on earnings. The proposed method is implemented in the R package RobustIV available from CRAN.
ETH-FDS seminar
Joint talk: Robust Causal Inference with Possibly Invalid Instruments: Post-selection Problems and A Solution Using Searching and Sampling
HG G 19.1
Fri 29.09.2023
15:15-16:15
Leonardo Egidi
University of Trieste
Abstract
Nowadays a Bayesian model needs to be reproducible, generative, predictive, robust, computationally scalable, and able to provide sound inferential conclusions. In this wide framework, Bayes factors still represent one of the most well-known and commonly adopted tools to perform model selection and hypothesis testing; however, they are usually criticized due to their intrinsic lack of calibration, and they are rarely used to measure the predictive accuracy arising from competing models. We propose two distinct approaches relying on BFs from our most recent research. With regard to prediction, we propose a new algorithmic protocol to transform Bayes factors into measures that evaluate the pure and intrinsic predictive capabilities of models in terms of posterior predictive distributions, by assessing some preliminary theoretical properties (joint work with Ioannis Ntzoufras). Then, regarding the analysis of replication studies (Held, 2020), we follow the stream outlined by Pawel and Held (2022) and propose a skeptical mixture prior which represents the prior of an investigator who is unconvinced by the original findings. Its novelty lies in the fact that it incorporates skepticism while controlling for prior-data conflict (Egidi et al., 2021). Consistency properties of the resulting skeptical BF are provided together with a thorough analysis of the main features of our proposal (joint work with Guido Consonni). Short Bibliography: Egidi, L., Pauli, F., & Torelli, N. (2022). Avoiding prior–data conflict in regression models via mixture priors. Canadian Journal of Statistics, 50(2), 491-510. Held, L. (2020). A new standard for the analysis and design of replication studies. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(2), 431-448. Pawel, S., & Held, L. (2022). The sceptical Bayes factor for the assessment of replication success. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3), 879-911.
Research Seminar in Statistics
Prediction, skepticism, and the Bayes Factory
HG G 19.1
Tue 03.10.2023
16:30-18:00
Linjun Zhang
Rutgers, University
Mikhail Yurochkin
IBM Research and MIT-IBM Watson AI Lab
Abstract
1) Fair conformal prediction, Linjun Zhang, Rutgers University
Multi-calibration is a powerful and evolving concept originating in the field of algorithmic fairness. For a predictor $f$ that estimates the outcome y given covariates $x$, and for a function class $C$, multi-calibration requires that the predictor $f(x)$ and outcome y are indistinguishable under the class of auditors in $C$. Fairness is captured by incorporating demographic subgroups into the class of functions $C$. Recent work has shown that, by enriching the class $C$ to incorporate appropriate propensity re-weighting functions, multi-calibration also yields target-independent learning, wherein a model trained on a source domain performs well on unseen, future target domains(approximately) captured by the re-weightings. The multi-calibration notion is extended, and the power of an enriched class of mappings is explored. HappyMap, a generalization of multi-calibration, is proposed, which yields a wide range of new applications, including a new fairness notion for uncertainty quantification (conformal prediction), a novel technique for conformal prediction under covariate shift, and a different approach to analyzing missing data, while also yielding a unified understanding of several existing seemingly disparate algorithmic fairness notions and target-independent learning approaches. A single HappyMap meta-algorithm is given that captures all these results, together with a sufficiency condition for its success.

2) Operationalizing Individual Fairness, Mikhail Yurochkin, IBM Research and MIT-IBM Watson AI Lab
Societal applications of ML proved to be challenging due to algorithms replicating or even exacerbating biases in the training data. In response, there is a growing body of research on algorithmic fairness that attempts to address these issues, primarily via group definitions of fairness. In this talk, I will illustrate several shortcomings of group fairness and present an algorithmic fairness pipeline based on individual fairness (IF). IF is often recognized as the more intuitive notion of fairness: we want ML models to treat similar individuals similarly. Despite the benefits, challenges in formalizing the notion of similarity and enforcing equitable treatment prevented the adoption of IF. I will present our work addressing these barriers via algorithms for learning the similarity metric from data and methods for auditing and training fair models utilizing the intriguing connection between individual fairness and adversarial robustness. Finally, I will demonstrate applications of IF with Large Language Models.

Discussant: Razieh Nabi, Emory University
Young Data Science Researcher Seminar Zurich
Joint webinar of the IMS New Researchers Group, Young Data Science Researcher Seminar Zürich, and the YoungStatS Project: Algorithmic Fairness
Zoom Call
Fri 27.10.2023
15:15-16:15
Johanna Ziegel
University of Bern
Abstract
The (asymptotic) t-test is used in many situations. In order to ensure type-I-error guarantees, it is necessary to specify the sample size before the data collection process, and one cannot peek at the data before all data has arrived. However, looking at the data during the data collection process may happen (or may even be desirable). In such cases, a sequential alternative to the t-test is more appropriate since it controls type-I-error even when checking for significance after each new data point that has arrived. Three sequential alternatives to the t-test will be presented and compared in a simulation study.
ZüKoSt Zürcher Kolloquium über Statistik
Sequential alternatives to the t-test
HG G 19.1
Thr 02.11.2023
16:15-17:15
Quentin Berthet
Google DeepMind
Abstract
Machine learning pipelines often rely on optimizers procedures to make discrete decisions (e.g., sorting, picking closest neighbors, or shortest paths). Although these discrete decisions are easily computed in a forward manner, they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform optimizers into operations that are differentiable and never locally constant. Our approach relies on stochastically perturbed optimizers, and can be used readily within existing solvers. Their derivatives can be evaluated efficiently, and smoothness tuned via the chosen noise amplitude. We also show how this framework can be connected to a family of losses developed in structured prediction, and give theoretical guarantees for their use in learning tasks. We demonstrate experimentally the performance of our approach on various tasks, including recent applications on protein sequences.
ETH-FDS seminar
Joint talk DACO-FDS: Perturbed Optimizers for Machine Learning
HG G 19.1
Thr 16.11.2023
16:15-17:15
Jakob Zech
Universität Heidelberg
Abstract
In this talk, we explore approximation properties and statistical aspects of Neural Ordinary Differential Equations (Neural ODEs). Neural ODEs are a recently established technique in computational statistics and machine learning, that can be used to characterize complex distributions. Specifically, given a fixed set of independent and identically distributed samples from a target distribution, the goal is either to estimate the target density or to generate new samples. We first investigate the regularity properties of the velocity fields used to push forward a reference distribution to the target. This analysis allows us to deduce approximation rates achievable through neural network representations. We then derive a concentration inequality for the maximum likelihood estimator of general ODE-parametrized transport maps. By merging these findings, we are able to determine convergence rates in terms of both the network size and the number of required samples from the target distribution. Our discussion will particularly focus on target distributions within the class of positive $C^k$ densities on the $d$-dimensional unit cube $[0,1]^d$.
ETH-FDS seminar
Nonparametric Distribution Learning via Neural ODEs
HG G 19.1
Tue 28.11.2023
17:15-18:15
Daniela M. Witten
University of Washington
Abstract
We propose data thinning, a new approach for splitting an observation from a known distributional family with unknown parameter(s) into two or more independent parts that sum to yield the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to a broad class of distributions within the natural exponential family, including the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. Furthermore, we generalize data thinning to enable splitting an observation into two or more parts that can be combined to yield the original observation using an operation other than addition; this enables the application of data thinning far beyond the natural exponential family. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. We will present an application of data thinning to single-cell RNA-sequencing data, in a setting where sample splitting is not applicable. This is joint work with Anna Neufeld (Fred Hutch), Ameer Dharamshi (University of Washington), Lucy Gao (University of British Columbia), and Jacob Bien (University of Southern California)
ETH-FDS Stiefel Lectures
Data thinning and its applications
HG F 30
Thr 30.11.2023
15:15-16:15
Xinwei Shen
ETH Zurich
Abstract
Extrapolation is crucial in many statistical and machine learning applications, as it is common to encounter test data outside the training support. However, extrapolation is a considerable challenge for nonlinear models. Conventional models typically struggle in this regard: while tree ensembles provide a constant prediction beyond the support, neural network predictions tend to become uncontrollable. This work aims at providing a nonlinear regression methodology whose reliability does not break down immediately at the boundary of the training support. Our primary contribution is a new method called ‘engression’ which, at its core, is a distributional regression technique for pre-additive noise models, where the noise is added to the covariates before applying a nonlinear transformation. Our experimental results indicate that this model is typically suitable for many real data sets. We show that engression can successfully perform extrapolation under some assumptions such as a strictly monotone function class, whereas traditional regression approaches such as least-squares regression and quantile regression fall short under the same assumptions. We establish the advantages of engression over existing approaches in terms of extrapolation, showing that engression consistently provides a meaningful improvement. Our empirical results, from both simulated and real data, validate these findings, highlighting the effectiveness of the engression method. The software implementations of engression are available in both R and Python.
Research Seminar in Statistics
Engression: Extrapolation for Nonlinear Regression?
HG G 43
Thr 14.12.2023
15:15-16:15
Shuheng Zhou
University of California
Abstract
We consider the following data perturbation model, where the covariates incur multiplicative errors. For two random matrices U, X, we denote by (U \circ X) the Hadamard or Schur product, which is defined as (U \circ X)_{i,j} = (U_{i,j}) (X_{ij}). In this paper, we study the subgaussian matrix variate model, where we observe the matrix variate data through a random mask U: \mathcal{X} = U \circ X, where X = B^{1/2} Z A^{1/2}, where Z is a random matrix with independent subgaussian entries, and U is a mask matrix with either zero or positive entries, where $E[U_{ij}] \in [0,1]$ and all entries are mutually independent.Under the assumption of independence between X and U, we introduce componentwise unbiased estimators for estimating covariance A and B, and prove the concentration of measure bounds in the sense of guaranteeing the restricted eigenvalue(RE) conditions to hold on the unbiased estimator for B, when columns of data matrix are sampled with different rates. We further develop multiple regression methods for estimating the inverse of B and show statistical rate of convergence. Our results provide insight for sparse recovery for relationships among entities (samples, locations, items) when features (variables, time points, user ratings) are present in the observed data matrix X with heterogeneous rates. Our proof techniques can certainly be extended to other scenarios. We provide simulation evidence illuminating the theoretical predictions.
Research Seminar in Statistics
Concentration of measure bounds for matrix-variate data with missing values
HG G 43
Fri 15.12.2023
15:15-16:15
Sylvain Robert
Google
Abstract
Advertisers are interested in measuring the effectiveness of their online marketing campaigns on various platforms. While user-based experiments are efficient and well-understood, they are not always feasible because of technical and legal reasons. Geo-based experiments are an attractive and privacy-centric alternative, where experimental units are defined as geographical regions instead of individual users. One issue with this type of experiments, however, is the presence of contamination (or interference) between units, due to natural movement of people and imprecision in geo-localization. In this work we will try to quantify the amount of contamination in our experiments and propose possible solutions to mitigate its adverse effect, both during the estimation at the end of the expriment and upstream at the design phase.
ZüKoSt Zürcher Kolloquium über Statistik
Dealing with contamination in geo-experiments
HG G 19.1
JavaScript has been disabled in your browser