Seminar overview

×

Modal title

Modal content

Spring Semester 2019

Date & Time Speaker Title Location
Thr 24.01.2019
16:15-17:00
Amin Karbasi
Yale University, USA
Abstract
The difficulty of searching through a massive amount of data in order to quickly make an informed decision is one of today’s most ubiquitous challenges. Many scientific and engineering models feature inherently discrete decision variables — from phrases in a corpus to objects in an image. The study of how to make near-optimal decisions from a massive pool of possibilities is at the heart of combinatorial optimization. Many of these problems are notoriously hard, and even those that are theoretically tractable may not scale to large instances. However, the problems of practical interest are often much more well-behaved and possess extra structures that allow them to be amenable to exact or approximate optimization techniques. Just as convexity has been a celebrated and well-studied condition under which continuous optimization is tractable, submodularity is a condition for which discrete objectives may be optimized. In order to provide the tightest approximation guarantees for submodular optimization problems, we usually need to leave the space of discrete domains and consider their continuous relaxations. To this end, we will explore the notion of submodularity in the continuous domains and introduce a broad class of non-convex objective functions. Despite the apparent lack of convexity, we will see that first-order optimization methods can provide strong approximation guarantees. We then show that such continuous relaxations can be used as an interface to provide tight approximation guarantees for maximizing stochastic submodular set functions. I will not assume any particular background on submodularity or optimization and will try to motivate and define all the necessary concepts during the talk.
Bio: Amin Karbasi is currently an assistant professor of Electrical Engineering, Computer Science, and Statistics at Yale University. He has been the recipient of ONR 2019 Young Investigator Award, AFOSR 2018 Young Investigator Award, Grainger Award 2017 from National Academy of Engineering, Microsoft Azure research award 2017, DARPA 2016 Young Faculty Award, Simons-Berkeley fellowship 2016, Google Faculty Award 2015, and ETH fellowship 2013. His work has been recognized with a variety of paper awards, including Medical Image Computing and Computer Assisted Interventions Conference (MICCAI) 2017, International Conference on Artificial Intelligence and Statistics (AISTAT) 2015, IEEE ComSoc Data Storage 2013, International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2011, ACM SIGMETRICS 2010, and IEEE International Symposium on Information Theory (ISIT) 2010 (runner-up). His Ph.D. work received the Patrick Denantes Memorial Prize 2013 from the School of Computer and Communication Sciences at EPFL, Switzerland.
ETH-FDS seminar
It Is Good to Relax (Maybe Best)
HG E 22
Fri 15.03.2019
15:15-16:00
Benjamin Stucky
Pharma UZH
Abstract
The phenomena of sleep and cognition involve complex phenotype-genotype associations, i.e., complex relationships between observable traits and the genetic variation that contributes to the expression of those traits. There is a general belief that investigating such relationships requires large sample sizes. However, sleep- and cognition-related phenotype-genotype associations may be strengthened through carefully controlled laboratory studies that amplify a given cognitive phenotype by perturbing the organism through sleep deprivation and pharmacological interventions. Here we will discuss the contributions of some recent modifications to the LASSO method (Tibshirani, 1996) in identifying a gene expression biomarker panel, i.e. a set of readily measurable genetic indicators of cognitive impairment due to sleep deprivation. We illustrate these approaches on the basis of a RNA expression data set introduced by Uyhelji et al., 2018.
ZüKoSt Zürcher Kolloquium über Statistik
Biomarkers for Sleep Deprivation-Induced Impairments in Human Cognition
HG G 19.1
Fri 22.03.2019
11:30-12:15
Judith Rousseau
Ceremade, Dauphine Université Paris
Abstract
Int the first part of this work I will present some properties of the class of graphs based on exchangeable point processes: these include in particular the asymptotic expressions for the numbers of edges, nodes and triangles together with the degree distributions, identifying four regimes: (i) a dense regime, (ii) a sparse, almost dense regime, (iii) a sparse regime with power- law behavior, and (iv) an almost extremely sparse regime. We also propose a class of models within this framework where one can separately control the local, latent structure and the global sparsity/power-law properties of the graph and we derive a central limit theorem for the number of nodes and edges in the graph. In the second part I will recall the Bayesian approach to make inference in such graphs, both for simple and multigraphs proposed by Caron and Fox (2017). I will then study the asymptotic behaviour of the posterior distribution, both under well and mis-specified multigraph models.
Research Seminar in Statistics
On Bayesian inference in a family of sparse graphs and multigraphs
HG G 19.1
Fri 22.03.2019
15:15-16:00
Mollie Brooks
National Institute of Aquatic Resources, Technical University of Denmark
Abstract
The diversity of features of generalized linear models that can be fit in R is huge (e.g. zero-inflation, numerous distributions, and random effect correlation structures), but the features were not easy to use in combination because they were available from separate packages. Ignoring these aspects can give biased estimates and inflate the rates of false-positives or false-negatives in hypothesis tests. The R package glmmTMB was developed using the TMB package to do maximum likelihood estimation to fit a diversity of models in a single robust package. Features of TMB that made this possible include automatic differentiation for calculating gradients, Laplace approximation for integrating over random effects, and adding and subtracting on the log-scale to avoid over- and under-flow. In addition to describing glmmTMB, the talk will include ecological examples that address zero-inflation and underdispersion in count data, as well as Tweedie and beta regression for biomass and proportion data, respectively.
ZüKoSt Zürcher Kolloquium über Statistik
Flexible generalized linear models with glmmTMB
HG G 19.1
Wed 27.03.2019
12:30-13:30
Manfred K. Warmuth
UC Santa Cruz and Google Zürich
Abstract
Joint work with Michal Derezinski.
Consider the following basic one dimensional linear regression problem. You are given a set of n points and are to predict the hidden response value for each point. The goal is to achieve total square loss close to the optimallinear predictor without seeing most of the hidden responses. We show that if you sample JUST ONE point from the set and request its response value, then the linear predictor fitting that point response pair has exactly twice the optimum total square loss. In d dimensions, sampling a subset of just d points and fitting their responses achieves total loss d+1 times the optimum. The key trick is to use a joint sampling technique called volume sampling for sampling a diverse subset of points. We show that the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of the optimal solution based on all n responses. This unbiasedness is a desirable property that is not shared by other common subset selection techniques.
Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, resulting in a number of new matrix expectation equalities and statistical guarantees which are of importance not only to least squares regression but also to numerical linear algebra in general. Our methods also lead to a regularized variant of volume sampling, and we propose the first efficient algorithm for volume sampling which makes this technique a practical tool in the machine learning toolbox. Finally, we provide experimental evidence which confirms our theoretical findings.
ETH-FDS seminar
Reverse Iterative Volume Sampling for Linear Regression
HG D 5.2
Fri 29.03.2019
15:15-16:00
Sebastian Sippel
Institute for Atmospheric and Climate Science, ETH
Abstract
Internal atmospheric variability fundamentally constrains short- and medium-term climate predictability and obscures detection of anthropogenic climate change on regional scales. Dynamical adjustment is a traditional climate science technique to characterize circulation-induced variability in temperature or precipitation; the residual contains an estimate of the external “forced”, i.e. circulation-independent response. Here, we present a novel dynamical adjustment technique that makes use of statistical learning principles within the context of (1) a set of climate model simulations and (2) a real-world application of the method to variability in winter temperatures and snow coverage in Switzerland. The statistical learning methods establish a consistent relationship between internal circulation variability and atmospheric target variables on a daily time scale with around 80% variance explained in European monthly winter temperature and precipitation; and to a similar degree for global annual mean temperature and zonal mean precipitation. A real-world application to the Swiss winter temperature series, a long-term homogenized observational record reveals that a large fraction of winter temperature variability, and to a smaller degree variability in snow coverage, can be explained by internal atmospheric variability. The adjusted residual time series reveals a smooth, increasing trend since around the late 1970s, mostly driven by thermodynamic changes. Overall, statistical learning techniques for dynamical adjustment help to uncover the external forced response at regional and global scales, thus strengthening process understanding and facilitating detection of climate change.
ZüKoSt Zürcher Kolloquium über Statistik
Climate change detection: Uncovering circulation-driven and external components in climate observations and models
HG G 19.1
Fri 05.04.2019
12:30-13:30
Bart De Moor
KU Leuven
Abstract
The tsunami of data and the increasing availability of massive computing power create many opportunities for AI in general, and for data science in particular. However, in many instances, the results delivered by machine learning algorithms are difficult to ‘understand’ or ‘explain’. Black boxes as they are, we basically do not know when and how they are successful, and when and why not. The whole field is pervaded, if not swamped, by hundreds of ‘heuristic tricks’, such as tampering with the neuronal activation function (sigmoid, RELU (rectified linear unit),…), experimenting with the number of neurons and hidden layers, selecting optimization methods (numerical ones, such as back-propagation or Gauss-Newton, or randomized ones, such as genetic algorithms), specifying user choices (initial guesses, step sizes, convergence tests, early stopping and/or regularization measures, seeds and cross-over/mutation rates, ...), etc.Therefore, machine learning is still more of an art than a science, and many results published in the literature do not even satisfy the minimal scientific requirement of ‘reproducibility’. Admittedly, this situation is not new. Even in the (‘classical’) field of system identification or statistical modeling of linear time-invariant finite dimensional dynamical linear systems, the origins of which predate those of deep learning by many years, heuristics prevail. Of course, there is a certain systematic methodology to tackle data-based identification problems:
1. Collect data;
2. Select a pre-specified model class that is parametrized by unknown parameters;
3. Choose an appropriate approximation criterion;
4. ‘Solve’ the resulting nonlinear optimization problem by numerical algorithms that output `optimal' parameters;
5. Validate the resulting model on validation/test sets or by assessing the quality of predictions.
6. Re-iterate the whole loop when necessary.
It is in Step 4 that still plenty of heuristics are required. Typically, the optimization problem is nonlinear and provably non-convex. The nonlinearities originate in the assumptions that are made on the data model (e.g. additive misfit on the observations, or unobserved additional inputs), and the fact that products between unknowns occur (e.g. between unknown model parameters and unobserved inputs as in ARMAX models, or between unknown model matrices and states in subspace identification methods). The nonlinearities can be dealt with depending on the identification framework one deploys: by invoking asymptotic orthogonality and numerical linear algebra tools in subspace identification, by using the machinery of instrumental variables in errors-in-variables approaches, or by implementing iterative nonlinear least squares algorithms like in Prediction-Error-Methods. Despite hundreds of person-years of experience, and thousands of published papers, books and software suites, all of these ‘solutions’ contain plenty of heuristics. The main message of this talk is that we have to be braver: There is a lot of old and new mathematics out there that we collectively ignore, but that could provide us with a deeper understanding of our identification approaches. Specifically, for the problem of identifying LTI models from observed data, the central observation is that all nonlinearities involved are multivariate polynomial. With least squares as an approximation criterion, we end up with a multivariate polynomial optimization problem. From algebraic geometry, we know that such problem is fundamentally equivalent to a large-scale eigenvalue problem. We will show how the set of first order optimality conditions comprises a multi-parameter eigenvalue problem (MEVP), the solution of which is surprisingly little studied in the literature, requiring ‘new’ mathematics: Such a MEVP can be solved by recursively building up a quasi-block-Toeplitz block Macaulay matrix, the null space of which is block multi-shift invariant (a property studied in operator theory). By then applying multi-dimensional (exact) realization theory in that null space (new multi-dimensional system theory), we can find the optimal parameters from the eigenvectors and -values of a specific large-scale matrix. This ‘solution’ as described, satisfies the very scientific definition of the word, because a set of a priori minimal assumptions on data and model leads to a sequence of mathematically verifiable and reconstructable steps that uniquely characterize the optimal solution to be an eigenvalue problem. Even if heuristics would still be needed to compute the optimal solution because of the mere size of the problem, we now know that we are looking for a minimizing eigenvalue-eigenvector pair of a matrix constructed from the data and chosen model class. Can we learn something about the challenges posed by deep learning from this very special case of linear system identification? The answer is a resounding ‘yes’, and we will provide some first indications why our mathematical framework as outlined above might also be applicable to neural networks. The take home message is that more - old and new - mathematics is mandatory to understand and explain deep learning. The endeavor is difficult, challenging, time consuming, requires patience and endurance, and the research involved is definitely high-risk high-gain. The journey will take us into the realms of mathematical disciplines such as one- and multidimensional system theory, algebraic geometry, operator theory and numerical linear algebra and optimization. For sure, diving deeper in the mathematical depts of neural networks is largely a-typical in the current massive wave of heuristic deep learning activities. It will require a creative understanding of mathematics, that is more profound than the acquired know-how that most practitioners of AI and deep learning currently possess. Yet, there is no alternative if we want to turn the toolbox of deep learning heuristics into real science.
ETH-FDS seminar
More Mathematics Mandatory in Data Science
HG G 19.1
Fri 05.04.2019
15:15-16:00
Ulrik Brandes
ETH, Departement Geistes-, Sozial- und Staatswissenschaften
Abstract
If statistics is concerned with the collection, management, analysis, presentation, and interpretation of data, then so is network science. The distinction lies neither in a universal network theory nor the complex systems of certain application domains but (i) an incidence structure relating the units of observation and (ii) a corresponding focus on dependencies among observations. While not mainstream, this view clears away some of the smoke and mirrors obscuring the close relationship between statistics and network science. Following this rather principled line of thought, I will introduce the positional approach to network analysis as a means to capitalize on it.
ZüKoSt Zürcher Kolloquium über Statistik
Network science as the dual of statistics
HG G 19.1
Fri 12.04.2019
15:15-16:00
Paulo Rodrigues
Bank of Portugal
Abstract
Standard tests based on predictive regressions estimated over the full available sample data have tended to nd little evidence of predictability in stock returns. Recent approaches based on the analysis of subsamples of the data have been considered, suggesting that predictability where it occurs might exist only within so-called \pockets of predictability" rather than across the entire sample. However, these methods are prone to the criticism that the sub-sample dates are endogenously determined such that the use of standard critical values appropriate for full sample tests will result in incorrectly sized tests leading to spurious ndings of stock returns predictability. To avoid the problem of endogenously-determined sample splits, we propose new tests derived from sequences of predictability statistics systematically calculated over sub-samples of the data. Speci cally, we will base tests on the maximum of such statistics from sequences of forward and backward recursive, rolling, and double-recursive predictive sub-sample regressions. We develop our approach using the over-identi ed instrumental variable-based predictability test statistics of Breitung and Demetrescu (2015). This approach is based on partial-sum asymptotics and so, unlike many other popular approaches including, for example, those based on Bonferroni corrections, can be readily adapted to implementation over sequences of subsamples. We show that the limiting distributions of our proposed tests are robust to both the degree of persistence and endogeneity of the regressors in the predictive regression, but not to any heteroskedasticity present even if the sub-sample statistics are based on heteroskedasticity-robust standard errors. We therefore develop xed regressor wild bootstrap implementations of the tests which we demonstrate to be rst-order asymptotically valid. Finite sample behaviour against a variety of temporarily predictable processes is considered. An empirical application to US stock returns illustrates the usefulness of the new predictability testing methods we propose.
Research Seminar in Statistics
Testing for Episodic Predictability in Stock Returns
HG G 26.3
Fri 10.05.2019
15:15-16:00
Jon Wellner
University of Washington
Abstract
Multiplier empirical processes have proved to be one of the key unifying themes in modern empirical process theory, with statistical applications including basic symmetriza- tion methods, bootstrap and resampling theory, and analysis of empirical risk minimization procedures. At the heart of the theory of these multiplier processes, a collection of multiplier inequalities provide the basic tools which drive the theoretical developments. In this talk I will review some some of the basic multiplier empirical processes, and explain their importance for a variety of problems in statistics. I will briefly compare sev- eral multiplier inequalities old and new, and then focus on application of a new multiplier inequality, and discuss one particular statistical application concerning convergence rates of least squares estimators (LSE) in regression models with possibly “heavy-tailed” errors. Particular cases involving sparse linear regression, shape restrictions, and finite sampling empirical processes will be mentioned briefly. (This talk is based on the University of Washington Ph.D. work of Qiyang (Roy) Han.)
Research Seminar in Statistics
Multiplier Processes in Statistics: a new multiplier inequality and applications
HG G 19.1
Wed 22.05.2019
15:15-16:00
Haakon Bakka
King Abdullah University
Abstract
In this talk, we will first give a short background on (local) differential operators, sparse discretisations, and GMRFs, and a practical motivation for continuously indexed models. Then we give a brief overview of some PDEs used to model physical processes, and turn these into SPDEs, getting the corresponding spline penalties, spatial, and spatio-temporal models. We outline a proof method for existence and uniqueness of solutions. These SPDEs naturally give the class of Matern models, and extend this class in several directions that are interesting for applications. We apply the approach to two general settings and produce non-separable and non-stationary space-time models.
Research Seminar in Statistics
Fundamental ideas of the stochastic partial differential equations approach to defining random effects
HG G 19.1
Fri 24.05.2019
15:15-16:00
Bin Yu
UC Berkeley
Abstract
In this talk, I'd like to discuss the intertwining importance and connections of three principles of data science in the title and the PCS workflow that is built on the three principles. The principles will be demonstrated in the context of collaborative projects in genomics for interpretable data results and testable hypothesis generation. If time allows, I will present proposed PCS inference that includes perturbation intervals and PCS hypothesis testing. The PCS inference uses prediction screening and takes into account both data and model perturbations. Finally, a PCS documentation is proposed based on Rmarkdown, iPython, or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo.
Research Seminar in Statistics
Three principles of data science: predictability, computability, and stability (PCS)
HG G 19.1
Thr 13.06.2019
15:15-16:00
Andreas Buja
University of Pennsylvania
Abstract
In this talk we will rethink what we do when we apply regression to data. We will do so without assuming the correctness of a fitted regression model. We will think of the parameters of the fitted model as statistical functionals, here called "regression functionals," which apply to largely arbitrary (X,Y) distributions. In this view a fitted model is an approximation, not a "data generating process." A natural question is whether such an assumption-lean framework lends itself to a useful statistical theory. Indeed it does: It is possible to (1) define a notion of well-specification for regression functionals that replaces the notion of correct specification of models, (2) create a well-specification diagnostic for regression functionals based on reweighting the data, (3) prove insightful Central Limit Theorems, (4) clear up the misconception that "model bias" generates biased estimates,(5) exhibit standard errors of the plug-in/sandwich type as limiting cases of the pairs- or (X,Y)-bootstrap, and (6) provide theoretical heuristics to indicate that pairs-bootstrap standard errors may generally be more stable than sandwich standard errors.
Research Seminar in Statistics
Models as Approximations -- A Model-Free Theory of Parametric Regression
HG G 19.1
Wed 26.06.2019
16:00-17:00
Francis Bach
INRIA and Ecole normale superieure PSL Research University, Paris France.
Abstract
We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model. (Joint work with Loucas Pillaud-Vivien and Alessandro Rudi)
ETH-FDS seminar
Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes
HG F 3
Tue 09.07.2019
11:00-12:00
Suvrit Sra
MIT
Abstract
Nonconvex optimization is currently enjoying intense interest in machine learning. This interest is largely fueled by deep learning, and also by more complex tasks in related areas. Although an understanding of why neural networks work so well remains elusive, there has been impressive progress in algorithms, software, and systems for nonconvex optimization. But in today's talk, I want to take a step back from algorithmic advances (fast nonconvex SGD, escaping saddle-points, adaptive methods, etc.). Instead, I want to draw your attention to a new set of tools that expand the repertoire of tractable nonconvex models. In particular, I will present a rich subclass of nonconvex problems that can be solved to global optimality by leveraging geometry. The specific concept that I'll discuss is geodesic convexity, which generalizes the usual vector-space notion of convexity to nonlinear spaces (more generally, I will also cover geodesically non-convex problems). My aim is to outline how geometric thinking leads to improved models or insights for fundamental tasks such as large-scale PCA, metric learning, Gaussian mixture models, among others. I will outline both theoretical and practical aspects, including iteration complexity theory, and conclude with some open problems.
ETH-FDS seminar
Non-convex optimization through a geometric lens
ML F 34
Thr 11.07.2019
11:00-12:00
Stefanie Jegelka
MIT
Abstract
In recent years, neural networks have achieved impressive empirical results on a wide range of tasks. One fundamental step in understanding their theoretical properties is to understand their expressive power. Classical results study this question for shallow and very wide networks, whereas recent results address deeper networks that are more common in practice. We focus on two types of networks that are popular in practice: (1) the ResNet architecture, and (2) neural networks for graphs. For ResNet, we show how narrow the network can be - if it can be deep - to still be a universal approximator. Our result reveals a distinction from other architectures. For graph neural networks, we study what graphs they can distinguish, and what properties of the network affect this discriminative power.
ETH-FDS seminar
Representational Power of ResNet and Graph Neural Networks
CAB G 51
Fri 19.07.2019
11:00-12:00
Stefanie Jegelka
MIT
Suvrit Sra
MIT
Abstract
Negative Dependence, Stable Polynomials, and All That This tutorial provides an introduction to a rapidly evolving topic: the theory of negative dependence and its numerous ramifications in machine learning. Indeed, negatively dependent probability measures provide a powerful tool for modeling non-i.i.d. data, and thus can impact all aspects of learning, including supervised, unsupervised, interpretable, interactive, and large-scale setups. The most well-known examples of negatively dependent distributions are perhaps the Determinantal Point Processes (DPPs), which have already found numerous ML applications. But DPPs are just the tip of the iceberg; the class of negatively dependent measures is much broader, and given the vast web of mathematical connections it enjoys, its holds great promise as a tool for machine learning. This tutorial exposes the ML audience to this rich mathematical toolbox, while outlining key theoretical ideas and motivating fundamental applications. Tasks that profit from negative dependence include anomaly detection, information maximization, experimental design, validation of black-box systems, architecture learning, fast MCMC sampling, dataset summarization, interpretable learning. Time permitting, we will also touch up recent related breakthrough on the broader class of so-called "completely log-concave" polynomials.
ETH-FDS seminar
Tutorial by Stefanie Jegelka and Suvrit Sra on Negative Dependence, Stable Polynomials, and All That
CAB G 51
Mon 22.07.2019
11:00-12:00
Stefanie Jegelka
MIT
Suvrit Sra
MIT
Abstract
Negative Dependence, Stable Polynomials, and All That This tutorial provides an introduction to a rapidly evolving topic: the theory of negative dependence and its numerous ramifications in machine learning. Indeed, negatively dependent probability measures provide a powerful tool for modeling non-i.i.d. data, and thus can impact all aspects of learning, including supervised, unsupervised, interpretable, interactive, and large-scale setups. The most well-known examples of negatively dependent distributions are perhaps the Determinantal Point Processes (DPPs), which have already found numerous ML applications. But DPPs are just the tip of the iceberg; the class of negatively dependent measures is much broader, and given the vast web of mathematical connections it enjoys, its holds great promise as a tool for machine learning. This tutorial exposes the ML audience to this rich mathematical toolbox, while outlining key theoretical ideas and motivating fundamental applications. Tasks that profit from negative dependence include anomaly detection, information maximization, experimental design, validation of black-box systems, architecture learning, fast MCMC sampling, dataset summarization, interpretable learning. Time permitting, we will also touch up recent related breakthrough on the broader class of so-called "completely log-concave" polynomials.
ETH-FDS seminar
Tutorial by Stefanie Jegelka and Suvrit Sra on Negative Dependence, Stable Polynomials, and All That
CAB G 51
JavaScript has been disabled in your browser