Seminar overview

×

Modal title

Modal content

Spring Semester 2017

Date & Time Speaker Title Location
Wed 22.02.2017
16:15-17:00
Stef van Buuren
Utrecht University
Gerko Vink
Utrecht University
Abstract
Nearly all data analytic procedures in R are designed for complete data and fail if the data contain NA's. Most procedures simply ignore any incomplete rows in the data, or use ad-hoc procedures like replacing NA with the "best value". However, such procedures for fixing NA's may introduce serious biases in the ensuing statistical analysis. Multiple imputation is a principled solution for this problem and is implemented in the R package MICE. In this talk we will give a compact overview of MICE capabilities for R experts, followed by a discussion.
ZüKoSt Zürcher Kolloquium über Statistik
A quick tour with the mice package for imputing missing data
HG G 19.1
Tue 28.02.2017
11:15-12:00
Po-Ling Loh
University of Wisconsin-Madison
Abstract
We consider the problem of influence maximization in fixed networks, for both stochastic and adversarial contagion models. In the stochastic setting, nodes are infected in waves according to linear threshold or independent cascade models. We establish upper and lower bounds for the influence of a subset of nodes in the network, where the influence is defined as the expected number of infected nodes at the conclusion of the epidemic. We quantify the gap between our upper and lower bounds in the case of the linear threshold model and illustrate the gains of our upper bounds for independent cascade models in relation to existing results. Importantly, our lower bounds are monotonic and submodular, implying that a greedy algorithm for influence maximization is guaranteed to produce a maximizer within a 1-1/e factor of the truth. In the adversarial setting, an adversary is allowed to specify the edges through which contagion may spread, and the player chooses sets of nodes to infect in successive rounds. We establish upper and lower bounds on the pseudo-regret for possibly stochastic strategies of the adversary and player. This is joint work with Justin Khim and Varun Jog.
Research Seminar in Statistics
Influence maximization in stochastic and adversarial settings
HG G 19.1
Thr 02.03.2017
16:15-17:00
Ben Marwick
University of Washington, Seattle
Abstract
"Long considered an axiom of science, the reproducibility of scientific research has recently come under scrutiny after some highly-publicized failures to reproduce results. This has often been linked to the failure of the current model of journal publishing to provide enough details for reviewers to adequately assess the correctness of papers submitted for publication. One early proposal for ameliorating this situation is to bundle the different files that make up a research result into a publicly-available 'compendium'. At the time it was originally proposed, creating a compendium was a complex process. In this talk I show how modern software tools and services have substantially lightened the burden of making compendia. I describe current approaches to making these compendia to accompany journal articles. Several recent projects of varying sizes are briefly presented to show how my colleagues and I are using R and related tools (e.g. version control, continuous integration, containers, repositories) to make compendia for our publications. I explain how these approaches, which we believe to be widely applicable to many types of research work, subvert the constraints of the typical journal article, and improve the efficiency and reproducibility of our research."

More information: https://www.math.ethz.ch/sfs/news-and-events/seminar-applied-statistics.html
ZüKoSt Zürcher Kolloquium über Statistik
Reproducible Research Compendia via R packages
HG G 19.1
Thr 06.04.2017
16:15-17:00
Sebastian Engelke
EPFL Lausanne
Abstract
Max-stable processes are suitable models for extreme events that exhibit spatial dependencies. The dependence measure is usually a function of Euclidean distance between two locations. In this talk, we explore two models for extreme events on an underlying graphical structure. Dependence is more complex in this case as it can no longer be explained by classical geostatistical tools. The first model concentrates on river discharges on a network in the upper Danube catchment, where flooding regularly causes huge damage. Inspired by the work by Ver Hoef and Peterson (2010) for non-extreme data, we introduce a max-stable process on the river network that allows flexible modeling of flood events and that enables risk assessment even at locations without a gauging station. The second approach studies conditional independence structures for threshold exceedances, which result in a factorization of the likelihoods of extreme events. This allows for the construction of parsimonious dependence models that respect the underlying graph.
ZüKoSt Zürcher Kolloquium über Statistik
Models for extremes on graphs
HG G 19.1
Fri 07.04.2017
15:15-16:00
Tommaso Proietti
University of Rome, Tor Vergata
Abstract
A recent strand of the time series literature has considered the problem of estimating high-dimensional autocovariance matrices, for the purpose of out of sample prediction. For an integrated time series, the Beveridge-Nelson trend is defined as the current value of the series plus the sum of all forecastable future changes. For the optimal linear projection of all future changes into the space spanned by the past of the series, we need to solve a high-dimensional Toeplitz system involving 𝑛 autocovariances, where 𝑛 is the sample size. The paper proposes a non-parametric estimator of the trend that relies on banding, or tapering, the sample partial autocorrelations, by a regularized Durbin-Levinson algorithm. We derive the properties of the estimator and compare it with alternative parametric estimators based on the direct and indirect finite order autoregressive predictors.
Research Seminar in Statistics
Optimal linear prediction of stochastic trends
HG G 19.1
Mon 10.04.2017
15:15-16:00
Shahar Mendelson
Australian National University, Canberra, and Technion, Haifa
Abstract
We study the geometry of the natural function class extension of a random projection of a subset of $R^d$: for a class of functions $F$ defined on the probability space $(\Omega,\mu)$ and an iid sample X_1,...,X_N with each of the $X_i$'s distributed according to $\mu$, the corresponding coordinate projection of $F$ is the set $\{ (f(X_1),....,f(X_N)) : f \in F\} \subset R^N$. We explain how structural information on such random sets can be derived and then used to address various questions in high dimensional statistics (e.g. regression problems), high dimensional probability (e.g., the extremal singular values of certain random matrices) and high dimensional geometry (e.g., Dvoretzky type theorems). Our focus is on results that are (almost) universally true, with minimal assumptions on the class $F$; these results are established using the recently introduced small-ball method.
Research Seminar in Statistics
The small-ball method and the structure of random coordinate projections
HG G 19.2
Mon 10.04.2017
16:30-17:15
Mladen Kolar
The University of Chicago
Abstract
In this talk, I will present two recent ideas that can help solve large scale optimization problems. In the first part, I will present a method for solving an ell-1 penalized linear and logistic regression problems where data are distributed across many machines. In such a scenario it is computationally expensive to communicate information between machines. Our proposed method requires a small number of rounds of communication to achieve the optimal error bound. Within each round, every machine only communicates a local gradient to the central machine and the central machine solves a ell-1 penalized shifter linear or logistic regression. In the second part, I will discuss usage of sketching as a way to solve linear and logistic regression problems with large sample size and many dimensions. This work is aimed at solving large scale optimization procedures on a single machine, while the extension to a distributed setting is work in progress.
Research Seminar in Statistics
Some Recent Advances in Scalable Optimization
HG G 19.2
Thr 27.04.2017
16:15-17:00
Marjolein Fokkema
Department of Methods and Statistics der Universität Leiden, NL
Abstract
Most statistical prediction methods provide a trade-off between accuracy and interpretability. For example, single classification trees may be easy to interpret, but likely provide lower predictive accuracy than many other methods. Random forests, on the other hand, may provide much better accuracy, but are more difficult to interpret, sometimes even termed black boxes. Prediction rule ensembles (PREs) aim to strike a balance between accuracy and interpretability. They generally consist of only a small set of prediction rules, which in turn can be depicted as very simple decision trees, which are easy to interpret and apply. Friedman and Popescu (2008) proposed an algorithm for deriving PREs, which derives a large initial ensemble of prediction rules from the nodes of CART trees and selects a sparse final ensemble by regularized regression of the outcome variable on the prediction rules. The R package ‘pre’ takes a similar approach to deriving PREs and offers several additional advantages. For example, it employs an unbiased tree induction algorithm, allows for a random-forest type approach to deriving prediction rules, and allows for plotting of the final ensemble. In this talk, I will introduce PRE methodology and package 'pre', illustrate with examples based on psychological research data, and discuss some future directions.
ZüKoSt Zürcher Kolloquium über Statistik
Prediction rule ensembles, or a Japanese gardening approach to random forests
HG G 19.1
Thr 11.05.2017
16:15-17:00
Alexandre Pintore
Winton Capital Management
Abstract
In this presentation I will give an introduction on the work we do at Winton, and in particular describe some of the research challenges we face across the investment process, from data collection and analysis, to the forecasting of asset returns.
ZüKoSt Zürcher Kolloquium über Statistik
An introduction to Winton and research in financial markets
HG G 19.2
Fri 12.05.2017
15:15-16:00
Walter Distaso
Imperial College
Abstract
The analysis of jumps spillovers across assets and markets is fundamental for risk management and portfolio diversification. This paper develops statistical tools for testing conditional independence among the jump components of the quadratic variation, which are measured as the sum of squared jump sizes over a day. To avoid sequential bias distortion, we do not pretest for the presence of jumps. We proceed in two steps. First, we derive the limiting distribution of the infeasible statistic, based on the unobservable jump component. Second, we provide sufficient conditions for the asymptotic equivalence of the feasible statistic based on realized jumps. When the null is true, and both assets have jumps, the statistic weakly converges to a Gaussian random variable. When instead at least one asset has no jumps, then the statistic approaches zero in probability. We then establish the validity of moon bootstrap critical values. If the null is true and both assets have jumps, both statistics have the same limiting distribution. in the absence of jumps in at least one asset, the bootstrap-based statistic converges to zero at a slower rate. Under the alternative, the bootstrap statistic diverges at a slower rate. Altogether, this means that the use of bootstrap critical values ensures a consistent test with asymptotic size equal to or smaller than alpha. We finally provide an empirical illustration using transactions data on futures and ETFs.
Research Seminar in Statistics
Testing for jump spillovers without testing for jumps
HG G 19.1
Thr 18.05.2017
16:15-17:00
Philip O'Neill
University of Nottingham
Abstract
In 1967, an outbreak of smallpox occurred in the Nigerian town of Abakaliki. The details were recorded in a World Health Organisation report, and the resulting data set has reappeared numerous times in the literature on infectious disease modelling. Surprisingly, in virtually all cases most of the available data are ignored. Moreover, the one previous analysis which does consider the full data set uses approximation methods to fit a stochastic transmission model. We present a new analysis which avoids such approximations, using data- augmented Markov chain Monte Carlo methods.
ZüKoSt Zürcher Kolloquium über Statistik
Modelling and Bayesian inference for the Abakaliki smallpox data
HG G 19.1
Tue 06.06.2017
11:00-12:00
Yuansi Chen
Berkeley, University of California
Abstract
Vision in humans and in non-human primates is mediated by a constellation of hierarchically organized visual areas. Visual cortex area V4, which has highly nonlinear response properties, is a challenging visual cortex area after V1 and V2 on the ventral pathway. We demonstrate that area V4 of the primate visual cortex can be accurately modeled via transfer learning of convolutional neural networks (CNNs). We also find that several different neural network architectures lead to similar predictive performance. This fact, combined with the high dimensionality of the models, makes model interpretation challenging. Hence, we introduce stability based principles to interpret these models and explain V4 neuron's pattern selectivity.
Research Seminar in Statistics
CNNs meet real neurons: transfer learning and pattern selectivity in V4.
HG G 26.3
Tue 13.06.2017
11:15-12:00
Caroline Uhler
Massachusetts Institute of Technology IDSS
Abstract
A recent break-through in genomics makes it possible to perform perturbation experiments at a very large scale. In order to learn gene regulatory networks from the resulting data, efficient and reliable causal inference algorithms are needed that can make use of both, observational and interventional data. I will present an algorithm of this type and prove that it is consistent under the faithfulness assumption. This algorithm is based on a greedy permutation search and it is a hybrid approach that uses conditional independence relations in a score-based method. Hence, this algorithm is non-parametric, which makes it useful for analyzing inherently non-Gaussian gene expression data. We will end by analyzing its performance on simulated data, protein signaling data, and single-cell gene expression data.
Research Seminar in Statistics
Permutation-based causal inference algorithms with interventions
HG G 19.2
Fri 16.06.2017
15:15-16:00
Kim Hendrickx
Hasselt University , Hasselt
Abstract
In statistics one often has to find a method for analyzing data which are only indirectly given. One such situation is when one has ``current status data", which only give the information that a certain event has taken place or on the other hand still did not happen. So one observes the ``current status" of the matter. We consider a simple linear regression model where the dependent variable is not observed due to current status censoring and where no assumptions are made on the distribution of the unobserved random error terms. For this model, the theoretical performance of the maximum likelihood estimator (MLE), maximizing the likelihood of the data over all possible distribution functions and all possible regression parameters, is still an open problem. We construct $\sqrt{n}$-consistent and asymptotically normal estimates for the finite dimensional regression parameter in the current status linear regression model, which do not require any smoothing device and are based on maximum likelihood estimates (MLEs) of the infinite dimensional parameter. We also construct estimates, again only based on these MLEs, which are arbitrarily close to efficient estimates, if the generalized Fisher information is finite.
Research Seminar in Statistics
current status linear regression
HG G 19.1
Thr 29.06.2017
15:15-16:00
Victor Chernozhukov
MIT
Abstract
We revisit the classic semiparametric problem of inference on a low dimensional parameter θ0 in the presence of high-dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high-dimensional that the traditional assumptions, such as Donsker properties, that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods which are particularly well-suited to estimation in modern, very high-dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N −1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman-orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0, and (2) making use of cross-fitting which provides an efficient form of data-splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in a N −1/2-neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of DML applied to learn the main regression parameter in a partially linear regression model, DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model, DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, and DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples.
Research Seminar in Statistics
Double/debiased machine learning for treatment and structural parameters
HG G 19.1
JavaScript has been disabled in your browser