Seminar overview
×
Modal title
Modal content
Spring Semester 2017
Date & Time | Speaker | Title | Location |
---|---|---|---|
Wed 22.02.2017 16:15-17:00 |
Stef van Buuren Utrecht University Gerko Vink Utrecht University |
Abstract
Nearly all data analytic procedures in R are designed for complete data and fail if the data contain NA's. Most procedures simply ignore any incomplete rows in the data, or use ad-hoc procedures like replacing NA with the "best value". However, such procedures for fixing NA's may introduce serious biases in the ensuing statistical analysis. Multiple imputation is a principled solution for this problem and is implemented in the R package MICE. In this talk we will give a compact overview of MICE capabilities for R experts, followed by a discussion.
ZüKoSt Zürcher Kolloquium über StatistikA quick tour with the mice package for imputing missing dataread_more |
HG G 19.1 |
Tue 28.02.2017 11:15-12:00 |
Po-Ling Loh University of Wisconsin-Madison |
Abstract
We consider the problem of influence maximization in fixed networks, for both stochastic and adversarial contagion models. In the stochastic setting, nodes are infected in waves according to linear threshold or independent cascade models. We establish upper and lower bounds for the influence of a subset of nodes in the network, where the influence is defined as the expected number of infected nodes at the conclusion of the epidemic. We quantify the gap between our upper and lower bounds in the case of the linear threshold model and illustrate the gains of our upper bounds for independent cascade models in relation to existing results. Importantly, our lower bounds are monotonic and submodular, implying that a greedy algorithm for influence maximization is guaranteed to produce a maximizer within a 1-1/e factor of the truth. In the adversarial setting, an adversary is allowed to specify the edges through which contagion may spread, and the player chooses sets of nodes to infect in successive rounds. We establish upper and lower bounds on the pseudo-regret for possibly stochastic strategies of the adversary and player. This is joint work with Justin Khim and Varun Jog.
Research Seminar in StatisticsInfluence maximization in stochastic and adversarial settingsread_more |
HG G 19.1 |
Thr 02.03.2017 16:15-17:00 |
Ben Marwick University of Washington, Seattle |
Abstract
"Long considered an axiom of science, the reproducibility of scientific research has recently come under scrutiny after some highly-publicized failures to reproduce results. This has often been linked to the failure of the current model of journal publishing to provide enough details for reviewers to adequately assess the correctness of papers submitted for publication. One early proposal for ameliorating this situation is to bundle the different files that make up a research result into a publicly-available 'compendium'. At the time it was originally proposed, creating a compendium was a complex process. In this talk I show how modern software tools and services have substantially lightened the burden of making compendia. I describe current approaches to making these compendia to accompany journal articles. Several recent projects of varying sizes are briefly presented to show how my colleagues and I are using R and related tools (e.g. version control, continuous integration, containers, repositories) to make compendia for our publications. I explain how these approaches, which we believe to be widely applicable to many types of research work, subvert the constraints of the typical journal article, and improve the efficiency and reproducibility of our research."
ZüKoSt Zürcher Kolloquium über StatistikMore information: https://www.math.ethz.ch/sfs/news-and-events/seminar-applied-statistics.htmlcall_made Reproducible Research Compendia via R packagesread_more |
HG G 19.1 |
Thr 06.04.2017 16:15-17:00 |
Sebastian Engelke EPFL Lausanne |
Abstract
Max-stable processes are suitable models for extreme events that exhibit spatial dependencies. The dependence measure is usually a function of Euclidean distance between two locations. In this talk, we explore two models for extreme events on an underlying graphical structure. Dependence is more complex in this case as it can no longer be explained by classical geostatistical tools.
The first model concentrates on river discharges on a network in the upper Danube catchment, where flooding regularly causes huge damage. Inspired by the work by Ver Hoef and Peterson (2010) for non-extreme data, we introduce a max-stable process on the river network that allows flexible modeling of flood events and that enables risk assessment even at locations without a gauging station.
The second approach studies conditional independence structures for threshold exceedances, which result in a factorization of the likelihoods of extreme events. This allows for the construction of parsimonious dependence models that respect the underlying graph.
ZüKoSt Zürcher Kolloquium über StatistikModels for extremes on graphsread_more |
HG G 19.1 |
Fri 07.04.2017 15:15-16:00 |
Tommaso Proietti University of Rome, Tor Vergata |
Abstract
A recent strand of the time series literature has considered the problem of estimating high-dimensional autocovariance matrices, for the purpose of out of sample prediction. For an integrated time series, the Beveridge-Nelson trend is defined as the current value of the series plus the sum of all forecastable future changes. For the optimal linear projection of all future changes into the space spanned by the past of the series, we need to solve a high-dimensional Toeplitz system involving 𝑛 autocovariances, where 𝑛 is the sample size. The paper proposes a non-parametric estimator of the trend that relies on banding, or tapering, the sample partial autocorrelations, by a regularized Durbin-Levinson algorithm. We derive the properties of the estimator and compare it with alternative parametric estimators based on the direct and indirect finite order autoregressive predictors.
Research Seminar in StatisticsOptimal linear prediction of stochastic trendsread_more |
HG G 19.1 |
Mon 10.04.2017 15:15-16:00 |
Shahar Mendelson Australian National University, Canberra, and Technion, Haifa |
Abstract
We study the geometry of the natural function class extension of a random projection of a subset of $R^d$: for a class of functions $F$ defined on the probability space $(\Omega,\mu)$ and an iid sample X_1,...,X_N with each of the $X_i$'s distributed according to $\mu$, the corresponding coordinate projection of $F$ is the set $\{ (f(X_1),....,f(X_N)) : f \in F\} \subset R^N$.
We explain how structural information on such random sets can be derived and then used to address various questions in high dimensional statistics (e.g. regression problems), high dimensional probability (e.g., the extremal singular values of certain random matrices) and high dimensional geometry (e.g., Dvoretzky type theorems). Our focus is on results that are (almost) universally true, with minimal assumptions on the class $F$; these results are established using the recently introduced small-ball method.
Research Seminar in StatisticsThe small-ball method and the structure of random coordinate projections read_more |
HG G 19.2 |
Mon 10.04.2017 16:30-17:15 |
Mladen Kolar The University of Chicago |
Abstract
In this talk, I will present two recent ideas that can help solve
large scale optimization problems. In the first part, I will present a
method for solving an ell-1 penalized linear and logistic regression
problems where data are distributed across many machines. In such a
scenario it is computationally expensive to communicate information
between machines. Our proposed method requires a small number of
rounds of communication to achieve the optimal error bound. Within
each round, every machine only communicates a local gradient to the
central machine and the central machine solves a ell-1 penalized
shifter linear or logistic regression. In the second part, I will
discuss usage of sketching as a way to solve linear and logistic
regression problems with large sample size and many dimensions. This
work is aimed at solving large scale optimization procedures on a
single machine, while the extension to a distributed setting is work
in progress.
Research Seminar in StatisticsSome Recent Advances in Scalable Optimizationread_more |
HG G 19.2 |
Thr 27.04.2017 16:15-17:00 |
Marjolein Fokkema Department of Methods and Statistics der Universität Leiden, NL |
Abstract
Most statistical prediction methods provide a trade-off between accuracy and interpretability. For example, single classification trees may be easy to interpret, but likely provide lower predictive accuracy than many other methods. Random forests, on the other hand, may provide much better accuracy, but are more difficult to interpret, sometimes even termed black boxes. Prediction rule ensembles (PREs) aim to strike a balance between accuracy and interpretability. They generally consist of only a small set of prediction rules, which in turn can be depicted as very simple decision trees, which are easy to interpret and apply.
Friedman and Popescu (2008) proposed an algorithm for deriving PREs, which derives a large initial ensemble of prediction rules from the nodes of CART trees and selects a sparse final ensemble by regularized regression of the outcome variable on the prediction rules. The R package ‘pre’ takes a similar approach to deriving PREs and offers several additional advantages. For example, it employs an unbiased tree induction algorithm, allows for a random-forest type approach to deriving prediction rules, and allows for plotting of the final ensemble. In this talk, I will introduce PRE methodology and package 'pre', illustrate with examples based on psychological research data, and discuss some future directions.
ZüKoSt Zürcher Kolloquium über StatistikPrediction rule ensembles, or a Japanese gardening approach to random forestsread_more |
HG G 19.1 |
Thr 11.05.2017 16:15-17:00 |
Alexandre Pintore Winton Capital Management |
Abstract
In this presentation I will give an introduction on the work we do at Winton, and in particular
describe some of the research challenges we face across the investment process, from
data collection and analysis, to the forecasting of asset returns.
ZüKoSt Zürcher Kolloquium über StatistikAn introduction to Winton and research in financial marketsread_more |
HG G 19.2 |
Fri 12.05.2017 15:15-16:00 |
Walter Distaso Imperial College |
Abstract
The analysis of jumps spillovers across assets and markets is fundamental for risk management
and portfolio diversification. This paper develops statistical tools for testing conditional independence
among the jump components of the quadratic variation, which are measured as the sum of squared jump
sizes over a day. To avoid sequential bias distortion, we do not pretest for the presence of jumps. We
proceed in two steps. First, we derive the limiting distribution of the infeasible statistic, based on the
unobservable jump component. Second, we provide sufficient conditions for the asymptotic equivalence
of the feasible statistic based on realized jumps. When the null is true, and both assets have jumps, the
statistic weakly converges to a Gaussian random variable. When instead at least one asset has no jumps,
then the statistic approaches zero in probability. We then establish the validity of moon bootstrap critical
values. If the null is true and both assets have jumps, both statistics have the same limiting distribution.
in the absence of jumps in at least one asset, the bootstrap-based statistic converges to zero at a slower
rate. Under the alternative, the bootstrap statistic diverges at a slower rate. Altogether, this means that
the use of bootstrap critical values ensures a consistent test with asymptotic size equal to or smaller than
alpha. We finally provide an empirical illustration using transactions data on futures and ETFs.
Research Seminar in StatisticsTesting for jump spillovers without testing for jumpsread_more |
HG G 19.1 |
Thr 18.05.2017 16:15-17:00 |
Philip O'Neill University of Nottingham |
Abstract
In 1967, an outbreak of smallpox occurred in the Nigerian
town of Abakaliki. The details were recorded in a World Health
Organisation report, and the resulting data set has reappeared numerous
times in the literature on infectious disease modelling. Surprisingly,
in virtually all cases most of the available data are ignored.
Moreover, the one previous analysis which does consider the full data
set uses approximation methods to fit a stochastic transmission model.
We present a new analysis which avoids such approximations, using data-
augmented Markov chain Monte Carlo methods.
ZüKoSt Zürcher Kolloquium über StatistikModelling and Bayesian inference for the Abakaliki smallpox dataread_more |
HG G 19.1 |
Tue 06.06.2017 11:00-12:00 |
Yuansi Chen Berkeley, University of California |
Abstract
Vision in humans and in non-human primates is mediated by a constellation of hierarchically organized visual areas. Visual cortex area V4, which has highly nonlinear response properties, is a challenging visual cortex area after V1 and V2 on the ventral pathway. We demonstrate that area V4 of the primate visual cortex can be accurately modeled via transfer learning of convolutional neural networks (CNNs). We also find that several different neural network architectures lead to similar predictive performance. This fact, combined with the high dimensionality of the models, makes model interpretation challenging. Hence, we introduce stability based principles to interpret these models and explain V4 neuron's pattern selectivity.
Research Seminar in StatisticsCNNs meet real neurons: transfer learning and pattern selectivity in V4.read_more |
HG G 26.3 |
Tue 13.06.2017 11:15-12:00 |
Caroline Uhler Massachusetts Institute of Technology IDSS |
Abstract
A recent break-through in genomics makes it possible to perform perturbation experiments at a very large scale. In order to learn gene regulatory networks from the resulting data, efficient and reliable causal inference algorithms are needed that can make use of both, observational and interventional data. I will present an algorithm of this type and prove that it is consistent under the faithfulness assumption. This algorithm is based on a greedy permutation search and it is a hybrid approach that uses conditional independence relations in a score-based method. Hence, this algorithm is non-parametric, which makes it useful for analyzing inherently non-Gaussian gene expression data. We will end by analyzing its performance on simulated data, protein signaling data, and single-cell gene expression data.
Research Seminar in StatisticsPermutation-based causal inference algorithms with interventionsread_more |
HG G 19.2 |
Fri 16.06.2017 15:15-16:00 |
Kim Hendrickx Hasselt University , Hasselt |
Abstract
In statistics one often has to find a method for analyzing data which are only indirectly given. One such situation is when one has ``current status data", which only give the information that a certain event has taken place or on the other hand still did not happen. So one observes the ``current status" of the matter. We consider a simple linear regression model where the dependent variable is not observed due to current status censoring and where no assumptions are made on the distribution of the unobserved random error terms. For this model, the theoretical performance of the maximum likelihood estimator (MLE), maximizing the likelihood of the data over all possible distribution functions and all possible regression parameters, is still an open problem.
We construct $\sqrt{n}$-consistent and asymptotically normal estimates for the finite dimensional regression parameter in the current status linear regression model, which do not require any smoothing device and are based on maximum likelihood estimates (MLEs) of the infinite dimensional parameter. We also construct estimates, again only based on these MLEs, which are arbitrarily close to efficient estimates, if the generalized Fisher information is finite.
Research Seminar in Statisticscurrent status linear regressionread_more |
HG G 19.1 |
Thr 29.06.2017 15:15-16:00 |
Victor Chernozhukov MIT |
Abstract
We revisit the classic semiparametric problem of inference on a low dimensional parameter θ0 in the presence of high-dimensional nuisance parameters η0. We depart from the classical setting by allowing for η0 to be so high-dimensional that the traditional assumptions, such as Donsker properties, that limit complexity of the parameter space for this object break down. To estimate η0, we consider the use of statistical or machine learning (ML) methods which are particularly well-suited to estimation in modern, very high-dimensional cases. ML methods perform well by employing regularization to reduce variance and trading off regularization bias with overfitting in practice. However, both regularization bias and overfitting in estimating η0 cause a heavy bias in estimators of θ0 that are obtained by naively plugging ML estimators of η0 into estimating equations for θ0. This bias results in the naive estimator failing to be N −1/2 consistent, where N is the sample size. We show that the impact of regularization bias and overfitting on estimation of the parameter of interest θ0 can be removed by using two simple, yet critical, ingredients: (1) using Neyman-orthogonal moments/scores that have reduced sensitivity with respect to nuisance parameters to estimate θ0, and (2) making use of cross-fitting which provides an efficient form of data-splitting. We call the resulting set of methods double or debiased ML (DML). We verify that DML delivers point estimators that concentrate in a N −1/2-neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements. The generic statistical theory of DML is elementary and simultaneously relies on only weak theoretical requirements which will admit the use of a broad array of modern ML methods for estimating the nuisance parameters such as random forests, lasso, ridge, deep neural nets, boosted trees, and various hybrids and ensembles of these methods. We illustrate the general theory by applying it to provide theoretical properties of DML applied to learn the main regression parameter in a partially linear regression model, DML applied to learn the coefficient on an endogenous variable in a partially linear instrumental variables model, DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, and DML applied to learn the local average treatment effect in an instrumental variables setting. In addition to these theoretical applications, we also illustrate the use of DML in three empirical examples.
Research Seminar in StatisticsDouble/debiased machine learning for treatment and structural parametersread_more |
HG G 19.1 |