Master's theses

Main content


Student Title Advisor(s) Date
Lydia Braunack-Mayer 
Interference Between Common Respiratory Pathogens Professor Peter Bühlmann 
Professor Sebastian Bonhoeffer 
Professor Roger Kouyos 
Rhinovirus, In
uenza and Respiratory Sinctial virus and other common bacteria and
viruses pose a serious burden to both individual and public health. These pathogens are
simultaneously present in a population and, yet, epidemiological studies of the complex
factors that cause these illnesses tend to focus on a single pathogen. The aim of this
thesis was to understand the shared determinants of infection by common bacteria and
respiratory viruses, focusing on interference between pathogens. Statistical inference was
applied to explore the incidence of infection by 16 common pathogens in multiplex PCR
tests conducted at the Universitatspital Basel, between June 2010 and September 2015.
With Fisher's exact tests for independence, cross-wavelet analyses and an SIR model
with cross-immunity, patterns in the pathogens detected were found to be consistent
with the hypothesis that, for a number of common respiratory viruses, infection by one
pathogen interferes with infection by a second.
Francesco Ortelli 
Statistics meets optimization: random projections and nearest neighbor search Prof. Dr. Sara van de Geer 
Benjamin Stucky 
In the era of big data the number of situations where one has to work with high-dimensional
data sets is growing. As a consequence the application of some statistical techniques to
such problems is slowed down considerably from a computational point of view by the
high dimensionality of the data: this phenomenon is called the curse of dimensionality.
Moreover sometimes it can even be costly to store the data itself. We present some
variants of the Johnson-Lindenstrauss Lemma, a data-oblivious dimensionality reduction
technique, and expose how it can be applied to the (approximate) nearest neighbor search
problem in order to break the curse of dimensionality. When presenting the variants
of the Johnson-Lindenstrauss Lemma the focus will lie on the time required by their
computational application. For the application to the nearest neighbor search problem
we will see that the Johnson-Lindenstrauss Lemma represents the bottleneck in terms of
time required. Finally we will complete the work by performing some simulations, aimed
at understanding how to implement the theory in the best possible way.
Samuel Schaffhauser 
Detection of Hyperreflective Foci in Optical Coherence Tomography Prof. Dr. Nicolai Meinshausen 
Dr. Clarisa Snáchez
Diabetic macular edema (DME) is a retinal disorder characterised by allocation
of cystoidal fluid in the retina. The current treatment consists of repeated antivascular
endothelial growth factor (anti-VEGF) injections. Recent studies indicate
that the presence and number of hyperreflective foci (HRF) could be a prognostic
biomarker for treatment response in DME. Since the detection of HRF is laborious,
manual foci quantifications seem infeasible. Therefore, an automated detection of
HRF in optical coherence tomography (OCT) images is designed to assist ophthalmologist
in their endeavour.
191 fovea centred B-scans from 76 patients with DME were obtained out of a
clinical database and serve as training set. A further data set with 88 B-scans
from 39 patients forms the test set and contains annotations from two independent
observers. HRF were only annotated in the layers ranging from the inner
plexiform layer (IPL) to the outer nuclear layer (ONL) as manual detection is challenging
in the remaining layers. A supervised fully convolutional neural network
(CNN) trained on patches classifies the central pixel into hyperreflective foci or
background. The CNN consists of 7 convolutional layers and 2 max-pooling layers.
After Providing the system enough training samples to fit its parameters, it is capable
to detect HRF in OCT B-scans. The derived results were compared to manual
annotations made by two human graders for the 3mm region surrounding the
fovea in the central B-scan.
The classifier has a free-response receiver operation characteristic (FROC) curve for
the independent test set above the operation point of two independent graders,
take one grader as truth and the other as classifier. Comparing the classifier to a
random forest with PCA components reveals that the performance is remarkable
An image analysis algorithm for the automatic detection and quantification of
HRF in OCT B-scans was developed. The experiments show promising results to
use convolutional neural network to obtain automated detection and foci based
biomarkers for build on medical studies.
Lennart von Thiessen 
Linear Regression Based on Imputed Data Sets and a Further Look on missForest Prof. Dr. Peter Bühlmann 
Dr. Daniel Stekhoven 
Johannes Göbel 
Analysis of Financial Data with Nonlinear Time Series Approaches Dr. Lukas Meier  May-2017
In this thesis we are analysing three sets of financial time series with R. We will first review parametric
linear time series processes and will show that they are not sufficient for analysing financial data. In
chapter 3 we will introduce parametric nonlinear time series models namely the ARCH processes which
were introduced by Engle in 1982 and their extension the GARCH processes introduced independently
by Taylor and Bollerslev in 1986.
Since it is essential for parametric time series analysis that the model chosen is the true data generating
model in order to provide good results and choosing the wrong model will introduce bias we
will present in chapter 4 the additive nonlinear model as an example of nonparametric nonlinear time
series models.
The R code we used in each chapter for the simulations and the analysis of the data can be found in
the Appendix. The following R packages were used: quantmod, FinTS, rugarch and mgcv.
Konrad Knuesel 
An ROC curve based comparison of multivariate classification methods Dr. Markus Kalisch  Apr-2017
The goal of this study is to compare classification methods that can be used to develop a diagnostic test based on multivariate data. The methods were evaluated based on their ROC curves. Following an introduction to the univariate ROC curve, estimators for the one-dimensional case were compared under a variety of simulation settings. These univari- ate estimators are: empirical, binormal, ”log-concave,” and kernel-smoothed. To evaluate the methods, data was simulated from known distributions under both small and large sample size settings. Comparing the accuracy of the estimators (defined by how well they approximate the true ROC curve), the binormal method performed best with small sam- ple sizes while the log-concave and kernel-smoothed methods performed best with large sample sizes. The focus of the study then turns to the multivariate case. The following classification methods were compared in a simulation study: simple average, distribution- free, LDA, QDA, logistic regression, and SVM. In the two-dimensional case, every method performed similarly except for the simple average and SVM, which were considerably worse. In the six-dimensional simulation that followed, QDA and SVM were generally the best performing methods although in certain cases, LDA and logistic regression had somewhat better results. Finally, the classification methods were applied to a medical data set. In this case, LDA and logistic regression were found to have the best cross-validated performance.
Emmanuel Profumo 
Finding the number of clusters via standardization of validity plots using parametric bootstrap Dr. Martin Mächler  Mar-2017
In this thesis, we present and study a method to estimate the number of clusters
in a data set. The calibration method consists in comparing cluster validity index
values to the ones obtained under a reference distribution, yielding the so-called gap
statistic. Then, we present a reference model for the absence of clusters for mixed-type
data, which can be seen as a generalisation of models for continuous data . We give R
functions to implement this method and null models, and run a simulation on mixed-
type simulated data to test the performances of the calibration method depending on
parameters such as separability between clusters . We also propose a slightly modifi?ed version
of the gap statistic, and test it on our simulated data.
Nina Aerni 
Evaluation of Feature Selection Methods for Classi?cation of Autism based on ABIDE II Prof. Dr. M. Maathuis 
Pegah Kassraian Fard
Prof. Dr. N. Wenderoth
This thesis evaluates several feature selection methods in combination with the
Support Vector Machine (SVM) classi?er to distinguish between autistic and typically
developed subjects. The Autism Brain Imaging Data Exchange II (ABIDE II)
database is used for this thesis. This database includes 1044 resting state functional
and structural MRI scans. First, the MRI scans were preprocessed using the Statistical
Parametric Mapping (SPM 12). In high dimensional data sets, such as the data
at hand, the number of predictors p is a lot bigger than the number of observations
N. We reduce the feature space with feature selection methods to avoid over?tting.
We achieved an accuracy of about 64 % with the univariate ?lter selection methods
t-test, chi-squared and di?erence in mean on the feature set including functional and
structural features and the covariates. For the multivariate selection method Principal
Feature Analysis (PFA) we achieved lower accuracy for this high dimensional data
set. In comparison to Kassraian Fard et al. (2016), where only functional MRI data is
used for the classi?cation, this thesis also considers the structural MRI scans for the
analysis. We did not see the expected increase of the accuracy results by the addition
of structural features to the functional features. In fact, the addition of the covariates
sex, age and IQ score increased the accuracy to a greater extent.
Emiliano Díaz 
Online deforestation detection Seminar for Statistics Spring 2017 Prof. Dr. Marloes Maathuis  Mar-2017

Deforestation detection using satellite images can make an important contribution to forest management. Current approaches can be broadly divided into those that compare two images taken at similar periods of the year and those that monitor changes by using multiple images taken during the growing season. The CMFDA algorithm described in Zhu et al. (2012) is an algorithm that builds on the latter category by implementing a year-long, continuous, time-series based approach to monitoring images. This algorithm was developed for 30m resolution, 16-day frequency reflectance data from the Landsat satellite. In this work we adapt the algorithm to 1km, 16-day frequency reflectance data from the modis sensor aboard the Terra satellite. The CMFDA algorithm is composed of two submodels which are fitted on a pixel-by-pixel basis. The first estimates the amount of surface reflectance as a function of the day of the year. The second estimates the ocurrence of a deforestation event by comparing the last few predicted and real reflectance values. For this comparison, the reflectance observations for six di↵erent bands are first combined into a forest index. Real and predicted values of the forest index are then compared and high absolute di↵erences for consecutive observation dates are flagged as deforestation events. Our adapted algorithm also uses the two model framework. However, since the modis 13A2 dataset used, includes reflectance data for di↵erent spectral bands than those included in the Landsat dataset, we cannot construct the forest index. Instead we propose two contrasting approaches: a multivariate and an index approach similar to that of CMFDA. In the first prediction errors (form first model) for selected bands are first compared against, band-specific, thresholds to produce one deforestation flag per band. The multiple deforestation flags are then combined using an or rule to produce a general deforestation flag. In the second approach, as with the CMFDA algorithm, the reflectance observations for selected bands are combined into an index. We chose to use the local Mahalanobis distance of prediction errors for the selected bands as our index. This index will measure how atypical a given multivariate predicted error is therby helping us to detect when an intervention to the data generating mechanism has occurred, i.e. a deforestation event. We found that, in general, the multivariate approach obtained slightly better performance although the index approach, based on the Mahalanobis distance, was better at detecting deforestation early. Our training approach was di↵erent to that used in Zhu et al. (2012) in that the lower resolution of the reflectance data and the pseudo ground-truth deforestation data used allowed us to select a much larger and diverse area including nine sites with di↵erent types of forest and deforestation, and training and prediction windows spanning 2003-2010. In Zhu et al. (2012) reflectance and deforestation information from only one site and only the 2001-2003 period is used. This approach allowed us to make conclusions about how the methodology generalizes accross space (specifically pixels) and accross the day of the year. In the CMFDA and our adapted CMFDA methodology a single (possibly multivariate) threshold is applied to the prediction errors irrespective of the location or the time of the year. By comparing the results when thresholds were optimized on a site- by-site basis, to those when a single threshold was optimized for all nine sites we found that optimal thresholds do not translate accross sites, rather they display a local behavior. This is a direct consequence of the local behavior of the prediction error distibutions. This lead us to try to homogenize the error distributions accross space and time by applying transformations based on di↵erent observations and assumptions about the predicted error distributions and their dependence on time and space. However, our e↵orts in this sense did not improve performance leading us to recommend the implementation of the multivariate approach without transforming predicted errors.
Lukas Schiesser 
Causal Inference in the Presence of Hidden Variables using Invariant Prediction Prof. Dr. Nicolai Meinshausen  Mar-2017
This thesis extends causal inference using invariant prediction based on the ideas introduced by Peters, Bühlmann, and Meinshausen (2015) to settings allowing the presence of hidden variables. Invariant causal prediction exploits that given different experimental environments resulting e.g. from interventions on variables, the predictions from a causal model will be invariant. Hence the causal model has to be among those models fulfilling such an invariance property, or is accepted with high probability in the context of a statistical test for a corresponding hypothesis. A rather general linear model with hidden variables is introduced and an invariant causal prediction framework for such models is established. Testing the invariance assumption is then reformulated to a quadratically constrained quadratic program which in general is non-convex and therefore does not necessarily have an exact solution. Thus, the optimization problem is relaxed to the semi-definite programming (SDP) framework and its solution can then be approximated or sometimes even obtained exactly in polynomial time. One main focus of the thesis lies on describing different approaches to apply SDP relaxations to solve the non-convex optimization problem. This provides specific methods to obtain confidence statements for the causal relationships in such models, namely for the set of causal predictors and for their causal coefficients. These are applied to simulated and real world data and numerical experiments are conducted to study the empirical properties of the developed approaches.
David Zhao 
Scattering Convolution Networks and PCA Networks for Image Processing Prof. Dr. Nicolai Meinshausen  Feb-2017
The convolutional neural network's defining principle of parameter sharing over shifting receptive fields makes it well-suited for image processing tasks, as this structure enforces both sparsity and invariance to translations and deformations. However, neural networks are not theoretically well understood, and their standard training method involves an NP-hard non-convex optimization. In this thesis, we explore two alternative models for image processing: the scattering convolution network (SCNet) of Bruna and Mallat (2013) and the principal component analysis network (PCANet) of Chan et al. (2015). Both models use sets of transformations that are fully predetermined, while maintaining the benefits of a convolutional structure. SCNet is built from layers of wavelet transforms, and PCANet is built from layers of PCA-extracted filters. SCNet and PCANet can be thought of as elaborate pre-processing steps that transform images into more expressive feature vectors. To obtain class predictions, we run a classification algorithm on these features. Four types of classifiers are considered in this thesis: generative PCA classifiers, linear and rbf kernel SVM, multiclass logistic regression with lasso, and random forest for classification trees. Experiments on the MNIST dataset show that 2-layer SCNet and 2-layer PCANet consistently outperform a comparable convolutional neural network with 2 hidden layers. We also test variations on the MNIST dataset and on the PCANet filters.
Gabriel Espadas 
Parameter estimation and uncertainty description for state-space models Dr. Markus Kalisch  Feb-2017

The present work seeks to address the statistical problem of non-linear regression, also known as calibration, for state-space models. In classical literature, for example in G. A. F. Seber (1988), Douglas M. Bates (1988) or Gallant (1987), such problem has been studied almost entirely from either a Frequentist or a Bayesian perspective. Here, we present the theory that support the basic models from both frameworks and carefully expose the probabilistic background needed for the use of transformations and the introduction of autocorrelation in the stochastic models. Furthermore, we demonstrate in detail the application of the methods to a real-world study case.
Keywords: State-space models, Non-linear regression, Parameter estimation, Frequentist estimation, Bayesian estimation, MCMC, Metropolis Hasting Algorithm, Gibbs Sampler, Autocorrelated errors, Heterosckedastic errors.
Christoph Buck 
Optimizing complementary surveys for mapping the spatial distribution of Mercury in soils near Visp, Canton of Valais, Switzerland Dr. Andreas Papritz 
Dr. Lukas Meier 
For a mercury pollution near Visp, Canton Valais, a geostatistical analysis was made for
a sub-area of the entire study-area. The aim of the analysis was to predict which parcels
have a mercury content over a certain threshold. An analysis was made for two separated
soil layers and a joint 3D-analysis of both soil layers together. It was found that the joint
3D-analysis produces a better prediction in the sub-area.
To make predictions more accurately, more samples must be taken. The aim of the additional
sampling design is to reduce false negative decisions. Based on the idea of the paper
from Heuvelink et al. (2010) and Marchant et al. (2013), an optimisation algorithm was
successfully implemented. It predicts an optimised sampling design which reduces false
negative and false positive decisions. The user can set the parameters of the loss function
for making false negative and false positive decisions. Based on these parameters, the optimisation
algorithm computes the expected loss of a design and investigates an optimised
sampling design. The implementation was made with conditional simulations followed by
an iteration process, which included kriging predictions, computing of expected loss and
spatial simulated annealing.


Student Title Advisor(s) Date
Manuel Schürch 
High-Dimensional Random Projection Ensemble Methods for Classification Prof. Dr. Peter Bühlmann  Nov-2016
In this thesis, we investigate random projection ensemble methods for multiclass classification based on the combination of arbitrary base classifiers operating on appropriately chosen low-dimensional random projections of the feature space. These methods are particularly intended for high-dimensional data sets where the dimension of the variables is comparable to or even greater than the number of available training data samples. We extend a recent proposal of Cannings and Samworth (2015) in two directions. First, we generalize their idea for binary classification to multiple classes. Second, we present alternative approaches to their weighted majority voting for the aggregation of the individual predictions in the ensemble to a final assignment. For this newly developed methodology, we provide implementations and an empirical comparison to state-of-the-art methods on synthetic as well as real-world high-dimensional data sets. Its competitive prediction performance underpins the promising direction of aggregating randomized low-dimensional projections. Moreover, we examine analogous ideas for regression and semi-supervised classification.
Fan Wu 
On Optimal Surface Estimation under Local Stationarity Dr. Rita Ghosh
Dr. Markus Kalisch 
Given a spatial dataset, consider a nonparametric regression model where the aim is to estimate the regression surface. By further assuming local stationarity of the error term, estimation of variance of the Priestly-Chao kernel estimator can be done without the estimation of the various nuisance parameters. All the proofs about uniform convergence of terms are already addressed in Ghosh (2015). In this thesis, we use the proved properties and propose a semiparametric algorithm for optimal bandwidth selection. The findings are then applied to a dataset of the Swiss National Forest Inventory (
Polina Minkina 
A new hybrid approach to learning Bayesian networks from observational data Dr. Markus Kalisch 
Dr. Jack Kuipers 
This work presents a new hybrid approach to learning Bayesian networks from observa- tional data. The method is based on the PC-algorithm combined with a Bayesian style MCMC search. There are several versions of the algorithm presented in this work. Base version of the algorithm suggests to limit the search space with a PC-skeleton and per- form either stochastic MAP search or sampling from the posterior distribution on a reduced search space. While this version yields relatively good results, in some cases the PC algo- rithm eliminates a large part of true positive edges from the search space. To overcome this issue we also suggest an algorithm for iterative expansion of the search space which helps to increase the number of true positives and as a result leads to much better estimates both in terms of skeleton and equivalence class.
We run simulation studies and compare performance of our approach to other algorithms for structure learning, such as PC-algorithm, greedy equivalent search (GES) and max-min hill climbing (MMHC). The advantages of our algorithm are more pronounced in a dense setting. In a sparse setting algorithm performs similarly to GES, but better than PC.
We provide assessments of computational complexity of a new approach, which grows polynomially with the size of network and exponentially with the size of maximal neigh- borhood, which is the main limitation of the method. For the PC-algorithm lower bound for computational complexity also grows exponentially with the size of maximal neighbor- hood, hence we conclude that if PC algorithm is feasible for some network our approach should be feasible too.
Mun Lin Lynette Tay 
Statistical analysis of multi-model climate projections with a Bayesian hierarchical model Prof. em. Dr. Hans-Rudolf Künsch 
Prof. Dr. Peter L. Bühlmann 
This thesis applies a Bayesian hierarchical model as developed by Buser et al. (2009), Buser et al. (2010) and Kerkhoff et al. (2015) to heterogeneous multi-model ensembles of global climate models (GCM) and regional climate models (RCM). The Bayesian hierarchical framework is applied to data from the European arm of the project CORDEX and probabilistic projections of future climate are derived from the climate models.
This thesis is also a continuation of the CH2011 initiative which aims to provide scientifically-grounded information on a changing climate in Switzerland to aid decision-making and planning with regard to climate change strategies. It does so by assessing climate change in the course of the 21st century in Switzerland with a focus on projections of temperature and precipitation. Suitable priors for temperature and precipitation data are suggested and probabilistic projections for different regions in Switzerland, different seasons and different emission scenarios are illustrated and explained. Furthermore, a variant on the Bayesian model proposed by Kerkhoff et al. (2015) which weights data from RCMs more equally to their GCMs is introduced and the two models are compared against each other.
Ravi Mishra 
Gated Recurrent Neural Network Language Models Prof. Dr. Nicolai Meinshausen  Aug-2016
"Long term dependencies are difficult to learn with gradient descent in standard Recurrent
Neural Networks due to vanishing and exploding gradient problems. Long Short-Term
Memory and other gated networks combined with gradient clipping strategies have been
successful at addressing these issues. This work provides details on standard RNN and
gated RNN architectures. The focus lies on forward and backward pass using backpropagation
through time. We train an implementation of a character level neural network
language model on fine food review data. The goal is to model a probability distribution
over the next character in a sequence when presented with the sequence of previous
characters. The results of our experiments indicate that for large datasets and increasing
sequence length gated architectures have better performance than traditional RNNs. This
is in line with previous research."
Janine Burren 
Outlier detection in temperature data by penalized least squares methods Prof. Dr. Nicolai Meinshausen  Aug-2016
Chernozhukov et al. (2015) proposed a new regularization technique called lava. In contrast to conventional methods like lasso or ridge regression, this method is able to discover signals, which are neither sparse nor dense. It was shown that this method outperforms the conventional methods in simulations. The application on the temperature anomaly data for January in this thesis confirmed this.
The focus of this thesis lies on the comparison of the lasso method, the elastic net method and ridge regression with the lava method in theory and application and can be split into five main parts. Firstly, all considered regularization methods are described for a multiple linear regression setting and are brought into relation in the orthonormal design case. Secondly, for the application on the temperature data the lava method and a corresponding cross-validation approach had to be implemented with R. Thirdly, the given temperature anomaly data (1940 - 2015) is analyzed and ordinary least squares models are fitted on temperature data, which result from a climate model, to assess how good temperature anomaly values can be predicted by the four nearest values. Fourthly, regularized linear regression models are fitted on the climate model data and predictions are made for an observed temperature anomaly data set. For this, a model fitting procedure was determined, which is able to deal with the NA-structure in the observed temperature anomaly data and which has a reasonable computational time. The residuals produced by prediction are analyzed with respect to their spatial, temporal and probabilistic distributions. In addition, the functioning of the regularization methods on the temperature anomaly data is studied for some examples to compare the methods and to understand the distributions of the residuals. In the last part of the thesis, these residuals are used to detect outliers in the temperature anomaly data. An outlier detection procedure is proposed, which takes into account the prediction error of the fitted linear models and the NA-structure in the observational data set. Furthermore, an artificial outlier study is conducted to assess the outlier detection power of the four considered regularization methods.
Elias Bolzern 
Stochastic Actor Oriented Models: An Approach Towards Consistency and Multi Network Analysis Prof. Dr. Marloes Henriette Maathuis  Jul-2016
Stochastic actor oriented models allow to describe longitudinal social networks, i.e., social networks observed at various time points. This model can be fitted either by a method of moments approach or a maximum likelihood approach.
In this thesis we discuss two topics. Firstly, up to now, there exists no proof for the con- sistency of the method of moments estimator. We discuss an approach that could lead to a consistency proof.
Secondly, the existing theory allows us to examine only a single social network. We want to examine the common behaviour that underlies several longitudinal networks. This allows us to gain deeper insights in the general behaviour of such networks. We propose to detect the commonalities by considering maximin-effects, which can be estimated by a magging type estimator. We will call our new estimator the multi group estimator. Simulations show that the multi group estimator performs well, especially for a large number of observed time points. Furthermore, the estimator has nice properties in terms of computational efficiency.
Solt Kovács 
Changepoint detection for high-dimensional covariance matrix estimation Peter Bühlmann  May-2016
In this thesis we pursue the goal of high-dimensional covariance matrix estimation for data with abrupt structural changes. We try to detect these changes and estimate the covariance matrices in the resulting segments. Our approaches closely follow a recent proposal of Leonardi and Bühlmann (2016) for changepoint detection in the case of high-dimensional linear regression. We propose two estimation approaches that directly build up on their regression estimator and a third procedure which is analogous to their regression estimator, but modified to match the likelihood arising in the case of covariance matrices. We mainly focus on the implementation, testing and comparison of these proposals. Moreover, we provide complementaries regarding the relevant literature of covariance matrix estimation and changepoint detection in similar settings, tuning parameter selection, models for simulations and error measures to evaluate performances. We also illustrate the developed methodology on a real-life example of stock returns.
José Luis Hablützel Aceijas 
Causal Structure Learning and Causal Inference Dr. Markus Kalisch  Apr-2016
This thesis presents the theory and main ideas behind some of the nowadays most popular methods used for causal structure learning as well as the ICP algorithm, a new algorithm based on a method recently developed at ETH Zurich. Then, we measure and compare the performance of these algorithms in two different ways. In our first measure we consider the probability of each of the considered methods for finding exactly all the parents of a randomly chosen target variable. In our second measure we consider the reliability of each method for not yielding a node as a parent which is not. Hereby, we focus on linear Structural Equation Models (SEM) and restrict ourselves to the situation where no hidden confounders are present. We start reproducing and extending the results given in Peters, Bu ̈hlmann, and Meinshausen (2015) and after that, we change the generation process of the data in several ways in order to conduct further comparisons.
Pascal Kaiser 
Learning City Structures from Online Maps Markus Kalisch 
Martin Jaggi
Thomas Hofmann
Huge amounts of remote sensing data are nowadays publicly available with
applications in a wide range of areas including the automated generation
of maps, change detection in biodiversity, monitoring climate change and
disaster relief. On the other hand, deep learning with multi-layer neural
networks, which is capable of learning complex patterns from huge datasets,
has advance greatly over the last few years.

This work presents a method that uses publicly available remote sensing
data to generate large and diverse new ground truth datasets, which can be
used to train neural networks for the pixel-wise, semantic segmentation of
aerial images.

First, new ground truth datasets for three different cities were generated
consisting of very-high resolution (VHR) aerial images with ground sampling
distance on the order of centimeters and corresponding pixel-wise object la-
bels. Both, VHR aerial images and object labels are publicly available and
were downloaded from online map services over the internet. Second, the
three newly generated ground truth datasets were used to learn the semantic
segmentation of aerial image by using fully convolutional networks (FCNs),
which have been introduced recently for accurate pixel-dense semantic seg-
mentation tasks. Third, two modifications of the base FCN architecture
were found that yielded performance improvements. Fourth, an FCN model
was trained on huge and diverse ground truth data of the three cities simul-
taneously and achieved good semantic segmentations of aerial images of a
geographic region that has not been used for training.

This work shows that using publicly available remote sensing data can
be used to generate new ground truth datasets that can be used to effec-
tively train neural networks for the semantic segmentation of aerial images.
Moreover, the method presented here allows to generate huge and in partic-
ular diverse ground truth datasets that enable neural networks to generalize
their predictions to geographic regions that have not been used for training.
Sriharsha Challapalli 
Understanding the intricacies of the PC algorithm and optimising causal structure discovery Markus Kalisch
The PC algorithm is one of the most notable algorithms in causal structure discovery. Over the years various suggestions have been made to optimize the algorithm further. But there is still scope to probe the intricacies of the algorithm deeper. The current study aims to examine the role of various factors like the number of variables, density in the true graph, use of conditional independence graph and sequence of carrying out conditional independence tests. The outcomes of the study contribute to the optimization of not just the PC algorithm but also causal structure discovery algorithms based on conditional independence tests in general.
The study suggests that skeleton-stable is the best of the studied algorithms for the discov- ery of skeleton. The order-independent option is not the best for causal structure discovery and the BC variant is recommended. The study validates that the sequence of orders of the PC algorithm is integral to causal structure discovery. The study recommends avoiding the use of conditional independence graph for very low values of p and very low densities. Algorithms based on conditional independence tests used in the study must be preferred to those based on greedy equivalent search except for extremely low values of p or extremely high density.
Sonja Meier 
Causal analysis of proximal and distal factors surrounding the HIV epidemic in Malawi Marloes Maathuis 
Olivia Keiser 
The HIV epidemic in Malawi is a major cause of mortality and induces a highly adverse impact on Malawi’s health system as well as on its economy. It is therefore the aim of this thesis to identify causal associations between proximal and distal factors that may drive the HIV epidemic. The Malawi Demographic and Health Survey 2010 provides a wide variety of behavioral, socio-economical and structural variables as well as information on the HIV status of more than 12’000 participants. To find and display causal pathways graphical models, such as directed acyclic graphs, are used. Amongst the numerous different causal structure learning methods the RFCI algorithm and the GES algorithm are found to be suitable for the considered dataset. To include the sample weights from the survey some modifications need to be made. The ”weighted“ versions of the two algorithms are repeatedly run on random subsets of all observations to obtain robust estimates. Finally, a summary graph is created, where only edges with a certain frequency are displayed. This analysis is carried out for three different sets of variables. Since the HIV prevalence amongst women is significantly higher than amongst men in Malawi, a stratification by gender provides further insight. The proposed method is able to detect various connections between proximal and distal variables in consideration of the provided sample weights. A group of variables robustly connected with the HIV status was found. However, the proposed method has difficulties determining causal directions as these are not robust under resampling.
Yannick Suter 
Implementation of different algorithms for biomarker detection and classification in breath analysis using mass spectrometry Marloes Maathuis 
Renato Zenobi 
We implement different algorithms for biomarker detection and classification for breath analysis studies using ambient ionization mass spectrometry. We test them on two studies done recently in the Zenobi research group at ETH Zürich on chronic obstructive pulmonary disease (COPD) and cystic fibrosis (CF). The studies investigate differences in molecules present in breath due to lung diseases.

The data sets contain a lot of highly correlated variables, due to isotope patterns and biological pathways. We show that this is useful for the interpretation of the results,
but has little effect on both biomarker detection as well as classification.

For biomarker detection, we use the Mann-Whitney U test, as well as subsampling with either the Mann-Whitney U test or the elastic net regression as selection method. For classification, we use prefiltering with the Mann-Whitney U test, followed by modern high-dimensional classification methods.

The best performing methods for both biomarker detection and classification are different for the two studies. Due to time drift effects, no significant molecules were found at an FDR control level of q = 0.05 for the COPD study with the Mann-Whitney U test. For the CF study, 127 molecules were found at an FDR control level of q = 0.05.

For classification, the best performing methods for the COPD study was partial least squares regression followed by linear discriminant analysis (PLS-LDA), with an area under the ROC curve (AUC) value of 0.90. A second study on COPD is used as a validation set, which gives an AUC value of 0.71 for PLS-LDA.\\
Concerning the CF study, the best performing classification method was principal component analysis followed by linear discriminant analysis (PCA-LDA) with an AUC value of 0.73.\\

We show in simulations that hierarchical testing approaches given by Mandozzi (2015) do not work well in our setting.
Zhiying Cui 
Quantifying Subject Level Uncertainty Through Probabilistic Prediction for Autism Classification Based on fMRI Data Marloes Maathuis 
Pegah Kassraian Fard
This thesis aims to quantify the subject level uncertainties of the classification between subjects with and without autism spectrum disorder using a type of brain image data, namely, the resting state functional magnetic resonance imaging data. The concerned subject level uncertainty measure for this study is based on the probabilistic predictions,
and the quality of the former is shown to be entirely dependent on the quality of the latter. A selected subset of the data from the Autism Brain Imaging Data Exchange is used for classification, and the quality of the label and probability predictions of nine conventional classifiers combined with the simple threshold feature selection are evaluated
through cross validation and by various evaluation metrics. The best achieved accuracy is 77% by logistic regression with L1 regularization. The best probability predictions are produced by logistic regression with L1 and L2 regularization for two of the three
probability evaluation metrics, and the best probability predictions are produced by both random forest and extremely randomized trees for the third evaluation metric. Considering both label and probability predictions, the best classifiers for this data set are logistic
regression with L1 and L2 regularization and adaptive boosting. To further improve the probability predictions, two probability calibration methods are respectively applied to each of the above mentioned best classifiers, and in the majority of the twelve examined cases, the probability calibrations make some levels of improvements. Similar classification tasks are also performed on one other autism data set and two additional data sets to examine the performance in different settings.
Jakob A. Dambon 
Multiple Comparisons with the Best Methods and their Implementations in R Dr. Lukas Meier  Feb-2016
The simultaneous evaluation of multiple factors is required in many scientific experiments. Multiple comparisons account for the multiplicity and are a useful tool for giving simultaneous inference of those factors. There are several methods for multiple comparisons, in particular the multiple comparisons with the best (MCB), which is our main focus for this thesis. Here, we are trying to find the best treatment in comparison to the others.
The main purpose of this thesis was to implement Edwards-Hsu’s MCB method into R, which is not part of the R package multcomp. Our main achievements of this thesis are step-by-step derivations of the confidence intervals of Edwards-Hsu’s MCB method in the balanced and unbalanced one-way ANOVA model as well as a successful implementation into R.
Maurus Thurneysen 
Performance Analysis of a Next Generation Sequencing Instrument Markus Kalisch 
Harald Quintel
The complexity of processes and data output in molecular diagnostics are growing rapidly. In December 2015 QIAGEN AG entered the market with the first complete workflow in Next Generation Sequencing designed to deliver all the steps from Sample to Insight to the customer. This GeneReader NGS System features built-in sample preparation, sequencing of the genetic code as well as analyses of the gene sequences and produces actionable insights for customers working in diagnostic fields.
The quality and reliability of such a workflow are crucial factors in assuring high performance standards. The statistical analysis of critical steps within the workflow provides a powerful means for achieving this goal. So far, this approach has not been exploited to its fullest in this context. Therefore, the aim of this master thesis in statistics is to analyze the performance of the newly developed GeneReader instrument, which carries out the sequencing substep of the workflow, with statistical learning techniques. Qualitiy Control data from instrument production and data from test campaigns in the field are analyzed by an unsupervised learning approach and then combined into supervised learning problems to predict the performance quality of a GeneReader instrument from its Quality Control data.
It was found that the GeneReader instruments are calibrated well and that their contribution to the variability of the workflow is relatively small. However, the power of this approach was limited due to the small number of true replicates available. Nonetheless, this investigation demonstrates the potential lying in the systematic application of statistical analysis to asses and guarantee high quality and stability in QIAGEN’s development and production processes that is currently largely untapped.
Sven Buchmann 
High-Dimensional Inference: Presenting the major inference methods, introducing the Unbalanced Multi Sample Splitting Method and comparing all in an Empirical Study Martin Mächler  Feb-2016
Performing statistical inference in the high dimensional setting is challenging and has become an important task in Statistics over the last decades. In my thesis I first give a selective overview of the high-dimensional inference methods, which have been developed to assign p-values and confidence intervals in linear models, including a graphical survey of every presented inference method. The overview is split in two parts: methods for detecting single predictor variables and methods for detecting groups of predictor variables.
Secondly, I introduce a new inference method in the high-dimensional setting, called Unbalanced Multi Sample Splitting, which is a modification of the Multi Sample Splitting Method of Meinshausen, Meier, and Bühlmann (2009). Furthermore, I prove its family-wise error control. Finally, I perform an Empirical Study using the R package simsalapar, which consists of three parts: designing the simulation study, actually performing the simulation and analyzing the various results.
Jürgen Zell 
Analyzing growth and mortality of Picea Abies for a growth simulator in Switzerland Martin Mächler  Feb-2016
The thesis is about modeling growth and mortality of Picea Abies. The data are complex and stem from experimental forest management trials all over Switzerland. In the first part growth was modeled. 65% of the total variation can be explained by many different explanatory variables. The second part is about mortality and contains a logistic regression model, which is compared to a Survival analysis approach.
Marc Stefani 
Lasso Chain Ladder Constrained Optimization for Claims Reserving Lukas Meier 
Jürg Schelldorfer

The Chain Ladder method is by far the most popular method for predicting non-life claims reserves in the insurance industry. Its simplicity induces two limitations: First, we do not have a robust estimation of old development factors which is caused by only few avail- able observations. Second, the Chain Ladder method is not able to deal with diagonal effects (i.e. claims inflation) which are often present in claims reserving data. Although many research papers present extensions to the classical Chain Ladder method, none has addressed the issue of using constrained optimization with Lasso-type estimators. Lasso- type estimators are primarily attractive for high dimensional statistics and still useful in low-dimensional problems. Either to obtain a smaller set of estimated parameters that exhibits the strongest effects or to obtain a robust estimator which reduces the variability of the estimated model parameters.
Since the Chain Ladder model can be understood as a regression problem, it was possible to develop Lasso-type estimators for three different models: A regression version of the Chain Ladder Time Series Model, an extension which allows modeling diagonal effects and an Overdispersed Poisson Model which also considers diagonal effects. To solve the optimization problems, we build up a regression framework to transform the claims re- serving data into appropriate data matrices. The application for real data sets shows that Lasso-type estimators predict plausible claims reserves. For simulated data sets we often achieve a better prediction accuracy with Lasso-type estimators compared to the Chain Ladder method, especially in situations where Chain Ladder model assumptions are not fulfilled. However, the solution of Lasso-type estimators is sensitive to the choice of the optimal tuning parameter and the model selection criterion. Finally, we estimate the pre- diction accuracy of Lasso-type Chain Ladder estimators via model-based bootstrap. The implementation of the Lasso-type estimators is done in R.
Benjamin Jakob 
Censored Regression Models Lukas Meier  Jan-2016
Empirically bounded distributions are investigated and the process of regression is employed on these dependent variables with several independent variables. Different models (censored as well as uncensored) are used and programmed with the programming language R such as the Logit model, the Beta distribution model, the Tree model, the Random Forest, a Censored Gamma model and two slight variations of it.
The conclusion is made that the Censored Gamma model and its extensions proposed by
Sigrist and Stahel (2011) do perform well - but not always - in comparison to the other models and might therefore be an attractive option to further investigate for banks and insurers.


Student Title Advisor(s) Date
Jakob Olbrich 
Screening Rules for Convex Problems Bernd Gärtner
Peter Bühlmann 
Martin Jaggi
This thesis gives a general approach to deriving screening rules for convex optimization problems. It splits up in three steps. As the first step, the Karush-Kuhn-Tucker conditions are used to derive necessary conditions that allow to reduce the problem size. They depend on the optimal solution itself. The second step is to gather information on the optimal solution from a known approximation. In the third and final step the information is used to get conditions that do not depend on the optimal solution, which are then called screening rules. This thesis studies in particular the unit simplex, the unit box and polytopes as domain. The resulting screening rules can be applied to various problems, such as Support Vector Machines (SVM), the Minimum Enclosing Ball (MEB), LASSO problems and logistic regression. The resulting screening rules are compared to existing rules for those problems.
Nicolas Bennett 
Analysis of High Content RNA Interference Screens at Single Cell Level Peter Bühlmann 
Anna Drewek 
Infectious diseases are among the leading causes of death worldwide and the evolution of antimicrobial resistance poses a troubling development in cases where our only effective line of defense is based on distribution of antibiotic agents. One possible way out of this problematic situation comes by the alternative approach of host directed therapeutics, which in turn warrants the meticulous study of the human infectome. Therefore, large-scale studies such as genome-wide siRNA knockdown experiments as performed by the InfectX/TargetInfectX consortia are of great importance.

The richness of datasets resulting from image-based high throughput RNAi screens permits a broad range of possible analysis approaches to be employed. The present study investigates cellular phenotypes as induced by gene knock-down, with a focus on the effect of pathogen infection, by applying generalized linear models (GLMs) to single cell measurements. In order to simplify handling of such datasets, an R package is presented, that fetches queried data from a centralized data store and produces data structures, capable of efficiently representing the logic of an assay plate. Convenience functions to preprocess, manipulate and normalize the resulting objects are provided, as is a caching system that helps to significantly speed up common operations.

GLM analysis of phenotypic response from knockdown and infection was attempted, but did not yield satisfactory results, most probably due to issues with data normalization. In order to facilitate the simultaneous study of measurements originating from multiple assay plates, several normalization schemes were explored, including Z- and B-scoring, as well as modeling technical artifacts with multivariate adaptive regression splines (MARS). While some improvements of data quality were observed, experimental sources of error could not be sufficiently controlled for meaningful GLM regression.
Marco Eigenmann 
A Score-Based Method for Inferrig Structural Equation Models with Additive NoiseP Peter Bühlmann  Aug-2015
We implement and analyse a new score-based algorithm for inferring linear structural equation models with a mixture of both, Gaussian and non-Gaussian distributed additive noise. After introducing some well-known algorithms providing theory, pseudo-codes, main advantages and disadvantages as well as some examples, we extensively cover the technical part which endorses the ideas behind our new algorithm. Finally, we present our algorithm in great detail describing its R implementation and showing its performance compared to the algorithms introduced in the previous chapters.
Patrick Welti 
Analysis of the Empirical Spectral Distribution of a Class of Large Dimensional Random Matrices with the Aid of the Stieltjes Transform Sara van de Geer 
Alan Muro Jiminez
Paweł Morzywołek 
Non-parametric Methods for Estimation of Hawkes Process for High-frequency Financial Data Peter Bühlmann 
Vladimir Filimonov
Didier Sornette
Due to its ability to represent clustered data well the popularity of the selfexcited Hawkes model has steadily grown in recent years. After originally being applied for earthquake prediction it has been also used to anticipate flash crashes in finance, epidemic type of behaviour in social media such as Twitter and YouTube or criminality outbursts in big cities.
The aim of this work is to conduct a comprehensive comparison study of the
existing non-parametric techniques for estimation of the Hawkes model, which
without making any a priori assumptions on the correlation structure of the
observables, provide us insights into the data. To the best of my knowledge
such work has not been done so far. The first considered method is the widely
used in non-parametric statistics EM Algorithm, adjusted to the case of a
Hawkes process. The second procedure is based on the estimation of a
conditional expectation of the Hawkes model’s counting process and then
solving a Wiener-Hopf type integral equation to obtain the kernel function of the model. The last estimation technique uses representation of the Hawkes model as an integer-valued autoregressive model and subsequently applies tools from theory of time series to obtain parameters of the model.
The methods were tested on synthetic data generated from the Hawkes model
with different kernels and different parameters. I investigated how the size of the sample and the overlapping of point clusters influences performance of
different estimation methods. When conducting the analysis, I did not restrict myself only to the case of the most commonly used exponential and power law kernels, but also considered less typical step and cut-off kernels. After the comparison on synthetic data has been accomplished I proceeded with
empirical data analysis. For this purpose I tested the estimation methods on the high-frequency data of price changes of E-mini S&P 500 and Brent Crude futures contracts.
Philip Berntsen 
Particle filter adapted to jump-diffusion model of bubbles and crashes with non-local crash-hazard rate estimation Markus Kalisch 
Didier Sornette
Yannick Malevergne
Crashes in the financial sector probably represent the most striking events among all possible extreme phenomena. The impact of the crises have become more severe and their arrivals more frequent. The most recent financial crises shed fresh light on the importance of identifying and understanding financial bubbles and crashes.
The model developed by Malevergne and Sornette (2014) aims at describing the
dynamics of the underlying occurrences and probability of crashes. A bubble in this work is synonymous with prices growing at a higher rate than what can be expected as normal growth over the same time period. A non-local estimation of the crash hazard
rate takes into account unsustainable price growth, and increases as the spread, between a proxy for the fundamental value and the market price becomes greater.
The historical evaluation of the jump risk is unique and expands the understanding of crash probability dynamics assumed embedded in financial log-return data.
The present work is mainly concerned with developing fast sequential Monte Carlo methods, using C++. The algorithms are developed for learning about unobserved shocks from discretely realized prices for the model introduced by Malevergne and Sornette (2014). In particular, we show how the best performing filter - auxiliary particle filter - is derived for the model at hand. All codes are accessible in the appendix for reproducibility and research extensions.
In addition, we show how the filter can be used for calibration of the model at
hand. The estimation of the parameters, however, is shown to be difficult.
Jakub Smajek 
Causal inference beyond adjustment Markus Kalisch  Jul-2015
Covariate adjustment is one of the most popular and widely used techniques to estimate causal effects. The method is easy to use, has a well-understood theory and can be combined with other statistical techniques for efficient estimation of a given causal effect. The problem is, that the covariate adjustment method is not complete, in the sense that it may not identify a causal effect even if it is identifiable by some other methods. The first goal of the thesis is to demonstrate the above mentioned problem and present some alternative techniques, like the instrumental variables technique and a new identification method, that can be useful in estimation of causal effects (chapter 2). The next goal and the main theme of the thesis is to answer a question: "How restrictive is it if we restrict causal inference to adjustment methods?". The third chapter tries to answer this question from a theoretical perspective for single nodes X and Y. It presents important results from other authors and generalize some of them for two types of graphs: acyclic directed mixed graphs (ADMGs or latent projections) and maximal ancestral graphs (MAGs). The chapter shows, that we cannot lose a possibility to identify a causal effect by covariate adjustment by a conversion from a DAG to the corresponding latent projection and provides a criterion that characterizes, when a given causal effect is identifiable at all (by any method), but not by covariate adjustment in an ADMG G. It also shows, that a possibility to estimate a causal effect can be lost purely due to a conversion from a latent projection to a corresponding MAG and provides a criterion that specifies when it happens. Moreover, the third chapter provides a necessary, sufficient and constructive criterion to form an adjustment set in a given MAG M, if X and Y are single variables. Finally, partially based on the theoretical results derived earlier in the thesis, the question is addressed in a simulation study in chapter 4. The chapter describes implementation issues, methodology and several different experiments. The experiments concentrate on a comparison of the complete identification algorithm and the covariate adjustment method in terms of proportions of identifiable causal effects. The comparison on uniformly sampled ADMGs shows a big advantage of the former method. It turns out however, that the difference is mainly caused by some simple cases that can be easily identified. Such an approach leads to the simple but very effective improvement of the covariate adjustment method, that can significantly increase the proportion of identifiable causal effects. Finally, an experiment that shows how much do we lose on a conversion from an ADMG to a MAG is performed. The problem is especially visible if we restrict the analysis to graphs that contain a causal path from X to Y.
Lukas Tuggener 
Analysis off Cross-Over Trials Markus Kalisch  Jul-2015
The goal of this thesis is to give the reader an introduction to cross-over trials. The first chapter explains the most basic cross-over design.
Using this design as an illustration it presents the necessary theory to analyse cross-over trials. It shows how this basic design is weak in many situations and introduces designs which are more versatile. There are three computer simulations which help building an intuitive understanding of cross-over design.
The most important insight from this thesis is that a good design choice is is always a multifactorial trade-off between subject recruiting, study duration and design complexity.
If available, it takes information about the expected carry-over behaviour and the structure of the between- and within-subjects variability into account.
Maria Elisabetta Ghisu 
A comparative study of Sparse PCA with extensions to Sparse CCA Marloes Maathuis  Jul-2015
In this thesis we compare different approaches to sparse principal component analysis (sparse PCA) and then extend our investigation to sparse canonical correlation analysis (sparse CCA).

First, we study sparse PCA methods, where regularization techniques are included in classical PCA to obtain sparse loadings. We compare different formulations by analyzing theoretical foundations and algorithms. Moreover, we carry out simulation studies to evaluate the performance in a wide variety of scenarios. The optimal choice of the method depends on the objective and on the specific parameters combination. Our results suggest that the SPC \citep[]{Witt09} approach usually outperforms the other techniques in recovering the true structure of the loadings, although the angle between true and estimated vector is generally high.

Subsequently, we examine the closely related problem of sparse CCA, where sparsity is imposed on the canonical correlation vectors. After a theoretical study of the methods, we run simulations to assess their quality. When the covariance matrices of the two sets of variables are not nearly diagonal, CAPIT \citep[]{chen13} shows higher accuracy; otherwise, the performances are similar.

Finally, we consider applications of both sparse PCA and sparse CCA to real data sets, obtaining satisfactory results in most of the situations.
Xiao Ye Zhan 
Modelling Operational Loss Event Frequencies Marloes Maathuis 
Michael Amrein
In this paper we study the application of count data modelling approaches to monthly counts of operational risk events, that have been recorded over 13 years from UBS. Assuming that the underlying distribution of the counts is Poisson, nonparametric and parametric regressions and a time series model are considered here. A mean-matching variance stabilizing transformation (VST) is used to facilitate the nonparametric Poisson regression and reduce the problem to a homoscedastic Gaussian regression one. The Poisson GLM regression and the generalized linear autoregressive moving average (GLARMA) model are applied to investigate the relationship of the number of operational losses observed with exogenous variables, and the dependence structure in the data. Our analysis shows significant connections between the loss count data and the financial and economic drivers. Notable serial correlations are also found in the data, with special attention paid to the Poisson distribution assumption and the over-dispersion issue. Simulation experiments are also provided to examine numerical properties of the estimators.
Marcos Felipe Monteiro Freire Ribeiro 
Learning with Dictionaries Nicolai Meinshausen  Jul-2015
The method of dictionary learning was introduced by Olshausen and Field (1997) as a model for images based on the primary visual cortex. It has been successfully used for representing sensory data like images and audio, also providing an explanation for many observed properties in the response of cortical simple cells. In this thesis, we will show that the method can also be derived from an information theoretical point of view. The approach is similar to Bell and Sejnowski (1995) but substitutes the framework of neural networks by a probabilistic one. We also discuss how the learned representations can be used for classification and apply the theoretical results to two real world problems. In the first problem, we analyse GPS data in order to characterize driving styles. In the second, we analyse fundus images of the eye in order to diagnose diabetic retinopathy.
Oxana Storozhenko  
Maximin effects with tree ensembles Nicolai Meinshausen  Jul-2015
Non-parametric models, such as regression trees, are often used as a primary estimation method in prediction problems. Fitting the trees requires virtually no assumptions about the data, the learning algorithm requires almost no tuning and non-linear relationships in the data are handled well. The flexibility of trees has been exploited in ensemble learning, where the members of an ensemble are the trees t to different samples of the training data. One of the most popular o-the shelf prediction algorithms is random forest (Breiman (2001)), that constructs an ensemble of randomised trees trained on bootstrap samples of the data and averages over the predictions made by each tree. We propose to extend the aforementioned algorithm for the prediction problems of inhomogeneous data. In particular the estimators in the ensemble can be trained on different groups of the training data, as opposed to perturbation of the dataset with bootstrap sampling. If the data has outliers, contaminations, time-varying or temporary effects, that are present locally, dividing the dataset into groups in a sequential manner outputs more diverse estimators. Another adjustment in the context of inhomogeneous data is finding a vector of weights for the estimators in the ensemble, such that the future predictions are optimal whatever group the new data point comes from. Bühlmann and Meinshausen (2014) proposed to minimise the L2-norm of the convex combination of the fitted values of the estimators, and use the resulting weights in order to maximise the minimum explained variance in every group. This scheme is called maximin aggregation and we show how it works for inhomogeneous data.

Teja Turk 
Comparison of Con fidence and Prediction Interval Approaches in Nonlinear Mixed-Eff ects Models Lukas Meier  Jun-2015
In this study we aim to assess the performance of various approaches for confidence and prediction intervals in single level nonlinear mixed-effects models. The evaluation is based on simulated samples of coverage rates for 13 nonlinear functions.

The bootstrap confidence intervals are constructed from the parametrically, nonparametrically and case resampled datasets. In addition, the confidence intervals from intervals function and the Wald confidence intervals are included in the comparison. The performance of all the methods is carried out for all three types of parameters: the fixed effects, variance-covariance components and the within-group standard deviation. Finally, the Wald confidence intervals are improved by empirically adjusting the degrees of freedom of the t-statistic. In general, the simulation speaks in favour of the non-bootstrap approaches.

The prediction intervals methods are based on the Wald's test and derived separately for observed and unobserved groups. The variance of the prediction error derivation is based on various linear approximations of the prediction error. In pairwise comparisons with their bootstrap variants no apparent differences are detected. When their performance is compared with the prediction intervals based on the bootstrap prediction error distribution, the latter exhibits coverage rates closer to the true nominal values.
Caroline Matthis 
Classifying Autistic Versus Typically Developing Subjects Based on Resting State fMRI Data Marloes Maathuis 
Nicole Wenderoth
Pegah Kassraian Fard 
In this thesis we investigate several classifiers to discriminate between autistic and typically developing subjects based on resting state fMRI data. We use data from the Autism Brain Imaging Data Exchange (ABIDE) database which consists of fMRI scans of 1112 subjects. First, we implement the Leave-One-Out (LOO) classifier designed by Anderson et al. [2] which attains an accuracy of 60 %. Next, we run various conventional classifiers on the data and compare their predictive performance to the LOO classifier. Most of the examined classifiers perform at least as well as the LOO classifier; a flexible formulation of discriminant analysis reaches an accuracy of 76 %. In a last step we attempt to attach a subject-specific uncertainty to the classification. Based on work by Fraley and Raftery [18] the posterior distributions of the flexible formulation of discriminant analysis are used to model these uncertainties. In a short simulation study we illustrate the informative value of the estimated uncertainties, given that the distributional assumptions are valid. Then, this uncertainty model is evaluated on the data, yielding satisfactory results.
Julia Brandenberg 
Statistical Analysis of Global Phytoplankton Biogeography in Mechanistic Models and Observations Nicolai Meinshausen  Apr-2015
After five months of intense work, I am proud to submit my Master’s thesis. I would like to thank my advisor Dr. Meike Vogt for her constant support, her reliability and motivation and congratulate her to her baby, which was one of the highlights during this period. Besides many fruitful on-topic discussions, I enjoyed the off-topic horse-related chats with her. Special thanks to my advisor Prof. Dr. Nicolai Meinshausen, whos support was competent, patient and committed. During several meetings I was able to deepen my statistical understanding and his versatile approaches for problem solving motivated me to try different techniques. I would like to thank Prof. Dr. Nicolas Gruber for his advice and for having me in his group the past months. Dr. Thomas Froelicher supported me in the interpretation of my results and was my contact during Meikes absence. Dr. Charlotte Laufkoetter and Dr. Chantal Swan contributed to this work by providing me with data and information concerning it. Last but not least, I would like to thank all my colleagues from the environmental physics group for their advices and contributions and especially for making this time such a pleasure to think back to.
At this point I would like to mention my parents, Barbara and Andreas Brandenberg and thank them for the unconditional support over the last years. Their love and faith in me contributed greatly to all my achievements and made me to the person I am today. Thank you!
Sonja Gassner 
Fitting and Learning of Bow-free Acyclic Path Diagram Models Marloes Maathuis 
Preetam Nandy Christopher Nowzohour
We consider the problem of learning causal structures from observational data, when the data are generated from a linear structural equation model. Under the assumption that the path diagram of the model is acyclic and the error variables are uncorrelated, one can apply a search and score technique to learn the underlying structure. However, the assumption of uncorrelated errors is often too restrictive. In this thesis we consider a more general subclass of linear structural equation models for structure learning, where correlation of the errors is allowed unless the corresponding random variables are in a direct causal relation. These models are called bow-free acyclic path diagram (BAP) models. BAP models are almost everywhere identifiable, which is in general not ensured for linear structural equation models with arbitrary correlation patterns. First, we consider two methods for estimating the parameters in BAP models. One results from the proof of the identifiability of BAP models and is implemented in this thesis. The other one is an iterative partial maximization algorithm for maximum likelihood estimation, for which an implementation was already available. Next, we use these two fitting methods in a greedy search algorithm for structure learning, which repeatedly fits and scores BAP models and chooses the model with the highest score. Finally, we evaluate the performance of these methods in a small simulation study.
Carolina Maestri 
Two approaches of causal inference for time series data Marloes Maathuis  Mar-2015
In this Master's thesis two approaches of causal inference for time series data are studied. The first one addresses non-linear deterministic systems, while the second one is designed for linear stochastic systems. For both methods the theoretical foundations are presented and the algorithms are analysed and described in detail. Applications to real data are also shown and various simulations are run to investigate the performances of the algorithms in different situations.
Kari Kolbeinsson 
Model Selection for Outcome Predictions of Professional Football Matches Markus Kalisch  Mar-2015
The subject of this thesis is to model and predict the outcome of professional football matches played in the premier leagues around the globe. For this purpose a number of statistical learning methods are employed and models fit to publicly available data.After gathering the simple data from the relevant websites, numerous variables are constructed to further capture the relative strength of each team. The second chapter of the thesis is dedicated to explaining the dataset constructed from these variables and their relationship with the response variables. The statistical learning commences in the third chapter by fitting classification models to a training subset of the data. For these models the response variable is categorical, taking on three values, a win for either team or a draw. The models considered are linear and quadratic discriminant analysis, k-nearest neighbours, random forest, boosted classification trees and support vector machines. For each model, the fit to the training set is analysed using an estimation of the misclassification rate and calibration plots. The fourth chapter explores the use of regression models for this task. The response variable now is either the goals scored by each team or the goal difference. Models fit to the goal difference of each team are then combined for one unified prediction of the goal difference. The models tried for this task are generalized linear models, random forest and boosted regression trees. Prediction accuracies of the best performing models in these two chapters are the subject of the fifth and final results chapter. The goal count estimations of the regression models are translated into the same categorical results as were modelled by the classification models for comparison between all methods. The best performing model was found to be the boosted classification trees with a prediction accuracy of 50.5%.
Lin Zhu 
Confidence Curves in Medical Research Leonhard Held  
Markus Kalisch 
This thesis briefly reviews the developments of confidence distributions. It introduces the modern definitions of a confidence distribution, confidence density and confidence curve along with point estimators based on a confidence distribution. Then different constructing methods of confidence curves are given for cases without nuisance parameters and cases with nuisance parameters, respectively. The pivotal approach and deviance-based approach are applied to both cases with and without nuisance parameters. The half-correction approach is applied to discrete data. The simulation or bootstrap approach is applied to cases with nuisance parameters. We take exponential distribution, binomial distribution, Weibull distribution, gamma distribution and the comparison of two binomials as examples to study the difference with each approach.
Anita Kaufmann 
Crime Linkage Jacob De Zoete
Marloes Maathuis 
Crime Linkage studies settings where similarities among several crimes suggest execution by the same offender. Due to their linkage, evidence of an individual case becomes relevant for the entire group of crimes. After giving a short introduction of the subject of Bayesian Networks we demonstrate how they can be used to model crime linkage settings. In a next step, a review of two research papers concerning this topic is provided. For a better understanding we outline the most important parts in detail. Moreover, the papers in focus only present examples for a small number of crimes since the complexity increases exponentially with the number of crimes considered. We aim at avoiding the fast increase in complexity by proposing simplifying adaptions of the Bayesian Network. Furthermore, we restrict the number of different offenders to m < n, where n is the number of crimes considered, since it is not very probable to have as many offenders as crimes in a crime linkage setting. The consequence is a reduction of the number of offender configurations which should result in a simplification of the computation of settings with a larger number of crimes. We propose two possibilities to find a reasonable value for m: The problem we encounter is that our adapted function for n crimes with at most m different offenders is not efficient and hence cannot be used for larger numbers of crimes. Nonetheless, comparing the two different approaches for small numbers of crimes we get very similar results. Consequently, the second approach is, at least for small numbers of crimes, faster and thus better suited for determining the number m of different offenders which have to be taken into consideration. In order to maintain its relevance also for larger number of crimes we furthermore propose a possible extension of the second approach.
Sheng Chen 
Random Projection in clustering classification and regression Markus Kalisch  Feb-2015
This thesis studies the performance of Random Projection - one of the relatively new dimensionality reduction techniques - when applied to the area of clustering, classification and regression, through reproducing or testing the results in three papers by Boutsidis and Zouzias (2010), Paul and Boutsidis (2013) and Kaban (2014), each from one of the three domains.Firstly, a review of the Johnson-Lindenstrauss lemma, as well as its extensions is given, which is the theoretical foundation of Random Projection. Besides the early subgaussian and sparse matrices, new random matrices based on the Fourier transform are developed for faster computation. Secondly, the experiment in Random Projection-based K-means (Boutsidis and Zouzias, 2010) is reproduced. The result shows when the cardinality of the embedded space is large, the RP-based K-means is comparable to the K-means with original data in terms of misclassification rate. Comparisons are drawn between RP, PCA and LS and finds that PCA outperforms RP in terms of misclassification rate, but RP needs only 19% of the time needed by PCA. Thirdly, for classification, part of the experiment in RP-based Support Vector Machine (Paul and Boutsidis, 2013) is run. The calculation shows that the misclassification rate of the RP-based SVM is not significantly larger than the SVM in the original space. However, the margin γ is significantly smaller. In the area of regression, Kaban (2014) proposed an upper bound on the excess risk of the OLS estimator in the embedded space, and proved that Random Projection applies to a larger group of matrices, whose entries have mean 0, unit variance, symmetric distribution and finite fourth moment. The last part of the thesis runs experiment to examine the necessity of these assumptions upon random matrices and finds that each of them could be loosened without breaking the bounds.
Ioan Gabriel Bucur 
Structural Intervention Distance for Maximal Ancestral Graphs Markus Kalisch  Jan-2015
In the process of causal inference, we are interested in accurately learning the causal structure of a data generating process from observational data, so as to correctly predict the effect of interventions on variables. In order to assess how accurate the output of an estimation method is, we would like to be able to compare causal structures in terms of their causal inference statements. Peters and Bühlmann have proposed the Structural Intervention Distance as a premetric between DAGs that provides a partial solution to the issue. However, the causal DAG may not be able to predict certain intervention effects in the presence of confounders. In this paper, we attempt to emulate the results of Peters and Bühlmann in a more realistic setting, where we observe only part of all relevant variables. We propose a new premetric, the Structural Intervention Distance for Maximal Ancestral Graphs (SIDM). A MAG is a causal structure which, unlike the DAG, is closed under marginalisation and can incorporate uncertainty about the presence of latent confounders. The SIDM allows us, under the assumption of no selection bias, to compare and contrast two MAGs based on their capacity for causal inference. The SIDM is consistent with the SID in its approach and provides valuable additional information to other metrics.


Student Title Advisor(s) Date
Lukas Weber 
Model selection techniques for detection of differential gene splicing Mark Robinson 
Peter Bühlmann 
 Alternative splicing during the messenger RNA (mRNA) transcription stage of gene expression can generate vast sets of possible mRNA isoforms from individual genes. These mRNA isoforms can create functionally distinct proteins during subsequent protein translation, explaining the enormous diversity of proteins in organisms such as humans. Differential splicing experiments aim to use microarray or RNA sequencing (RNA-seq) technologies to detect genes exhibiting differences in splicing patterns between groups of biological samples, for example comparing diseased versus healthy samples, or treated versus untreated. In this thesis, we have tested whether model selection  techniques can be used to improve the performance of existing statistical methods to detect differential gene splicing in RNA-seq data sets. The new methods were successful, and have been implemented as an R package available on GitHub.
Lucas Enz 
The Lasso and Modifications to Control the False Discovery Rate Sara van de Geer 
Benjamin Stucky 
Nowadays, a huge focus is set on high dimensional data sets where the number of predictors $p$ is a lot larger than the amount of observations $n$. One example is detecting which genes are responsible for a specific biological function of our body. Due to the fact that it causes very high costs to measure some microarray data, we normally have at most a few hundred observations, but thousands of possible genes which could control the instance we want to research. Because we have a lot more predictor variables than observations, we cannot compute a unique solution. cite{Tibshirani96} introduced a method called Lasso, which deals precisely with this problem and sets some variables exactly to zero. In other words, the Lasso can ban some predictors from our model. Nevertheless, the Lasso sometimes picks a lot of predictor variables which are in truth not responsible for the observed process. As a consequence, the false discovery rate (FDR), defined as the expected proportion of irrelevant predictor variables among all selected variables, is not even controlled in some models.In this paper we will focus on a new procedure which controls the FDR better, but does not ban too many predictor variables which are actually relevant for the process, i.e. we do not lose too much power. This paper is mainly based on the works of cite{Candes13} (and an updated version cite{Candes2}) about the procedure they introduced, called SLOPE. We analyze the improvement of SLOPE in high dimensional examples for the linear model with Gaussian and orthogonal design matrices. In the end, we adapt the idea of SLOPE to the group Lasso, which is very useful if we can group the predictor variables and select or ban a whole group of regression variables. We present an extension of the group Lasso named SIPE and test its skills in sparse scenarios via simulation study.
Hannes Toggenburger 
Joint Modelling of Repeated Measurement and Time-to-Event Data, with Applications to Data from the IeDEA-SA Marloes Maathuis  
Matthias  Egger 
Klea Panavidou 
After the start of ART the low CD4 count in a HIV positive patient typically recovers up to a regular level. By measuring the CD4 repeatedly, a patient's individual CD4 trajectory is known at a discrete set of times. Different approaches were made to model CD4 counts to obtain continuous trajectories. If ART is not working any more, the CD4 will start its decay anew. Such a treatment failure, or in particular the time of its occurrence, is modelled by survival models. In this work, the repeated measurement outcomes of the CD4 are modelled with a nonlinear mixed-effects (NLME) model with three random-effects. The time-to-event data is modelled with a log-normal accelerated failure time (AFT) model. These two models are merged into a random-effects-dependent joint model. Broadly speaking this means that the random-effects of the NLME model are used as continuous predictors in the AFT model. Different approaches, and their pitfalls, to estimate the involved parameters via the maximum likelihood method are discussed. The final model is applied to real data from the International epidemiologic Databases to Evaluate AIDS in sub-Saharan Africa (IeDEA-SA).
Andreas Puccio 
A review of two model-based spike sorting methods Marloes Maathuis  Aug-2014
In modern neuroscience, extracellular recordings play an important role in the analysis of neuron activity. Whereas earlier experiments were based on single electrodes, modern settings consist of a large number of channels that record data from multiple cells simultaneously.In such settings, every electrode will record action potentials from all nearby neurons, visible as spikes whose shape depends on various factors. The problem of spike sorting, in a nutshell, is to detect the occurrence of such spikes in multi-electrode voltage recordings and to classify them, i.e., to identify the corresponding neurons.A widely used approach is a so-called clustering method consisting of a thresholding step to detect the occurrence of a spike, a feature-reduction step (e.g. PCA) and a classification ("sorting") step based on these features. However, this method has several disadvantages, an important one being the inability to handle overlapping spikes.After an introduction into the problem of spike sorting and the data encountered in such settings, we review two different modern spike sorting frameworks, one being binary pursuit (Pillow, Shlens, Chichilnisky, and Simoncelli, 2013) and the other one relying on a method called continuous basis pursuit (Ekanadham, Tranchina, and Simoncelli, 2014). These frameworks use a statistical model for the recorded voltage trace and do not rely on a clustering procedure for the spike train estimation. We present an implementation of binary pursuit in MATLAB, conduct a performance assessment of this algorithm using simulated data and identify advantages and disadvantages of model-based spike sorting algorithms.
Laura Casalena 
Statistical inference for the inverse covariance matrix in high-dimensional settings Sara van de Geer 
Jana Jankova 
The focus of this work is the problem of estimating the inverse covariance matrix Θ∗ in a high-dimensional setting. High-dimensionality is reflected by allowing p to grow as a function of n, but for our results to hold we require p = o(exp(n)). We will propose four different estimation methods for Θ∗ and study their asymptotic properties under appropriate distributional assumptions as well as model assumptions on the concentration matrix Θ∗. In particular, whenever it is possible, we will give rates of convergence in various matrix norms and state results which prove asymptotic normality of each individual element Θ∗ij. Consequently, we will construct asymptotic confidence intervals for Θ∗ij. Finally, we will illustrate the theoretical results through numerical simulations.
Fabio Ghielmetti  
Causal Effect Estimation of Structural Pricing Changes in the Airline Industry Peter Bühlmann 
Karl Isler
Pricing changes in the Airline Industry occur on a daily basis, their revenue effects are difficult to measure though. This problem, namely inferring the causal effect of a pricing change on the revenue can be modeled by a structural equation model (SEM) and a causal graph. A lately published paper (Ernest and Bu ̈hlmann (2014)) showed that causal effects within SEMs can directly be inferred out of an additive model, even if the true underlying relationships are not additive. After introducing the subject of Airline Revenue Management and the mathematical tools to infer causal effects, this recent result is applied to actual airline data. Following the identification of the corresponding causal graph, multiple additive models are fitted: with several levels of data aggregation and a comparison of different subsets, the sensitivity of the causal effect estimation is tested. Finally, the results are discussed and interpreted.
Shu Li 
Causal Reasoning in Time Series Analysis through Additive Regression Peter Bühlmann 
Jan Ernest 
Causal inference has evolved from its fractionized early days towards a more unified and formal framework with diverse applications ranging from brain mapping to the modeling of gene regulatory pathways. In a time series setting causal reasoning revolves predominantly around Granger causality, disregarding recent advances in structural equation or graphical modelling. We use the former to explore the potential of intervention-based causal inference from observational time series data. Drawing its inspiration from a recent result by Ernest and Bühlmann (2014), we propose a novel approach for inferring causal effects in AR(p) models: Addtime, short for additive regression in time series analysis. Our method is theoretically sound, even for nonlinear or non-additive AR(p) models and computationally efficient, requiring on average 0.5s per intervention and enabling potentially high-dimensional applications. Empirically, Addtime is able to recover the true effect in simulated and real data. Within the scope of (nonlinear) time series the effect of interventions is largely unexplored. Our approach can be regarded as a safe benchmark for univariate time series and generalizes to the multivariate case without further constraints.
Anja Franceschetti 
Alternatives to Generalized Linear Models in Non-Life Pricing Lukas Meier 
Christoph Buser 
Christina Heinze 
Random Projections in High-dimensional and Large-scale Linear Regression Nicolai Meinshausen  Jul-2014
We study the use of Johnson-Lindenstrauss random projections in different regression settings. First, we examine the high-dimensional case, where the number of variables p largely exceeds the number of observations n. Specifically, we consider so-called compressed least-squares regression (CLSR). CLSR reduces the dimensionality of the data by a random projection before applying ordinary least squares regression on this compressed data set. We perform an empirical comparison of predictive performance between CLSR and other widely used methods for high-dimensional least squares estimation, such as ridge regression, principal component and the Lasso. Our results suggest that an aggregation scheme which averages the predictions of CLSR over a number of independent random projections can greatly improve predictive accuracy. This extension of CLSR performs similarly to the competing methods on a variety of real data sets. Subsequently, we experiment with two variable importance measures where one exploits the fact that omitting variables in the original high-dimensional data set does not necessarily have to change the projection dimension. This allows for the estimated regression coefficients to be directly compared in the compressed space. The second statistic is based on the change in mean squared prediction error. For both importance measures we explore whether the importance of clusters of highly correlated variables can be identified correctly. We find that the procedures work reasonably well for synthetic data sets with large signal-to-noise ratios (SNRs) and no inter-cluster correlations. However, the randomness in the projection matrix makes detection difficult for data sets with low SNRs. Also, different correlation structures between clusters pose significant challenges. Lastly, we look at the large-scale setting where both p and n are very large, and possibly p > n. We develop a distributed algorithm, LOCO, for large-scale ridge regression. Specifically, LOCO randomly assigns variables to different processing units. The dependencies between variables are preserved using random projections of those variables that were assigned to the respective remaining workers. Importantly, the communication costs of LOCO are very low. In the fixed design setting, we show that the difference between the estimates returned by LOCO and the exact ridge regression solution is bounded. Experimentally, LOCO obtains significant speedups as well as good predictive accuracy. Notably LOCO is able to solve a regression problem with 5 billion non-zeros, distributed across 128 workers, in 25 seconds.
Sabrina Dorn 
Local Polynominal Matching and Considerations with Respect to Bandwidth Choice Sara van de Geer  Jul-2014
This master's thesis considers local polynomial matching which is a popular methodin econometrics for estimating counterfactual outcomes and average treatment effects. We discuss identification of counterfactual expectations under conditional independence, give an overview of selected properties of the local polynomial matching estimator, and apply these to calculate the mean squared error for the according two-step estimator for general order approximating polynomials. Finally, this enables us to derive and implement a feasible mean squared error criterion that can be minimized numerically, and provide some evidence of its reasonable performance within an empirical application to the NSW and PSID data.
Olivier Bachem 
Coresets for the DP-Means Clustering Problem Andreas Krause 
Markus Kalisch  
Valentina Lapteva 
Different Stability Selection Models for Structure Learning Nicolai Meinshausen  Jul-2014
Recent developments in analytics, high performance computing, machine learning, and databases result in a situation when collecting and processing web-scale datasets becomes possible. Not only the number of samples increases dramatically, but also the number of features observed and evaluated.Big data analysis, in turn, requires unique experts that need to fully understand all the attributes of the data and the connections between them, which can be costly if at all possible. This all brings the problem of automated structure discovery in the most acute way.The task of structure learning attracts a lot of attention, with many new algorithms being proposed in recent years. However, all of them highly depend on the choice of a regularization parameter. To deal with this problem, Stability selection technique cite{stability_selection} was proposed. Original formulation of Stability Selection approach limits the maximum number of false positive variables selected.In this thesis we explore the problem of learning the structure in an undirected Gaussian graphical model. We extensively explore the properties of Stability Selection when applied in combination with different structure estimators, such as Graphical LASSO cite{glasso}, CLIME cite{clime} and TIGER cite{tiger}.We also propose and explore, for the first time, a variety of different models that are based on Stability Selection approach, but rely on different types of assumptions or incorporate different types of constraints.For example, we show how to incorporate the prior knowledge about the sparsity pattern, topological constraints, such as connectivity or the maximum number of edges adjacent to every node.We also explore assumptions based on the properties of an estimator, such as homogeneous type I and type II discrepancies, or the underlying logistic model as a function from an estimator output and the output of the method.We show that in some cases, either when the prior assumptions hold, or when the graphical model structure is dense, the proposed models can serve as a better regularizer for Stability Selection than the original formulation.
Gian Andrea Thanei 
Dimension reduction techniques in regression Nicolai Meinshausen  Jul-2014
Maximilien Vila 
Statistical Validation of Stochastic Loss Reserving Models Submission Lukas Meier 
Jürg Schelldorfer
Claims reserving in non-life insurance is the task of predicting claims reserves for theoutstanding loss liabilities. There are many methods and models to set the predictedclaims reserves. However, in order to quantify the total prediction uncertainty of thefull run-off risk (long term view) or the one-year risk (short term view) a corresponding stochastic model is needed. In practice, one usually compares the results of several stochastic models in order to determine the appropriate claims reserves and their uncertainties. From a statistical point of view, all these stochastic models require a thorough consideration of the data as well as checking if the model assumptions are fulfilled. In this thesis we are going to investigate these issues by focusing on four different models: the distribution free Chain Ladder model, the Cumulative Log Normal model, a Bornhuetter-Ferguson model and generalized linear models. We present known statistical tools and some newly developed data plots and model checking graphics to support the decision for the appropriate stochastic model. Different numerical examples are used to illustrate the procedure of model checking. Public triangles and AXA triangles were considered and the conclusions coincide. Therefore and for confidentiality we only present the results for the publicly available data.
Colin Stoneking 
Bayesian inference of Gaussian mixture models with noninformative priors Peter Bühlmann  May-2014
This thesis deals with Bayesian inference of a mixture of Gaussian distributions. A novel formulation of the mixture model is introduced, which includes the prior constraint that each Gaussian component is always assigned a minimal number of data points. This enables noninformative improper priors such as the Jeffreys prior to be used for the component parameters. We demonstrate difficulties involved in specifying a prior for the standard Gaussian mixture model, and show how our new model can be used to overcome these. MCMC methods are given for efficient sampling from the posterior of this model.
Alexandra Ioana Negrut 
Traffic safety in Switzerland Hans R. Künsch  May-2014
More than 50000 car accidents occured on Swiss roads in 2012. With new data at hand, the Traffic Engineering department of ETH Zurich was interested in finding out which factors determine the severity of a car accident. Moreover, they were interested to know what determines a certain cause and type of a car crash. In order to answer this first set of questions, parametric and non parametric methods were used and then compared in terms of misclassification errors and variable ranking. The results confirmed that in order to predict the accident's severity level, one also needs information about the events that didn't happen. In the second part of the thesis, the severe crash frequency was investigated on five of the Switzerland's motorways. It was proved that the higher the average daily volume (DTV) the higher the number of severe accidents.
Yannick Trant 
Stock Portfolio Selection with Random Forests Peter Bühlmann 
Thorsten Hens 
Applications of machine learning algorithms to stock selection usually focus on technical parameters and limited sets of fundamental company ratios. In this study the complete balance sheet, income statement and cash flow statement information of US companies from 1989-2013 is used as model input. The amount and inhomogeneous distribution of missing values is a key characteristic and difficulty in working with this data. I present a structured way to prepare this challenging dataset for statistical learning methods. The fundamental data is complemented by a wide range of technical indicators. In this work the predictive power of random forests is explored on a calibration period from 1989-2006 using this huge data set with respect to stock return prediction. My results show that a small but significant predictive power with respect to ranked returns can be attained for an ‘extreme’ random forest parametrization. The calibrated random forest parametrization raises interesting question with respect to the nature of the data set. Based on the random forest predictions simple investment strategies are formulated. They exhibit significant out-performance in an out-of-sample back test for the period from 2006- 2013. The risk adjusted performance measures are on level with the latest stock selection criteria in the finance literature. Throughout my work I illustrate the challenging peculiarities of working with equity data and propose solutions originating both from finance and mathematics.
Annette Aigner 
Statistical Analysis of Lower Limb Performance Assessments in Patients with Spinal Cord Injures Marloes Maathuis 
Armin Curt 
Lorenzo Tanadini
Based on longitudinal data from spinal cord injured patients participating in the European Multicenter Study about Spinal Cord Injury, the focus of this thesis lies on the assessments of lower limb performance. Initially, the performance measures' abilities to capture change in a patient's walking ability are measured and their relationships with each other assessed. Based on these results, two measures are identified to subsequently explore the possibility of modelling a patient's recovery in these two outcome measures. Finally, the potential of predicting the extent to which patients will regain their walking ability is examined. Choosing methods such that the results may best help answer the respective research questions, non-parametric two-sample testing, canonical correlation analysis, principal component analysis, latent class factor analysis, as well as linear mixed effects models and random forest were relied upon. The findings show that the scores, currently used on an equal footing for assessing lower limb performance, only apply to certain patients. Therefore, there are subgroups of scores associated with specific patient groups. Out of the six walking tests (6MWT, 10MWT, TUG, SCIM3a, SCIM3b, WISCI), 6MWT and SCIM3b exhibit the desired characteristic of responsiveness and turn out slightly better, and especially most consistent, with respect to the assessment of the interdependency of all scores. Regarding the potential of modelling recovery, i.e. the development over time, the effect of time on 6MWT exhibits a log-like trend. On the other hand, the recovery measured with SCIM3b has a different development, for which time alone may even have a negative influence. The results for the prediction of these two outcomes, six months after injury, showed that such an endeavor is very difficult and will therefore have low accuracy if applied to new patients.
Claude Renaux 
Confidence Intervals Adjusted for High- Dimensional Selective Inference Peter Bühlmann  Apr-2014
There is a growing demand for determining statistical uncertainty which is a largely unexplored field for high-dimensional data. The main focus of this thesis lies on confidence intervals adjusted for selective inference in the high-dimensional case. Selective inference denotes the selection of some co-variables and construction of the corresponding confidence intervals based on the same data. This results in a bias, namely the selection effect. One can correct for the selection effect by adjusting the marginal confidence level. We select some co-variables and apply this adjustment to Bayesian confidence intervals based on Ridge regression and frequentist confidence intervals based on de-sparsifying the Lasso. Furthermore, we summarize the theory of selective inference and of the methods used to construct confidence intervals. The methods are demonstrated on a real data set, and large simulations on synthetic and semi-synthetic data sets are carried out. Two of the three methods proposed to construct Bayesian confidence intervals based on Ridge regression perform well only in some set-ups. Furthermore, our simulations show that the False Coverage-statement Rate (FCR) criterion is controlled and the power takes high values for the confidence intervals based on de-sparsifying the Lasso. Moreover, the implementation of the de-sparsified Lasso can be changed for the purpose of selective inference which results in computations finishing in 1% to 6.5% of the time with only slight changes in the results. The results are useful for settings where selective inference is appropriate and high-dimensional data is present.
Christoph Dätwyler 
Causality in Time Series, a Time Series Version of the FCI Algorithm and its Application to Data from Molecular Biology Marloes Maathuis  Apr-2014
Among many other concepts, Granger causality has become popular to infer causal relations in time series. In the first part of this work we give a short introduction to this topic, whereby we see that Granger causality can be formulated in terms of conditional orthogonality or conditional independence and can be closely linked to path diagrams, which provide a convenient way of visualising causal relationships among the factors/variables of interest. A concept called m-separation then provides us with a graphical criterion to infer conditional orthogonality relations in path diagrams and we conclude the first part with a precise statement linking m-separation and Granger causality.The second part then deals with the FCI algorithm, which has been designed to infer causal relations among systems of variables, where possibly not all of them have been observed. Furthermore we present an adaptation of the original FCI algorithm to the framework of time series data.In the last part of this thesis we apply the time series version of the FCI algorithm to a dataset from molecular biology, with the goal to infer causal relations among the factors of interest and thereby getting a better understanding of how the transcription process of genes works.
Thomas Schulz 
A Clustering Approach to the Lasso in the Context of the HAR model Peter Bühlmann 
Francesco Audrino 
We discuss a covariate clustering approach to the Lasso and compare it to the standard Lasso in the context of the HAR model. We analyze the difference in  forecasting error between these models on historical volatility data and find that the error tends to be slightly larger for the clustering approach. Subsequently, we employ the same data to compare the stability of the chosen coefficients for the considered models and we observe that the clustering approach achieves better results than the standard Lasso. Finally, we conduct a data simulation analysis to study stability issues in a synthetic HAR setting and conclude again that the coefficients selected by the clustering approach appear to be more stable.
Huan Liu 
Incorporating Prior Knowledge in CPDAGs Marloes Maathuis  Mar-2014
A causal model can be presented as a graph model, with each node representing a variable, and each edge representing a causal relationship. A completed partial directed acyclic graph (CPDAG) is such a causal model with no hidden variables, and with every undirected orientation being possible. A causal prior knowledge is presented as the existence or absence of a directed path from one variable to another. This paper provides an algorithm to incorporate a set of causal prior knowledge into a CPDAG. It uses the chordal properties of a CPDAG to separate the undirected graph into connected subgraphs, then with the help of Meek’s rules and theorems to incorporate all the prior knowledge. This paper also proves the correctness of the incorporation for both positive prior knowledge and negative prior knowledge. Furthermore, a simulation is done to test and compare the performance of the algorithm.
Lana Colakovic 
Classification using Random Ferns Nicolai Meinshausen  Mar-2014
Random Ferns are a supervised learning algorithm for classification introduced recently by Özuysal, Fua, Calonder, and Lepetit (2010), as a simpler and faster alternative to Random Forests (Breiman (2001)), with specific application in image recognition. In contrast to trees, ferns have non-hierarchical structure and the aggregation is performed by multiplication rather that averaging. Also, they rely on completely random selection of features as well as split points. The aim of this master's thesis is to investigate general properties of Random Ferns and compare them to Random Forests. We want to see if, and under which circumstances, Random Ferns are comparable in performance to Random Forests. We implemented Random Ferns algorithm in R and used simulated as well as real data sets to investigate Random Ferns' properties in more detail.
Christoph Kovacs 
Semi-supervised Label Propagation Models for Relational Classification in Dyadic Networks: Theory, Application and Extensions Marloes Maathuis 
Lukas Meier 
If a dataset not only comprises instance features but also exhibits a relationalstructure between its elements, it can be represented as a network with nodes definedby instances and links defined by relations. Data analysis can be performedon such a structure under the statistical relational learning (SRL) paradigm. Twoof its basic cornerstones, collective classification and collective inference, can becarried out by semi-supervised label propagation (SSLP) algorithms, which allowfor label information to be propagated and updated through the network to arriveat class affiliation predictions for unlabeled nodes. For this purpose, harmonicfunctions have been applied on Gaussian random fields and adapted accordingly,leading to the weighted-vote Relational Neighbor classifier with Relaxation Labeling(wvRNRL). Extending this approach to support social features, extractable fromthe network’s topology, results in the Social Context Relational Neighbor (SCRN)classifier. Moreover, MultiRankWalk (MRW), a classifier which uses ideas from randomwalk with restart, is presented and discussed. These different semi-supervisedclassification models are being applied on nine dyadic networks and their predictionperformances are being evaluated for various accuracy measures using the repeatedNetwork Cross-Validation (rNCV) scheme. Ideas to relax certain model restrictionsand to expand their applicability are outlined, together with a suggested measureof unlabeled node importance (MIUN statistic). In order to provide an adequatevisualization of the obtained results, a new means of holistic visualization, theCirco-Clustogram, is proposed. A discussion of the advantages and disadvantagesof semi-supervised label propagation and its applicability concludes this thesis.
Ambra Toletti 
Tree-based variational methods for parameter estimation, inference and denoising on Markov Random Fields Sara van de Geer  Feb-2014
The attention of statisticians and computer scientists for variational methods has increased considerably in the last few decades. While it has become (computationally) cheap to store huge amounts of multivariate data describing complex systems (e.g. in natural sciences, sociology, etc.), the elaboration of this information for either getting parameter estimates for the underlying statistical models, or making inference or denoising is still infeasible in general. In fact, classical (exact) methods (e.g. computing Maximum Likelihood estimates via Iterative Proportional Fitting) need a huge amount of time for solving these issues if the complexity/size of the underlying model is sufficiently large. Markov Random Fields, which are widely used because of their nice representations as both graphs and exponential families, are not immune to this problem. In this case it is possible to convert both inference and parameter estimation into constrained optimization problems connected with the exponential representation. Unfortunately this transformation does not provide any improvement in feasibility, because it is often impossible to write the objective function in an explicit way and even the quantity of constraints is prohibitive. One can obtain a  computational cheaper (approximate) solution by appropriately relaxing the constraints and by approximating the objective function. In this work the relaxation was made by considering all combinations of locally consistent marginal distributions and the objective function was approximated with a convex combination of Bethe entropy approximations based on the spanning trees of the underlying graph. Wainwright (2006) proved that parameters estimates obtained with this method are asymptotically normal but don’t converge toward the true parameter. However, if these estimates are used for purposes such as inference or denoising their performance is comparable with the one of exact methods. In this work some empirical evidence confirming these properties for an Ising model on a grid graph was produced and general definitions and results about graphical models and variational methods were resumed.
Tobia Fasciati 
Semi Supervised Learning Markus Kalisch  Feb-2014
The potential advantages of Semi Supervised Learning compared to more traditional learning methods like Supervised and Unsupervised Learning has attracted many researchers in the recent past. The goal is to learn a classifier from data having both labeled and unlabeled observations by exploiting their geometrical position.The aim of this Master Thesis is to give an overview about SSL and study two different methods, Transductive Support Vector Machine and Anchor Graph Regularization. Finally, both approaches are tested on selected datasets.
David Bürge 
Causal Additive Models with Tree Structure: Structure Search and Causal Effects Peter Bühlmann 
Jonas Peters 
Drawing conclusions about causal relations from data is a central goal in numerous scientific fields. In this thesis we study a special case of a restricted structural equation model (SEM). In addition to the common assumptions of acyclicity and no hidden confounders, we assume additive Gaussian noise, non-linear functions and a causal structure represented by a directed acyclic graph (DAG) with tree structure. Given data from such a causal additive model with tree structure (CAMtree) we estimate the underlying tree structure and give characterisations of the causal effects from variables on others. This restricted model leads to several simplifications. Identifiability of the structure is guaranteed by a result from Peters et al. (2013). We present a method that efficiently finds a maximum likelihood estimator for the causal structure among all trees. As our method is based on local properties of the distribution, it extends without constraints to high-dimensional settings. Furthermore, we investigate how to characterise causal effects from one variable on others. The maximum mean discrepancy is used to quantify changes in the distribution of the effect variable when the potential cause is varied. Based on our estimate for the structure, we present a procedure which, given only observational data, predicts the strongest causal effects. All methods are implemented in R and we give experimental results for synthetic data and one set of real high-dimensional data.
Emilija Perkovic 
The FCI+ Algorithm Markus Kalisch  Feb-2014
The primary focus of this thesis was to understand and implement the FCI+ algorithm as described in “Learning Sparse Causal Models is not NP-hard” Claassen, Mooij, and Heskes (2013a). In order to understand how this algorithm works, a short introduction to causality and some methods of dealing with causal data are examined. Firstly, we deal with introducing the reader to the terminology and graphical representation of causal systems. Then we focus on examining methods for dealing with data from causal systems when there are no hidden variables (PC), as opposed to those when hidden variables are present (FCI, FCI+). Special attention is given to the theory behind the FCI+ algorithm. In the end a comparison between FCI and FCI+ is made, based on accuracy and computational time, and conclusions are drawn.
Xi Xia 
Comparision of Different Confidence Interval Method for Linear Mixed Effect Models Martin Maechler  Feb-2014
Our study is a simulation analysis of different confidence intervals methods of fixed effect parameter in linear Mixed-effects models. Two functions, lmer function in the lme4 package and lme function in nlme package, are used to fit the linear mixed-effects mod- els. 6 different confidence interval methods from pacakges lme4, nlme, lmerTest and boot are studied and compared in our study. We conclude that both lmer and lme functions have similar results in fitting the LME models, but bias are growing as the number of fixed effects increase. For confidence interval methods, a general finding is that most of the intervals are too small. But among all methods, lmerTest method perform best. It has the lowest confi- dence interval MP among all methods and its coverage rate is closest to the nominal rate (α). The drawback of lmerTest is it sometimes returns error or intervals which do not make sense(e.g. with infinite boudary) and it runs significantly slower than lme4-Wald and nlme-intervals. Lme4-Wald and nlme-intervals are both very stable and fast, but the intervals are nearly always too small. Profile method is not better than lmerTest, and bootstrap-type methods perform worst. Also, we found that sometimes poor performance of confidence intervals might indicating overfitting in model design.


Student Title Advisor(s) Date
Vineet Mohan 
Grouped Regression in High Dimensional Statistics Sara van de Geer  Oct-2013
This work is devoted to clustered estimation in a sparse linear model where parameters are highly correlated and far outnumber the observations. Three variants of the group lasso technique from literature are examined. They are found to have equivalence with  weighted lasso after some dimension reduction. The priors they impose on the parameters is used to suggest which class of problems they work best with. Based on this analysis, a new estimator which bases dimension reduction on principal component analysis is proposed. Empirical experiments follow to confirm results from theory.
Vasily Tolkachev 
Parameter Estimation  for Diffusion Proess Hans Rudolf Künsch  Sep-2013
This work considers estimating functions approach to calibrating parameters in stochastic differential equations based on discretely-sampled observations. Since the likelihood function is not known in closed form in the discrete case, we have to rely on an approximation to the score function, the estimating function, and then take its root as the estimator. It turns out that roots of estimating functions enjoy a number of remarkable asymptotic properties. First, some major rigorous regularity assumptions are outlined for the major results to hold. Then we consider one major result that when conditional moments of the process are known in closed-form, the roots of an estimating function are asymptotically normal. Secondly, a more general theorem, which uses sample moments instead of the conditional ones in the estimating function, is discussed. Under a suitable choice of the approximating scheme the roots are still asymptotically normal, but with bias and larger variance. Finally, estimation of both drift and diffusion coefficients are considered for Geometric Brownian Motion and Ornstein-Uhlenbeck process, generated from a Monte-Carlo simulation. Important issues for various values of parameters are emphasized, as well as advantages and difficulties of using estimating functions.
Sarah Grimm 
Supervised and semi-supervised classification of skin cancer Sara van de Geer 
Markus Kalisch 
Chris Snijders
As skin cancer rates continue to grow, dermatologists will be overwhelmed with the number of patients seeking skin cancer diagnosis. This problem is being addressed in the Netherlands, where research with a hospital has been developing logistic  regression models that may help train nurses to diagnose skin cancer, and that are accessible via a mobile application. The present work investigated whether the logistic regression models could be improved or outperformed. A small simulation study explored the future potential for improving the models by incorporating information from patients who would use the application and who have not received a diagnosis.Logistic regression proved to be a competitive model. A smaller set of predictors with which models performed practically as well was identified. Although incorporating information from undiagnosed cases did not improve performance, it also did no deteriorate it, and it is worth to continue investigating the value of undiagnosed cases for model performance.
Lennart Schiffmann 
Measuring the MFD of Zurich: Identifying and Evaluating Strategies for an Efficient Placement of Detectors Marloes Maathuis 
Markus Kalisch 
In recent years the macroscopic fundamental diagram (MFD) was established in the traffic research community. It can describe the overall traffic state in homogeneously congested areas in cities. To facilitate a real world implementation of an MFD-based traffic control system, we are developing strategies for placement of fixed monitoring resources (e.g. loop detectors). These strategies to place detectors efficiently are based on univariate and bivariate distributions of street properties such as road length, number of lanes and occurrence of traffic lights. We find that the use of bivariate distributions including the length of streets can yield good results. Our research is based on a microsimulation of the city of Zurich implemented in VISSIM.
Reto Christoffel-Totzke 
Time Series Analysis Applied to Power Market Data Peter Bühlmann  Aug-2013
The object of study of the present thesis are the daily closing prices of the futures contract for the Base-13. The goal is to elaborate on their characteristics and to understand which impacts determine their trend. By means of appropriate methods and procedures, the most important of numerous variables are selected and five different models developed to describe the Base-13. These models simultaneously try to compute precise one step ahead forecasts for future closing prices. A short introduction equips the reader with the necessary basic knowledge about the functional principal of power markets in order that the results of the analyses can be understood and their interpretations comprehended. The descriptive time series analysis of the closing prices demonstrates in the following section that the volatility heavily changes over a specific period of time what challenges the development of the models. Furthermore, in the same section, the Random Walk Hypothesis could not be confirmed concerning independent and incidental alteration in prices for the financial contracts in the case of the Base-13. The next section focusses on GLM models. Based on GLM, a model has been developed which includes the most important indicators for the closing prices: coal API2-13, EUA-13, Gas TTF-13, CLDS and the USD/EUR exchange rate. The resulting forecasting model with GLM generated very accurate performances with a precision in trend of 81%. A strong linear correlation has appeared between Base-13 and coal, EUA, gas and the exchange rates having the major quantifiable impact what is shown in a graphical analysis of these effects. Thereafter, the impact analysis has been intensified. In the course of analyzing, it has produced some interesting insights on the reaction of the closing prices concerning the changing volatility of the input variables. All variables of the final GLM model are highly significant in the GAM as well and show identical features relating to their impact on the Base-13. The forecasting model with GAM reaches accuracy in trend of 78%. The research documented in the next section has been able to confirm four important variables of the final model by applying MARS: coal, EUA, CLDS und the USD/EUR exchange rate. The effect of those most important variables likewise is almost linear according to the graphical analysis. The forecasting model with MARS reaches accuracy in trend of 78%. Furthermore, another forecasting model has been developed with NNET which captures non-linear effects to an acceptable extent. The relating effect plot illustrates this non-linearity quite obviously, especially high for gas, the exchange rates, coal and CLSS. The forecasting model with NNET demonstrates accuracy in trend of 74%. The following section illustrates that the results with PPR confirm the outcome to a considerable extent provided with GLM for the final model. The forecasting model based on PPR shows a precision in trend of 75%. Various theoretical findings relating to the impact on the closing prices of Base-13 as well as such based on applied experience have been confirmed based on empirical data. The straightforward linear model has proven very accurate as well as comprehensible thanks to its mathematical form. Furthermore, it has been demonstrated that complex non-linear models bear no advantage due to the strong correlation of the most important variables and the Base-13. It can therefore be concluded that the goals set for this thesis have been achieved by providing substantial insight in theoretical and applied aspects of statistical models relating to forecasting of futures closing prices.
Andrea Remo Riva 
Convex optimization for variable selection in high-dimensional statistics Sara van de Geer  Jul-2013
Which genes are involved in the favor or oppose the formation of potentially fatal diseases such as prostate cancer, Crohn’s or Huntington’s disease? The world around us is increasingly confronted with situations in which a large number of collected data should be interpreted with the purpose of being able to formulate specific hypotheses about the reasons that lead to phenomena of particular interest. The modern statistics therefore seeks to develop new tools that can effectively deal with this kind of problems. This Master-Thesis will initially refresh the basic ideas related to the LASSO (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani in 1996 and the basics of convex optimization. Following the study will focus on finding optimal solutions by regularizing the empirical risk with appropriate nonsmooth norms. The proximal methods face profitably and in a diversified manner these optimization problems and become of considerable interest from the computational point of view because the proposed algorithms have good convergence rates. Later we will be interested to explore the possibility of introducing a structured-sparsity in the solutions in order to be able to greatly improve the quality of the regression coefficients. For this purpose we will introduce new variational norms imposing the membership of auxiliary vectors with positive components to a set of our choice determinant in an inductive way the desired structure. Finally, some applications in the field of image processing and medical research will illustrate concretely how the multidimensional statistic is called today upon to help the man.
Nilkanth Kumar  
An Empirical Analysis of the Mobility Behaviour in Switzerland using Robust Methods  Werner Stahel 
Massimo Filippini
In this thesis, the demand for personalized mobility by Swiss households has been studied using vehicle stock parameters, geographic and socio-economic characteristics. For this purpose, disaggregate household level data from the latest Swiss travel micro-census for the period 2010 { 2011 has been used. In addition to the OLS approach, robust methods using MM-estimators have been incorporated to obtain improved model fits and estimation results. Few related demand questions, like comparing car usage of single and multiple-car households, are also explored.
The estimated coefficients mostly have expected signs. The demand for personal mobility is found to vary diversely across different locations and households. The non-availability of good public transport in an area is found evidenced with a significantly higher demand for car utilization. Rich households appear to have a higher travel demand in general.Efficient cars are found to be driven more compared to those with poor energy ratings. In multi-car households, vehicle usage disparity of as much as 21% is noticed based on the efficiency label. From a policy maker's point of view, further research into specific areas to assess feasibility of diverse policy instruments that account for the differences in vehicle utilization behaviour of the people is advised.
Nicolas Meng 
Optimal Portfolios - The Benefits of Advanced Techniques in Risk Mangement and Portfolio Optimization Sara van de Geer 
Markus Kalisch 
This Master Thesis deals with the most important challenges facing practitioners in portfolio and risk management. It embeds a variety of risk- and optimization methodologies into a common framework and performs an empirical backtest on a typical sector rotation strategy in the US market. The objective of this study is to evaluate the impact of wrong assumptions in risk modeling and portfolio optimization, as a recent survey showed that practitioners are still using simplified approaches based on wrong assumptions, despite empirical evidence that contradicts their assumptions. This thesis embeds a variety of risk and optimization methods into a common framework and performs an empirical backtest on a typical sector rotation strategy in the US market. First, we apply different risk forecast models to the empirical data. Apart from an unconditional model still prominently practiced, a constant conditional correlation (CCC) and dynamic conditional correlation (DCC) model are implemented and the forecasting performance is evaluated on the risk measures of volatility, VaR, and CVaR. There is clear empirical evidence that the unconditional model performs poorly and lead to severe underforecasting and clustering of loss during the financial crisis of 2008. The more complex DCC model provided the most accurate forecasts, followed by the CCC model. This demonstrates that wrong model assumptions lead to unacceptable results in practice. Based on forecasts from all risk models, two optimization approaches are tested. An adapted version of the traditional mean-variance optimization is employed. Additionally, a relatively new method of diversification optimization is implemented and compared against return maximization, subject to a CVaR constraint. Using this comparison, we examine the effect of estimation error on the expected returns and risk parameters. As a diversification approach is invariant to the estimates of expected returns, we assume that it should provide more stability to an optimized portfolio. We were able to confirm the concerns about estimation error and found that return maximization does not lead to optimal portfolios out of-sample. In contrast, the empirical results of the diversification-CVaR strategy are promising. Maximum diversification of independent risk factors leads to better performance in terms of both, realized risk and returns. In light of these findings, we question the practice of using the traditional method of return maximization, as the cost of ignoring estimation error in the optimization seems to be significant. Finally, we conclude that the standard approach still followed by a majority of practitioners does not deliver satisfactory results due to wrong assumptions about the statistical properties of the financial markets. We conclude that conditional risk estimates and the problem field of estimation errors are important aspects that cannot be neglected solely for the sake of simplicity.
Cong Dat Huynh 
Semi-supervised learning methods for problems having positive and unlabeled examples Sara van de Geer 
Markus Kalisch 
Thomas Beer, Swisscom
A company can use upselling methods to upgrade the products its customers have bought from the company. Besides increasing the profit, the higher dependency of the customers to the company through the new upgraded products can help to reduce the churn rate. This is especially crucial in the telecommunication sector in which the volatility is high and the customer loyalty is low. The easiest way to upsell is to offer the customers the upgraded products. However, the reason why not to offer all products to all customers is that too many marketing information will annoy the customers. In this paper we will introduce and compare several methods that can support the decision of whether to offer a product to a customer or not. The effectiveness of the methods is validated through a simulation study based on real world datasets. The results from the study indicate that several methods have great potential
Ruben Dezeure 
P-values for high-dimensional statistics Peter Bühlmann  Mar-2013
In this work, recently published methods for hypothesis testing inhigh-dimensional statistics are studied. The methods are compared bytesting for variable importance in linear models for a  variety of test setups, including real datasets. For multiple testingcorrection a procedure is used that is closely related to theWestfall-Young procedure, which has been shown to have asymptoticallyoptimal power. The estimation performance of the regressioncoefficients is also looked at to provide a different level ofcomparison. Finally, we also test for a logistic regression model toinvestigate if testing in generalized linear models is reliable with state of the art methods.
Harald Bernhard 
Parameter estimation in state space models Hans Rudolf Künsch  Mar-2013
We considered the effectiveness of a particle approximation procedure to the score function via filtered moments of artificially time-varying parameters in general state space models. To investigate this issue we considered a simple two state hidden Markov model where exact reference values are available. For this model we conducted simulation studies to estimate several diagnostic statistics about the score approximation procedure. The results were then used to perform maximum likelihood estimation in the same model, using the noisy score approximation in combination with a stochastic approximation procedure.
Mark Hannay 
Robust Testing and Robust Model Selection Werner Stahel 
Manuel Koller
As the title of the thesis suggests, this thesis belongs to the domain of robust statistics.  There are 3 main chapters: testing in linear models, testing in generalized linear models and model selection.We start in the linear model, where we describe classical estimation and classical tests. After describing the classical methods, we introduce robust estimation, namely the SMDM estimator. With our robust estimates we present robust tests, including a new robust score test. To improve the speed of our new robust score test, we develop methods to estimate the scale parameter $sigma$ from the reduced modal. The most prominent robust tests for the composite hypothesis are the robust Wald test and the $ au$-test. Both these tests are computationally expensive, they require fitting the full model. We develop new robust tests, that only require fitting the reduced model.In generalized linear models (GLMs), we once again describe the classical estimation and the classical tests. By using robust scores, we introduce robust estimation. In GLMs, 2 prominent robust tests already exist, the quasi deviance test and the robust saddlepoint test. However, they are computationally expensive. So we introduce the robust Wald test and the  robust score test, which are both computationally cheaper. Here we compare the quasi deviance test with the robust Wald test and the robust score test, while simultaneously comparing them to the classical saddlepoint test. In the chapter on model selection, we introduce an important method, the classical Mallows' $C_{p}$ criterion. By using the classical Mallows' $C_{p}$ criterion in an example, we discuss the importance of using robust methods for model selection. So we develop our own robust Mallows' $C_{p}$ criterion, which works well in the example. We compare the classical and the robust Mallows' $C_{p}$ criteria with each other in a simulation study. Another approach to model selection, based on testing is also discussed. I have tried to make this thesis as self contained and as comprehensive as possible, while keeping to the essentials. Chapters 2 and 4 should be accessible for people with a good foundation in linear regression. While chapter 3 should be accessible with a good foundation in generalized linear regression.
Benjamin Stucky 
Second-Level Significance Testing Sara van de Geer  Feb-2013
The emergence of all the modern-day information gathering technologies, amongst all their benefits, gave rise to some new problems and challenges. Nowadays we need to be able to handle huge data sets. We will often face the problem that some information of interest is very rarely contained in our data. At the same time this information is very hard to distinguish from every other observation. This thesis will focus on how to detect the presence of such sparse information with the aid of a method called Higher Criticism. This is a hypothesis test to determine whether we have a very small fraction of non-null hypotheses amongst many null hypotheses or if this fraction is indeed zero. For the definition of this test we need a collection of different significance tests, hence the name Second-Level Significance Testing. Higher Criticism was suggested by Tukey in 1976 and then developed by Donoho and Jin [15] in 2004. This thesis is mainly based on their work as well as the work of Cai et al. [9]. The main focus lies on the detection of sparse signals, but some cases where the signals are dense are also discussed. Higher Criticism works very well for the adaptive detection of sparse and faint signals amongst background noise. Adaptive means that Higher Criticism is able to work without knowing the sparseness and the faintness of the detection problem. The case where the data is Gaussian distributed is the basis for developing the Higher Criticism test statistic. In this setting Higher Criticism is optimal. Optimality means that asymptotically Higher Criticism is able to detect all theoretically detectable signals. The detectable signals are described by the detection boundary. We also encounter the problem of correlated observations. There we can modify Higher Criticism and still get nice results, this follows the work of Hall and Jin [19]. The notion of the detection boundary and Higher Criticism can even be generalized to a wide range of different settings due to Cai and Wu [12]. Higher Criticism thus solves one challenge that new technologies have posed us. We discuss other important problems connected to the detection of sparse signals according to Cai et al. [10], such as the estimation of the fraction of sparse signals and discovering which observations are signals of interest.


Student Title Advisor(s) Date
Giacomo Dalla Chiara 
Factor approach to forecasting with high-dimensional data, an application to financial returns Peter Bühlmann  Oct-2012
This study considers forecasting a time series of financial returns in a linear regression setting using a number of macroeconomic predictors (N) which can exceed the number of  time series observations (T). Usually, regression estimation techniques either consider only a handful of predictors  or assume that the vector of parameters is sparse. Several recent papers advocate the use of a factor approach to deal with such high-dimensional data without discarding any of the predictors. Assuming an approximate-factor structure on the data, it is possible to summarize the large set of time series using a limited number of indexes, which can be consistently estimated using principal components. First, we review the recent theoretical developments in the construction and estimation of a forecasting procedure which uses the large-dimensional approximate factor model. The aim is to contribute to bridge these studies with the empirical research, which presents mixed performance results of the factor model implemented on real-world data. In a second part we discuss four implementation techniques of the factor model, namely, (i) screening, (ii) estimation window size selection, (iii) factor selection, (iv) variable selection in a factor-augmented regression. We argue that these four methodologies, which have often been considered separately in the empirical literature, are paramount for the factor model to achieve a better forecasting performance than lower dimensional models. In the last part of this study, factor-augmented models, with and without the above mentioned methodologies, are implemented using the Stock and Watson (2006) dataset of macroeconomic and financial predictors to forecast the time series of monthly returns of the Standard and Poor 500 index. Indeed, the empirical results show that screening and estimation window size selection are needed, in the factor model, to outperform lower dimensional benchmarks. The main contribution of this work is to provide general guidelines for applying the large dimensional factor model to real-world data. All the practical methodologies discussed in the paper are coded in the R programming language, and are contained in Appendix E.
Raphael Gervais 
Predicting the Effect of Joint Interventions from Observational Data Marloes Maathuis  Sep-2012
It is commonly believed that causal knowledge discovery is not possible from observational data and requires the use of experiments. In fact, it is indeed impossible to learn causal information from observational data when one is not willing to make any assumptions. However, under some fairly general assumptions, IDA (Intervention calculus when the DAG is Absent) is a methodology that can deduce information on causal effects from observational data. The present work extends the IDA methodology in two ways. Firstly, in the case of single outside interventions on a system, two new algorithms are presented: IDA Path and IDA Semi-Local. These algorithms compare favourably in simulation studies in terms of both statistical properties and computational efficiency. Secondly, the IDA methodology is extended to cases where one seeks information about the causal effect of joint outside interventions on a system. Here, two algorithms are introduced, IDA IPW Joint and IDA Path Joint, that show encouraging results in simulation studies. These new algorithms for joint interventions may easily be extended to the creation of IDA-type algorithms for arbitrarily many outside interventions on a system.
Laura Buzdugan 
High-dimensional statistical inference Peter Bühlmann 
Markus Kalisch 
The present work seeks to address the issue of error control in high-dimensional settings. This task has proven challenging due to: 1) Difficulty of deriving the asymptotic normal distribution of the estimators, and, 2) The high degree of multicollinearity commonly exhibited by the predictor variables. These two issues were addressed by combining Bühlmann (2012)’s method of constructing p-values based on Ridge estimation with an additional bias correction term, and Meinshausen (2008)’s proposal of a hierarchical testing procedure that controls FWER (Family Wise Error Rate) at all levels. This led to the extension for the construction of p-values to cases in which the response variable is multivariate. The new method was tested on an SNP phenotype association dataset, which also allowed for investigation of different approaches to bias correction.
Michel Philipp 
Cost Efficiency of Managed Care Programs in Health Care Insurance Werner Stahel  Aug-2012
Managed care (MC) plans in health care systems promise an improved quality of medical service at significantly lower expenses. Therefore, politicians and health insurers have a strong incentive to estimate the cost efficiency of such alternative insurance plans from historical data on health care expenditure (HCE).However, estimating cost effects between basic and alternative insurance plans in an observational study is particularly challenging. Differences between the baseline characteristics in the different insurance collectives result in selection biases. This occurs notably when insurance companies o_er discounts to MC plans policyholders, effectively creating an economic incentive that approaches basically young and healthy people.This thesis first discusses the statistical challenges that artise when estimating the cost efficiency of MC plans. To draw causal conclusions from estimates based on observed health care data, the MC plan assignment must be independent of the HCE within subgroups of relevant confounders. Unfortunately, not every potential confounder can be observed by the insurance companies and therefore we conclude that it is not practicable to estimate causal effects from the available health care data. However, insurance companies are similarly interested in monitoring the HCE between different insurance plans.Therefore, we analyse data from a large Swiss insurance company using Tobit regression to estimate differences in (left-censored) HCE between basic and MC insurance plans, particularly within regions and pharmaceutical cost groups. Further, we attempt to improve the models using a propensity score, the probability of choosing MC insurance and calculate the confidence bands of the resulting differences in HCE between insurance plans from 100 bootstrap replications. To avoid additional bias we excluded covariates that are potentially affected by the MC plan.The estimates that we receive with our models vary significantly between regions. However, in total we obtain lower HCE compared to basic insurace of (with 95% confidence limits) Since it is unknown if the requirements for causal inference are met, our conclusion is that one can not absolutely exclude remaining selection bias from these estimates.
Rainer Ott 
A Wavelet Packet Transform based Stock Index Prediction Algorithm Hans Rudolf Künsch  
Kilian Vollenweider
Evangelos Kotsalis
In this Master thesis we develop prediction algorithms which optimize a performance measure over a specified set of wavelet packet trees and smoothing parameters. The performance of the algorithms is evaluated for the daily DAX prices from 18th December 2003 to 30th December 2011. Using a quantitative return quality measure with an algorithm based on a delayed version of the discrete wavelet packet transform (DWPT) we achieved to outperform the exponential weighted moving average trend follower. For the same algorithm 3 out of 25 wavelet packet trees were observed to be favourable. Furthermore, the DWPT was found to consistently outperform the discrete wavelet transform, if the Haar basis is used.
Sylvain Robert 
Sequential Monte Carlo methods for a dynamical model of stock prices Hans Rudolf Künsch  
Didier Sornette
Stock markets often exhibit behaviours that are far from equilibrium, such as bubbles and crashes. The model developed in Yukalov et al. (2009) aims at describing the dynamic of stock prices, and notably the way they deviate from their fundamental value. The present work was interested in estimating the parameters of the model and in filtering the underlying mispricing process. Various Sequential Monte Carlo methods were applied to the problem at hand. In particular, a fully adapted Particle Filter was derived and  showed the best performances.While the filtering was well handled by the different methods, the estimation of the parameters was much more diffcult. Nevertheless, it was possible to identify the market type, which qualitatively describes the dynamic of a stock.The methods were first tested on simulated data before having been applied to the Dow Jones Industrial Average. The latter application led to very interesting results. Indeed, the estimated model provided insight about the underlying dynamic, and the filtering of the mispricing process allowed to shed a new light on some important financial events of the last 40 years
Radu Petru Tanase 
Learning Causal Structures from Gaussian Structural Equation Models Jonas Peters 
Peter Bühlmann 
Traditional algorithms in causal inference assume the Markov and faithfulness conditions and recover the causal structure up to the Markov equivalence class. Recent advances have shown that by using structural equation models it is possible to go even further and in some cases identify the underlying causal DAG from the joint distribution. We focus on an identi_ability result for linear Gaussian SEMs with same noise variances and propose an algorithm that estimates the causal DAG from such models. We evaluate the performance of the algorithm in a simulation study and compare it to the performance of two other existing methods: the PC Algorithm and Greedy Equivalence Search.
Matteo Tanadini 
Regression with Relationship Matrices using partial Mantel tests Werner Stahel   Jul-2012
Relationship matrices and the statistical methods used to analyse them are of growing importance in science because of the increasing number of systems that are represented by networks. Relationship matrices are often used in fields such as Social sciences, Biology or Economics. In the context of multiple linear regression with relationship matrices, partial Mantel tests represent the standard statistical framework for inference. Several approaches of this kind can be found in the literature. In order to evaluate the performance of these methods, a sensible way to simulate datasets is indispensable. Unfortunately, studies conducted so far comparing performance of partial Mantel tests rely on inadequate simulated datasets and are therefore questionable. The goals of this master thesis were to compare the performance (measured as level and power) of widely used partial Mantel tests using state-of-the-art simulation techniques and to describe new implementations with improved performance. In a first phase, we focused on improving the quality of models used to simulate datasets for multiple linear regression with relationship matrices. We were able to propose two convenient procedures for simulating predictors (i.e. relationship matrices). We could also show a more appropriate way to simulate the error term for linear regression with relationship matrices. In a second phase, we described three modi_cations for partial Mantel tests that are supposed to improve performance. The implementation of these improvements in a Rcode will be object of future research. Finally, we compared the performance of three partial Mantel tests using datasets simulated according to our improved technique. The results agree with previous studies and confirm that the method proposed by Freedman & Lane has the best overall performance.
Markus Harlacher 
Cointegration Based Statistical Arbitrage Sara van de Geer 
Markus Kalisch 
This thesis analyses a cointegration based statistical arbitrage model. Starting with a brief overview of the topic, a simulation study is carried out that is intended to shed light on the mode of action of such a model and to highlight some potential flaws of the method. The study continues with a back-testing on the US equity market for the time period reaching from 1996 up to 2011. The results of all the different model versions that were tested look quite promising. "Traditional" mean-variance based performance measurements attest the employed cointegration based statistical arbitrage model very good results. The advanced dependence analysis with respect to the returns of the S&P 500 index and the returns obtained from the back-testing shows a very favourable structure and indicates that such a model can provide returns that are only very weekly related to the returns of the S&P 500 index.
Yongsheng Wang 
Numerical approximations and Goodness-of-fit of Copulas Martin Mächler 
Werner Stahel 
The author first gives an introduction to copulas and derives Rosenblatt transform of elliptical copulas. To circumvent numerical challenges in estimating the density of Gumbel copula, several approaches are presented. The author finds an algorithm to choose appropriate methods under various conditions. It is obtained by first determining the bit precision when using the benchmark method dsSib.Rmpfr and then conducting a simulation study for comparisons. Then followed by a review of goodness-of-fit methods of copulas including tests based on empirical copulas, Rosenblatt transform, Kendall transform and Hering-Hofert transform. The author conducts a large simulation experiment to investigate the effect of the dimension on the level and power of goodness-of-fit tests for various combinations of null hypothesis copulas and alternative copulas. The results are interpreted via graphs of confidence intervals and power ratios. Also, the relationships among the computational time, dimension, sample size and number of bootstraps are explored. Last, dependence structure of Dow Jones 30 is investigated using graphical goodness-of-fit test under various types of Student-t, Gumbel and Clayton copula families. Student-t copula with unstructured correlation matrix and optimized degree of freedom estimated by maximum likelihood estimator gives the best solution.
Amanda Strong 
A review of anomaly detection with focus on chnagepoint detection Sara van de Geer 
Markus Kalisch 
Anomaly detection has the goal of identifying data that is, in some sense, not "normal." The definition of what is anomalous and what is normal is heavily dependent on the application. The unifying factor across applications is that, in general, anomalies occur only rarely. This means that we do not have much information available for modeling the anomaly generating distribution directly. We will describe several ways of approaching anomaly detection and discuss some of the properties of these approaches. Changepoint detection can be considered a subtopic in anomaly detection. Here the problem setting is more specific. We have a sequence of observations and we would like to detect whether their generating distribution has remained stable or has undergone some abrupt change. The goals of a changepoint analysis may include both detecting that a change has occurred as well as estimating the time of the change. We will discuss some of the classic approaches to changepoint detection. As very large datasets become more common, so do the instances in which it is dificult or impossible for humans to heuristically monitor for anomalous observations or events. The development and improvement of anomaly detection methods is therefore of everincreasing importance.
Peter Fabsic 
Comparing the accuracy of ROC curve estimation methods Peter Bühlmann  Jul-2012
The aim of this study is to compare the accuracy of commonly used ROC curve estimation methods. The following ROC curve estimators were compared: empirical, parametric, binormal, "log-concave" together with its smoothed version (as introduced in Rufibach (2011)), and the estimator based on kernel smoothing. Two simulations were carried out, each assessing the performance of the estimators in a range of scenarios. In each scenario we simulated data from known distributions and computed the true and the estimated ROC curves. Using various measures we assessed how close the estimates were to the true curve. In the first simulation, a large sample size was used to compute the estimated ROC curves. A substantially smaller sample size was used in the other simulation. The "log-concave estimator"was found to perform the best when a large sample was available. On the other hand, the estimator based on kernel smoothing outperformed all other competitors in the simulation with the small sample size.
Edgar Alan Muro Jimenez 
About Statistical Learning Theory and Online Convex Optimization Sara van de Geer 
This work is divided in two parts: in the first block, we present a relationship between an empirical process and the minimax regret of a game from Prediction with Expert Advice (PWEA). We use this expression to show how the lower bound of a PWEA minimax regret can give us some information about the form of the experts class being used, in particular, whether it is a VC class or not. In the second block, we analyse from a theoretical point of view the similarities of the performance of algorithms from Statistical Learning, PWEA, and Online Convex Optimization (OCO). We present results for the three methods that show us that the rate of decay of the prediction error depends on the curvature of the loss function over the space of the predictor's choices. In addition, we provide Theorems for Statistical Learning and OCO showing that similar lower bounds for their regrets can be obtained assumming that the minimizer of the expected loss in not unique. This provides more evidence on the resemblance between the performance of algorithms from Statistical Learning and OCO. Finally, we show that any PWEA game can be seen as an special case of an OCO game. Even though this represents an advantage for finding upper bounds for PWEA, we present an example where the upper bounds for the regret originally created for OCO are not better than those found for PWEA
Elena Fattorini 
Estimating the direction of the causal effects for observational data Marloes Maathuis  Jul-2012
In many scientific studies, causal relationships are of crucial importance. Unfortunately, it is not possible, without making some assumptions, to calculate the causal effects only with observational data. In this thesis, the observational data are assumed to be generated from an unknown directed acyclic graph (DAG). Under such a model, bounds on causal effect can be computed with the approach of Maathuis, Kalisch, and Bühlmann (2009). The idea behind this approach is as follows. First, one tries to estimate the DAG that generated the data and then one computes the causal effects for the obtained DAG. However, under our assumptions we can generally only identify an equivalence class of DAGs that are compatible with the data. Due to the existence of these different possible generating DAGs, the causal effect from a variable X to a variable Y can not always be identified uniquely. However, one can identify the causal effect for each DAG in the equivalence class, and collect all these effects in a multisets. These multisets can be summarized using summary measures. For example in the paper of Maathuis et al. (2009) the minimum absolute values is used as a summary measure. That gives a lower bound on the size of the causal effect. In this thesis, we focus on the problem of how to derive the sign of the causal effects. Clearly, the minimum absolute value is not appropriate for this purpose. Eight new summary measures are proposed and simulation studies are performed to detect the summary measure that best detects the largest positive causal effects among a set of given variables. The summary measures are compared using averaged ROC curves. The maximum and the mean results to be the best summary measures. In the estimated graphs it occurs that some edges are directed in the wrong direction. A large positive causal effect can be estimated as zero due to a wrong directed arrow. Therefore, in order to detect all the largest positive causal effects, one should also investigate the effects which are estimated as zero.
David Schönholzer 
Geostatistische Kartierung der Waldbodenversauerung im Kanton Zürich Andreas Papritz
Hans Rudolf Künsch  
Im Rahmen des erhöhten Bewusstseins der zunehmenden Versauerung der Waldböden in der Schweiz und im Kanton Zürich kartiert diese Arbeit erstmals annähernd flächendeckend den Versauerungsgrad der Waldböden im Kanton Zürich anhand des geschätzten pH-Werts im Oberboden. Dazu werden eine Reihe von nationalen und kantonalen Datensätzen über Bodenversauerung, Klima, Vegetation, Topo- graphie und Geologie verarbeitet und zur statistischen Schätzung der Bodenversauerung verwendet. Um einigen Schwierigkeiten der statistischen Schätzung umweltnaturwissenschaftlicher Messgrössen zu begegnen, wird eine Kombination verschiedener statistischer Methoden eingesetzt, insbesondere der Geostatistik und der robusten Statistik.
Myriam Riek 
Towards Consistency of the PC-Algorithm for Categorical Data in High-Dimensional Settings Marloes Maathuis  Jun-2012
The PC-algorithm is an algorithm used to learn about or estimate the causal structure among a causally sufficient set V of random variables from data. Under the assumption of faithfulness, the PC-algorithm yields an estimate of the graph representing the Markov equivalence class of causal structures over V that are compatible with the probability distribution defined over V. Consistency of an estimator is a crucial property. It has been proven to hold for the PC-algorithm applied to multivariate normal data in high-dimensional settings where the number of variables is increasing with sample size, under some conditions ([10]). In this master thesis, an attempt was made to prove consistency of the PC-algorithm applied to categorical data in low- and high-dimensional settings.
Stephan Hemri 
Calibrating multi-model runo_off predictions for a head catchment using Bayesian model averaging Hans Rudolf Künsch  
Felix Fundel
One approach to quantify uncertainty in hydrological rainfall-runoff modeling is using meteorological ensemble prediction systems as input for a hydrological model. Such ensemble forecasts consist of a possibly large number of deterministic forecasts. Uncertainty is given by their spread. As such ensemble forecasts are often under-dispersed, biased and do not account for other sources of uncertainty, like the hydrological model formulations, statistical post-processing needs to be applied to achieve sharp and calibrated predictions. In this thesis post-processing of runoff forecasts from summer 2007 to the end of 2009 for the river Alp in Switzerland is done by applying Bayesian model averaging (BMA). A total of 68 ensemble members coming from a deterministic and two ensemble forecasts are used as input for BMA. These forecasts cover different lead-times from 1h to 240h. First, BMA based on univariate normal and inverse gamma distributions is performed under the assumption of independence between lead-times. Then, the independence assumption is relaxed in order to simultaneously estimate multivariate runoff forecasts over the entire range of lead-times. This approach is based on a BMA version that uses multivariate normal distributions. Since river discharge follows a highly skewed distribution, Box-Cox transformation is applied in order to achieve approximate normality. Back-transformation combined with data quality leads in some cases to too high predicted probabilities of extremely high runoffs. Using the inverse gamma distribution, instead, cannot remove this problem, neither. Nevertheless, both, the univariate and multivariate, BMA approaches are able to generate well calibrated forecasts that are considerably sharper than the climatology
Linda Staub 
On the Statistical Analysis of Support Vector Machines Sara van de Geer 
We analyze Support Vector Machines from a theoretical and computational point of view by explaining every building block of this algorithm separately, where we mainly restrict ourselves to binary classification. We start with loss functions and risks and then make a digression to the theory of kernel functions and their Reproducing Kernel Hilbert Spaces. We are then ready to perform the statistical analysis, where we assume in a first part the data to be independent and identically distributed. This analysis aims to investigate under which conditions on the regularization sequence the method is consistent and, more interestingly, to find the optimal learning rate and a way of nearly reaching it. We thereby explain the results given in [21] and add the missing proofs. Next, we briefly discuss the computational aspects of support vector machines, where we show that numerically the problem is reduced to solving a finite dimensional convex program. Subsequently, we explain how to use support vector machines in practice by applying the R function svm() from the package e1071 to independent and identically distributed data. We then slightly violate this assumption and generate data of a GARCH process which naturally carries a dependence structure and observe that the algorithm still produces good results for this kind of data. We finally find the theoretical explanation for this by performing a statistical analysis of support vector machines for weakly dependent data following the work of [22].
Christian Haas 
Analysis of market efficiency: Post earnings drift in Swiss stock prices   Peter Bühlmann  Mar-2012
The study of stock market behaviour and market efficiency is a very active topic within probability theory and statistics. Market models and their implications have recently been in focus of not only the mathematical and economic community. In this thesis, we take a look at some market models and studies of market efficiency. We therefore establish the theory behind efficiency and regression.In chapter 7 we then study the post earnings drift for Swiss stocks. We find a significant intraday drift in direction of the first reaction after the earnings release. We then look at a strategy, using our result, and try to answer whether we found market inefficiency or not.
Ana Teresa Yanes Musetti 
Clustering methods for financial time series Martin Mächler 
Werner Stahel 
The purpose of this thesis is to study a set of companies from the S&P 100 and determine whether share closing prices that move together correspond to companies belonging to the same economic sector. To verify this, different clustering methods were applied to a dis­similarity matrix corresponding to the degree of dependence between the companies. Since financial data does not exhibit multivariate normal distribution, applying non-parametric dependence measures was needed. For this, the theory of the Hoeffding’s D, Kendall’s τ and Spearman’s ρ was reviewed. Then, in order to choose the best clustering solution, a set of validation statistics were applied. To compare in advance the performance of the different clustering methods and the validation statistics under different circumstances re­garding the overlapping level of the clusters, two simulation studies were carried out. The first simulation based on correlation matrices computed from covariance matrix samples from a Wishart distribution and the second one based on models with Gaussian mixture distributions. This study showed that the transformation of the data, either from de­pendence measures or distances to (dis)similarities, has an impact in the performance of the clustering methods. Additionally, regarding the validation statistics, in the simulation studies some of these statistics showed a poor performance in extreme scenarios, where the clusters are very well separated. Finally, when the companies belonging to the S&P 100 were clustered, the method PAM applied to the corresponding dissimilarity matrix estimated with the Hoeffding’s D gave the best solution compared to the clustering meth­ods AGNES, DIANA and DSC, which agreed with the results from the simulation studies and the reviewed theory.
Lukas Patrick Abegg 
Analysis of market risk models Werner Stahel 
Evangelos Kotsalis
Lukas Wehinger
In this thesis, risk models are evaluated in a joint project with the swissQuant GroupAG and a major Swiss bank. Risk models with high complexity, e.g. based on GARCHmodels and different distribution assumptions, as well as simpler models, e.g. based on EWMA models and normal distributions, are assessed and compared for weekly data. The out-of-sample results are assessed graphically and the evaluation is performed with statistical tests applied to large scale data. At the 95% confidence level, the quality of the Value-at-Risk estimates under the simple and complex models are assessed to be similar. If the Expected Shortfall and the Value-at-Risk at higher confidence levels are considered, however, the sophisticated methods improve the risk estimates. A risk model based on copulas, GARCH models and non-parametric distribution estimates is developed additionally and found to outperform the risk models provided.
Martina Albers 
Boundary Estimation of Densities with Bounded Support Geurt Jongbloed
Marloes Maathuis 
When estimating a density supported on a bounded or semi-infinite interval by the kernel density estimation, problems may arise at the boundary. In the past, many variations of the 'standard' kernel density estimator have been developed to achieve boundary corrections.Smooth estimates of the distribution and density functions have recently been derived for current status censored data. This topic is closely related to kernel density estimation and the mentioned boundary problem can also appear in this context.In this Master's thesis some boundary corrections were combined with the smooth distribution estimates for current status censored data. Simulations to analyze the performance of these new constructions were carried out using R.
Alexandros Gekenidis 
Learning Causal Models from Binary International Data Peter Bühlmann  Mar-2012
The goal of this thesis is to provide and test a method for causal inference from binary data. To this end, we first introduce the mathematical tools for describing causal relationships between random variables, such as directed acyclic graphs (DAGs for short), in which the random variables are represented by vertices whereas edges stand for causal influences. A DAG can, however, only be identified up to Markov equivalence which roughly means that one can estimate its skeleton, but not the direction of most of the edges. This can be improved by performing interventions, i.e. by forcing a certain value upon one or several random variables and observing the change in the values of the other factors to obtain additional data. The resulting Markov equivalence classes form a finer partitioning of the space of DAGs than the non-interventional ones, thus improving the estimation possibilities. Based upon this theory we will adapt the existing Greedy Interventional Equivalence Search algorithm (GIES, [1]) to the case of binary random variables and test it on simulated data
Eszter Ilona Lohn 
Estimating the clinical score of coma patients - a comparison of model selection methods Werner Stahel 
Markus Kalisch 
The aim of this thesis is to explore the possibility of estimating coma patients' clinical awareness score by objective clinical measurements, in order to substitute the rather subjective doctors' examination which is expensive and time consuming. A comparison is made on variable selection and model fitting methods by cross-validation. The basic analysis is extended towards block subset analysis, alternative cross-validation schemes and analyzing the dynamics of the clinical score. As only a small sample is available, the phenomenon of over fitting is a serious concern throughout the analysis, which is seen through the difference of in-sample and cross-validated model fits. In general we observe that low-variance (higher bias) methods perform better on this sample size. In the end it is concluded that based on this sample the clinical measurements contain little information about the clinical awareness score.
Lisa Borsi 
Estimating the causal effect of switching to second-line antiretroviral HIV treatment using G-computation Marloes Maathuis 
Markus Kalisch 
Thomas Gsponer
Understanding causal effects between exposure and outcome is of great interest in many fields. In this work, the causal effect of switching to second-line antiretroviral treatment on death is estimated for a study population including HIV-infected patients experiencing immunological failure in Southern Africa (Zambia and Malawi). CD4 cell count is con­sidered as a time-varying confounder of treatment switching and death, while it is itself affected by previous treatment. Given the impossibility to conduct a randomised exper­iment, we address the problem of time-varying confounding by G-computation. Under certain conditions, G-computation yields consistent estimates of the causal effect by sim­ulating what would happen to the study population if treatment is set to a certain regime by intervention. In our analysis we compare intervention “always switch to second-line treatment” to intervention “always remain on first-line treatment”. We find the resulting risk ratio to be 0.24 (95% CI 0.14-0.33), emphasizing that the risk of dying is smaller in the population that switched to second-line treatment than in the population that stayed on first-line treatment. Thus, we conclude that there is a beneficial causal effect of switching to second-line treatment among HIV-patients experiencing immunological failure.
Gardar Sveinbjoernsson 
Practical aspects of causal inference from observational data Peter Bühlmann  Mar-2012
In this thesis we study methods to infer causal relationships from observational data. Under some assumptions causal effects can be estimated using Pearl’s intervention calculus provided that the data is supplemented with a known causal influence diagram. We study the IDA algorithm which estimates the equivalence class of this diagram and uses the intervention calculus to get a lower bound on the size of the causal effects. Since it can be a difficult task to discover structure, especially in high dimensional setting, we combine the IDA algorithm with stability selection, a subsampling method to select the most stable causal effects. In hope for improvement we verify our results on a data set where the true causal effects are known from experiments. We also investigate the robustness of our method with a simulation study where we look at violations of assumptions.
Simon Kunz 
Simulated Maximum Likelihood Estimation of the Parameters of Stochastic Differential Equations Lukas Meier 
Werner Stahel 
Marcel Freisem 
Estimating rating transition probabilities and their dependence on macroeconomic conditions for a bank loan portfolio Peter Bühlmann  Feb-2012
Tulasi Agnihotram 
Statistical Analysis of Target SNPs and their Association with Phenotypes Peter Bühlmann 
Markus Kalisch 
Genomics is not only influencing the field of medicine, but also distantly related fields such as behavioural sciences, economics etc. The primary goal of this thesis is to investigate relation between the genome represented by SNPs and the behavioural characteristics (such as risk aversion) of an individual, using supervised learning techniques. Human genome has 23 chromosomes, which contains information on millions of SNPs. Applying supervised learning techniques on millions of SNPs is difficult and may not be efficient.To simplify the analysis we select Target SNPs, which can represent all the surrounding SNPs. Target SNPs can be found by linkage disequilibrium with our modified Carlson's algorithm. By applying random forest (a supervised learning technique) on genotype data at Target SNPs as predictors and the categorized phenotype data as response vector, the error rates obtained corresponding to each phenotype were not informative.By using heuristic approach we select Best SNPs from all SNPs on chromosomes according to their rank correlation with phenotype. With the test data at Best SNPs as predictors and the categorized test data of phenotype as response vector, error rate of random forest did not suggest relationship between genotype and phenotype. Furthermore we apply this procedure on random SNPs and compare the results with the results of Target SNPs, Best SNPs and provide directions for future work.


Student Title Advisor(s) Date
Sung Won Kim 
Study on Empirical Process, based on Empirical Process Theory and Applications in Nonparametric Statistics Sara van de Geer  Dec-2011
Any estimator is a function of empirical measure, while what we want to estimate is a  function of theoretical measure. Then to justify our estimator we want to see that the estimator, a function of empirical measure, converges to the parameter, a function of theoretical measure, as the sample size grows. In general, however, the function to be measured is unknown, and one wants to see simultaneous convergence of the class of all possible functions. Thus, we present uniform laws of large numbers to show empirical meausure on class of functions converges to theoretical measure on that. To show it one requires entropy condition on the class, which ensures the proper size of the class of functions to be estimated, and condition of finite envelope, where the envelope is the supremum of the class of functions. Furthermore, we address uniform central limit theorem which gives the information how well the empirical measure converges to theoretical measure. If one can show the equicontinuity of the empirical process, indexed by the class of functions, and if this indexing class is totally bounded, then the class of functions is P-Donsker, equivalently the process satisfies the uniform central limit theorem. That is, the empirical process converges to Gaussian distribution. Equicontinuity, derived for showing P-Donsker, will open the way to deduce the rate of convergence, in our case, of least squares estimators. Therefore, as an application, we derive the rate of convergences of the least squares estimators for different classes of functions. Also, we consider the rate of convergences of the least squares estimators when the penalty is imposed for the complexity of the class of models. Even if one is not aware of the optimal model in the class, the proper choice of penalty would allow one to attain the optimal rate of convergence, as if one knows the optimal model. As the applications of uniform law of large numbers and uniform central limit theorem, convergence and normality of M-estimator are introduced, as well. There, one can see how empirical process theory is applied on the way to proving those properties. Furthermore, in order to see whether a class satisfies ULLN or UCLT, it is convenient to use Vapnik Chervonenkis index, VC index. Vapnik-Chervonenkis class, whose VC is finite, satisfy both ULLN and UCLT with envelope condition, and it would play a role in empirical process
Andre Meichtry 
Back pain and depression across 11 years Analysis of Swiss Household Panel data Werner Stahel 
Thomas Läubli 
Design and objective: In this longitudinal retrospective cohort study, we analysed back pain and depression data across 11 years in the general population of Switzerland. The main objective was to investigate the association between back pain and depression. Methods: We used data from the Swiss Household Panel. 7799 individuals (aged 13- 93, mean 42.9 years, 56.2% women) were interviewed between 1999 and 2009. Observed depression and back pain were described across 11 years. Missingness was assumed to be independent of unobserved data. We estimated marginal structural models using inverse-probability-of-exposure-and-censoring weights to assess the (causal) association between back pain history and depression. Correlated data was analysed by fitting marginal and transition models with generalised estimating equations yielding robust sandwich variance estimates. Results: Cross-sectional analysis adjusting for other time-fixed covariates showed that back pain was associated with a 42% increase in the odds of depression over time. The association of continuous past back pain up to time t−1 with depression at time t was 0.65 on a linear logistic scale (95% CI: 0.48-0.82), corresponding to a 92% (62-127%) increase in the odds of depression. Assuming a causal model accounting for confounded back pain by past depression, a marginal structural model (inverse-probability-of-exposure-and-censoring weighted model) regressing depression on past back pain showed an association of 0.63 (0.44-0.81) on a linear logistic scale, corresponding to a 87% (55-126%) increase in the odds of depression. Expressing exposure history by cumulative back pain up to time t-1, marginal structural model estimated a causal effect on depression at time t that increased with age at baseline and decreased for individuals with depression at baseline. Conclusion: Marginal structural models are well suited for the analysis of observational longitudinal data with time-dependant potential causes of depression, however,  marginal structural models do not address all issues of causal inference. Back pain history is one of many possible causes of depression. Future work must collect more socio-economic and health-related covariates, investigate possible non-ignorable missing and investigate other functions of back pain history.
Jongkil Kim 
Heavy Tails and Self Similarity of Wind Turbulence Data (corrected version July 2012) Hans Rudolf Künsch  Nov-2011
In this thesis, we perform the statistical analysis in order to figure out the characteristics of wind turbulence. We estimate the pdf of the increments of wind velocities which have heavy tails. Also, we estimates the autocovariances and the autocorrelations of the increments of wind velocities by revealing their second order properties for the purpose of showing Self Similarity. Parsimonious properties of wind turbulence are discussed by the estimated parameters. With reasonable assumptions, the relations between lag of wind increments and estimated parameters are suggested. Also, interpretations of the result are explained. In addition, the dependency between the wind increments and mean velocities are also discussed. Non-parametric tests are perform the whether the dependency exists between the increments of wind velocities and block mean velocities. Also, the dependency of two consecutive increments on the block means velocities are researched. Key words: Wind turbulence, Generalized Hyperbolic distribution, Normal Inverse Gaussian distribution, Self-similarity
Evgenia Ageeva 
Bayesian Inference for Multivariate t Copulas Modeling Financial Market Risk Martin Mächler 
Peter Bühlmann 
The main objective of this thesis is to develop a Markov chain Monte Carlo (MCMC) method under the Bayesian inference framework for estimating meta-t copula functions for modeling financial market risks. The complete posterior distribution of the copula parameters resulting from Bayesian MCMC allows further analysis such as calculating the risk measures that incorporate the parameter uncertainty. The simulation study of the fictitious and real equity portfolio returns shows that the parameter uncertainty tends to increase the risk measures, such as the Value-at-Risk and the Expected Shortfall of the profit-and-loss distribution.
Emmanuel Payebto Zoua 
Subsampling estimates of the Lasso distribution. Peter Bühlmann   Sep-2011
We investigate possibilities offered by subsampling to etimate the distribution of the Lasso estimator and construct confidence intervals/hypothesis tests. Despite being inferior to the bootstrap in terms of higher-order accuracy in situations where the later is consistent,subsampling offers the advantage to work under very weak assumptions.Thus, building upon Knight and Fu (2000), we first study the asymptotics of the Lasso estimator in a low dimensional setting and prove that under an orthogonal design assumption, the finite sample component distributions converge to a limit in a mode allowing for consistency of subsampling confidence intervals. We give hints that this result holds in greater generality. In a high dimensional setting, we study the adaptive Lasso under assumption of partial orthogonality introduced by Huang, Ma and Zhang (2008) and use the partial oracle result in distribution to argue that subsampling should provide valid confidence intervals for nonzero parameters. Simulations studies confirm the validity of subsampling to construct confidence intervals, tests for null hypotheses and control the FWER through subsampled p-values in a low dimensional setting. In the high dimensional setting, confidence intervals for nonzero coefficients are slightly anticonservative and false positive rates are shown to be conservative.
Hesam Montazeri 
Nonparametric Density and Mode Estimation for Bounded Data Rita Ghosh
Werner Stahel 
This thesis investigates the performances of various estimators in density estimation and mode estimation for bounded data. It is shown that many nonparametric estimators have boundary bias when the support of true probability density function has a compact support. Because the boundary region might be a large percentage of the whole support, boundary bias problem could be very serious in many complex and real-world applications. The widely accepted method for boundary bias correction in regression and density estimation is Automatic Boundary Correction [1]. This method is based on local polynomial fitting and no explicit correction for boundary effects is needed in this method. In the first part of this thesis, we consider applications of this method and Parzen's method in density estimation of some bounded univariate and bivariate data examples. It is shown that the local polynomial based method has no significant boundary bias in the considered examples. In addition, we also give a new formula for the asymptotic bias of the density estimate based on local polynomial fitting which includes the bin width parameter. In the second part of this thesis, we consider mode estimation and several methods are examined for bounded data. We show that many nonparametric mode estimation methods have boundary bias if the true global mode is located in boundary region. Among the considered methods, mode estimation based on local polynomial shows to have superior performance and it does not seem to have considerable boundary bias problem.
Xiaobei Zhou 
Prediction Models for Serious Outcome and Death in Patients with Non-specific Complaints Presenting to the Emergency Department Werner Stahel 
Markus Kalisch 
This paper is based on the Basel Non-specific Complaints (BANC) by Nemec, Koller, Nickel, Maile, Winterhalder, Karrer, Laifer, and Bingisser [2010]. Nonspecific complaints (NSCs) are very common in emergency departments (EDs). How­ever, when treating the patients with NSCs, emergency physicians have rarely experience. My research mainly focuses on the outcome variables a serious condition (o ser) and death in ED patients with NSCs. My primary goal is to find a set of methods (classifiers) which classify with high accuracies for o ser and death. Moreover, we try to find a series of risk factors (explanatory variables) which are highly correlated with the outcome variables. We do not find a classifier that clearly outperforms all others in all aspects. Random-Forest, Logistic-Regression and Adaboost turn out to be favorable according to different criteria. We find that dealing with missing values using imputation increases classification performance. Finally, we discuss SMOTE as an interesting but not fully satisfy method for dealing with highly unbalanced data.
Marc Lickes 
Portfolio optimization if parameters are estimated Hans Rudolf Künsch  Aug-2011
In the following we discuss the effect of parameter estimation in the context of mean variance portfolio optimizations. We compare the efficient frontier under a certainty equivalent approach and Bayesian predictive posterior distribution. We will show that the sample estimators lead to a risk underestimation and we will provide corrected estimators. In addition we will relax the assumption of identical returns and introduce dynamic linear models for time varying mean and covariance matrices. This study will conclude by analysing the performance of those estimators on a simulated multivariate normal data set and on a sample set of returns drawn from either the Dow Jones 30 or S&P500.
David Lamparter 
Stability Selection for Error Control in High-Dimensional Regression Peter Bühlmann  Aug-2011
In the recent past, the development of statistical methods for high-dimensional problems has greatly advanced leading to methods for model selection such as the lasso. However, the question of error control in high-dimensional settings has proven to be difficult. Recently, an approach called stability selection has been proposed to tackle the problem. It combines a method for model selection and subsampling to deliver a form of error control. In this thesis, some variants of stability selection are introduced. It was tested if error control would actually hold up. Furhermore, some conditions were isolated where using these variants might have beneficial effects.
Marco Läubli 
Particle Markov Chain Monte Carlo for Partially Observed Markov Jump Processes Hans Rudolf Künsch  Aug-2011
The goal of the thesis was to investigate, understand and implement the so called particle Markov chain Monte Carlo (PMCMC) algorithms introduced by Andrieu, Doucet, and Holenstein (2010) and to compare them to classical MCMC algorithms. The PMCMC algorithms are introduced in the framework of state space models. Their key idea is to use sequential Monte Carlo (SMC) algorithms to construct efficient highdimensional proposals for MCMC algorithms. The performance of the algorithms is examined on a simple birth-death process in discrete time as well as on the stochastic Oregonator, an idealized model of the Belousov-Zhabotinskii non-linear chemical oscillator. In summary it can be said that the PMCMC algorithms produce satisfactory results even when using only standard components and they require comparably little problem-specific design effort from the user's side. On the other hand it must be mentioned that the computational effort, compared to classical methods, is tremendous and a serious drawback.
Christian Sbardella 
High dimensional regression and survival models Peter Bühlmann 
Patric Müller
In the high-dimensional regression we have too many parameters relative to the number of observations and then we can have the problem of the overfitting. A method to solve this problem is to use the Lasso (Least Absolute Shrinkage and Selection Operator) to estimate the regression's coefficients. This estimator has become very popular because, among other properties, it does variable selection, in the sense that some estimated coefficients are equal to zero.We study the Lasso estimator proving its consistency and finding an oracle inequality in the case of squared error loss.In this thesis we also talk about survival analysis: this branch of the statistic studies the failure times of an individual (or of a group of individuals) to conclude if for example a new treatment is effective, or if a certain group of individuals has more survival probability than another. We mainly focus on the Cox Proportional Hazard model and the Weibull Proportional Hazard model.A natural question is: "Can we use the theory of the Lasso estimator in the survival analysis?"We try to answer this question in the last chapter of this thesis (Chapter 5).
Alexandra Federer 
Estimating networks using mutual information Marloes Maathuis 
Markus Kalisch 
Identifying the relations between variables of a dataset and visualize these relationships in an independence network is important in many applications. We use the concepts of entropy and mutual information to estimate  the dependency between two random variables. An advantage of this method in comparison to a correlation test is that mutual information measures also non-linear dependency. To estimate the correlation graph of a dataset, we construct a statistical test for zero mutual information. We analyze the performance of this method compared with the well-known method of estimating the correlation graph by defining a threshold for the mutual information regarding to ROC-curves.
Oliver Burkhard 
The Effect of Managed Care Models on Health Care Expenditure Marloes Maathuis 
Markus Kalisch 
In this thesis we want to estimate the cost reduction effects by managed care plans that were introduced in Swiss health insurance in 1996. Those plans limit the free choice of health care provider and come with reduced premiums. The data comes from one insurer and the years 1997-2000.The challenge we face comes from the unobserved health of the insured. It can have an influence on both the choice of managed care plan and on the costs caused. We tackle the problem by generating an estimate of an auxiliary variable "latent health'" using Tobit regression which allows us to estimate the causal effect of managed care plans on costs using a Two Part model. We then look at different possibilities to improve the results.We find that  the total effect of managed care consists of a part that can be explained through the auxiliary variable and a part that cannot, indicating true cost reduction effects by the managed care models.
Niels Hagenbuch 
A Comparison of Four Methods to Analyse a Non-Linear Mixed-Effects Model Using Simulated Pharmacokinetic Data Martin Mächler 
Werner Stahel 
Our study characterizes the behaviour of four different methods to estimate a non-linear  mixed-effects model in R . Three methods used a closed-form analytical solution of a  system of ordinary differential equations (ODEs), the fourth method used the system of  ODEs directly. The three methods were nlme() from the package of the same name,  nlmer() from package lme4a and nlmer() from package lme4. For the ODEs, we used  nlmeODE() along with nlme(). The two methods using nlme() do not differ much in their estimates. Non-convergences occurred. lme4a and lme4 provide fast and reliable (in terms  convergence) routines nlmer() which have shortcomings as well: fixed-effects parameters’  standard errors are over- or  underestimated, inconsistently across the parameters; the estimation of the standard deviations of the random effects does not always profit from an  increase in observations. The results across three simulations reveal unpredictable patterns  of the estimators of lme4a and lme4 considering coverage ratios, bias and standard error  as functions of number of observations. A limitation of this study is its limited number of simulation runs (250).
Stephanie Werren 
Pseudo-Likelihood Methods for the Analysis of Interval Censored Data Marloes Maathuis  Mar-2011
We study the work of Sen and Banerjee (2007), focusing on their method based on apseudo-likelihood-ratio statistic to obtain point-wise confidence intervals for null hypotheses on the distribution function of the survival time in a mixed-case interval censoring model. Mixed-case interval censored data arises naturally in clinical trials and a variety of other applied fields. The setting of such a model is one where n independent individuals are under study and each individual is observed a random number of times at possibly different observation time-points. At each observation time it is recorded whether an event happened or not and one is interested in estimating the distribution function of the time to such an event, also called failure. However, the time to failure cannot be observed directly, but is subject to interval censoring. That is, one only obtains the information whether failure occurred between two successive observation time-points or not.We extend the results from Sen and Banerjee (2007) to mixed-case interval censored data with competing risks. This is data, where the failure is caused from one of R risks, where R ∈ N is fixed. We define a naive pseudo-likelihood estimator for the distribution function of the event that the system failed from risk r for each r = 1, 2, . . . ,R, analogous to Jewell, Van der Laan, and Henneman (2003). We prove consistency and the asymptotic limit distribution of the naive estimators and present a method to draw point-wise confidence intervals for these sub-distribution functions based on the pseudo-likelihood-ratio statistic introduced by Sen and Banerjee (2007).
Karin Peter 
Marginal Structural Models and Causal Inference Marloes Maathuis  Feb-2011
We analyze data of an observational treatment study of HIV patients in Africa, collected by the Institute for Social and Preventive Medicine (ISPM) in Bern. In particular, we focus on patients who received frst-line treatment and experienced immunologic failure, where immunologic failure might be an indication that the current treatment is no longer effective. Some of these patients were switched to a second-line treatment, according to the decision of their doctor (i.e. non-randomized). Based on these data, we are interested in estimating the causal effect of the switch to second-line treatment on survival. The data contain information on the treatment regime and the CD4 counts of the patients, where both of these are time dependent. A main challenge in the analysis is the CD4 count, which indicates how well the immune system is working. The CD4 count may influence future treatment and survival, making it a confounder that one should control for. On the other hand, the CD4 count is likely to be influenced by past treatment, making it an intermediate variable that one should not control for. We address this problem by using marginal structural models. Conceptually, this method weighs each data point by its inverse probability of treatment weight (IPTW), creating data of an unconfounded pseudo-population. Our results indicate that switching to second-line treatment is beneficial, and slightly more so than an analysis with classical methods would imply.
Reto Bürgin 
Pain after an intensive care unit stay Werner Stahel 
Marianne Müller
The present study examines pain occurring within twelve months after an intensive care unit (ICU) stay by focussing on three aspects: i) Which variables relate to pain after an ICU stay? ii) Which is the longitudinal association of ICU-related variables and pain? And iii) do former ICU patients suffer more severe pain than comparable people who haven’t been in ICU recently?The first two aspects are examined with statistical analyses of data of 149 former ICU patients: Whilst these data contain three repeated pain measurements per patient - immediately after as well as six and twelve months after the ICU stay - the provided explanatory variables are physiological-, emotional- and sociodemographic-related and were measured before, during and after the ICU stay. The third aspect is examined by using additional data of a control group of 153 subjects.Concerning the first aspect, stepwise regression model selections have identified gender, pain before the ICU stay, four ICU-related variables, agitation and other illnesses as to be useful explanatory variables for pain after an ICU stay. Moreover, anxiety before the ICU stay and the length of stay in the ICU have shown significant associations too.The second aspect, the longitudinal study was examined by the use of a repeated measurement regression model. This model has shown a significant association between ICU-related variables and pain, both six and twelve months after the ICU stay (p-values: 0.005 and 0.025). Whilst the significance of these associations tends to decrease with the time that has elapsed since the ICU stay, the effect of variables which are not directly ICU-related, particularly that of pain before the ICU stay, tends to increase.The third aspect was again analysed with a repeated measurement regression model. This model has demonstrated that ICU patients tend to suffer more severe pain than the subjects of the control group. However, this difference decreases as time passes from the initial ICU stay. As a result, twelve months after the ICU stay, the difference is no longer significant (p-value: 0.3).Finally, the identification of explanatory variables for pain turned out to be the principal challenge of this study. As the discovered explanatory variables are indicators which leave room for interpretation, both an extended discussion of the study results - also with experts from medical sciences - as well as their comparison with similar studies were essential
Weilian Shi 
Distribution of Realized Volatility of Long Financial Time Series Werner Stahel 
Dr. Michel Dacorogna
Insurance companies face a difficult situation as the regulators ask for the same level of solvency during the crisis [Zumbach et al., 2000]. This master thesis focuses on the log returns and volatilities of very long financial time series. We investigate the distributions and tail behaviors of both log return and volatilities, where the Hill estimator is used for the tail index estimation of the volatility distribution. Taking the definition that a crisis occurs when the GDP consecutive drops for two quarters, the financial crisis has been identified as the biggest crisis after the Second World War. A linear regression model is conducted to analyze the connection between realized volatilities and the GDP log return before and after 1947, respectively. The negative correlation between them suggests that the volatility has the tendency to increase when the economy is experiencing a recession.


Student Title Advisor(s) Date
Alain Helfenstein 
Forecasting OD-Path Booking Data for Airline Revenue Management using a Random Forest Approach  Peter Bühlmann  Aug-2010
A main issue of an airline's revenue management is to calculate an accurate forecast of the future demand of bookings. Poor estimates of demand lead to inadequate inventory controls and sub-optimal revenue performance. Within this thesis we describe the structure of booking data within the airline industry that needs to be forecasted and discuss the current bayesian forecasting model implemented by Swiss Revenue Management.We then implement new forecasting models using different random forest (regression) approaches and discuss the accuracy of the predicted demand of all models. As a further result we will illustrate how an implementation of a regression using the random forest algorithm can fail.
Fabio Valeri 
Sample Size Calculation for Malaria Vaccine Trials with Attributable Morbidity as Outcome  Marloes Maathuis  Aug-2010
Malaria is a ma jor public health issue. Big efforts have been put into research to develop a vaccine against malaria. Problems arises in estimating vaccine efficacy. Standard methods as the cutoff method and logistic regression may have biased efficacy estimates. An alternative approach which avoid bias is to apply a Bayes latent class model to estimate attributable risk. One problem using this probabilistic approach is that it is not clear how big a trial would need to be in order to have comparable power to that of the cutoff method. To assess the size of a trial using this approach a hypothetical parasite density of a population has been constructed based on a latent class model and some other constraints. Samples have been drawn from these true values, measurement errors simulated and vaccine efficacy estimated. This has been done for three different vaccine type mechanism. For the vaccine we considered, to get a power of 80% the probabilistic method needs 3 to 12 times more individuals as in the cutoff method. Whereas the probabilistic has no biased efficacy estimates, two vaccine types have large or very large bias. If vaccine type is not well defined standard methods to estimate vaccine efficacy could produce large biased estimates which can result in a rejection of the vaccine. The probabilistic approach would avoid bias but due to larger size for the same power the costs will be higher.
Doriano Regazzi 
The Lasso for Linear Models with within Group Structure  Sara van de Geer   Aug-2010
In an high dimensional regression model, we consider the problem ofestimating a grouped parameter vector. We assume there is within groupstructure, in the sense that the ordering of the variables within groups ex-presses their relevance. In this setting, we study two group lasso methods:the structured group lasso and the weighted group lasso. Our work consistsin the implementation of these two methods in R. First, we prove the con-vergence of their algorithms. Then, we run simulations and we compare thetwo estimators in various situations.
Anna Drewek 
A Linear Non-Gaussian Acyclic Model for Causal Discovery Marloes Maathuis  Jul-2010
The discovery of causal relationships between variables is important in many applications. Shimizu et al. proposed a method to discover the causal structure from observational data in linear non-Gaussian acyclic models, abbreviated by LiNGAM (see Shimizu et al. 2006). We analyze their approach and empirically test the strictness of non-Gaussianity byapproximating the Gaussian distribution with the t-distribution. Moreover, we compare the performance of the LiNGAM algorithm to that of the PC algorithm (Sprites et al. 2000). Finally, a combination of both algorithms is discussed (Hoyer et al. 2008) that enables the detection of causal structure in linear acyclic models with arbitrary distributions.
Rita Achermann 
Effect of proton pump inhibitors on clopidogrel therapy Werner Stahel  Mar-2010
In the present study, the interaction between clopidogrel and proton pump inhibitors (PPI) is investigated. A PPI might reduce the anti platelet function of clopidogrel and increase the risk of a second myocardial infarction. Patients with both drugs prescribed have a higher risk for such an event, but whether this is due to individual risk factors or a reduced effect of clopidogrel is an open question. The present study aims to assess the effect due to an interaction between the two drugs using health insurance data. Methods to adjust for confounders in observational data were applied, and new graph theory developments in combination with probability theory were evaluated. The study population consisted of 4 623 patients with prescribed clopidogrel, a hospital stay of at most 30 days before the first administration of clopidogrel, and health insurance coverage with Helsana. Hospitalization due to cardiac event and death were used as the clinical endpoints to assess, whether proton pump inhibitor prescription was associated with a higher risk of rehospitalization.A graph was constructed based on knowledge to derive theoretically, whether the effect was identifiable. Causal inference rules applied to this knowledge based graph showed, that the effect is identifiable when observational data are used. Graphs estimated from data did not disprove these findings. The effect of PPI on clopidogrel was calculated from the interventional distribution defined  by the graph. Also standard statistical techniques, a Cox proportional hazard regression, was applied, once with covariates to adjust for confounding and once with a propensity score. An instrumental variable approach was not feasible, since no instrument was found.Patients with concomitant use of clopidogrel and proton pump inhibitors had a higher risk for rehospitalization due to a cardiac event by a factor of 1.33 (CI 95%: 1.10, 1.61) compared to patients with no prescription for PPI. Important for the analysis was, that some patients had PPI administred together with clopidogrel but had no prescription before. Treatment guidelines recommend PPI to prevent stomach bleeding, a side effect caused by clopidogrel. It is assumed that this patients had no higher individual risk for a recurrent myocardial infarction compared to patients with no PPI prescription. Hence, the patients can be compared to patients with no PPI prescription before and during the study phase to estimate the effect. Comparison of the baseline characteristics for 23 drug groups, as well as age and gender revealed only minor differences. Results calculated based on the interventional distribution defined by the graph showed similar results compared to Cox regression. Finally, the propensity score used as a stratifier in a Cox proportional hazard regression yielded similar results  either. As alternative treatments for PPI are available, patients should not take these two drugs together.
Armin Zehetbauer 
A Statistical Interest Rate Prediction Model  Werner Stahel  Mar-2010


Student Title Advisor(s) Date
Nicoletta Andri 
Using Causal Inference for Identifying Coresets of the ICF Marloes Maathuis  Sep-2009
The World Health Organisation (WHO) has a strong interest in reducing the ICF-catalogue to a smaller set of items for different reasons such as time management and complexity. In this context, we analyse two data sets of the WHO concerning rheumatism/arthritis and chronic widespread pain consisting of variables from the ICF-catalogue. For this variable selection process we use the approach of Maathuis, Kalisch and Bühlmann which uses graph estimation techniques in combination with a causal method called back door adjustment. We show under which conditions this approach can be applied also to dichotomized data sets and how interactions between the variables can be handled. Significance of the estimates is assessed using permutation tests and a method called stability selection presented by Meinshausen and Bühlmann. Finally, the causal results are discussed and compared to associational results.
Simon Figura 
Response of Swiss groundwaters to climatic forcing and climate change A preliminary analysis of the available historical instrumental records Werner A. Stahel  
Rolf Kipfer
David M. Livingstone
Research on groundwater quality over long-term periods has scarcely been done in the past. In this thesis groundwater temperature is used as an indicator for groundwater quality. Temperature measurements of 8 river recharged and 6 rain-fed groundwaters were analysed. Some data sets also contained records of water level, spring discharge, pumping amount and oxygen concentration. The length of the records ranged from 20 to 52 years. Plots and trend and changepoint tests were used to describe the temperature developments. Correlations and regression models were established to analyse the impact of climatic forcing in the form of air temperature and the impact of groundwater quantity variables on groundwater temperature. The behaviour of oxygen concentrations was also briefly analysed.Most of the river recharged groundwaters showed an increase in temperature of 1-1.5◦C in the last 30 years. More than half of this warming took place in the period of 1987-1993. Results indicate that this warming was due to climatic forcing. The temperature of the rain-fed groundwaters showed small to no increase. Some properties of air temperature development can be recognized in temperature of these groundwaters but a possible response of rain-fed groundwaters to climatic forcing is outweighed by other factors.Measurements of oxygen concentrations were available at 4 sites. Decreasing concentrations at 3 measurement sites are likely caused by higher microbiological activity and lower oxygen solubility as a result of higher temperatures. This theory is contradicted by the increasing oxygen concentration at the fourth measurement site.
Lukas Rosinus 
Fehlende Werte EM-Algorithmus und Lasso in hochdimensionaler linearer Regression Peter Bühlmann  Aug-2009
Verschiedene Schätzer für hochdimensionale lineare Regressionsprobleme mit fehlenden Werten werden vorgeschlagen und untersucht [[?]]. Dabei wird Mithilfe des EM-Algorithmus der beobachtete negative Log-Likelihood mit- samt Lasso-Bestrafung der Regressionsparameter β minimiert. Durch die Verwendung der Lasso-Bestrafung werden die Regressionskoeffizienten sparse geschätzt. In Simulationsstudien werden die Methoden an verschiedenen multivariat normalverteilten Modellen untersucht. Dabei zeigt sich, dass die MissRegr Methode die besten Resultate erzielt. Mit dem EM-Algorithmus wird die inverse Kovarianzmatrix K = Σ−1 im Likelihood Sinn optimal geschätzt. Mit der Lasso Bestrafung werden dann auch die Regressionsparameter gut geschätzt, auch bei hohem Anteil fehlender Daten.
Philipp Stirnemann 
Unmatched Count Technik: Zum Zusammenhang zwischen Anonymität und statistischer Effizienz Werner A. Stahel  
B. Jann
Rudolf Dünki 
Robuste Variogrammschätzung und robustes Kriging  Hans Rudolf Künsch   Aug-2009
The thesis describes the development of robust algorithms for the analysis of geostatistical data. Three algorithms where implemented in R and each of these allows for a simultaneous estimation of the regression parameters and the covariance parameters. All three algorithms returned consistent results. Two of them are implemented as a package of R-functions. The treatment of the nugget effect makes the essential difference between the two algorithms: the first algorithm treats the nugget as a part of the estimation of the covariance parameters. The other algorithm treats the nugget as a part of the regression problem. This bears advantages in the analysis of polluted data. Sets of 50 simulations with different degrees of added pollution were analysed. The resulting parameter estimates agreed with the true values within the statistically tolerable range. The exception was the set containing the most polluted data. The estimation of the range parameter was somewhat problematic when performed with small Huber constants i.e. the resulting range displayed a bias upward. In contrast to this, the nugget estimate was improved when choosing a small Huber constant. The algorithm treating the nugget effect as a part of the regression problem returned more stable results in the case of a high degree of pollution. A Huber constant of 1.333 ... 1.666 appeared appropriate in these cases. An increase in stability was also visible in the behaviour of influence functions. The algorithms were applied to data on contamination of soils with Cu in the surroundings of a metal smelter in Dornach. It could be shown that the estimated parameters allowed for kriging estimates which are comparable with earlier analyses. Despite this it was not possible to gain unambigous parameter estimates. The reason lies in the existence of a very flat and extended optimum region. This allows for fitting models with comparable goodness of fit characteristics for clearly distinct parameter sets.  
Thomas André Rauber 
Parameter risk in reinsurance Peter Bühlmann  Jul-2009
In this thesis we consider parameter uncertainty that comes along in differentpricing areas in a reinsurance. By parameter risk we mean the riskof not estimating the parameter properly. We mainly look at parameter riskin the severity distribution. We differentiate three different ways ofcharacterising uncertainty. We first replace the parameter that has to beestimated by a randomvariable and derive some analytical result. Then we look into MaximumLikelihood Estimators and use the result that they are asymptotic normallydistributed. For some examples these asymptotic results are not accurateenough. Considering these cases we will classify the uncertainty by usingbootstrap. Finally we will specify where uncertainty arises in theExperience, Exposure and Credibility Rating in praxis. We will see anexample of Credibility Rating which blends Experience and Exposure Ratingby minimizing the parameter risk. 
Alessia Fenaroli 
Propagating Quantitative Traits in Gene Networks Marloes Maathuis  Feb-2009
Gene networks have been created to extend the knowledge of the gene functions in a specific organism. Such networks describe connections between genes involved in the same biological process.McGary, Lee and Marcotte have related a gene network of the baker yeast, called the YeastNet, with a morphological traits variation dataset, the SCMD, and have defined a method which assigns scores to each gene of the network in order to predict their activity. The researchers have tested the predictability of YeastNet with ROC curves and the respective AUC values by computing a leave-one-out cross-validation and have obtained the median value 0.615. Our contribution to this study includes: the definition of other score methods that take into account the quantitative data given by the SCMD dataset, in opposition to the dichotomization applied to these data made by McGary et all.; some new rules to predict the activity of each gene based on their scores, more complicated than the simple idea of comparing the scores with a cutoff adopted by McGary et all. but more efficient; and a different procedure, the 10-fold cross-validation, to compute the network predictability analysis.Thanks to these changes we have improved the YeastNet prediction quality by 5%, whose median value now is 0.665.
Simon Lüthy 
Merkmalswichtigkeit im Random Forest Peter Bühlmann  Feb-2009
In der Bioinformatik und verwandten Wissenschaftsgebieten, wie die statistische Genforschung und die genetische Epidemiologie, ist die Vorhersage von kategoriellen Antwortvariablen (wie der Krankheitsstatus eines Patienten oder die Eigenschaften eines Molekuls) einerseits und die verlässliche Identifikation der relevanten Merkmale andererseits, eine wichtige Aufgabe. In der Genforschung enthalten typische Datensätze hunderte oder gartausende von Genen beziehungsweise Merkmalen, doch stehen oftmals verhältnismassig wenige Beobachtungen, anhand deren man die Vorhersagen und Identifikationen machen will, zur Verfügung. Der Random Forest-Algorithmus löst dieses Problem sehr gut.In dieser Arbeit möchten wir in einem ersten Schritt die Entstehung eines Entscheidungsbaumes, mit dessen Hilfe ganze Vorhersage-Wälder {sogenannte Random Forests{ generiert werden, erklären. Wir erläutern kurz die Vorgehensweise bei der Erzeugung eines solchen Waldes und definieren die permutierte Fehlerfreiheit (engl. permutation accuracy importance) als ein Mass fur die Merkmalswichtigkeit.In einem zweiten Schritt weisen wir auf die Problematik hin, die auftritt, wenn man die permutierte Fehlerfreiheit auf Datenmengen mit stark korrelierenden Variablen oder mit Variablen, die sich in der Anzahl ihrer Kategorien unterscheiden, anwenden möchte. Wir präsentieren den Lösungsvorschlag nach Strobel et al. (2007), die einen anderen Algorithmus zur Erzeugung des Waldes propagieren.Wir führen zwei weitere Masse für die Merkmalswichtigkeit ein, zeigen anhand von Simulationen ihr Verhalten auf verschiedenen Datenmodellen und vergleichen sie mit der permutierten Fehlerfreiheit. Nach unserer Meinung ist die permutierte Fehlerfreiheit im Random Forest nach wie vor ein starkes und glaubwürdiges Werkzeug in der Variablenselektion.
Patric Müller 
Image restoration Blind deconvolution for noised Gaussian blur  Sara van de Geer  Feb-2009
Blind deconvolution is an inverse problem with one or more unknown parameters.Nowadays, one of the more common practical applications of deconvoultion is in image analysis, where it is used determining how to restore blurred images. To recover the original image, however, we first have to estimate the unknown parameteres the image  was blurred. In the last years, this topic has attracted significant attention, resulting in numerous studies. This thesis studies blind deconvolution from theoretical and practical point of view.On the other side, we provide the necessary tools we will utilise to improve the quality of blurred and noised pictures. Our results give rise to algorithms computing estimations if the aforementioned unknows.The applicability of the explored techniques then is demonstrated by means of several practical examples.The thesis is concluded by a brief qualitative analysis of the limits of deconvolutionwith regard to image restoration. To this end we show that the process isill-conditioned. Thus, it might be at best inefficient, but at worst impossible, to retrieve the original picture from a blurred one.


Student Title Advisor(s) Date
Diego Colombo 
Goodness of fit test for isotonic regression Marloes Maathuis  Jul-2008
We study the work of Durot and Tocquet (2001), whom proposed a new test of the hypothesis H0 : ”f = f0” versus the composite alternative Hn : ”f != f0”, under the assumption that the true regression function f is monotone decreasing on [0, 1]. The test statistic is based on the L1-distance between the isotonic estimator ˆ fn of f and the given function f0, since a centered and normalized version of this distance, is asymptotically standard normal distributed under the null hypothesis H0, provided that the given function f0 satisfies some regularity conditions. The main purpose to study asymptotic normality of the isotonic estimator, relies on the study of its asymptotic power under the alternative Hn : ”f = f0 + cn"n”. The idea is to study the minimal rate of convergence for cn, such that the test has a prescribed asymptotic power. Durot and Tocquet show that this minimal rate is n−5/12 if "n does not depend on n and n−3/8 if it does.Our contribution is a more detailed explanation of the models, of the main results and the insertion of some extra particular steps in the proofs. To check these theoretical results in simulations like Durot and Tocquet, we write new R codes. Namely, we perform a simulation study to compare the power of this test with that of the likelihood ratio test, for the case where f0 is linear, and we also compare these simulations results to the ones obtained by Durot and Tocquet. Moreover, we propose extra simulations for the power of another test not treated by Durot and Tocquet and we will see that it is always most powerful than the one they studied. Finally, we conduct a new simulation study in the case where the given monotone function f0 is quadratic.
Alain Weber 
Probabilistic predictions of the future seasonal precipitation and temperature in the Alps Hans Rudolf Künsch  Jul-2008
This work presents probabilistic predictions of the future (2071-2100) seasonalprecipitation and temperature in the Alps. The predictions combine climate forecasts from different numerical simulations in a Bayesian ensemble approach. It is well known that these climate simulations have systematic errors, which should be taken into account. Unfortunately, simulations are driven by boundary conditions, which are very different to those of the last century. This is a problem because there exist no comparable data from the past to estimate the bias of a climate model under similar boundary conditions. It becomes necessary to rely on assumptions, which can hardly be proven wright or wrong. Recently,Christoph Buser showed that predictions of seasonal temperature in the Alps differ for two reasonable assumptions. In this work we compare predictions of precipitation for the same two assumptions. In addition, one of the corresponding Bayesian modelsis extended to predict the bivariate distribution of precipitation and temperature.
Patricia Hinder 
Additive Isotonic Regression Sara van de Geer  Mar-2008
In this master thesis we study the isotonic regression model for one or more covariates. We will first give an introduction to the one dimensional regression problem with calculated using the pool-adjacent violator algorithm (PAVA). We will extend the regression problem to multiple covariates and assume an additive model. The functions will be estimated with a classic backfitting estimator. We compare the backfitting estimator with an oracle estimator and discuss that they can be estimated with the smoothed by applying a kernel smoother to the isotonized data. The monotonicity of the kernel smoother ist guaranteed by using a log-concave kernel. We will study another approach of the additive isotonic regression problem that is based on boosting. The function are expanded into a sum of basis functions and component-wise boosting algorithm is applied.
Manuel Koller 
Robust Statistics:Tests for Robust Linear Regression Werner Stahel  Mar-2008
Analyzing data using statistical methods means to break reality down toa mathematical framework, a model. Often this model is based on strongassumptions, for example normally distributed data. Classical statisticsprovides methods that fit the chosen model perfectly. But in reality themodel assumptions usually hold only approximately. Anomalies and untrueassumptions might render the statistical analysis useless. Robust statistics aims for methods that are based on weaker assumptionsand thus allow small deviations from the classical model.  However,robust statistics is not restricted to the use of robust estimationmethods alone. It also extends to methods used to draw inference. In thepast, there has not been much research focused on robust tests.In this thesis we study the quality of inference performed by of twostate-of-the-art robust regression procedures. We then propose a designadapted scale estimator and use it as part of a new robust regressionestimator, the MMD-estimator. This new estimator improves the quality ofrobust tests considerably.A simulation study is performed to compare the performance of thementioned regression procedures in combination with various covariancematrix estimators. We found large differences between the testedmethods. Some methods were able to approximately reach the desired levelin the corresponding tests for most tested scenarios while othersproduced estimates that were only useful in specific high samplesettings. We infer that the covariance matrix estimator needs to becarefully selected for every new scenario.
Philipp Rütimann 
Variablenselektion in hochdimensionalen linearen Modellen mittels Schrumpfvarianten des PC-Algorithmus Peter Bühlmann  Mar-2008
In dieser Arbeit geht es um Variablenselektion in hochdimensionalen linearen Modellen. Dazu wird der Ansatz von Professor Peter Bühlmann und Markus Kalisch basierend auf dem PC-Algorithmus übernommen. Dieser Ansatz wird in der Arbeit dahingehend verändert, dass die Korrelationen, statt mit dem Maximum Likelihood Schätzer, mit verschiedenen Schrumpfschätzern berechnet werden. Diese neuen Varianten des PC-Algorithmus werden mittels ROC-Plots und weiteren graphischen Vergleichsmethoden mit der Standardvariante verglichen.Des Weiteren geht es in dieser Masterarbeit um Dimensionsreduktion. Diese wird verwendet um die Dimension der hochdimensionalen linearen Modelle zu verringern. Es stellt sich heraus, dass sich dadurch die Varianten des Algorithmus klar Verbessern. Somit kam die Idee auf, die Dimensionsreduktion auch im Falle des robusten PC-Algorithmus zu verwenden. Doch dies ergibt nicht die selben positiven Resultate wie bei den Schrumpfvarianten.
Bruno Gagliano 
Asymptotic theory for discretely observed stochastic volatility models Sara van de Geer  Feb-2008
This thesis investigates the estimation of parameters for discretely observed stochastic volatility models. The main concern is to give a general methodology for estimating the unknown parameters from a discrete set of observations of the stock price. Two estimation methods, the minimum contrast and estimating functions, are introduced and it is shown that, under certain assumptions, the estimators obtained are consistent and asymptotically normal. Finally, a series of simulations is performed to confirm the results and an application to real-world stock data is made.
Sandra König 
Analyse von Skisprungdaten Sara van de Geer  Feb-2008
In der vorliegenden Arbeit soll untersucht werden, welche Faktoren beim Skispringen signifikant mehr Weite bringen. Als einfachstes Modell wird eine lineare Regression angepasst. Dabei zeigt sich wie erwartet, dass Wind, Anlaufgeschwindigkeit und Gewicht die Weite eines Sprungs beeinflussen. Für eine detailliertere Analyse werden Verallgemeinerungen des linearen Modells herangezogen. Insbesondere das Gemischte Effekte Modell zeigt, dass es springerspeziefische Effekte (wie etwa das Fluggefühl) gibt; weiter wird die isotone Regression betrachtet sowie die Möglichkeit, mittels Multiscale Testing die Isotonie einer Funktion zu überprüfen. Da insbesondere der Wind immer wieder Wettkämpfe mitzubestimmen scheint, wird sein Einfluss durch Messungen an weiteren Stellen detaillierter untersucht. Dabei stellt sich heraus, dass der Wind besonders beim Schanzentisch eine wichtige Rolle spielt. Eine weitere offene Frage war, ob Podestplätze bei der Junioren Weltmeisterschaft ein Indikator für spätere Erfolge sind. Da es ebenso viele Beispiele für wie gegen diese Hypothese gibt, war anders als bei den vorhergehenden Untersuchungen keine intuitive Antwort vorhanden. Die Natur der Daten macht das Testen schwierig, daher wird wiederum eine Regressionsanalyse gemacht. Mathematisch schwierig zu beurteilen ist die Frage, wann Punkterichter, die den Sprung subjektiv bewerten, parteiisch sind. Eine mögliche Beschreibung der sehr komplexen Situation liefert das Gemischte Effekte Modell.
Francesco Croci 
The World of Volatility Estimators  Sara von de Geer   Jan-2008
This thesis investigates the estimation of the volatility of an asset  return process. The main concern is to give a general overview for how to  estimate volatility non-parametrically and efficiently. First of all, I  have introduced the basic notions of stochastic theory and a special and  unusual limit theorem that I will use throughout the thesis. Then, I  deal with several volatility estimators, from the easiest and worst one,  the so called realized volatility (RV) estimator, to the so far best  estimator, the so called multi-scale realized volatility (MSRV)  estimator, which converges to the true volatility at the rate of  n-1/4. Finally, in the last section, we consider microstructure as  an arbitrary contamination of the underlying latent securities price,  through a Markov kernel Q. The main result there is that, subject to  smoothness conditions, the two scales realized volatility (TSRV) is  robust to the form of contamination Q.


Student Title Advisor(s) Date
Sonja Angehrn 
Random Forest Klassifikator zur Erkennung von Alarmsignalen in Sicherheitssystemen Peter Bühlmann  Aug-2007
In dieser Diplomarbeit werden die drei Klassifikatoren Logistische Regression, CART und Random Forest auf ihre Verwendbarkeit für einen Erkennungsalgorithmus überprüft, in welchem von verschiedenen Geräuschsignalen bestimmt werden soll, ob sie der Klasse Alarm oder Normal zugehören. Es stellt sich heraus, dass der Random Forest-Algorithmus von den drei Klassifikatoren für diese Problemstellung am besten geeignet ist. Anschliessend wird dieser Klassifikator anhand verschiedener Szenarien mit einem bestehenden HMM-Algorithmus verglichen.Für die Implementierung der Klassifikatoren stehen mehrere Features zur Verfügung. In dieser Arbeit wird für den Random Forest- und den HMM-Algorithmus überpfüft, welche Auswahl dieser Features eine möghlichst kleine Fehlerrate ergibt.
Sarah Gerster 
Learning Graphs from Data: A Comparison of Different Algorithms with Application to Tissue Microarray Experiments Peter Bühlmann  Aug-2007
A new algorithm (logilasso) to learn network structures from data has been introduced in “Penalized Likelihood and Bayesian Methods for Sparse Contingency Tables with an Application to Full-Length cDNA Libraries” (Dahinden, Parmigiani, Emerick and Buehlmann, 2007). The main idea is to study the interactions between the variables by performing a model selection in log-linear models.In this master thesis, a few other graphical model fitting algorithms are compared to the logilasso. The chosen algorithms are the PC, the Max-Min-Hill-Climbing (MMHC) and the Greedy Equivalent Search (GES). They all base on different approaches to fit a graphical model. Those methods are presented and the algorithms are described. Their performance, in the sense of their ability to reconstruct a graph, is tested on simulated data. The algorithms are also applied to Renal Cell Carcinoma data toillustrate a typical domain of application for such algorithms.
Lorenza Menghetti 
Density estimation, deconvolution and the stochastic volatility model Sara van de Geer  Aug-2007
The stochastic volatility model  contains the stochastic volatility process observed at discrete time instance with vanishing gaps whose density is to be estimated. The volatility density based on logarithm of the squared process is estimated with  the deconvolving kernel density estimator. Since the error density is supersmooth,  the convergence is very slow.This thesis studies the theoretical and empirical behaviour of the bias and the variance  of the estimator.  Empirical study suggests considering the bandwidth to be smaller than  the theoretical bandwidth  and confirms the slow rate of convergence.
Giovanni Morosoli 
Optimale Anpassung einer Portfolioschadenhöhenverteilung an ein individuelles Risiko Peter Bühlmann  Aug-2007
In der Einführung wird das Ziel dieser Diplomarbeit erklärt und werden die zur Verfügung stehenden Schadendaten präsentiert. Grundsätzlich besteht unsere Aufgabe aus der Berechnung eines optimalen Gewichts für die individuelle und die Portfolioschadenhöhenverteilung. Im Kapitel 2 wird das Problem des sogenannten Data fittings analysiert; mit anderen Worten, gegeben eine Stichprobe von Schadenhöhen, versucht man eine geeignete Verteilung zu finden, welche die gegebenen Schäden erzeugen könnte. Insbesonder untersuchen wir zwei versicherungsspezifische Methoden: das Erweiterungsverfahren, welches eine Verallgemeinerung der Maximum Likelihood Methode ist, und die Join Operation, welche als eine erste "grobe" Anpassung einer Portfolioschadenhöhenverteilung an ein individuelles Risiko interpretiert werden kann.Im dritten Kapitel benützt man den Chi2-Test um eine Anpassung einer Portfolioverteilung an ein individuelles Risiko zu bestimmen. Diese Anpassung hängt aber stark von den gewählten Signifikanzniveau ab; daher, im 4. Kapitel analysieren wir das Problem der Wahl eines geeigneten Signifikanzniveaus, indem wir eine Art von "Credibility Approach" verwenden. Im letzten  Kapitel diskutieren wir die erhaltenen Resultate und einige Hinweise für eventuelle zukünftige Entwicklungen.
Nicolò Valenti 
Regression under shape restriction and the option price model Sara van de Geer  Aug-2007
Many types of problems are concerned with identifying a meaningful structure in real world situations. A structure involving orderings and inequalities is often useful since it is easy to interpret, understand, and explain. In many settings, economic theory only restricts the direction of the relationship between variables, not the particular functional form of their relationship. Let c(X) denote the call price as function of the strike price X. By the no arbitrage principle, c is a convex, decreasing function ofX, i.e. it satisfies certain shape constraints. It can be argued that economic theory virtually places no other restrictions on c, and that the estimation of the state-price density should be carried out using only these shape restrictions (and some bounds on first and second derivative). Furthersmoothness assumptions or parametric assumptions may not be justified and have the potential risk of misspecifying the state-price density. Our work consists of studying estimation under such shape restrictions. We first consider monotone regression function estimation, the so-called  sotonic regression problem. Second, we analyse the problem of convex regression estimation. Then we build a nonparametric estimator of the call pricing that is decreasing and convex for small samples.
Enrico Berkes 
Statistical Analysis of ChIP on Chip Experiments Peter Bühlmann  Jul-2007
With the end of the Human Genome Project, the challenge of the emerging discipline of modern biology is  to determine the role of the newly characterised genes in man and model organisms. This new sequence data represents, for the first time, a realistic opportunity to link the function (and dysfunction) of specific tissues and cell types to the activity of the genes expressed within them, and so to identify genes and gene products that could act as therapeutic targets. The underlyingstrategy in the identification and functional characterisation of target genes will rely heavily on the ability to perform high throughput analysis of gene expression, at both the tissue and cellular level. Gene expression is regulated by proteins, specific for every gene, that bind themselves tothe target gene and promote or repress its transcription. In the last years two methods have been refined in order to study the gene regulation mechanism: microarray and ChIP on chip experiments. However the large quantity of data and the uncertainties, due to noise, provided by these experiments make the interpretation of the results difficult and laborious. For this reason many statistical methods have been developed trying to obtain the most relevant information from these data.Our work consists of modifying Motif Regressor, an already existing method to analyze data of microarray experiments, and using this new algorithm to search the transcription factor DNA-binding motifs of HIF1-alpha, a protein involved in gene regulation under hypoxia. The results show thatour algorithm is fast, effective, does not require many biological experiments and gives important suggestions on where future biological researches could be directed.
Jürg Schelldorfer 
Multivariate Analyse linearer Mischungen mit bekannten potenziellen Quellenprofilen Werner Stahel  Jul-2007
Die Konzentration von gewissen Schadstoffen in der Luft kann mathematisch durch eine lineare Mischung von Beiträgen verschiedener Quellen approximiert werden. In diesem Zusammenhang ist die Aufgabe der multivariaten Statistik, mit geeigneten Verfahren die Anzahl der vorhandenen Quellen, deren Emissionsprofile sowie deren Aktivitäten (in Abhängigkeit der Zeit) zu schätzen.In dieser Arbeit präsentieren wir Verfahren, wie wir die Kenntnisse über mögliche vorgegebene Quellenprofile benutzen können, um die Datenanalyse bei einem linearen Mischungsmodell zu verbessern.
Miriam Blattmann-Singh 
Nonparametric volatility estimation with a functional gradient descent algorithm for univariate financial time series Peter Bühlmann  Mar-2007
Claudia Soldini 
Variablenselektion in hochdimensionalen Regressionsmodellen bei nicht-homogenen Daten: Die Nutzung von Bacillus subtilis zur Synthese von Ribaflavin Peter Bühlmann  Mar-2007
Die vorliegende Arbeit ist Teil eines interdisziplinären Forschungsprojektes. Ihr Ziel ist die Identifizierung von wichtigen Mechanismen, die an der Herstellung eines Vitamins durch ein Bakterium teilnehmen. Dafur stützt man sich auf Daten, die aus einer Genexpressionstudie einer pharmazeutischen Firma stammen. Da man mit einer grossen Anzahl von Genen zu tun hat, werden Regressionsmethoden angewendet, die für hochdimensionale Probleme geeignet sind, und Variablen selektieren können. Die Gene werden als Prädiktoren und die Menge des produzierten Vitamins als Zielvariable betrachtet.Die Experimente wurden unter verschiedenen Bedingungen durchgeführt, so dass man es mit einem nicht-homogenen Datensatz zu tun hat. Die Menge des produzierten Vitamins variiert in Abhängigkeit von den Bakterienstämmen, die untersucht wurden, und vom Zeitpunkt, zu dem die Messungen genommen wurden. Es ist daher interessant, die wichtigsten Gene oder Gruppen von  Genen zu identifizieren, die für solche Unterschiede verantwortlich sind. Zu diesem Zweck werden statistische Tests durchgeführt, sowohl auf den einzelnen Genen als auch auf Gruppen von Genen. Diese werden mit Hilfe einer Clusteranalyse gebildet, wobei als Ähnlichkeitsmass die Korrelation verwendet wird.
Nicolas Städler  
Statistische Modellentwicklung für nichtinvasive Blutzuckermessung mittels Sensoren Werner Stahel  Mar-2007
Impedanzsignale zur nicht-invasiven Messung der Blutzuckerkonzentration werden durch eine Vielzahl anderer Einflussfaktoren (Temperatur, Schweiss, Durchblutung, usw.) beeinträchtigt. Um den Einfluss solcher Störparameter auf die Impedanzsignale zu quantifizieren, werden diese mit Sensoren gemessen. Ziel dieser Arbeit ist es, mittels einer linearen Regression und verschiedener Variablen-Selektions-Methoden möglichst allgemeingültige Modelle zu entwickeln, welche die Glukose-Konzentration in Abhängigkeit der Impedanzsignale und anderer Einflussfaktoren vorhersagen. In einem ersten Teil der Arbeit kommen die klassischen Selektions-Methoden Schrittweise-Vorwärts, Schrittweise-Rückwärts und "all subsets" zum Zuge. Da die erklärenden Variablen enorme Messungenauigkeiten aufweisen, werden diese in einem nächsten Schritt geglättet. Im Verlaufe der Arbeit zeigt sich, dass gewöhnliche Selektionskriterien wie AIC und Cp zu extrem überangepassten Modellen führen. In einem entscheidenden Schritt wird alternativ zum AIC und Cp ein an die spezielle Struktur der Daten besser angepasstes Kriterium vorgeschlagen. Mit dem neuen Kriterium wird sowohl eine Adaptive-Lasso-, als auch eine Schrittweise-Vorwärts-Selektion durchgeführt. Beide Methoden führen zu sehr ähnlichen und vernünftigen Modellen mit einem R2 von 0.73. Besondere Aufmerksamkeit wird dem Adaptive-Lasso gewidmet. Die Analyse zeigt, dass eine datenabhängige Gewichtung im Adaptive-Lasso einen erheblichen Fortschritt gegenüber dem gewöhnlichen Lasso bringt. Da die funktionale Form des Modells a priori unbekannt ist, wird zudem eine Analyse mit dem Namen "Multi Adaptive Regression Splines (MARS)" benutzt. Diese Methode erweist sich aber als ungeeignet.


Student Title Advisor(s) Date
Massimo Merlini 
Identifikation relevanter Mechanismen der Vitaminproduktion Peter Bühlmann   Sep-2006
Die Systembiologie ist eine junge interdisziplinäre Wissenschaft, deren Ziel es ist, biologische Organismen in ihrer Gesamtheit zu verstehen. In dieser Arbeit wird ein Forschungsprojekt vorgestellt, das die Produktion eines speziellen Vitamins durch einen Mikroorganismus untersucht. Man möchte die wesentlichen Mechanismen identifizieren, die am Fermentierungsprozess teilnehmen, um die Produktion zu optimieren.
Michael Amrein 
Parameterschätzung in zeitstetigen Markovprozessen Hans Rudolf Künsch  Aug-2006
In dieser Arbeit geht es um Parameterschätzungen in einer bestimmten Klasse von zeitstetigen, homogenen Markov-Ketten, die sich insbesondere zur Modellierung von gewissen chemischen Reaktionen oder Systemen aus der Populationsdynamik eignet. Die Daten sollen dazu in Form einer Zeitreihe vorliegen, das heisst, man kennt die Werte des Prozesses zu diskreten Zeiten.Die Übergangswahrscheinlichkeiten zwischen je zwei Observationen werden mit Hilfe von Poisson-Verteilungen approximiert. Die Güte dieser Näherung wird durch das Einführen von zusätzlichen Zeitpunkten (und latenten Daten) zwischen den eigentlichen Beobachtungszeiten verbessert. Zur approximativen Bestimmung des Maximum-Likelihood-Schätzers wird der EM-Algorithmus gepaart mit Monte-Carlo- beziehungsweise Markov-Ketten-Monte-Carlo-Methoden verwendet. Daraus resultieren schlussendlich zwei Algorithmen, die an verschiedenen Beispielen, insbesondere an künstlichen Datensätzen, getestet werden.
Andrea Cantieni 
Effiziente Approximation der a posteriori Verteilung für komplexe Simulationsmodelle in Umweltwissenschaften Hans Rudolf Künsch  Aug-2006
Elma Rashedan 
Models for Emission Factors of Passenger Cars linking them to Driving Cycle Characteristics Werner Stahel  Jul-2006
Carmen Casanova  
Vorhersage von Partikelgrössen-Verteilung anhand Bildananlyse-Daten Werner Stahel  Mar-2006
Andreas Elsener  
Statistical Analysis of Quantum Chemical Data; Using Generalized XML/CML Archives Peter Bühlmann  Mar-2006
Simone Elmer  
Sparse Logit-Boosting in hochdimensinalen Räumen Peter Bühlmann  Mar-2006
Das Ziel meiner Diplomarbeit ist es, das Klassifikationsverfahren Sparse-LogitBoost zu entwickeln und dies in R zu implementieren. Weiter soll das Verfahren auf simulierte und natürliche Daten angewendet werden und die Vorhersagegenauigkeit mit anderen Klassifikationsverfahren verglichen werden.


Student Title Advisor(s) Date
Roman Grischott 
Robuste Geostatistik im Markovmodellen am Beispiel eines Schwermetalldatensatzes Hans R. Künsch  Sep-2005
Michael Hornung  
Klassifikation hochdimensionaler Daten unter Anwendung von Box-Cox Transformationen Peter Bühlmann  Aug-2005
Die Regressionsmethoden Lasso, relaxed Lasso und Boosting werden benutzt, umsowohl simulierte wie natürliche hochdimensionale Daten vorherzusagen und zu klassieren.Dabei bestehen die betrachteten Daten nicht nur aus den erklärenden Variablensondern auch aus deren Box-Cox Transformationen, was die Vorhersagegenauigkeitvergrössern soll. Da die Zielvariable bei den natürlichen Datensätzen diskret ist, richtenwir unser Augenmerk vor allem auf den Missklassifikationsfehler. Es zeigt sich, dassbei einzelnen Datensätzen durch die Verwendung der Box-Cox Transformationen wohlVerbesserungen der Vorhersagekraft auftreten können, aber häufig auch Verschlechterungen in Kauf genommen werden müssen.Im zweiten Teil dieser Arbeit wird die Korrelation der durch die drei Regressionsmethodenausgewählten Modellvariablen betrachtet und zu verringern versucht. Dabei werdenzwei unterschiedliche Ansätze verfolgt. Als erstes wird durch eine Lasso-ähnlicheMethode, die zusätzliche Gewichte im Bestrafungsterm benutzt, die Korrelation zumTeil beträchtlich verringert. In einem zweiten Schritt werden aus den gegebenen Variablendurch Mittelung von Gruppen bestehend aus stark korrelierten Variablen neueErklärende konstruiert. Diese werden dann für weitere Klassifikationen benutzt. Auchdiese Methode verringert die Korrelation der Variablen teilweise stark. Jedoch lassensich keine allgemeinen Aussagen machen und beide Ideen führen in der Regel zu einerVergrösserung des Missklassifikationsfehlers.
Stefan Oberhänsli  
Robustheit bei Multiplem Testen: Differentielle Expression bei Genen Peter Bühlmann  Aug-2005
Warum Multiples Testen? Seit Datenerhebungen aller Art nicht mehr von Hand gemacht werden, sondern mit Computerunterstützung, sind die Datenmengen stark angestiegen. Damit wurden Methoden nötig, welche mit so umfangreichen Datensätzen umgehene können - und gleichzeitig möglichst wenige Fehler machen. Üblicherweise umfassen Datensätze hunderte von Faktoren. Damit wird es möglich, ganz verschiedene (eventuell schon vermutete) Zusammehänge zu testen. Weiter erlauben solch umfangreiche Datensätze ein exploratives Vorgehen, d.h. man betrachtet die Daten im Prinzip ohne Vorwissen und schaut, welche Zusammenhänge sich aufdrängen. Dieses Vorgehen ist allerdings statistisch heikel, da mit einer geschickten Auswahl von Testprozeduren oder vorgängiger "Datenbereinigung" fast beliebige Zusammenhänge "belegt" werden können. Der einschränkende Faktor bei wissenschaftlichen Untersuchungen ist sehr oft das festgesetzte Budget. Trotzdem möchte man möglichst viel Information aus den gesammelten Daten erhalten. Es ist viel billiger, einer Testperson eine Frage mehr zu stellen als eine weitere Testperson zu rekrutieren. In einem Experiment werden also aus finanzieller Sicht besser mehr Variablen gemessen als das ganze Experiment öfter zu wiederholen. Es gibt dann zwar weniger Beobachtungen, dafür mehr Faktoren, deren Zusammenhänge es zu analysieren gilt. In derartigen Fällen ist es unvermeidlich, sehr viele Tests simultan (Multiple Tests) durchzufähren. Bei der Analyse und Interpretation von Multiplen Tests treten erhebliche Schwierigkeiten auf, welche bei einfachem Testen inexistent sind. Wie diese Schwierigkeiten gemeistert werden können, wird im Verlaufe der Arbeit beschrieben.
Rahel Liesch 
Statistical Genetics for the Budset in Norway Spruce Peter Bühlmann  Mar-2005
Genetic variations is needed for plants to respond and adapt to environmental challenges. Understanding the genetic variation of adaptive traits and the forces that shaped it is one of the main goals of evolutionary biology. This is a difficult task, as most adaptive traits are quantitative traits, i.e. traits that are controlled by many loci intercting with the environment. The aim of this thesis was (i) to analyze the genetic variation of the timing of budset of Norway spruce (Picea abies L) within and among 15 populations covering the natural range of the species and (ii) to relate the variation among population for timing of budset with the variation observed at both neutral and candidate genes. The former was done through a classical ANOVA after choosing the adequate model. The latter was achieved by estimating and calculating confidence intervals for Wright's fixation indices (a measure of among-population differentiation) for budset, on the one hand, and neutral or candidate genes, on the other hand. Estimating confidence intervals for Wright's fixation index for quantitative trait, such as timing of busdet, has been and can be done in many different ways. In some studies the delta method has been used whereas in others nonparametric bootstrapping was favored. In almost all studies, the choice of a certain method was not justified or discussed, nor, when bootstrap was retained, was the choice of a particular bootstrap strategy of type warranted. We therefore simulated several datasets and applied miscellaneous methods to find the most appropriate method. We concluded that either a semiparametric of a parametric bootstrap gave the best results in the case of the spruce dataset. Using a nonparametric bootstrap, sampling over populations and families would definitely be the most adequate way of obtaining a confidence interval. Finally, Wright's fixation index for budset was significatly larger than differentiation at both candidate and neutral loci suggesting strong local adaptation.


Student Title Advisor(s) Date
Lukas Meier 
Extemwertanalyse von Starkniederschlägen Hans R. Künsch  Mar-2004
Zusammenfassung:Klimaveränderungen sind von grossem Interesse, da sie einen bedeutenden Einfluss auf den Menschen und die Umwelt haben können. In dieser Arbeit untersuchen wir mit Hilfe von Methoden der Extremwerttheorie den zeitlichen Verlauf von Starkniederschlägen für 104 Messstationen in der Schweiz. Wir modellieren die stationenweisen Überschreitungen von genügend hohen Schwellen mit einem 2-dim. Poisson Punktprozess und nicht-stationären Modellen für die Lokations- und Skalenparameter. Wir finden so für viele Stationen eine grosse Evidenz eines positiven Trends. Um die einzelnen Trendschätzungen zu kombinieren, verwenden wir ein Analogon zu einem hierarchischen Modell. Die räumliche Analyse der Resultate zeigt jedoch Anomalien, die eine Kombination der Messstationen erschwert. Wir untersuchen deshalb alternative Ansätze, hauptsächlich um saisonale Besonderheiten besser zu modellieren. Es zeigen sichdabei grosse saisonale Unterschiede bei der räumlichen Abhängigkeit der Trendschätzungen der verschiedenen Stationen, die genauer untersucht werden könnten.
Andreas Greutert 
Methoden zur Schätzung der Clusteranzahl Peter Bühlmann  Mar-2004
Im Zusammenhang mit Microarray-Experimenten werden laufend neue Methoden der Cluster-Analyse entwickelt. Drei solche Methoden werden im Technical Report von Fridlyand und Dudoit (citeyear{clest}) vorgestellt. Fridlyand und Dudoit verfolgen zwei Ziele. Erstens möchten sie durch die resampling Methode clest die Clusteranzahl schätzen. Zweitens soll die Genauigkeit der Clusterung verbessert werden. Um die Genauigkeit zu verbessern, schlagen sie zwei Bagging Methoden für Clusteralgorithmen vor.Wir werden uns mit dem clest-Algorithmus auseinander setzen. Damit wir den Algorithmus verstehen und anwenden können, ist einige Theorie notwendig. Im Kapitel 2 beginnen wir mit einer kurzen Einführung in die Cluster-Analyse. Weitere Methoden, die die Clusteranzahl schätzen, werden im Kapitel 3 vorgestellt. In den Kapiteln 4, 5 und 6 wird clest mit seinen Parametern eingeführt.Das Ziel dieser Diplomarbeit besteht darin, den clest-Algorithmus zu verstehen und wenn möglich ihn zu verbessern. Dazu war es notwendig den Algorithmus clest in R zu implementieren (siehe Anhang B). Das grosse Ziel clest zu verbessern, wollen wir erreichen, indem wir die verschiedenen Parameter von clest verändern. Eine weitere Aufgabe besteht darin, ein Mass für die Sicherheit der Clusteranzahl-Schätzung zu konstruieren (siehe Kapitel 7). Weiter sollen auch bestehende Schätzmethoden mit clest verglichen werden.
Käthi Schneider 
Mischungsmodelle für evozierte Potenziale in Nervenzellen Hans R. Künsch  Mar-2004
Dieser Arbeit liegen 18 Datensätze neurobiologischer Daten über evozierte Potentiale zugrunde. Jeder Datensatz enthält Amplituden- und Noise-Werte, wobei die Amplituden-Werte die evozierten Potentiale darstellen.Da es sich um neurobiologische Daten handelt, werden in Kapitel 2 zuerst einige biologische Begriffe und Abläufe erklärt. Diese spielen bei der Erhebung der Daten, welche ebenfalls thematisiert wird, eine Rolle. Nebst einer ersten Übersicht über die Daten wird zudem auf die quantale Hypothese eingegangen, da sie bei der Auswertung der Daten eine wesentliche Rolle spielt.ZielsetzungAn die einzelnen Amplituden-Werte der Datensätze werden Mischverteilungsdichten angepasst. Dazu sind verschiedene Modelle zu betrachten und gleichzeitig ist zu überprüfen, welches Modell am besten dafür geeignet ist.Als erster Schwerpunkt werden Mischungsmodelle betrachtet, die von abhängigen Daten ausgehen. Deshalb muss vorher geprüft werden, ob überhaupt Abhängigkeiten zwischen evozierten Potentialen bestehen. Falls solche vorhanden sind, ist zu untersuchen, wie diese modelliert werden und ob diese Modelle die besseren Schätzungen der Mischverteilungsdichten liefern.Der zweite Schwerpunkt wird auf die quantale Hypothese gelegt. Man möchte wissen, ob sich evozierte Potentiale als eine Überlagerung einer zufällig ausgeschütteten Anzahl Quanten modellieren lassen oder nicht.
Jeannine Britschgi 
Analyse einer Brustkrebsstudie Hans R. Künsch  Feb-2004
Das Ziel dieser Diplomarbeit besteht darin, für eine Gruppe von Brustkrebspatientinnen, deren Tumor operativ entfernt wurde, eineÜberlebenszeitanalyse durchzuführen. Es interessiert uns aber weniger die Zeitspanne, bis eine Patientin an Brustkrebs stirbt, sondern vielmehr die Zeit bis zu einem Rezidiv (Wiederauftreten des Tumors). Wir möchten für die Patientinnen ein gutes Prognose-Modell konstruieren, das die Zeit eines Rückfalls des Tumors voraussagt. Dieses Modell wird eine Funktion sein. Wir wollen herausfinden, welche der vielen erklärenden Variablen notwendig sind, um diese Funktion gut zu charakterisieren. Esstellt sich die Frage, ob die Angaben über die Lymphknoten, welche ebenfalls operativ entfernt und nach Ablegern des Tumors untersucht wurden, notwendig sind, oder ob sich auch ohne diese Informationen gute Prognose-Modelle finden lassen.


Student Title Advisor(s) Date
Corinne Dahinden 
Schätzung des Vorhersagefehlers und Anwendungen auf Genexpressionsdaten Peter Bühlmann  Nov-2003
Im Kapitel 2: Microarray Prädiktoren werden verschiedene Methoden vorgestellt, welche wir später verwenden, um Microarrays zu klassifizieren.Im Kapitel 3 werden Schätzungen des Vorhersagefehlers einführt.Im Kapitel 4: Schätzung der Vertrauensintervalle werden Schätzungen der Standardabweichungen für die im Kapitel 3 eingeführten Schätzer besprochen.In den Kapiteln Kapitel 5-7 werden die verschiedenen Schätzungen für die Fehler und die Vertrauensintervalle anhand von Simulationen miteinander verglichen.Diese Erkenntnisse werden im Kapitel 8: Vergleich von Microarray Prädiktoren mit und ohne klinische Variablen angewandt.Im Kapitel 9: Prevalidierung wird die gleichnamige Technik eingeführt und angewandt auf verschiedene Microarray Prädiktoren um die Relevanz der klinischen Variablen zu bestimmen.In dieser Diplomarbeit habe ich sehr viel simuliert und dabei einige der verwendeten Fehlerschätzer in der Statistiksoftware R selbst programmiert. Den Code der wichtigsten Programme findet man unter /u/dahinden/Diplomarbeit/RCode.
Christof Birrer 
Konstruktion von Vorschlagsdichten für Markovketten Monte Carlo mit Sprüngen zwischen Räumen unterschiedlicher Dimension Hans R. Künsch  Sep-2003
In der vorliegenden Diplomarbeit ging es darum, Vorschlagsdichten für Markovketten Monte Carlo zu konstruieren, wobei vor allem im AR-Modell gearbeitet wurde. Die Arbeit baut auf dem Paper von Brooks, Giudici, und Roberts (2003) auf. Es sollte der Vorschlag im Diskussionsbeitrag von H.R. Künsch untersucht werden, der eine sorgfältiger ausgesuchte Sprungfunktion empfiehlt als die naheliegende, mit der im Paper gearbeitet wurde. Zu diesem Zweck sollten auch Simulationen mit der Statistik-Software R durchgeführt werden. In einem zweiten Teil sollte untersucht werden, ob auch für ARCH-Modelle und Gauss'sche graphische Modelle geschicktere Sprungfunktionen zu finden sind als die offensichtlichen. Dabei sollte mit der Kullback-LeiblerDistanz gearbeitet werden.
Christoph Buser 
Differentialgleichungen mit zufälligen zeitvariierenden Parametern Hans R. Künsch  Mar-2003
Biologische Prozesse werden mit Differentialgleichungen beschrieben. Die Annahme, dass die Parameter zeitlich invariant sind, erleichtert das Lösen und wird in der Praxis oft getroffen. Die dadurch entstehenden systematischen Fehler werden in Kauf genommen, solange sie nicht zu gross sind.In unserem Beispiel haben wir drei Grössen: die Biomasse (Bakterien), das Substrat (Nahrung) und den Sauerstoff. Es handelt sich um Konzentrationen. Messbar ist nur die Sauerstoffkonzentration. Wir rekonstruieren die anderen Grössen aus diesen Messdaten. Dazu arbeiten wir mit einem Glätter, welcher Daten der Zukunft und der Vergangenheit berücksichtigt.Wir geben die Konstanz der Parameter auf und modellieren diese mit zeitvariierenden stochastischen Prozessen, genauer gesagt mit dem mean-reverting Ornstein-Uhlenbeck Prozess. Das Modell wird flexibler. Der Ansatz ist bayesianisch. Wir suchen nicht die besten Parameter, sondern konstruieren die bedingte Verteilung der Parameter, gegeben die Sauerstoffmessdaten. Das ist nicht in geschlossener Form möglich. Wir verwenden den Metropolis-Hastings Algorithmus und erzeugen eine Markovkette, welche asymptotisch die gewünschte Verteilung hat. Um zweidimensionale Vorschlagsdichten zu umgehen, arbeiten wir mit dem Gibbs-Sampler, der jeweils einen der beiden Parameter wählt, der neu vorgeschlagen wird.In der ersten Simulation nehmen wir im Metropolis-Hastings Algorithmus bedingte Orn-stein-Uhlenbeck Prozesse als Vorschläge für die neuen Parameterwerte. Die Daten werden nicht in die Vorschlagsdichte einbezogen. Wir unterteilen das Zeitintervall $[0,T]$ in zufällige Intervalle gleicher Durchschnittslänge und ändern den Parameter nur auf einem solchen Intervall. Das ist notwendig, um vernünftige Akzeptierungswahrscheinlichkeiten zu erhalten.In der zweiten Simulation benutzen wir die quadratischen Abweichungen der Sauerstoffdaten, um in einem Intervall einen Vorschlag zu konstruieren. Die zusätzliche Information reduziert die Varianz der Vorschlagsdichte. Der Rechenaufwand vergrössert sich.Während des Verfahrens sind wir mit einem Problem konfrontiert. Solange Substrat vorhanden ist, dominiert der Wachstumsparameter den Sterbeparameter. Dieser Maskierungseffekt erhöht die Unsicherheit bei der Bestimmung des Sterbeparameters im ersten Zeitabschnitt. Die Unsicherheit überträgt sich auf die Hauptprozesse. In beiden Simulationen gelingt es meist gut bis sehr gut, die Verteilungen aller Prozesse zu bestimmen. Probleme des Filters, der nur Messwerte der Vergangenheit verwendet, werden durch den Glätter behoben. Der Glätter bringt mehr Daten in das Verfahren und ist dem Filter vorzuziehen.Der Algorithmus ist rechenintensiv. Einerseits ist zum Erreichen der stationären Verteilung eine lange Einschwingphase erforderlich. Andererseits verringern wir die Abhängigkeiten in der Markovkette, indem wir nicht jedes Element verwenden. Daher ist eine grosse Anzahl Schritte im Algorithmus notwendig.Es gibt Varianten der Vorschlagsdichte. Wir verzichten auf den Gibbs-Sampler und arbeiten zweidimensional. Möglicherweise wird so das Zusammenspiel der beiden Parameter besser wiedergegeben und der Maskierungseffekt kompensiert.Ein anderer Algorithmus versucht, mehr Information aus den Sauerstoffabweichungen zu gewinnen, indem deren Vorzeichen berücksichtigt wird.
Eric André Graf 
Vorhersage des Luftqualitätsindexes Hans R. Künsch  Mar-2003
In dieser Arbeit geht es um die Entwicklung eines Modells für die Vorhersage eines Luftqualitätsindexes (LQI). Dieser LQI beschreibt in Worten den Zustand der Luft. Der LQI wird stündlich auf dem Internet publiziert (verb||).Der Luftqualitätsindex LQI zeigt die Wirkung der aktuellen Luftqualität auf die Gesundheit an.Bei der Messung von Luftschadstoffen (Ozon O3, Stickoxide NOx, Stickstoffmonoxid NO und Feinstaub PM10) werden Zahlen erzeugt. Diese geben Auskunft über die Konzentration der einzelnen Stoffe in der Aussenluft. Der LQI wird aufgrund dieser Konzentrations-Angaben berechnet und gibt Auskunft über den Einfluss der Schadstoffe auf das körperliche Befinden. Die Aussage des LQI ist stark generalisiert, sie entspricht aber den heutigen Kenntnissen über kurzfristigen Auswirkungen der Schadstoffe auf den menschlichen Organismus.Für jeden Schadstoff werden nun Indexstufen von 1 bis 6 zugeordnet in Bezug dessen Konzentration.
Page URL:
© 2017 Eidgenössische Technische Hochschule Zürich