Navigation Area
 Ph.D. theses
 Current Subcategory: Master's theses
Master's theses
Main content
2016
Student  Title  Advisor(s)  Date 

Manuel Schürch 
HighDimensional Random Projection Ensemble Methods for Classification  Prof. Dr. Peter Bühlmann  Nov2016 
Abstract
In this thesis, we investigate random projection ensemble methods for multiclass classification based on the combination of arbitrary base classifiers operating on appropriately chosen lowdimensional random projections of the feature space. These methods are particularly intended for highdimensional data sets where the dimension of the variables is comparable to or even greater than the number of available training data samples. We extend a recent proposal of Cannings and Samworth (2015) in two directions. First, we generalize their idea for binary classification to multiple classes. Second, we present alternative approaches to their weighted majority voting for the aggregation of the individual predictions in the ensemble to a final assignment. For this newly developed methodology, we provide implementations and an empirical comparison to stateoftheart methods on synthetic as well as realworld highdimensional data sets. Its competitive prediction performance underpins the promising direction of aggregating randomized lowdimensional projections. Moreover, we examine analogous ideas for regression and semisupervised classification.


Fan Wu 
On Optimal Surface Estimation under Local Stationarity 
Dr. Rita Ghosh
Dr. Markus Kalisch 
Oct2016 
Abstract
Given a spatial dataset, consider a nonparametric regression model where the aim is to estimate the regression surface. By further assuming local stationarity of the error term, estimation of variance of the PriestlyChao kernel estimator can be done without the estimation of the various nuisance parameters. All the proofs about uniform convergence of terms are already addressed in Ghosh (2015). In this thesis, we use the proved properties and propose a semiparametric algorithm for optimal bandwidth selection. The findings are then applied to a dataset of the Swiss National Forest Inventory (http://www.lfi.ch).


Polina Minkina 
A new hybrid approach to learning Bayesian networks from observational data 
Dr. Markus Kalisch
Dr. Jack Kuipers 
Sep2016 
Abstract
This work presents a new hybrid approach to learning Bayesian networks from observa tional data. The method is based on the PCalgorithm combined with a Bayesian style MCMC search. There are several versions of the algorithm presented in this work. Base version of the algorithm suggests to limit the search space with a PCskeleton and per form either stochastic MAP search or sampling from the posterior distribution on a reduced search space. While this version yields relatively good results, in some cases the PC algo rithm eliminates a large part of true positive edges from the search space. To overcome this issue we also suggest an algorithm for iterative expansion of the search space which helps to increase the number of true positives and as a result leads to much better estimates both in terms of skeleton and equivalence class.
We run simulation studies and compare performance of our approach to other algorithms for structure learning, such as PCalgorithm, greedy equivalent search (GES) and maxmin hill climbing (MMHC). The advantages of our algorithm are more pronounced in a dense setting. In a sparse setting algorithm performs similarly to GES, but better than PC. We provide assessments of computational complexity of a new approach, which grows polynomially with the size of network and exponentially with the size of maximal neigh borhood, which is the main limitation of the method. For the PCalgorithm lower bound for computational complexity also grows exponentially with the size of maximal neighbor hood, hence we conclude that if PC algorithm is feasible for some network our approach should be feasible too. 

Mun Lin Lynette Tay 
Statistical analysis of multimodel climate projections with a Bayesian hierarchical model 
Prof. em. Dr. HansRudolf Künsch
Prof. Dr. Peter L. Bühlmann 
Aug2016 
Abstract
This thesis applies a Bayesian hierarchical model as developed by Buser et al. (2009), Buser et al. (2010) and Kerkhoff et al. (2015) to heterogeneous multimodel ensembles of global climate models (GCM) and regional climate models (RCM). The Bayesian hierarchical framework is applied to data from the European arm of the project CORDEX and probabilistic projections of future climate are derived from the climate models.
This thesis is also a continuation of the CH2011 initiative which aims to provide scientificallygrounded information on a changing climate in Switzerland to aid decisionmaking and planning with regard to climate change strategies. It does so by assessing climate change in the course of the 21st century in Switzerland with a focus on projections of temperature and precipitation. Suitable priors for temperature and precipitation data are suggested and probabilistic projections for different regions in Switzerland, different seasons and different emission scenarios are illustrated and explained. Furthermore, a variant on the Bayesian model proposed by Kerkhoff et al. (2015) which weights data from RCMs more equally to their GCMs is introduced and the two models are compared against each other. 

Ravi Mishra 
Gated Recurrent Neural Network Language Models  Prof. Dr. Nicolai Meinshausen  Aug2016 
Abstract
"Long term dependencies are difficult to learn with gradient descent in standard Recurrent
Neural Networks due to vanishing and exploding gradient problems. Long ShortTerm Memory and other gated networks combined with gradient clipping strategies have been successful at addressing these issues. This work provides details on standard RNN and gated RNN architectures. The focus lies on forward and backward pass using backpropagation through time. We train an implementation of a character level neural network language model on fine food review data. The goal is to model a probability distribution over the next character in a sequence when presented with the sequence of previous characters. The results of our experiments indicate that for large datasets and increasing sequence length gated architectures have better performance than traditional RNNs. This is in line with previous research." 

Janine Burren 
Outlier detection in temperature data by penalized least squares methods  Prof. Dr. Nicolai Meinshausen  Aug2016 
Abstract
Chernozhukov et al. (2015) proposed a new regularization technique called lava. In contrast to conventional methods like lasso or ridge regression, this method is able to discover signals, which are neither sparse nor dense. It was shown that this method outperforms the conventional methods in simulations. The application on the temperature anomaly data for January in this thesis confirmed this.
The focus of this thesis lies on the comparison of the lasso method, the elastic net method and ridge regression with the lava method in theory and application and can be split into five main parts. Firstly, all considered regularization methods are described for a multiple linear regression setting and are brought into relation in the orthonormal design case. Secondly, for the application on the temperature data the lava method and a corresponding crossvalidation approach had to be implemented with R. Thirdly, the given temperature anomaly data (1940  2015) is analyzed and ordinary least squares models are fitted on temperature data, which result from a climate model, to assess how good temperature anomaly values can be predicted by the four nearest values. Fourthly, regularized linear regression models are fitted on the climate model data and predictions are made for an observed temperature anomaly data set. For this, a model fitting procedure was determined, which is able to deal with the NAstructure in the observed temperature anomaly data and which has a reasonable computational time. The residuals produced by prediction are analyzed with respect to their spatial, temporal and probabilistic distributions. In addition, the functioning of the regularization methods on the temperature anomaly data is studied for some examples to compare the methods and to understand the distributions of the residuals. In the last part of the thesis, these residuals are used to detect outliers in the temperature anomaly data. An outlier detection procedure is proposed, which takes into account the prediction error of the fitted linear models and the NAstructure in the observational data set. Furthermore, an artificial outlier study is conducted to assess the outlier detection power of the four considered regularization methods. 

Elias Bolzern 
Stochastic Actor Oriented Models: An Approach Towards Consistency and Multi Network Analysis  Prof. Dr. Marloes Henriette Maathuis  Jul2016 
Abstract
Stochastic actor oriented models allow to describe longitudinal social networks, i.e., social networks observed at various time points. This model can be fitted either by a method of moments approach or a maximum likelihood approach.
In this thesis we discuss two topics. Firstly, up to now, there exists no proof for the con sistency of the method of moments estimator. We discuss an approach that could lead to a consistency proof. Secondly, the existing theory allows us to examine only a single social network. We want to examine the common behaviour that underlies several longitudinal networks. This allows us to gain deeper insights in the general behaviour of such networks. We propose to detect the commonalities by considering maximineffects, which can be estimated by a magging type estimator. We will call our new estimator the multi group estimator. Simulations show that the multi group estimator performs well, especially for a large number of observed time points. Furthermore, the estimator has nice properties in terms of computational efficiency. 

Solt Kovács 
Changepoint detection for highdimensional covariance matrix estimation  Peter Bühlmann  May2016 
Abstract
In this thesis we pursue the goal of highdimensional covariance matrix estimation for data with abrupt structural changes. We try to detect these changes and estimate the covariance matrices in the resulting segments. Our approaches closely follow a recent proposal of Leonardi and Bühlmann (2016) for changepoint detection in the case of highdimensional linear regression. We propose two estimation approaches that directly build up on their regression estimator and a third procedure which is analogous to their regression estimator, but modified to match the likelihood arising in the case of covariance matrices. We mainly focus on the implementation, testing and comparison of these proposals. Moreover, we provide complementaries regarding the relevant literature of covariance matrix estimation and changepoint detection in similar settings, tuning parameter selection, models for simulations and error measures to evaluate performances. We also illustrate the developed methodology on a reallife example of stock returns.


José Luis Hablützel Aceijas 
Causal Structure Learning and Causal Inference  Dr. Markus Kalisch  Apr2016 
Abstract
This thesis presents the theory and main ideas behind some of the nowadays most popular methods used for causal structure learning as well as the ICP algorithm, a new algorithm based on a method recently developed at ETH Zurich. Then, we measure and compare the performance of these algorithms in two different ways. In our first measure we consider the probability of each of the considered methods for finding exactly all the parents of a randomly chosen target variable. In our second measure we consider the reliability of each method for not yielding a node as a parent which is not. Hereby, we focus on linear Structural Equation Models (SEM) and restrict ourselves to the situation where no hidden confounders are present. We start reproducing and extending the results given in Peters, Bu ̈hlmann, and Meinshausen (2015) and after that, we change the generation process of the data in several ways in order to conduct further comparisons.


Pascal Kaiser 
Learning City Structures from Online Maps 
Markus Kalisch
Martin Jaggi Thomas Hofmann 
Mar2016 
Abstract
Huge amounts of remote sensing data are nowadays publicly available with
applications in a wide range of areas including the automated generation of maps, change detection in biodiversity, monitoring climate change and disaster relief. On the other hand, deep learning with multilayer neural networks, which is capable of learning complex patterns from huge datasets, has advance greatly over the last few years. This work presents a method that uses publicly available remote sensing data to generate large and diverse new ground truth datasets, which can be used to train neural networks for the pixelwise, semantic segmentation of aerial images. First, new ground truth datasets for three different cities were generated consisting of veryhigh resolution (VHR) aerial images with ground sampling distance on the order of centimeters and corresponding pixelwise object la bels. Both, VHR aerial images and object labels are publicly available and were downloaded from online map services over the internet. Second, the three newly generated ground truth datasets were used to learn the semantic segmentation of aerial image by using fully convolutional networks (FCNs), which have been introduced recently for accurate pixeldense semantic seg mentation tasks. Third, two modifications of the base FCN architecture were found that yielded performance improvements. Fourth, an FCN model was trained on huge and diverse ground truth data of the three cities simul taneously and achieved good semantic segmentations of aerial images of a geographic region that has not been used for training. This work shows that using publicly available remote sensing data can be used to generate new ground truth datasets that can be used to effec tively train neural networks for the semantic segmentation of aerial images. Moreover, the method presented here allows to generate huge and in partic ular diverse ground truth datasets that enable neural networks to generalize their predictions to geographic regions that have not been used for training. 

Sriharsha Challapalli 
Understanding the intricacies of the PC algorithm and optimising causal structure discovery 
Markus Kalisch
http://stat.ethz.ch/~kalischm/ 
Mar2016 
Abstract
The PC algorithm is one of the most notable algorithms in causal structure discovery. Over the years various suggestions have been made to optimize the algorithm further. But there is still scope to probe the intricacies of the algorithm deeper. The current study aims to examine the role of various factors like the number of variables, density in the true graph, use of conditional independence graph and sequence of carrying out conditional independence tests. The outcomes of the study contribute to the optimization of not just the PC algorithm but also causal structure discovery algorithms based on conditional independence tests in general.
The study suggests that skeletonstable is the best of the studied algorithms for the discov ery of skeleton. The orderindependent option is not the best for causal structure discovery and the BC variant is recommended. The study validates that the sequence of orders of the PC algorithm is integral to causal structure discovery. The study recommends avoiding the use of conditional independence graph for very low values of p and very low densities. Algorithms based on conditional independence tests used in the study must be preferred to those based on greedy equivalent search except for extremely low values of p or extremely high density. 

Sonja Meier 
Causal analysis of proximal and distal factors surrounding the HIV epidemic in Malawi 
Marloes Maathuis
Olivia Keiser 
Mar2016 
Abstract
The HIV epidemic in Malawi is a major cause of mortality and induces a highly adverse impact on Malawi’s health system as well as on its economy. It is therefore the aim of this thesis to identify causal associations between proximal and distal factors that may drive the HIV epidemic. The Malawi Demographic and Health Survey 2010 provides a wide variety of behavioral, socioeconomical and structural variables as well as information on the HIV status of more than 12’000 participants. To find and display causal pathways graphical models, such as directed acyclic graphs, are used. Amongst the numerous different causal structure learning methods the RFCI algorithm and the GES algorithm are found to be suitable for the considered dataset. To include the sample weights from the survey some modifications need to be made. The ”weighted“ versions of the two algorithms are repeatedly run on random subsets of all observations to obtain robust estimates. Finally, a summary graph is created, where only edges with a certain frequency are displayed. This analysis is carried out for three different sets of variables. Since the HIV prevalence amongst women is significantly higher than amongst men in Malawi, a stratification by gender provides further insight. The proposed method is able to detect various connections between proximal and distal variables in consideration of the provided sample weights. A group of variables robustly connected with the HIV status was found. However, the proposed method has difficulties determining causal directions as these are not robust under resampling.


Yannick Suter 
Implementation of different algorithms for biomarker detection and classification in breath analysis using mass spectrometry 
Marloes Maathuis
Renato Zenobi 
Mar2016 
Abstract
We implement different algorithms for biomarker detection and classification for breath analysis studies using ambient ionization mass spectrometry. We test them on two studies done recently in the Zenobi research group at ETH Zürich on chronic obstructive pulmonary disease (COPD) and cystic fibrosis (CF). The studies investigate differences in molecules present in breath due to lung diseases.
The data sets contain a lot of highly correlated variables, due to isotope patterns and biological pathways. We show that this is useful for the interpretation of the results, but has little effect on both biomarker detection as well as classification. For biomarker detection, we use the MannWhitney U test, as well as subsampling with either the MannWhitney U test or the elastic net regression as selection method. For classification, we use prefiltering with the MannWhitney U test, followed by modern highdimensional classification methods. The best performing methods for both biomarker detection and classification are different for the two studies. Due to time drift effects, no significant molecules were found at an FDR control level of q = 0.05 for the COPD study with the MannWhitney U test. For the CF study, 127 molecules were found at an FDR control level of q = 0.05. For classification, the best performing methods for the COPD study was partial least squares regression followed by linear discriminant analysis (PLSLDA), with an area under the ROC curve (AUC) value of 0.90. A second study on COPD is used as a validation set, which gives an AUC value of 0.71 for PLSLDA.\\ Concerning the CF study, the best performing classification method was principal component analysis followed by linear discriminant analysis (PCALDA) with an AUC value of 0.73.\\ We show in simulations that hierarchical testing approaches given by Mandozzi (2015) do not work well in our setting. 

Zhiying Cui 
Quantifying Subject Level Uncertainty Through Probabilistic Prediction for Autism Classification Based on fMRI Data 
Marloes Maathuis
Pegah Kassraian Fard 
Feb2016 
Abstract
This thesis aims to quantify the subject level uncertainties of the classification between subjects with and without autism spectrum disorder using a type of brain image data, namely, the resting state functional magnetic resonance imaging data. The concerned subject level uncertainty measure for this study is based on the probabilistic predictions,
and the quality of the former is shown to be entirely dependent on the quality of the latter. A selected subset of the data from the Autism Brain Imaging Data Exchange is used for classification, and the quality of the label and probability predictions of nine conventional classifiers combined with the simple threshold feature selection are evaluated through cross validation and by various evaluation metrics. The best achieved accuracy is 77% by logistic regression with L1 regularization. The best probability predictions are produced by logistic regression with L1 and L2 regularization for two of the three probability evaluation metrics, and the best probability predictions are produced by both random forest and extremely randomized trees for the third evaluation metric. Considering both label and probability predictions, the best classifiers for this data set are logistic regression with L1 and L2 regularization and adaptive boosting. To further improve the probability predictions, two probability calibration methods are respectively applied to each of the above mentioned best classifiers, and in the majority of the twelve examined cases, the probability calibrations make some levels of improvements. Similar classification tasks are also performed on one other autism data set and two additional data sets to examine the performance in different settings. 

Jakob A. Dambon 
Multiple Comparisons with the Best Methods and their Implementations in R  Dr. Lukas Meier  Feb2016 
Abstract
The simultaneous evaluation of multiple factors is required in many scientific experiments. Multiple comparisons account for the multiplicity and are a useful tool for giving simultaneous inference of those factors. There are several methods for multiple comparisons, in particular the multiple comparisons with the best (MCB), which is our main focus for this thesis. Here, we are trying to find the best treatment in comparison to the others.
The main purpose of this thesis was to implement EdwardsHsu’s MCB method into R, which is not part of the R package multcomp. Our main achievements of this thesis are stepbystep derivations of the confidence intervals of EdwardsHsu’s MCB method in the balanced and unbalanced oneway ANOVA model as well as a successful implementation into R. 

Maurus Thurneysen 
Performance Analysis of a Next Generation Sequencing Instrument 
Markus Kalisch
Harald Quintel 
Feb2016 
Abstract
The complexity of processes and data output in molecular diagnostics are growing rapidly. In December 2015 QIAGEN AG entered the market with the first complete workflow in Next Generation Sequencing designed to deliver all the steps from Sample to Insight to the customer. This GeneReader NGS System features builtin sample preparation, sequencing of the genetic code as well as analyses of the gene sequences and produces actionable insights for customers working in diagnostic fields.
The quality and reliability of such a workflow are crucial factors in assuring high performance standards. The statistical analysis of critical steps within the workflow provides a powerful means for achieving this goal. So far, this approach has not been exploited to its fullest in this context. Therefore, the aim of this master thesis in statistics is to analyze the performance of the newly developed GeneReader instrument, which carries out the sequencing substep of the workflow, with statistical learning techniques. Qualitiy Control data from instrument production and data from test campaigns in the field are analyzed by an unsupervised learning approach and then combined into supervised learning problems to predict the performance quality of a GeneReader instrument from its Quality Control data. It was found that the GeneReader instruments are calibrated well and that their contribution to the variability of the workflow is relatively small. However, the power of this approach was limited due to the small number of true replicates available. Nonetheless, this investigation demonstrates the potential lying in the systematic application of statistical analysis to asses and guarantee high quality and stability in QIAGEN’s development and production processes that is currently largely untapped. 

Sven Buchmann 
HighDimensional Inference: Presenting the major inference methods, introducing the Unbalanced Multi Sample Splitting Method and comparing all in an Empirical Study  Martin Mächler  Feb2016 
Abstract
Performing statistical inference in the high dimensional setting is challenging and has become an important task in Statistics over the last decades. In my thesis I first give a selective overview of the highdimensional inference methods, which have been developed to assign pvalues and confidence intervals in linear models, including a graphical survey of every presented inference method. The overview is split in two parts: methods for detecting single predictor variables and methods for detecting groups of predictor variables.
Secondly, I introduce a new inference method in the highdimensional setting, called Unbalanced Multi Sample Splitting, which is a modification of the Multi Sample Splitting Method of Meinshausen, Meier, and Bühlmann (2009). Furthermore, I prove its familywise error control. Finally, I perform an Empirical Study using the R package simsalapar, which consists of three parts: designing the simulation study, actually performing the simulation and analyzing the various results. 

Jürgen Zell 
Analyzing growth and mortality of Picea Abies for a growth simulator in Switzerland  Martin Mächler  Feb2016 
Abstract
The thesis is about modeling growth and mortality of Picea Abies. The data are complex and stem from experimental forest management trials all over Switzerland. In the first part growth was modeled. 65% of the total variation can be explained by many different explanatory variables. The second part is about mortality and contains a logistic regression model, which is compared to a Survival analysis approach.


Marc Stefani 
Lasso Chain Ladder Constrained Optimization for Claims Reserving 
Lukas Meier
Jürg Schelldorfer 
Feb2016 
Abstract
The Chain Ladder method is by far the most popular method for predicting nonlife claims reserves in the insurance industry. Its simplicity induces two limitations: First, we do not have a robust estimation of old development factors which is caused by only few avail able observations. Second, the Chain Ladder method is not able to deal with diagonal effects (i.e. claims inflation) which are often present in claims reserving data. Although many research papers present extensions to the classical Chain Ladder method, none has addressed the issue of using constrained optimization with Lassotype estimators. Lasso type estimators are primarily attractive for high dimensional statistics and still useful in lowdimensional problems. Either to obtain a smaller set of estimated parameters that exhibits the strongest effects or to obtain a robust estimator which reduces the variability of the estimated model parameters. Since the Chain Ladder model can be understood as a regression problem, it was possible to develop Lassotype estimators for three different models: A regression version of the Chain Ladder Time Series Model, an extension which allows modeling diagonal effects and an Overdispersed Poisson Model which also considers diagonal effects. To solve the optimization problems, we build up a regression framework to transform the claims re serving data into appropriate data matrices. The application for real data sets shows that Lassotype estimators predict plausible claims reserves. For simulated data sets we often achieve a better prediction accuracy with Lassotype estimators compared to the Chain Ladder method, especially in situations where Chain Ladder model assumptions are not fulfilled. However, the solution of Lassotype estimators is sensitive to the choice of the optimal tuning parameter and the model selection criterion. Finally, we estimate the pre diction accuracy of Lassotype Chain Ladder estimators via modelbased bootstrap. The implementation of the Lassotype estimators is done in R. 

Benjamin Jakob 
Censored Regression Models  Lukas Meier  Jan2016 
Abstract
Empirically bounded distributions are investigated and the process of regression is employed on these dependent variables with several independent variables. Different models (censored as well as uncensored) are used and programmed with the programming language R such as the Logit model, the Beta distribution model, the Tree model, the Random Forest, a Censored Gamma model and two slight variations of it.
The conclusion is made that the Censored Gamma model and its extensions proposed by Sigrist and Stahel (2011) do perform well  but not always  in comparison to the other models and might therefore be an attractive option to further investigate for banks and insurers. 
2015
Student  Title  Advisor(s)  Date 

Jakob Olbrich 
Screening Rules for Convex Problems 
Bernd Gärtner
Peter Bühlmann Martin Jaggi 
Sep2015 
Abstract
This thesis gives a general approach to deriving screening rules for convex optimization problems. It splits up in three steps. As the first step, the KarushKuhnTucker conditions are used to derive necessary conditions that allow to reduce the problem size. They depend on the optimal solution itself. The second step is to gather information on the optimal solution from a known approximation. In the third and final step the information is used to get conditions that do not depend on the optimal solution, which are then called screening rules. This thesis studies in particular the unit simplex, the unit box and polytopes as domain. The resulting screening rules can be applied to various problems, such as Support Vector Machines (SVM), the Minimum Enclosing Ball (MEB), LASSO problems and logistic regression. The resulting screening rules are compared to existing rules for those problems.


Nicolas Bennett 
Analysis of High Content RNA Interference Screens at Single Cell Level 
Peter Bühlmann
Anna Drewek 
Aug2015 
Abstract
Infectious diseases are among the leading causes of death worldwide and the evolution of antimicrobial resistance poses a troubling development in cases where our only effective line of defense is based on distribution of antibiotic agents. One possible way out of this problematic situation comes by the alternative approach of host directed therapeutics, which in turn warrants the meticulous study of the human infectome. Therefore, largescale studies such as genomewide siRNA knockdown experiments as performed by the InfectX/TargetInfectX consortia are of great importance.
The richness of datasets resulting from imagebased high throughput RNAi screens permits a broad range of possible analysis approaches to be employed. The present study investigates cellular phenotypes as induced by gene knockdown, with a focus on the effect of pathogen infection, by applying generalized linear models (GLMs) to single cell measurements. In order to simplify handling of such datasets, an R package is presented, that fetches queried data from a centralized data store and produces data structures, capable of efficiently representing the logic of an assay plate. Convenience functions to preprocess, manipulate and normalize the resulting objects are provided, as is a caching system that helps to significantly speed up common operations. GLM analysis of phenotypic response from knockdown and infection was attempted, but did not yield satisfactory results, most probably due to issues with data normalization. In order to facilitate the simultaneous study of measurements originating from multiple assay plates, several normalization schemes were explored, including Z and Bscoring, as well as modeling technical artifacts with multivariate adaptive regression splines (MARS). While some improvements of data quality were observed, experimental sources of error could not be sufficiently controlled for meaningful GLM regression. 

Marco Eigenmann 
A ScoreBased Method for Inferrig Structural Equation Models with Additive NoiseP  Peter Bühlmann  Aug2015 
Abstract
We implement and analyse a new scorebased algorithm for inferring linear structural equation models with a mixture of both, Gaussian and nonGaussian distributed additive noise. After introducing some wellknown algorithms providing theory, pseudocodes, main advantages and disadvantages as well as some examples, we extensively cover the technical part which endorses the ideas behind our new algorithm. Finally, we present our algorithm in great detail describing its R implementation and showing its performance compared to the algorithms introduced in the previous chapters.


Patrick Welti 
Analysis of the Empirical Spectral Distribution of a Class of Large Dimensional Random Matrices with the Aid of the Stieltjes Transform 
Sara van de Geer
Alan Muro Jiminez 
Aug2015 
Abstract
tba


Paweł Morzywołek 
Nonparametric Methods for Estimation of Hawkes Process for Highfrequency Financial Data 
Peter Bühlmann
Vladimir Filimonov Didier Sornette 
Aug2015 
Abstract
Due to its ability to represent clustered data well the popularity of the selfexcited Hawkes model has steadily grown in recent years. After originally being applied for earthquake prediction it has been also used to anticipate flash crashes in finance, epidemic type of behaviour in social media such as Twitter and YouTube or criminality outbursts in big cities.
The aim of this work is to conduct a comprehensive comparison study of the existing nonparametric techniques for estimation of the Hawkes model, which without making any a priori assumptions on the correlation structure of the observables, provide us insights into the data. To the best of my knowledge such work has not been done so far. The first considered method is the widely used in nonparametric statistics EM Algorithm, adjusted to the case of a Hawkes process. The second procedure is based on the estimation of a conditional expectation of the Hawkes model’s counting process and then solving a WienerHopf type integral equation to obtain the kernel function of the model. The last estimation technique uses representation of the Hawkes model as an integervalued autoregressive model and subsequently applies tools from theory of time series to obtain parameters of the model. The methods were tested on synthetic data generated from the Hawkes model with different kernels and different parameters. I investigated how the size of the sample and the overlapping of point clusters influences performance of different estimation methods. When conducting the analysis, I did not restrict myself only to the case of the most commonly used exponential and power law kernels, but also considered less typical step and cutoff kernels. After the comparison on synthetic data has been accomplished I proceeded with empirical data analysis. For this purpose I tested the estimation methods on the highfrequency data of price changes of Emini S&P 500 and Brent Crude futures contracts. 

Philip Berntsen 
Particle filter adapted to jumpdiffusion model of bubbles and crashes with nonlocal crashhazard rate estimation 
Markus Kalisch
Didier Sornette Yannick Malevergne 
Jul2015 
Abstract
Crashes in the financial sector probably represent the most striking events among all possible extreme phenomena. The impact of the crises have become more severe and their arrivals more frequent. The most recent financial crises shed fresh light on the importance of identifying and understanding financial bubbles and crashes.
The model developed by Malevergne and Sornette (2014) aims at describing the dynamics of the underlying occurrences and probability of crashes. A bubble in this work is synonymous with prices growing at a higher rate than what can be expected as normal growth over the same time period. A nonlocal estimation of the crash hazard rate takes into account unsustainable price growth, and increases as the spread, between a proxy for the fundamental value and the market price becomes greater. The historical evaluation of the jump risk is unique and expands the understanding of crash probability dynamics assumed embedded in financial logreturn data. The present work is mainly concerned with developing fast sequential Monte Carlo methods, using C++. The algorithms are developed for learning about unobserved shocks from discretely realized prices for the model introduced by Malevergne and Sornette (2014). In particular, we show how the best performing filter  auxiliary particle filter  is derived for the model at hand. All codes are accessible in the appendix for reproducibility and research extensions. In addition, we show how the filter can be used for calibration of the model at hand. The estimation of the parameters, however, is shown to be difficult. 

Jakub Smajek 
Causal inference beyond adjustment  Markus Kalisch  Jul2015 
Abstract
Covariate adjustment is one of the most popular and widely used techniques to estimate causal effects. The method is easy to use, has a wellunderstood theory and can be combined with other statistical techniques for efficient estimation of a given causal effect. The problem is, that the covariate adjustment method is not complete, in the sense that it may not identify a causal effect even if it is identifiable by some other methods. The first goal of the thesis is to demonstrate the above mentioned problem and present some alternative techniques, like the instrumental variables technique and a new identification method, that can be useful in estimation of causal effects (chapter 2). The next goal and the main theme of the thesis is to answer a question: "How restrictive is it if we restrict causal inference to adjustment methods?". The third chapter tries to answer this question from a theoretical perspective for single nodes X and Y. It presents important results from other authors and generalize some of them for two types of graphs: acyclic directed mixed graphs (ADMGs or latent projections) and maximal ancestral graphs (MAGs). The chapter shows, that we cannot lose a possibility to identify a causal effect by covariate adjustment by a conversion from a DAG to the corresponding latent projection and provides a criterion that characterizes, when a given causal effect is identifiable at all (by any method), but not by covariate adjustment in an ADMG G. It also shows, that a possibility to estimate a causal effect can be lost purely due to a conversion from a latent projection to a corresponding MAG and provides a criterion that specifies when it happens. Moreover, the third chapter provides a necessary, sufficient and constructive criterion to form an adjustment set in a given MAG M, if X and Y are single variables. Finally, partially based on the theoretical results derived earlier in the thesis, the question is addressed in a simulation study in chapter 4. The chapter describes implementation issues, methodology and several different experiments. The experiments concentrate on a comparison of the complete identification algorithm and the covariate adjustment method in terms of proportions of identifiable causal effects. The comparison on uniformly sampled ADMGs shows a big advantage of the former method. It turns out however, that the difference is mainly caused by some simple cases that can be easily identified. Such an approach leads to the simple but very effective improvement of the covariate adjustment method, that can significantly increase the proportion of identifiable causal effects. Finally, an experiment that shows how much do we lose on a conversion from an ADMG to a MAG is performed. The problem is especially visible if we restrict the analysis to graphs that contain a causal path from X to Y.


Lukas Tuggener 
Analysis off CrossOver Trials  Markus Kalisch  Jul2015 
Abstract
The goal of this thesis is to give the reader an introduction to crossover trials. The first chapter explains the most basic crossover design.
Using this design as an illustration it presents the necessary theory to analyse crossover trials. It shows how this basic design is weak in many situations and introduces designs which are more versatile. There are three computer simulations which help building an intuitive understanding of crossover design. The most important insight from this thesis is that a good design choice is is always a multifactorial tradeoff between subject recruiting, study duration and design complexity. If available, it takes information about the expected carryover behaviour and the structure of the between and withinsubjects variability into account. 

Maria Elisabetta Ghisu 
A comparative study of Sparse PCA with extensions to Sparse CCA  Marloes Maathuis  Jul2015 
Abstract
In this thesis we compare different approaches to sparse principal component analysis (sparse PCA) and then extend our investigation to sparse canonical correlation analysis (sparse CCA).
First, we study sparse PCA methods, where regularization techniques are included in classical PCA to obtain sparse loadings. We compare different formulations by analyzing theoretical foundations and algorithms. Moreover, we carry out simulation studies to evaluate the performance in a wide variety of scenarios. The optimal choice of the method depends on the objective and on the specific parameters combination. Our results suggest that the SPC \citep[]{Witt09} approach usually outperforms the other techniques in recovering the true structure of the loadings, although the angle between true and estimated vector is generally high. Subsequently, we examine the closely related problem of sparse CCA, where sparsity is imposed on the canonical correlation vectors. After a theoretical study of the methods, we run simulations to assess their quality. When the covariance matrices of the two sets of variables are not nearly diagonal, CAPIT \citep[]{chen13} shows higher accuracy; otherwise, the performances are similar. Finally, we consider applications of both sparse PCA and sparse CCA to real data sets, obtaining satisfactory results in most of the situations. 

Xiao Ye Zhan 
Modelling Operational Loss Event Frequencies 
Marloes Maathuis
Michael Amrein 
Jul2015 
Abstract
In this paper we study the application of count data modelling approaches to monthly counts of operational risk events, that have been recorded over 13 years from UBS. Assuming that the underlying distribution of the counts is Poisson, nonparametric and parametric regressions and a time series model are considered here. A meanmatching variance stabilizing transformation (VST) is used to facilitate the nonparametric Poisson regression and reduce the problem to a homoscedastic Gaussian regression one. The Poisson GLM regression and the generalized linear autoregressive moving average (GLARMA) model are applied to investigate the relationship of the number of operational losses observed with exogenous variables, and the dependence structure in the data. Our analysis shows significant connections between the loss count data and the financial and economic drivers. Notable serial correlations are also found in the data, with special attention paid to the Poisson distribution assumption and the overdispersion issue. Simulation experiments are also provided to examine numerical properties of the estimators.


Marcos Felipe Monteiro Freire Ribeiro 
Learning with Dictionaries  Nicolai Meinshausen  Jul2015 
Abstract
The method of dictionary learning was introduced by Olshausen and Field (1997) as a model for images based on the primary visual cortex. It has been successfully used for representing sensory data like images and audio, also providing an explanation for many observed properties in the response of cortical simple cells. In this thesis, we will show that the method can also be derived from an information theoretical point of view. The approach is similar to Bell and Sejnowski (1995) but substitutes the framework of neural networks by a probabilistic one. We also discuss how the learned representations can be used for classification and apply the theoretical results to two real world problems. In the first problem, we analyse GPS data in order to characterize driving styles. In the second, we analyse fundus images of the eye in order to diagnose diabetic retinopathy.


Oxana Storozhenko 
Maximin effects with tree ensembles  Nicolai Meinshausen  Jul2015 
Abstract
Nonparametric models, such as regression trees, are often used as a primary estimation method in prediction problems. Fitting the trees requires virtually no assumptions about the data, the learning algorithm requires almost no tuning and nonlinear relationships in the data are handled well. The flexibility of trees has been exploited in ensemble learning, where the members of an ensemble are the trees t to different samples of the training data. One of the most popular othe shelf prediction algorithms is random forest (Breiman (2001)), that constructs an ensemble of randomised trees trained on bootstrap samples of the data and averages over the predictions made by each tree. We propose to extend the aforementioned algorithm for the prediction problems of inhomogeneous data. In particular the estimators in the ensemble can be trained on different groups of the training data, as opposed to perturbation of the dataset with bootstrap sampling. If the data has outliers, contaminations, timevarying or temporary effects, that are present locally, dividing the dataset into groups in a sequential manner outputs more diverse estimators. Another adjustment in the context of inhomogeneous data is finding a vector of weights for the estimators in the ensemble, such that the future predictions are optimal whatever group the new data point comes from. Bühlmann and Meinshausen (2014) proposed to minimise the L2norm of the convex combination of the fitted values of the estimators, and use the resulting weights in order to maximise the minimum explained variance in every group. This scheme is called maximin aggregation and we show how it works for inhomogeneous data.


Teja Turk 
Comparison of Confidence and Prediction Interval Approaches in Nonlinear MixedEffects Models  Lukas Meier  Jun2015 
Abstract
In this study we aim to assess the performance of various approaches for confidence and prediction intervals in single level nonlinear mixedeffects models. The evaluation is based on simulated samples of coverage rates for 13 nonlinear functions.
The bootstrap confidence intervals are constructed from the parametrically, nonparametrically and case resampled datasets. In addition, the confidence intervals from intervals function and the Wald confidence intervals are included in the comparison. The performance of all the methods is carried out for all three types of parameters: the fixed effects, variancecovariance components and the withingroup standard deviation. Finally, the Wald confidence intervals are improved by empirically adjusting the degrees of freedom of the tstatistic. In general, the simulation speaks in favour of the nonbootstrap approaches. The prediction intervals methods are based on the Wald's test and derived separately for observed and unobserved groups. The variance of the prediction error derivation is based on various linear approximations of the prediction error. In pairwise comparisons with their bootstrap variants no apparent differences are detected. When their performance is compared with the prediction intervals based on the bootstrap prediction error distribution, the latter exhibits coverage rates closer to the true nominal values. 

Caroline Matthis 
Classifying Autistic Versus Typically Developing Subjects Based on Resting State fMRI Data 
Marloes Maathuis
Nicole Wenderoth Pegah Kassraian Fard 
Apr2015 
Abstract
In this thesis we investigate several classifiers to discriminate between autistic and typically developing subjects based on resting state fMRI data. We use data from the Autism Brain Imaging Data Exchange (ABIDE) database which consists of fMRI scans of 1112 subjects. First, we implement the LeaveOneOut (LOO) classifier designed by Anderson et al. [2] which attains an accuracy of 60 %. Next, we run various conventional classifiers on the data and compare their predictive performance to the LOO classifier. Most of the examined classifiers perform at least as well as the LOO classifier; a flexible formulation of discriminant analysis reaches an accuracy of 76 %. In a last step we attempt to attach a subjectspecific uncertainty to the classification. Based on work by Fraley and Raftery [18] the posterior distributions of the flexible formulation of discriminant analysis are used to model these uncertainties. In a short simulation study we illustrate the informative value of the estimated uncertainties, given that the distributional assumptions are valid. Then, this uncertainty model is evaluated on the data, yielding satisfactory results.


Julia Brandenberg 
Statistical Analysis of Global Phytoplankton Biogeography in Mechanistic Models and Observations  Nicolai Meinshausen  Apr2015 
Abstract
After five months of intense work, I am proud to submit my Master’s thesis. I would like to thank my advisor Dr. Meike Vogt for her constant support, her reliability and motivation and congratulate her to her baby, which was one of the highlights during this period. Besides many fruitful ontopic discussions, I enjoyed the offtopic horserelated chats with her. Special thanks to my advisor Prof. Dr. Nicolai Meinshausen, whos support was competent, patient and committed. During several meetings I was able to deepen my statistical understanding and his versatile approaches for problem solving motivated me to try different techniques. I would like to thank Prof. Dr. Nicolas Gruber for his advice and for having me in his group the past months. Dr. Thomas Froelicher supported me in the interpretation of my results and was my contact during Meikes absence. Dr. Charlotte Laufkoetter and Dr. Chantal Swan contributed to this work by providing me with data and information concerning it. Last but not least, I would like to thank all my colleagues from the environmental physics group for their advices and contributions and especially for making this time such a pleasure to think back to.
At this point I would like to mention my parents, Barbara and Andreas Brandenberg and thank them for the unconditional support over the last years. Their love and faith in me contributed greatly to all my achievements and made me to the person I am today. Thank you! 

Sonja Gassner 
Fitting and Learning of Bowfree Acyclic Path Diagram Models 
Marloes Maathuis
Preetam Nandy Christopher Nowzohour 
Mar2015 
Abstract
We consider the problem of learning causal structures from observational data, when the data are generated from a linear structural equation model. Under the assumption that the path diagram of the model is acyclic and the error variables are uncorrelated, one can apply a search and score technique to learn the underlying structure. However, the assumption of uncorrelated errors is often too restrictive. In this thesis we consider a more general subclass of linear structural equation models for structure learning, where correlation of the errors is allowed unless the corresponding random variables are in a direct causal relation. These models are called bowfree acyclic path diagram (BAP) models. BAP models are almost everywhere identifiable, which is in general not ensured for linear structural equation models with arbitrary correlation patterns. First, we consider two methods for estimating the parameters in BAP models. One results from the proof of the identifiability of BAP models and is implemented in this thesis. The other one is an iterative partial maximization algorithm for maximum likelihood estimation, for which an implementation was already available. Next, we use these two fitting methods in a greedy search algorithm for structure learning, which repeatedly fits and scores BAP models and chooses the model with the highest score. Finally, we evaluate the performance of these methods in a small simulation study.


Carolina Maestri 
Two approaches of causal inference for time series data  Marloes Maathuis  Mar2015 
Abstract
In this Master's thesis two approaches of causal inference for time series data are studied. The first one addresses nonlinear deterministic systems, while the second one is designed for linear stochastic systems. For both methods the theoretical foundations are presented and the algorithms are analysed and described in detail. Applications to real data are also shown and various simulations are run to investigate the performances of the algorithms in different situations.


Kari Kolbeinsson 
Model Selection for Outcome Predictions of Professional Football Matches  Markus Kalisch  Mar2015 
Abstract
The subject of this thesis is to model and predict the outcome of professional football matches played in the premier leagues around the globe. For this purpose a number of statistical learning methods are employed and models fit to publicly available data.After gathering the simple data from the relevant websites, numerous variables are constructed to further capture the relative strength of each team. The second chapter of the thesis is dedicated to explaining the dataset constructed from these variables and their relationship with the response variables. The statistical learning commences in the third chapter by fitting classification models to a training subset of the data. For these models the response variable is categorical, taking on three values, a win for either team or a draw. The models considered are linear and quadratic discriminant analysis, knearest neighbours, random forest, boosted classification trees and support vector machines. For each model, the fit to the training set is analysed using an estimation of the misclassification rate and calibration plots. The fourth chapter explores the use of regression models for this task. The response variable now is either the goals scored by each team or the goal difference. Models fit to the goal difference of each team are then combined for one unified prediction of the goal difference. The models tried for this task are generalized linear models, random forest and boosted regression trees. Prediction accuracies of the best performing models in these two chapters are the subject of the fifth and final results chapter. The goal count estimations of the regression models are translated into the same categorical results as were modelled by the classification models for comparison between all methods. The best performing model was found to be the boosted classification trees with a prediction accuracy of 50.5%.


Lin Zhu 
Confidence Curves in Medical Research 
Leonhard Held
Markus Kalisch 
Mar2015 
Abstract
This thesis briefly reviews the developments of confidence distributions. It introduces the modern definitions of a confidence distribution, confidence density and confidence curve along with point estimators based on a confidence distribution. Then different constructing methods of confidence curves are given for cases without nuisance parameters and cases with nuisance parameters, respectively. The pivotal approach and deviancebased approach are applied to both cases with and without nuisance parameters. The halfcorrection approach is applied to discrete data. The simulation or bootstrap approach is applied to cases with nuisance parameters. We take exponential distribution, binomial distribution, Weibull distribution, gamma distribution and the comparison of two binomials as examples to study the difference with each approach.


Anita Kaufmann 
Crime Linkage 
Jacob De Zoete
Marloes Maathuis 
Mar2015 
Abstract
Crime Linkage studies settings where similarities among several crimes suggest execution by the same offender. Due to their linkage, evidence of an individual case becomes relevant for the entire group of crimes. After giving a short introduction of the subject of Bayesian Networks we demonstrate how they can be used to model crime linkage settings. In a next step, a review of two research papers concerning this topic is provided. For a better understanding we outline the most important parts in detail. Moreover, the papers in focus only present examples for a small number of crimes since the complexity increases exponentially with the number of crimes considered. We aim at avoiding the fast increase in complexity by proposing simplifying adaptions of the Bayesian Network. Furthermore, we restrict the number of different offenders to m < n, where n is the number of crimes considered, since it is not very probable to have as many offenders as crimes in a crime linkage setting. The consequence is a reduction of the number of offender configurations which should result in a simplification of the computation of settings with a larger number of crimes. We propose two possibilities to find a reasonable value for m: The problem we encounter is that our adapted function for n crimes with at most m different offenders is not efficient and hence cannot be used for larger numbers of crimes. Nonetheless, comparing the two different approaches for small numbers of crimes we get very similar results. Consequently, the second approach is, at least for small numbers of crimes, faster and thus better suited for determining the number m of different offenders which have to be taken into consideration. In order to maintain its relevance also for larger number of crimes we furthermore propose a possible extension of the second approach.


Sheng Chen 
Random Projection in clustering classification and regression  Markus Kalisch  Feb2015 
Abstract
This thesis studies the performance of Random Projection  one of the relatively new dimensionality reduction techniques  when applied to the area of clustering, classification and regression, through reproducing or testing the results in three papers by Boutsidis and Zouzias (2010), Paul and Boutsidis (2013) and Kaban (2014), each from one of the three domains.Firstly, a review of the JohnsonLindenstrauss lemma, as well as its extensions is given, which is the theoretical foundation of Random Projection. Besides the early subgaussian and sparse matrices, new random matrices based on the Fourier transform are developed for faster computation. Secondly, the experiment in Random Projectionbased Kmeans (Boutsidis and Zouzias, 2010) is reproduced. The result shows when the cardinality of the embedded space is large, the RPbased Kmeans is comparable to the Kmeans with original data in terms of misclassification rate. Comparisons are drawn between RP, PCA and LS and finds that PCA outperforms RP in terms of misclassification rate, but RP needs only 19% of the time needed by PCA. Thirdly, for classification, part of the experiment in RPbased Support Vector Machine (Paul and Boutsidis, 2013) is run. The calculation shows that the misclassification rate of the RPbased SVM is not significantly larger than the SVM in the original space. However, the margin γ is significantly smaller. In the area of regression, Kaban (2014) proposed an upper bound on the excess risk of the OLS estimator in the embedded space, and proved that Random Projection applies to a larger group of matrices, whose entries have mean 0, unit variance, symmetric distribution and finite fourth moment. The last part of the thesis runs experiment to examine the necessity of these assumptions upon random matrices and finds that each of them could be loosened without breaking the bounds.


Ioan Gabriel Bucur 
Structural Intervention Distance for Maximal Ancestral Graphs  Markus Kalisch  Jan2015 
Abstract
In the process of causal inference, we are interested in accurately learning the causal structure of a data generating process from observational data, so as to correctly predict the effect of interventions on variables. In order to assess how accurate the output of an estimation method is, we would like to be able to compare causal structures in terms of their causal inference statements. Peters and Bühlmann have proposed the Structural Intervention Distance as a premetric between DAGs that provides a partial solution to the issue. However, the causal DAG may not be able to predict certain intervention effects in the presence of confounders. In this paper, we attempt to emulate the results of Peters and Bühlmann in a more realistic setting, where we observe only part of all relevant variables. We propose a new premetric, the Structural Intervention Distance for Maximal Ancestral Graphs (SIDM). A MAG is a causal structure which, unlike the DAG, is closed under marginalisation and can incorporate uncertainty about the presence of latent confounders. The SIDM allows us, under the assumption of no selection bias, to compare and contrast two MAGs based on their capacity for causal inference. The SIDM is consistent with the SID in its approach and provides valuable additional information to other metrics.

2014
Student  Title  Advisor(s)  Date 

Lukas Weber 
Model selection techniques for detection of differential gene splicing 
Mark Robinson
Peter Bühlmann 
Sep2014 
Abstract
Alternative splicing during the messenger RNA (mRNA) transcription stage of gene expression can generate vast sets of possible mRNA isoforms from individual genes. These mRNA isoforms can create functionally distinct proteins during subsequent protein translation, explaining the enormous diversity of proteins in organisms such as humans. Differential splicing experiments aim to use microarray or RNA sequencing (RNAseq) technologies to detect genes exhibiting differences in splicing patterns between groups of biological samples, for example comparing diseased versus healthy samples, or treated versus untreated. In this thesis, we have tested whether model selection techniques can be used to improve the performance of existing statistical methods to detect differential gene splicing in RNAseq data sets. The new methods were successful, and have been implemented as an R package available on GitHub.


Lucas Enz 
The Lasso and Modifications to Control the False Discovery Rate 
Sara van de Geer
Benjamin Stucky 
Aug2014 
Abstract
Nowadays, a huge focus is set on high dimensional data sets where the number of predictors $p$ is a lot larger than the amount of observations $n$. One example is detecting which genes are responsible for a specific biological function of our body. Due to the fact that it causes very high costs to measure some microarray data, we normally have at most a few hundred observations, but thousands of possible genes which could control the instance we want to research. Because we have a lot more predictor variables than observations, we cannot compute a unique solution. cite{Tibshirani96} introduced a method called Lasso, which deals precisely with this problem and sets some variables exactly to zero. In other words, the Lasso can ban some predictors from our model. Nevertheless, the Lasso sometimes picks a lot of predictor variables which are in truth not responsible for the observed process. As a consequence, the false discovery rate (FDR), defined as the expected proportion of irrelevant predictor variables among all selected variables, is not even controlled in some models.In this paper we will focus on a new procedure which controls the FDR better, but does not ban too many predictor variables which are actually relevant for the process, i.e. we do not lose too much power. This paper is mainly based on the works of cite{Candes13} (and an updated version cite{Candes2}) about the procedure they introduced, called SLOPE. We analyze the improvement of SLOPE in high dimensional examples for the linear model with Gaussian and orthogonal design matrices. In the end, we adapt the idea of SLOPE to the group Lasso, which is very useful if we can group the predictor variables and select or ban a whole group of regression variables. We present an extension of the group Lasso named SIPE and test its skills in sparse scenarios via simulation study.


Hannes Toggenburger 
Joint Modelling of Repeated Measurement and TimetoEvent Data, with Applications to Data from the IeDEASA 
Marloes Maathuis
Matthias Egger Klea Panavidou 
Aug2014 
Abstract
After the start of ART the low CD4 count in a HIV positive patient typically recovers up to a regular level. By measuring the CD4 repeatedly, a patient's individual CD4 trajectory is known at a discrete set of times. Different approaches were made to model CD4 counts to obtain continuous trajectories. If ART is not working any more, the CD4 will start its decay anew. Such a treatment failure, or in particular the time of its occurrence, is modelled by survival models. In this work, the repeated measurement outcomes of the CD4 are modelled with a nonlinear mixedeffects (NLME) model with three randomeffects. The timetoevent data is modelled with a lognormal accelerated failure time (AFT) model. These two models are merged into a randomeffectsdependent joint model. Broadly speaking this means that the randomeffects of the NLME model are used as continuous predictors in the AFT model. Different approaches, and their pitfalls, to estimate the involved parameters via the maximum likelihood method are discussed. The final model is applied to real data from the International epidemiologic Databases to Evaluate AIDS in subSaharan Africa (IeDEASA).


Andreas Puccio 
A review of two modelbased spike sorting methods  Marloes Maathuis  Aug2014 
Abstract
In modern neuroscience, extracellular recordings play an important role in the analysis of neuron activity. Whereas earlier experiments were based on single electrodes, modern settings consist of a large number of channels that record data from multiple cells simultaneously.In such settings, every electrode will record action potentials from all nearby neurons, visible as spikes whose shape depends on various factors. The problem of spike sorting, in a nutshell, is to detect the occurrence of such spikes in multielectrode voltage recordings and to classify them, i.e., to identify the corresponding neurons.A widely used approach is a socalled clustering method consisting of a thresholding step to detect the occurrence of a spike, a featurereduction step (e.g. PCA) and a classification ("sorting") step based on these features. However, this method has several disadvantages, an important one being the inability to handle overlapping spikes.After an introduction into the problem of spike sorting and the data encountered in such settings, we review two different modern spike sorting frameworks, one being binary pursuit (Pillow, Shlens, Chichilnisky, and Simoncelli, 2013) and the other one relying on a method called continuous basis pursuit (Ekanadham, Tranchina, and Simoncelli, 2014). These frameworks use a statistical model for the recorded voltage trace and do not rely on a clustering procedure for the spike train estimation. We present an implementation of binary pursuit in MATLAB, conduct a performance assessment of this algorithm using simulated data and identify advantages and disadvantages of modelbased spike sorting algorithms.


Laura Casalena 
Statistical inference for the inverse covariance matrix in highdimensional settings 
Sara van de Geer
Jana Jankova 
Aug2014 
Abstract
The focus of this work is the problem of estimating the inverse covariance matrix Θ∗ in a highdimensional setting. Highdimensionality is reflected by allowing p to grow as a function of n, but for our results to hold we require p = o(exp(n)). We will propose four different estimation methods for Θ∗ and study their asymptotic properties under appropriate distributional assumptions as well as model assumptions on the concentration matrix Θ∗. In particular, whenever it is possible, we will give rates of convergence in various matrix norms and state results which prove asymptotic normality of each individual element Θ∗ij. Consequently, we will construct asymptotic confidence intervals for Θ∗ij. Finally, we will illustrate the theoretical results through numerical simulations.


Fabio Ghielmetti 
Causal Effect Estimation of Structural Pricing Changes in the Airline Industry 
Peter Bühlmann
Karl Isler 
Aug2014 
Abstract
Pricing changes in the Airline Industry occur on a daily basis, their revenue effects are difficult to measure though. This problem, namely inferring the causal effect of a pricing change on the revenue can be modeled by a structural equation model (SEM) and a causal graph. A lately published paper (Ernest and Bu ̈hlmann (2014)) showed that causal effects within SEMs can directly be inferred out of an additive model, even if the true underlying relationships are not additive. After introducing the subject of Airline Revenue Management and the mathematical tools to infer causal effects, this recent result is applied to actual airline data. Following the identification of the corresponding causal graph, multiple additive models are fitted: with several levels of data aggregation and a comparison of different subsets, the sensitivity of the causal effect estimation is tested. Finally, the results are discussed and interpreted.


Shu Li 
Causal Reasoning in Time Series Analysis through Additive Regression 
Peter Bühlmann
Jan Ernest 
Aug2014 
Abstract
Causal inference has evolved from its fractionized early days towards a more unified and formal framework with diverse applications ranging from brain mapping to the modeling of gene regulatory pathways. In a time series setting causal reasoning revolves predominantly around Granger causality, disregarding recent advances in structural equation or graphical modelling. We use the former to explore the potential of interventionbased causal inference from observational time series data. Drawing its inspiration from a recent result by Ernest and Bühlmann (2014), we propose a novel approach for inferring causal effects in AR(p) models: Addtime, short for additive regression in time series analysis. Our method is theoretically sound, even for nonlinear or nonadditive AR(p) models and computationally efficient, requiring on average 0.5s per intervention and enabling potentially highdimensional applications. Empirically, Addtime is able to recover the true effect in simulated and real data. Within the scope of (nonlinear) time series the effect of interventions is largely unexplored. Our approach can be regarded as a safe benchmark for univariate time series and generalizes to the multivariate case without further constraints.


Anja Franceschetti 
Alternatives to Generalized Linear Models in NonLife Pricing 
Lukas Meier
Christoph Buser 
Jul2014 
Abstract


Christina Heinze 
Random Projections in Highdimensional and Largescale Linear Regression  Nicolai Meinshausen  Jul2014 
Abstract
We study the use of JohnsonLindenstrauss random projections in different regression settings. First, we examine the highdimensional case, where the number of variables p largely exceeds the number of observations n. Specifically, we consider socalled compressed leastsquares regression (CLSR). CLSR reduces the dimensionality of the data by a random projection before applying ordinary least squares regression on this compressed data set. We perform an empirical comparison of predictive performance between CLSR and other widely used methods for highdimensional least squares estimation, such as ridge regression, principal component and the Lasso. Our results suggest that an aggregation scheme which averages the predictions of CLSR over a number of independent random projections can greatly improve predictive accuracy. This extension of CLSR performs similarly to the competing methods on a variety of real data sets. Subsequently, we experiment with two variable importance measures where one exploits the fact that omitting variables in the original highdimensional data set does not necessarily have to change the projection dimension. This allows for the estimated regression coefficients to be directly compared in the compressed space. The second statistic is based on the change in mean squared prediction error. For both importance measures we explore whether the importance of clusters of highly correlated variables can be identified correctly. We find that the procedures work reasonably well for synthetic data sets with large signaltonoise ratios (SNRs) and no intercluster correlations. However, the randomness in the projection matrix makes detection difficult for data sets with low SNRs. Also, different correlation structures between clusters pose significant challenges. Lastly, we look at the largescale setting where both p and n are very large, and possibly p > n. We develop a distributed algorithm, LOCO, for largescale ridge regression. Specifically, LOCO randomly assigns variables to different processing units. The dependencies between variables are preserved using random projections of those variables that were assigned to the respective remaining workers. Importantly, the communication costs of LOCO are very low. In the fixed design setting, we show that the difference between the estimates returned by LOCO and the exact ridge regression solution is bounded. Experimentally, LOCO obtains significant speedups as well as good predictive accuracy. Notably LOCO is able to solve a regression problem with 5 billion nonzeros, distributed across 128 workers, in 25 seconds.


Sabrina Dorn 
Local Polynominal Matching and Considerations with Respect to Bandwidth Choice  Sara van de Geer  Jul2014 
Abstract
This master's thesis considers local polynomial matching which is a popular methodin econometrics for estimating counterfactual outcomes and average treatment effects. We discuss identification of counterfactual expectations under conditional independence, give an overview of selected properties of the local polynomial matching estimator, and apply these to calculate the mean squared error for the according twostep estimator for general order approximating polynomials. Finally, this enables us to derive and implement a feasible mean squared error criterion that can be minimized numerically, and provide some evidence of its reasonable performance within an empirical application to the NSW and PSID data.


Olivier Bachem 
Coresets for the DPMeans Clustering Problem 
Andreas Krause
Markus Kalisch 
Jul2014 
Abstract


Valentina Lapteva 
Different Stability Selection Models for Structure Learning  Nicolai Meinshausen  Jul2014 
Abstract
Recent developments in analytics, high performance computing, machine learning, and databases result in a situation when collecting and processing webscale datasets becomes possible. Not only the number of samples increases dramatically, but also the number of features observed and evaluated.Big data analysis, in turn, requires unique experts that need to fully understand all the attributes of the data and the connections between them, which can be costly if at all possible. This all brings the problem of automated structure discovery in the most acute way.The task of structure learning attracts a lot of attention, with many new algorithms being proposed in recent years. However, all of them highly depend on the choice of a regularization parameter. To deal with this problem, Stability selection technique cite{stability_selection} was proposed. Original formulation of Stability Selection approach limits the maximum number of false positive variables selected.In this thesis we explore the problem of learning the structure in an undirected Gaussian graphical model. We extensively explore the properties of Stability Selection when applied in combination with different structure estimators, such as Graphical LASSO cite{glasso}, CLIME cite{clime} and TIGER cite{tiger}.We also propose and explore, for the first time, a variety of different models that are based on Stability Selection approach, but rely on different types of assumptions or incorporate different types of constraints.For example, we show how to incorporate the prior knowledge about the sparsity pattern, topological constraints, such as connectivity or the maximum number of edges adjacent to every node.We also explore assumptions based on the properties of an estimator, such as homogeneous type I and type II discrepancies, or the underlying logistic model as a function from an estimator output and the output of the method.We show that in some cases, either when the prior assumptions hold, or when the graphical model structure is dense, the proposed models can serve as a better regularizer for Stability Selection than the original formulation.


Gian Andrea Thanei 
Dimension reduction techniques in regression  Nicolai Meinshausen  Jul2014 
Abstract


Maximilien Vila 
Statistical Validation of Stochastic Loss Reserving Models Submission 
Lukas Meier
Jürg Schelldorfer 
Jul2014 
Abstract
Claims reserving in nonlife insurance is the task of predicting claims reserves for theoutstanding loss liabilities. There are many methods and models to set the predictedclaims reserves. However, in order to quantify the total prediction uncertainty of thefull runoff risk (long term view) or the oneyear risk (short term view) a corresponding stochastic model is needed. In practice, one usually compares the results of several stochastic models in order to determine the appropriate claims reserves and their uncertainties. From a statistical point of view, all these stochastic models require a thorough consideration of the data as well as checking if the model assumptions are fulfilled. In this thesis we are going to investigate these issues by focusing on four different models: the distribution free Chain Ladder model, the Cumulative Log Normal model, a BornhuetterFerguson model and generalized linear models. We present known statistical tools and some newly developed data plots and model checking graphics to support the decision for the appropriate stochastic model. Different numerical examples are used to illustrate the procedure of model checking. Public triangles and AXA triangles were considered and the conclusions coincide. Therefore and for confidentiality we only present the results for the publicly available data.


Colin Stoneking 
Bayesian inference of Gaussian mixture models with noninformative priors  Peter Bühlmann  May2014 
Abstract
This thesis deals with Bayesian inference of a mixture of Gaussian distributions. A novel formulation of the mixture model is introduced, which includes the prior constraint that each Gaussian component is always assigned a minimal number of data points. This enables noninformative improper priors such as the Jeﬀreys prior to be used for the component parameters. We demonstrate diﬃculties involved in specifying a prior for the standard Gaussian mixture model, and show how our new model can be used to overcome these. MCMC methods are given for eﬃcient sampling from the posterior of this model.


Alexandra Ioana Negrut 
Traffic safety in Switzerland  Hans R. Künsch  May2014 
Abstract
More than 50000 car accidents occured on Swiss roads in 2012. With new data at hand, the Traffic Engineering department of ETH Zurich was interested in finding out which factors determine the severity of a car accident. Moreover, they were interested to know what determines a certain cause and type of a car crash. In order to answer this first set of questions, parametric and non parametric methods were used and then compared in terms of misclassification errors and variable ranking. The results confirmed that in order to predict the accident's severity level, one also needs information about the events that didn't happen. In the second part of the thesis, the severe crash frequency was investigated on five of the Switzerland's motorways. It was proved that the higher the average daily volume (DTV) the higher the number of severe accidents.


Yannick Trant 
Stock Portfolio Selection with Random Forests 
Peter Bühlmann
Thorsten Hens 
May2014 
Abstract
Applications of machine learning algorithms to stock selection usually focus on technical parameters and limited sets of fundamental company ratios. In this study the complete balance sheet, income statement and cash flow statement information of US companies from 19892013 is used as model input. The amount and inhomogeneous distribution of missing values is a key characteristic and difficulty in working with this data. I present a structured way to prepare this challenging dataset for statistical learning methods. The fundamental data is complemented by a wide range of technical indicators. In this work the predictive power of random forests is explored on a calibration period from 19892006 using this huge data set with respect to stock return prediction. My results show that a small but significant predictive power with respect to ranked returns can be attained for an ‘extreme’ random forest parametrization. The calibrated random forest parametrization raises interesting question with respect to the nature of the data set. Based on the random forest predictions simple investment strategies are formulated. They exhibit significant outperformance in an outofsample back test for the period from 2006 2013. The risk adjusted performance measures are on level with the latest stock selection criteria in the finance literature. Throughout my work I illustrate the challenging peculiarities of working with equity data and propose solutions originating both from finance and mathematics.


Annette Aigner 
Statistical Analysis of Lower Limb Performance Assessments in Patients with Spinal Cord Injures 
Marloes Maathuis
Armin Curt Lorenzo Tanadini 
May2014 
Abstract
Based on longitudinal data from spinal cord injured patients participating in the European Multicenter Study about Spinal Cord Injury, the focus of this thesis lies on the assessments of lower limb performance. Initially, the performance measures' abilities to capture change in a patient's walking ability are measured and their relationships with each other assessed. Based on these results, two measures are identified to subsequently explore the possibility of modelling a patient's recovery in these two outcome measures. Finally, the potential of predicting the extent to which patients will regain their walking ability is examined. Choosing methods such that the results may best help answer the respective research questions, nonparametric twosample testing, canonical correlation analysis, principal component analysis, latent class factor analysis, as well as linear mixed effects models and random forest were relied upon. The findings show that the scores, currently used on an equal footing for assessing lower limb performance, only apply to certain patients. Therefore, there are subgroups of scores associated with specific patient groups. Out of the six walking tests (6MWT, 10MWT, TUG, SCIM3a, SCIM3b, WISCI), 6MWT and SCIM3b exhibit the desired characteristic of responsiveness and turn out slightly better, and especially most consistent, with respect to the assessment of the interdependency of all scores. Regarding the potential of modelling recovery, i.e. the development over time, the effect of time on 6MWT exhibits a loglike trend. On the other hand, the recovery measured with SCIM3b has a different development, for which time alone may even have a negative influence. The results for the prediction of these two outcomes, six months after injury, showed that such an endeavor is very difficult and will therefore have low accuracy if applied to new patients.


Claude Renaux 
Confidence Intervals Adjusted for High Dimensional Selective Inference  Peter Bühlmann  Apr2014 
Abstract
There is a growing demand for determining statistical uncertainty which is a largely unexplored field for highdimensional data. The main focus of this thesis lies on confidence intervals adjusted for selective inference in the highdimensional case. Selective inference denotes the selection of some covariables and construction of the corresponding confidence intervals based on the same data. This results in a bias, namely the selection effect. One can correct for the selection effect by adjusting the marginal confidence level. We select some covariables and apply this adjustment to Bayesian confidence intervals based on Ridge regression and frequentist confidence intervals based on desparsifying the Lasso. Furthermore, we summarize the theory of selective inference and of the methods used to construct confidence intervals. The methods are demonstrated on a real data set, and large simulations on synthetic and semisynthetic data sets are carried out. Two of the three methods proposed to construct Bayesian confidence intervals based on Ridge regression perform well only in some setups. Furthermore, our simulations show that the False Coveragestatement Rate (FCR) criterion is controlled and the power takes high values for the confidence intervals based on desparsifying the Lasso. Moreover, the implementation of the desparsified Lasso can be changed for the purpose of selective inference which results in computations finishing in 1% to 6.5% of the time with only slight changes in the results. The results are useful for settings where selective inference is appropriate and highdimensional data is present.


Christoph Dätwyler 
Causality in Time Series, a Time Series Version of the FCI Algorithm and its Application to Data from Molecular Biology  Marloes Maathuis  Apr2014 
Abstract
Among many other concepts, Granger causality has become popular to infer causal relations in time series. In the first part of this work we give a short introduction to this topic, whereby we see that Granger causality can be formulated in terms of conditional orthogonality or conditional independence and can be closely linked to path diagrams, which provide a convenient way of visualising causal relationships among the factors/variables of interest. A concept called mseparation then provides us with a graphical criterion to infer conditional orthogonality relations in path diagrams and we conclude the first part with a precise statement linking mseparation and Granger causality.The second part then deals with the FCI algorithm, which has been designed to infer causal relations among systems of variables, where possibly not all of them have been observed. Furthermore we present an adaptation of the original FCI algorithm to the framework of time series data.In the last part of this thesis we apply the time series version of the FCI algorithm to a dataset from molecular biology, with the goal to infer causal relations among the factors of interest and thereby getting a better understanding of how the transcription process of genes works.


Thomas Schulz 
A Clustering Approach to the Lasso in the Context of the HAR model 
Peter Bühlmann
Francesco Audrino 
Apr2014 
Abstract
We discuss a covariate clustering approach to the Lasso and compare it to the standard Lasso in the context of the HAR model. We analyze the difference in forecasting error between these models on historical volatility data and find that the error tends to be slightly larger for the clustering approach. Subsequently, we employ the same data to compare the stability of the chosen coefficients for the considered models and we observe that the clustering approach achieves better results than the standard Lasso. Finally, we conduct a data simulation analysis to study stability issues in a synthetic HAR setting and conclude again that the coefficients selected by the clustering approach appear to be more stable.


Huan Liu 
Incorporating Prior Knowledge in CPDAGs  Marloes Maathuis  Mar2014 
Abstract
A causal model can be presented as a graph model, with each node representing a variable, and each edge representing a causal relationship. A completed partial directed acyclic graph (CPDAG) is such a causal model with no hidden variables, and with every undirected orientation being possible. A causal prior knowledge is presented as the existence or absence of a directed path from one variable to another. This paper provides an algorithm to incorporate a set of causal prior knowledge into a CPDAG. It uses the chordal properties of a CPDAG to separate the undirected graph into connected subgraphs, then with the help of Meek’s rules and theorems to incorporate all the prior knowledge. This paper also proves the correctness of the incorporation for both positive prior knowledge and negative prior knowledge. Furthermore, a simulation is done to test and compare the performance of the algorithm.


Lana Colakovic 
Classification using Random Ferns  Nicolai Meinshausen  Mar2014 
Abstract
Random Ferns are a supervised learning algorithm for classification introduced recently by Özuysal, Fua, Calonder, and Lepetit (2010), as a simpler and faster alternative to Random Forests (Breiman (2001)), with specific application in image recognition. In contrast to trees, ferns have nonhierarchical structure and the aggregation is performed by multiplication rather that averaging. Also, they rely on completely random selection of features as well as split points. The aim of this master's thesis is to investigate general properties of Random Ferns and compare them to Random Forests. We want to see if, and under which circumstances, Random Ferns are comparable in performance to Random Forests. We implemented Random Ferns algorithm in R and used simulated as well as real data sets to investigate Random Ferns' properties in more detail.


Christoph Kovacs 
Semisupervised Label Propagation Models for Relational Classification in Dyadic Networks: Theory, Application and Extensions 
Marloes Maathuis
Lukas Meier 
Feb2014 
Abstract
If a dataset not only comprises instance features but also exhibits a relationalstructure between its elements, it can be represented as a network with nodes definedby instances and links defined by relations. Data analysis can be performedon such a structure under the statistical relational learning (SRL) paradigm. Twoof its basic cornerstones, collective classification and collective inference, can becarried out by semisupervised label propagation (SSLP) algorithms, which allowfor label information to be propagated and updated through the network to arriveat class affiliation predictions for unlabeled nodes. For this purpose, harmonicfunctions have been applied on Gaussian random fields and adapted accordingly,leading to the weightedvote Relational Neighbor classifier with Relaxation Labeling(wvRNRL). Extending this approach to support social features, extractable fromthe network’s topology, results in the Social Context Relational Neighbor (SCRN)classifier. Moreover, MultiRankWalk (MRW), a classifier which uses ideas from randomwalk with restart, is presented and discussed. These different semisupervisedclassification models are being applied on nine dyadic networks and their predictionperformances are being evaluated for various accuracy measures using the repeatedNetwork CrossValidation (rNCV) scheme. Ideas to relax certain model restrictionsand to expand their applicability are outlined, together with a suggested measureof unlabeled node importance (MIUN statistic). In order to provide an adequatevisualization of the obtained results, a new means of holistic visualization, theCircoClustogram, is proposed. A discussion of the advantages and disadvantagesof semisupervised label propagation and its applicability concludes this thesis.


Ambra Toletti 
Treebased variational methods for parameter estimation, inference and denoising on Markov Random Fields  Sara van de Geer  Feb2014 
Abstract
The attention of statisticians and computer scientists for variational methods has increased considerably in the last few decades. While it has become (computationally) cheap to store huge amounts of multivariate data describing complex systems (e.g. in natural sciences, sociology, etc.), the elaboration of this information for either getting parameter estimates for the underlying statistical models, or making inference or denoising is still infeasible in general. In fact, classical (exact) methods (e.g. computing Maximum Likelihood estimates via Iterative Proportional Fitting) need a huge amount of time for solving these issues if the complexity/size of the underlying model is sufficiently large. Markov Random Fields, which are widely used because of their nice representations as both graphs and exponential families, are not immune to this problem. In this case it is possible to convert both inference and parameter estimation into constrained optimization problems connected with the exponential representation. Unfortunately this transformation does not provide any improvement in feasibility, because it is often impossible to write the objective function in an explicit way and even the quantity of constraints is prohibitive. One can obtain a computational cheaper (approximate) solution by appropriately relaxing the constraints and by approximating the objective function. In this work the relaxation was made by considering all combinations of locally consistent marginal distributions and the objective function was approximated with a convex combination of Bethe entropy approximations based on the spanning trees of the underlying graph. Wainwright (2006) proved that parameters estimates obtained with this method are asymptotically normal but don’t converge toward the true parameter. However, if these estimates are used for purposes such as inference or denoising their performance is comparable with the one of exact methods. In this work some empirical evidence confirming these properties for an Ising model on a grid graph was produced and general definitions and results about graphical models and variational methods were resumed.


Tobia Fasciati 
Semi Supervised Learning  Markus Kalisch  Feb2014 
Abstract
The potential advantages of Semi Supervised Learning compared to more traditional learning methods like Supervised and Unsupervised Learning has attracted many researchers in the recent past. The goal is to learn a classifier from data having both labeled and unlabeled observations by exploiting their geometrical position.The aim of this Master Thesis is to give an overview about SSL and study two different methods, Transductive Support Vector Machine and Anchor Graph Regularization. Finally, both approaches are tested on selected datasets.


David Bürge 
Causal Additive Models with Tree Structure: Structure Search and Causal Effects 
Peter Bühlmann
Jonas Peters 
Feb2014 
Abstract
Drawing conclusions about causal relations from data is a central goal in numerous scientific fields. In this thesis we study a special case of a restricted structural equation model (SEM). In addition to the common assumptions of acyclicity and no hidden confounders, we assume additive Gaussian noise, nonlinear functions and a causal structure represented by a directed acyclic graph (DAG) with tree structure. Given data from such a causal additive model with tree structure (CAMtree) we estimate the underlying tree structure and give characterisations of the causal effects from variables on others. This restricted model leads to several simplifications. Identifiability of the structure is guaranteed by a result from Peters et al. (2013). We present a method that efficiently finds a maximum likelihood estimator for the causal structure among all trees. As our method is based on local properties of the distribution, it extends without constraints to highdimensional settings. Furthermore, we investigate how to characterise causal effects from one variable on others. The maximum mean discrepancy is used to quantify changes in the distribution of the effect variable when the potential cause is varied. Based on our estimate for the structure, we present a procedure which, given only observational data, predicts the strongest causal effects. All methods are implemented in R and we give experimental results for synthetic data and one set of real highdimensional data.


Emilija Perkovic 
The FCI+ Algorithm  Markus Kalisch  Feb2014 
Abstract
The primary focus of this thesis was to understand and implement the FCI+ algorithm as described in “Learning Sparse Causal Models is not NPhard” Claassen, Mooij, and Heskes (2013a). In order to understand how this algorithm works, a short introduction to causality and some methods of dealing with causal data are examined. Firstly, we deal with introducing the reader to the terminology and graphical representation of causal systems. Then we focus on examining methods for dealing with data from causal systems when there are no hidden variables (PC), as opposed to those when hidden variables are present (FCI, FCI+). Special attention is given to the theory behind the FCI+ algorithm. In the end a comparison between FCI and FCI+ is made, based on accuracy and computational time, and conclusions are drawn.


Xi Xia 
Comparision of Different Confidence Interval Method for Linear Mixed Effect Models  Martin Maechler  Feb2014 
Abstract
Our study is a simulation analysis of different confidence intervals methods of fixed effect parameter in linear Mixedeffects models. Two functions, lmer function in the lme4 package and lme function in nlme package, are used to fit the linear mixedeffects mod els. 6 different confidence interval methods from pacakges lme4, nlme, lmerTest and boot are studied and compared in our study. We conclude that both lmer and lme functions have similar results in fitting the LME models, but bias are growing as the number of fixed effects increase. For confidence interval methods, a general finding is that most of the intervals are too small. But among all methods, lmerTest method perform best. It has the lowest confi dence interval MP among all methods and its coverage rate is closest to the nominal rate (α). The drawback of lmerTest is it sometimes returns error or intervals which do not make sense(e.g. with infinite boudary) and it runs significantly slower than lme4Wald and nlmeintervals. Lme4Wald and nlmeintervals are both very stable and fast, but the intervals are nearly always too small. Profile method is not better than lmerTest, and bootstraptype methods perform worst. Also, we found that sometimes poor performance of confidence intervals might indicating overfitting in model design.

2013
Student  Title  Advisor(s)  Date 

Vineet Mohan 
Grouped Regression in High Dimensional Statistics  Sara van de Geer  Oct2013 
Abstract
This work is devoted to clustered estimation in a sparse linear model where parameters are highly correlated and far outnumber the observations. Three variants of the group lasso technique from literature are examined. They are found to have equivalence with weighted lasso after some dimension reduction. The priors they impose on the parameters is used to suggest which class of problems they work best with. Based on this analysis, a new estimator which bases dimension reduction on principal component analysis is proposed. Empirical experiments follow to confirm results from theory.


Vasily Tolkachev 
Parameter Estimation for Diffusion Proess  Hans Rudolf Künsch  Sep2013 
Abstract
This work considers estimating functions approach to calibrating parameters in stochastic differential equations based on discretelysampled observations. Since the likelihood function is not known in closed form in the discrete case, we have to rely on an approximation to the score function, the estimating function, and then take its root as the estimator. It turns out that roots of estimating functions enjoy a number of remarkable asymptotic properties. First, some major rigorous regularity assumptions are outlined for the major results to hold. Then we consider one major result that when conditional moments of the process are known in closedform, the roots of an estimating function are asymptotically normal. Secondly, a more general theorem, which uses sample moments instead of the conditional ones in the estimating function, is discussed. Under a suitable choice of the approximating scheme the roots are still asymptotically normal, but with bias and larger variance. Finally, estimation of both drift and diffusion coefficients are considered for Geometric Brownian Motion and OrnsteinUhlenbeck process, generated from a MonteCarlo simulation. Important issues for various values of parameters are emphasized, as well as advantages and difficulties of using estimating functions.


Sarah Grimm 
Supervised and semisupervised classification of skin cancer 
Sara van de Geer
Markus Kalisch Chris Snijders 
Aug2013 
Abstract
As skin cancer rates continue to grow, dermatologists will be overwhelmed with the number of patients seeking skin cancer diagnosis. This problem is being addressed in the Netherlands, where research with a hospital has been developing logistic regression models that may help train nurses to diagnose skin cancer, and that are accessible via a mobile application. The present work investigated whether the logistic regression models could be improved or outperformed. A small simulation study explored the future potential for improving the models by incorporating information from patients who would use the application and who have not received a diagnosis.Logistic regression proved to be a competitive model. A smaller set of predictors with which models performed practically as well was identified. Although incorporating information from undiagnosed cases did not improve performance, it also did no deteriorate it, and it is worth to continue investigating the value of undiagnosed cases for model performance.


Lennart Schiffmann 
Measuring the MFD of Zurich: Identifying and Evaluating Strategies for an Efficient Placement of Detectors 
Marloes Maathuis
Markus Kalisch 
Aug2013 
Abstract
In recent years the macroscopic fundamental diagram (MFD) was established in the traffic research community. It can describe the overall traffic state in homogeneously congested areas in cities. To facilitate a real world implementation of an MFDbased traffic control system, we are developing strategies for placement of fixed monitoring resources (e.g. loop detectors). These strategies to place detectors efficiently are based on univariate and bivariate distributions of street properties such as road length, number of lanes and occurrence of traffic lights. We find that the use of bivariate distributions including the length of streets can yield good results. Our research is based on a microsimulation of the city of Zurich implemented in VISSIM.


Reto ChristoffelTotzke 
Time Series Analysis Applied to Power Market Data  Peter Bühlmann  Aug2013 
Abstract
The object of study of the present thesis are the daily closing prices of the futures contract for the Base13. The goal is to elaborate on their characteristics and to understand which impacts determine their trend. By means of appropriate methods and procedures, the most important of numerous variables are selected and five different models developed to describe the Base13. These models simultaneously try to compute precise one step ahead forecasts for future closing prices. A short introduction equips the reader with the necessary basic knowledge about the functional principal of power markets in order that the results of the analyses can be understood and their interpretations comprehended. The descriptive time series analysis of the closing prices demonstrates in the following section that the volatility heavily changes over a specific period of time what challenges the development of the models. Furthermore, in the same section, the Random Walk Hypothesis could not be confirmed concerning independent and incidental alteration in prices for the financial contracts in the case of the Base13. The next section focusses on GLM models. Based on GLM, a model has been developed which includes the most important indicators for the closing prices: coal API213, EUA13, Gas TTF13, CLDS and the USD/EUR exchange rate. The resulting forecasting model with GLM generated very accurate performances with a precision in trend of 81%. A strong linear correlation has appeared between Base13 and coal, EUA, gas and the exchange rates having the major quantifiable impact what is shown in a graphical analysis of these effects. Thereafter, the impact analysis has been intensified. In the course of analyzing, it has produced some interesting insights on the reaction of the closing prices concerning the changing volatility of the input variables. All variables of the final GLM model are highly significant in the GAM as well and show identical features relating to their impact on the Base13. The forecasting model with GAM reaches accuracy in trend of 78%. The research documented in the next section has been able to confirm four important variables of the final model by applying MARS: coal, EUA, CLDS und the USD/EUR exchange rate. The effect of those most important variables likewise is almost linear according to the graphical analysis. The forecasting model with MARS reaches accuracy in trend of 78%. Furthermore, another forecasting model has been developed with NNET which captures nonlinear effects to an acceptable extent. The relating effect plot illustrates this nonlinearity quite obviously, especially high for gas, the exchange rates, coal and CLSS. The forecasting model with NNET demonstrates accuracy in trend of 74%. The following section illustrates that the results with PPR confirm the outcome to a considerable extent provided with GLM for the final model. The forecasting model based on PPR shows a precision in trend of 75%. Various theoretical findings relating to the impact on the closing prices of Base13 as well as such based on applied experience have been confirmed based on empirical data. The straightforward linear model has proven very accurate as well as comprehensible thanks to its mathematical form. Furthermore, it has been demonstrated that complex nonlinear models bear no advantage due to the strong correlation of the most important variables and the Base13. It can therefore be concluded that the goals set for this thesis have been achieved by providing substantial insight in theoretical and applied aspects of statistical models relating to forecasting of futures closing prices.


Andrea Remo Riva 
Convex optimization for variable selection in highdimensional statistics  Sara van de Geer  Jul2013 
Abstract
Which genes are involved in the favor or oppose the formation of potentially fatal diseases such as prostate cancer, Crohn’s or Huntington’s disease? The world around us is increasingly confronted with situations in which a large number of collected data should be interpreted with the purpose of being able to formulate specific hypotheses about the reasons that lead to phenomena of particular interest. The modern statistics therefore seeks to develop new tools that can effectively deal with this kind of problems. This MasterThesis will initially refresh the basic ideas related to the LASSO (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani in 1996 and the basics of convex optimization. Following the study will focus on finding optimal solutions by regularizing the empirical risk with appropriate nonsmooth norms. The proximal methods face profitably and in a diversified manner these optimization problems and become of considerable interest from the computational point of view because the proposed algorithms have good convergence rates. Later we will be interested to explore the possibility of introducing a structuredsparsity in the solutions in order to be able to greatly improve the quality of the regression coefficients. For this purpose we will introduce new variational norms imposing the membership of auxiliary vectors with positive components to a set of our choice determinant in an inductive way the desired structure. Finally, some applications in the field of image processing and medical research will illustrate concretely how the multidimensional statistic is called today upon to help the man.


Nilkanth Kumar 
An Empirical Analysis of the Mobility Behaviour in Switzerland using Robust Methods 
Werner Stahel
Massimo Filippini 
May2013 
Abstract
In this thesis, the demand for personalized mobility by Swiss households has been studied using vehicle stock parameters, geographic and socioeconomic characteristics. For this purpose, disaggregate household level data from the latest Swiss travel microcensus for the period 2010 { 2011 has been used. In addition to the OLS approach, robust methods using MMestimators have been incorporated to obtain improved model fits and estimation results. Few related demand questions, like comparing car usage of single and multiplecar households, are also explored.
The estimated coefficients mostly have expected signs. The demand for personal mobility is found to vary diversely across different locations and households. The nonavailability of good public transport in an area is found evidenced with a significantly higher demand for car utilization. Rich households appear to have a higher travel demand in general.Efficient cars are found to be driven more compared to those with poor energy ratings. In multicar households, vehicle usage disparity of as much as 21% is noticed based on the efficiency label. From a policy maker's point of view, further research into specific areas to assess feasibility of diverse policy instruments that account for the differences in vehicle utilization behaviour of the people is advised. 

Nicolas Meng 
Optimal Portfolios  The Benefits of Advanced Techniques in Risk Mangement and Portfolio Optimization 
Sara van de Geer
Markus Kalisch 
May2013 
Abstract
This Master Thesis deals with the most important challenges facing practitioners in portfolio and risk management. It embeds a variety of risk and optimization methodologies into a common framework and performs an empirical backtest on a typical sector rotation strategy in the US market. The objective of this study is to evaluate the impact of wrong assumptions in risk modeling and portfolio optimization, as a recent survey showed that practitioners are still using simplified approaches based on wrong assumptions, despite empirical evidence that contradicts their assumptions. This thesis embeds a variety of risk and optimization methods into a common framework and performs an empirical backtest on a typical sector rotation strategy in the US market. First, we apply different risk forecast models to the empirical data. Apart from an unconditional model still prominently practiced, a constant conditional correlation (CCC) and dynamic conditional correlation (DCC) model are implemented and the forecasting performance is evaluated on the risk measures of volatility, VaR, and CVaR. There is clear empirical evidence that the unconditional model performs poorly and lead to severe underforecasting and clustering of loss during the financial crisis of 2008. The more complex DCC model provided the most accurate forecasts, followed by the CCC model. This demonstrates that wrong model assumptions lead to unacceptable results in practice. Based on forecasts from all risk models, two optimization approaches are tested. An adapted version of the traditional meanvariance optimization is employed. Additionally, a relatively new method of diversification optimization is implemented and compared against return maximization, subject to a CVaR constraint. Using this comparison, we examine the effect of estimation error on the expected returns and risk parameters. As a diversification approach is invariant to the estimates of expected returns, we assume that it should provide more stability to an optimized portfolio. We were able to confirm the concerns about estimation error and found that return maximization does not lead to optimal portfolios out ofsample. In contrast, the empirical results of the diversificationCVaR strategy are promising. Maximum diversification of independent risk factors leads to better performance in terms of both, realized risk and returns. In light of these findings, we question the practice of using the traditional method of return maximization, as the cost of ignoring estimation error in the optimization seems to be significant. Finally, we conclude that the standard approach still followed by a majority of practitioners does not deliver satisfactory results due to wrong assumptions about the statistical properties of the financial markets. We conclude that conditional risk estimates and the problem field of estimation errors are important aspects that cannot be neglected solely for the sake of simplicity.


Cong Dat Huynh 
Semisupervised learning methods for problems having positive and unlabeled examples 
Sara van de Geer
Markus Kalisch Thomas Beer, Swisscom 
Apr2013 
Abstract
A company can use upselling methods to upgrade the products its customers have bought from the company. Besides increasing the profit, the higher dependency of the customers to the company through the new upgraded products can help to reduce the churn rate. This is especially crucial in the telecommunication sector in which the volatility is high and the customer loyalty is low. The easiest way to upsell is to offer the customers the upgraded products. However, the reason why not to offer all products to all customers is that too many marketing information will annoy the customers. In this paper we will introduce and compare several methods that can support the decision of whether to offer a product to a customer or not. The effectiveness of the methods is validated through a simulation study based on real world datasets. The results from the study indicate that several methods have great potential


Ruben Dezeure 
Pvalues for highdimensional statistics  Peter Bühlmann  Mar2013 
Abstract
In this work, recently published methods for hypothesis testing inhighdimensional statistics are studied. The methods are compared bytesting for variable importance in linear models for a variety of test setups, including real datasets. For multiple testingcorrection a procedure is used that is closely related to theWestfallYoung procedure, which has been shown to have asymptoticallyoptimal power. The estimation performance of the regressioncoefficients is also looked at to provide a different level ofcomparison. Finally, we also test for a logistic regression model toinvestigate if testing in generalized linear models is reliable with state of the art methods.


Harald Bernhard 
Parameter estimation in state space models  Hans Rudolf Künsch  Mar2013 
Abstract
We considered the effectiveness of a particle approximation procedure to the score function via filtered moments of artificially timevarying parameters in general state space models. To investigate this issue we considered a simple two state hidden Markov model where exact reference values are available. For this model we conducted simulation studies to estimate several diagnostic statistics about the score approximation procedure. The results were then used to perform maximum likelihood estimation in the same model, using the noisy score approximation in combination with a stochastic approximation procedure.


Mark Hannay 
Robust Testing and Robust Model Selection 
Werner Stahel
Manuel Koller 
Mar2013 
Abstract
As the title of the thesis suggests, this thesis belongs to the domain of robust statistics. There are 3 main chapters: testing in linear models, testing in generalized linear models and model selection.We start in the linear model, where we describe classical estimation and classical tests. After describing the classical methods, we introduce robust estimation, namely the SMDM estimator. With our robust estimates we present robust tests, including a new robust score test. To improve the speed of our new robust score test, we develop methods to estimate the scale parameter $sigma$ from the reduced modal. The most prominent robust tests for the composite hypothesis are the robust Wald test and the $ au$test. Both these tests are computationally expensive, they require fitting the full model. We develop new robust tests, that only require fitting the reduced model.In generalized linear models (GLMs), we once again describe the classical estimation and the classical tests. By using robust scores, we introduce robust estimation. In GLMs, 2 prominent robust tests already exist, the quasi deviance test and the robust saddlepoint test. However, they are computationally expensive. So we introduce the robust Wald test and the robust score test, which are both computationally cheaper. Here we compare the quasi deviance test with the robust Wald test and the robust score test, while simultaneously comparing them to the classical saddlepoint test. In the chapter on model selection, we introduce an important method, the classical Mallows' $C_{p}$ criterion. By using the classical Mallows' $C_{p}$ criterion in an example, we discuss the importance of using robust methods for model selection. So we develop our own robust Mallows' $C_{p}$ criterion, which works well in the example. We compare the classical and the robust Mallows' $C_{p}$ criteria with each other in a simulation study. Another approach to model selection, based on testing is also discussed. I have tried to make this thesis as self contained and as comprehensive as possible, while keeping to the essentials. Chapters 2 and 4 should be accessible for people with a good foundation in linear regression. While chapter 3 should be accessible with a good foundation in generalized linear regression.


Benjamin Stucky 
SecondLevel Significance Testing  Sara van de Geer  Feb2013 
Abstract
The emergence of all the modernday information gathering technologies, amongst all their benefits, gave rise to some new problems and challenges. Nowadays we need to be able to handle huge data sets. We will often face the problem that some information of interest is very rarely contained in our data. At the same time this information is very hard to distinguish from every other observation. This thesis will focus on how to detect the presence of such sparse information with the aid of a method called Higher Criticism. This is a hypothesis test to determine whether we have a very small fraction of nonnull hypotheses amongst many null hypotheses or if this fraction is indeed zero. For the definition of this test we need a collection of different significance tests, hence the name SecondLevel Significance Testing. Higher Criticism was suggested by Tukey in 1976 and then developed by Donoho and Jin [15] in 2004. This thesis is mainly based on their work as well as the work of Cai et al. [9]. The main focus lies on the detection of sparse signals, but some cases where the signals are dense are also discussed. Higher Criticism works very well for the adaptive detection of sparse and faint signals amongst background noise. Adaptive means that Higher Criticism is able to work without knowing the sparseness and the faintness of the detection problem. The case where the data is Gaussian distributed is the basis for developing the Higher Criticism test statistic. In this setting Higher Criticism is optimal. Optimality means that asymptotically Higher Criticism is able to detect all theoretically detectable signals. The detectable signals are described by the detection boundary. We also encounter the problem of correlated observations. There we can modify Higher Criticism and still get nice results, this follows the work of Hall and Jin [19]. The notion of the detection boundary and Higher Criticism can even be generalized to a wide range of different settings due to Cai and Wu [12]. Higher Criticism thus solves one challenge that new technologies have posed us. We discuss other important problems connected to the detection of sparse signals according to Cai et al. [10], such as the estimation of the fraction of sparse signals and discovering which observations are signals of interest.

2012
Student  Title  Advisor(s)  Date 

Giacomo Dalla Chiara 
Factor approach to forecasting with highdimensional data, an application to financial returns  Peter Bühlmann  Oct2012 
Abstract
This study considers forecasting a time series of financial returns in a linear regression setting using a number of macroeconomic predictors (N) which can exceed the number of time series observations (T). Usually, regression estimation techniques either consider only a handful of predictors or assume that the vector of parameters is sparse. Several recent papers advocate the use of a factor approach to deal with such highdimensional data without discarding any of the predictors. Assuming an approximatefactor structure on the data, it is possible to summarize the large set of time series using a limited number of indexes, which can be consistently estimated using principal components. First, we review the recent theoretical developments in the construction and estimation of a forecasting procedure which uses the largedimensional approximate factor model. The aim is to contribute to bridge these studies with the empirical research, which presents mixed performance results of the factor model implemented on realworld data. In a second part we discuss four implementation techniques of the factor model, namely, (i) screening, (ii) estimation window size selection, (iii) factor selection, (iv) variable selection in a factoraugmented regression. We argue that these four methodologies, which have often been considered separately in the empirical literature, are paramount for the factor model to achieve a better forecasting performance than lower dimensional models. In the last part of this study, factoraugmented models, with and without the above mentioned methodologies, are implemented using the Stock and Watson (2006) dataset of macroeconomic and financial predictors to forecast the time series of monthly returns of the Standard and Poor 500 index. Indeed, the empirical results show that screening and estimation window size selection are needed, in the factor model, to outperform lower dimensional benchmarks. The main contribution of this work is to provide general guidelines for applying the large dimensional factor model to realworld data. All the practical methodologies discussed in the paper are coded in the R programming language, and are contained in Appendix E.


Raphael Gervais 
Predicting the Effect of Joint Interventions from Observational Data  Marloes Maathuis  Sep2012 
Abstract
It is commonly believed that causal knowledge discovery is not possible from observational data and requires the use of experiments. In fact, it is indeed impossible to learn causal information from observational data when one is not willing to make any assumptions. However, under some fairly general assumptions, IDA (Intervention calculus when the DAG is Absent) is a methodology that can deduce information on causal effects from observational data. The present work extends the IDA methodology in two ways. Firstly, in the case of single outside interventions on a system, two new algorithms are presented: IDA Path and IDA SemiLocal. These algorithms compare favourably in simulation studies in terms of both statistical properties and computational efficiency. Secondly, the IDA methodology is extended to cases where one seeks information about the causal effect of joint outside interventions on a system. Here, two algorithms are introduced, IDA IPW Joint and IDA Path Joint, that show encouraging results in simulation studies. These new algorithms for joint interventions may easily be extended to the creation of IDAtype algorithms for arbitrarily many outside interventions on a system.


Laura Buzdugan 
Highdimensional statistical inference 
Peter Bühlmann
Markus Kalisch 
Aug2012 
Abstract
The present work seeks to address the issue of error control in highdimensional settings. This task has proven challenging due to: 1) Difficulty of deriving the asymptotic normal distribution of the estimators, and, 2) The high degree of multicollinearity commonly exhibited by the predictor variables. These two issues were addressed by combining Bühlmann (2012)’s method of constructing pvalues based on Ridge estimation with an additional bias correction term, and Meinshausen (2008)’s proposal of a hierarchical testing procedure that controls FWER (Family Wise Error Rate) at all levels. This led to the extension for the construction of pvalues to cases in which the response variable is multivariate. The new method was tested on an SNP phenotype association dataset, which also allowed for investigation of different approaches to bias correction.


Michel Philipp 
Cost Efficiency of Managed Care Programs in Health Care Insurance  Werner Stahel  Aug2012 
Abstract
Managed care (MC) plans in health care systems promise an improved quality of medical service at significantly lower expenses. Therefore, politicians and health insurers have a strong incentive to estimate the cost efficiency of such alternative insurance plans from historical data on health care expenditure (HCE).However, estimating cost effects between basic and alternative insurance plans in an observational study is particularly challenging. Differences between the baseline characteristics in the different insurance collectives result in selection biases. This occurs notably when insurance companies o_er discounts to MC plans policyholders, effectively creating an economic incentive that approaches basically young and healthy people.This thesis first discusses the statistical challenges that artise when estimating the cost efficiency of MC plans. To draw causal conclusions from estimates based on observed health care data, the MC plan assignment must be independent of the HCE within subgroups of relevant confounders. Unfortunately, not every potential confounder can be observed by the insurance companies and therefore we conclude that it is not practicable to estimate causal effects from the available health care data. However, insurance companies are similarly interested in monitoring the HCE between different insurance plans.Therefore, we analyse data from a large Swiss insurance company using Tobit regression to estimate differences in (leftcensored) HCE between basic and MC insurance plans, particularly within regions and pharmaceutical cost groups. Further, we attempt to improve the models using a propensity score, the probability of choosing MC insurance and calculate the confidence bands of the resulting differences in HCE between insurance plans from 100 bootstrap replications. To avoid additional bias we excluded covariates that are potentially affected by the MC plan.The estimates that we receive with our models vary significantly between regions. However, in total we obtain lower HCE compared to basic insurace of (with 95% confidence limits) Since it is unknown if the requirements for causal inference are met, our conclusion is that one can not absolutely exclude remaining selection bias from these estimates.


Rainer Ott 
A Wavelet Packet Transform based Stock Index Prediction Algorithm 
Hans Rudolf Künsch
Kilian Vollenweider Evangelos Kotsalis 
Aug2012 
Abstract
In this Master thesis we develop prediction algorithms which optimize a performance measure over a specified set of wavelet packet trees and smoothing parameters. The performance of the algorithms is evaluated for the daily DAX prices from 18th December 2003 to 30th December 2011. Using a quantitative return quality measure with an algorithm based on a delayed version of the discrete wavelet packet transform (DWPT) we achieved to outperform the exponential weighted moving average trend follower. For the same algorithm 3 out of 25 wavelet packet trees were observed to be favourable. Furthermore, the DWPT was found to consistently outperform the discrete wavelet transform, if the Haar basis is used.


Sylvain Robert 
Sequential Monte Carlo methods for a dynamical model of stock prices 
Hans Rudolf Künsch
Didier Sornette 
Aug2012 
Abstract
Stock markets often exhibit behaviours that are far from equilibrium, such as bubbles and crashes. The model developed in Yukalov et al. (2009) aims at describing the dynamic of stock prices, and notably the way they deviate from their fundamental value. The present work was interested in estimating the parameters of the model and in filtering the underlying mispricing process. Various Sequential Monte Carlo methods were applied to the problem at hand. In particular, a fully adapted Particle Filter was derived and showed the best performances.While the filtering was well handled by the different methods, the estimation of the parameters was much more diffcult. Nevertheless, it was possible to identify the market type, which qualitatively describes the dynamic of a stock.The methods were first tested on simulated data before having been applied to the Dow Jones Industrial Average. The latter application led to very interesting results. Indeed, the estimated model provided insight about the underlying dynamic, and the filtering of the mispricing process allowed to shed a new light on some important financial events of the last 40 years


Radu Petru Tanase 
Learning Causal Structures from Gaussian Structural Equation Models 
Jonas Peters
Peter Bühlmann 
Aug2012 
Abstract
Traditional algorithms in causal inference assume the Markov and faithfulness conditions and recover the causal structure up to the Markov equivalence class. Recent advances have shown that by using structural equation models it is possible to go even further and in some cases identify the underlying causal DAG from the joint distribution. We focus on an identi_ability result for linear Gaussian SEMs with same noise variances and propose an algorithm that estimates the causal DAG from such models. We evaluate the performance of the algorithm in a simulation study and compare it to the performance of two other existing methods: the PC Algorithm and Greedy Equivalence Search.


Matteo Tanadini 
Regression with Relationship Matrices using partial Mantel tests  Werner Stahel  Jul2012 
Abstract
Relationship matrices and the statistical methods used to analyse them are of growing importance in science because of the increasing number of systems that are represented by networks. Relationship matrices are often used in fields such as Social sciences, Biology or Economics. In the context of multiple linear regression with relationship matrices, partial Mantel tests represent the standard statistical framework for inference. Several approaches of this kind can be found in the literature. In order to evaluate the performance of these methods, a sensible way to simulate datasets is indispensable. Unfortunately, studies conducted so far comparing performance of partial Mantel tests rely on inadequate simulated datasets and are therefore questionable. The goals of this master thesis were to compare the performance (measured as level and power) of widely used partial Mantel tests using stateoftheart simulation techniques and to describe new implementations with improved performance. In a first phase, we focused on improving the quality of models used to simulate datasets for multiple linear regression with relationship matrices. We were able to propose two convenient procedures for simulating predictors (i.e. relationship matrices). We could also show a more appropriate way to simulate the error term for linear regression with relationship matrices. In a second phase, we described three modi_cations for partial Mantel tests that are supposed to improve performance. The implementation of these improvements in a Rcode will be object of future research. Finally, we compared the performance of three partial Mantel tests using datasets simulated according to our improved technique. The results agree with previous studies and confirm that the method proposed by Freedman & Lane has the best overall performance.


Markus Harlacher 
Cointegration Based Statistical Arbitrage 
Sara van de Geer
Markus Kalisch 
Jul2012 
Abstract
This thesis analyses a cointegration based statistical arbitrage model. Starting with a brief overview of the topic, a simulation study is carried out that is intended to shed light on the mode of action of such a model and to highlight some potential flaws of the method. The study continues with a backtesting on the US equity market for the time period reaching from 1996 up to 2011. The results of all the different model versions that were tested look quite promising. "Traditional" meanvariance based performance measurements attest the employed cointegration based statistical arbitrage model very good results. The advanced dependence analysis with respect to the returns of the S&P 500 index and the returns obtained from the backtesting shows a very favourable structure and indicates that such a model can provide returns that are only very weekly related to the returns of the S&P 500 index.


Yongsheng Wang 
Numerical approximations and Goodnessoffit of Copulas 
Martin Mächler
Werner Stahel 
Jul2012 
Abstract
The author first gives an introduction to copulas and derives Rosenblatt transform of elliptical copulas. To circumvent numerical challenges in estimating the density of Gumbel copula, several approaches are presented. The author finds an algorithm to choose appropriate methods under various conditions. It is obtained by first determining the bit precision when using the benchmark method dsSib.Rmpfr and then conducting a simulation study for comparisons. Then followed by a review of goodnessoffit methods of copulas including tests based on empirical copulas, Rosenblatt transform, Kendall transform and HeringHofert transform. The author conducts a large simulation experiment to investigate the effect of the dimension on the level and power of goodnessoffit tests for various combinations of null hypothesis copulas and alternative copulas. The results are interpreted via graphs of confidence intervals and power ratios. Also, the relationships among the computational time, dimension, sample size and number of bootstraps are explored. Last, dependence structure of Dow Jones 30 is investigated using graphical goodnessoffit test under various types of Studentt, Gumbel and Clayton copula families. Studentt copula with unstructured correlation matrix and optimized degree of freedom estimated by maximum likelihood estimator gives the best solution.


Amanda Strong 
A review of anomaly detection with focus on chnagepoint detection 
Sara van de Geer
Markus Kalisch 
Jul2012 
Abstract
Anomaly detection has the goal of identifying data that is, in some sense, not "normal." The definition of what is anomalous and what is normal is heavily dependent on the application. The unifying factor across applications is that, in general, anomalies occur only rarely. This means that we do not have much information available for modeling the anomaly generating distribution directly. We will describe several ways of approaching anomaly detection and discuss some of the properties of these approaches. Changepoint detection can be considered a subtopic in anomaly detection. Here the problem setting is more specific. We have a sequence of observations and we would like to detect whether their generating distribution has remained stable or has undergone some abrupt change. The goals of a changepoint analysis may include both detecting that a change has occurred as well as estimating the time of the change. We will discuss some of the classic approaches to changepoint detection. As very large datasets become more common, so do the instances in which it is dificult or impossible for humans to heuristically monitor for anomalous observations or events. The development and improvement of anomaly detection methods is therefore of everincreasing importance.


Peter Fabsic 
Comparing the accuracy of ROC curve estimation methods  Peter Bühlmann  Jul2012 
Abstract
The aim of this study is to compare the accuracy of commonly used ROC curve estimation methods. The following ROC curve estimators were compared: empirical, parametric, binormal, "logconcave" together with its smoothed version (as introduced in Rufibach (2011)), and the estimator based on kernel smoothing. Two simulations were carried out, each assessing the performance of the estimators in a range of scenarios. In each scenario we simulated data from known distributions and computed the true and the estimated ROC curves. Using various measures we assessed how close the estimates were to the true curve. In the first simulation, a large sample size was used to compute the estimated ROC curves. A substantially smaller sample size was used in the other simulation. The "logconcave estimator"was found to perform the best when a large sample was available. On the other hand, the estimator based on kernel smoothing outperformed all other competitors in the simulation with the small sample size.


Edgar Alan Muro Jimenez 
About Statistical Learning Theory and Online Convex Optimization 
Sara van de Geer

Jul2012 
Abstract
This work is divided in two parts: in the first block, we present a relationship between an empirical process and the minimax regret of a game from Prediction with Expert Advice (PWEA). We use this expression to show how the lower bound of a PWEA minimax regret can give us some information about the form of the experts class being used, in particular, whether it is a VC class or not. In the second block, we analyse from a theoretical point of view the similarities of the performance of algorithms from Statistical Learning, PWEA, and Online Convex Optimization (OCO). We present results for the three methods that show us that the rate of decay of the prediction error depends on the curvature of the loss function over the space of the predictor's choices. In addition, we provide Theorems for Statistical Learning and OCO showing that similar lower bounds for their regrets can be obtained assumming that the minimizer of the expected loss in not unique. This provides more evidence on the resemblance between the performance of algorithms from Statistical Learning and OCO. Finally, we show that any PWEA game can be seen as an special case of an OCO game. Even though this represents an advantage for finding upper bounds for PWEA, we present an example where the upper bounds for the regret originally created for OCO are not better than those found for PWEA


Elena Fattorini 
Estimating the direction of the causal effects for observational data  Marloes Maathuis  Jul2012 
Abstract
In many scientific studies, causal relationships are of crucial importance. Unfortunately, it is not possible, without making some assumptions, to calculate the causal effects only with observational data. In this thesis, the observational data are assumed to be generated from an unknown directed acyclic graph (DAG). Under such a model, bounds on causal effect can be computed with the approach of Maathuis, Kalisch, and Bühlmann (2009). The idea behind this approach is as follows. First, one tries to estimate the DAG that generated the data and then one computes the causal effects for the obtained DAG. However, under our assumptions we can generally only identify an equivalence class of DAGs that are compatible with the data. Due to the existence of these different possible generating DAGs, the causal effect from a variable X to a variable Y can not always be identified uniquely. However, one can identify the causal effect for each DAG in the equivalence class, and collect all these effects in a multisets. These multisets can be summarized using summary measures. For example in the paper of Maathuis et al. (2009) the minimum absolute values is used as a summary measure. That gives a lower bound on the size of the causal effect. In this thesis, we focus on the problem of how to derive the sign of the causal effects. Clearly, the minimum absolute value is not appropriate for this purpose. Eight new summary measures are proposed and simulation studies are performed to detect the summary measure that best detects the largest positive causal effects among a set of given variables. The summary measures are compared using averaged ROC curves. The maximum and the mean results to be the best summary measures. In the estimated graphs it occurs that some edges are directed in the wrong direction. A large positive causal effect can be estimated as zero due to a wrong directed arrow. Therefore, in order to detect all the largest positive causal effects, one should also investigate the effects which are estimated as zero.


David Schönholzer 
Geostatistische Kartierung der Waldbodenversauerung im Kanton Zürich 
Andreas Papritz
Hans Rudolf Künsch 
Jul2012 
Abstract
Im Rahmen des erhöhten Bewusstseins der zunehmenden Versauerung der Waldböden in der Schweiz und im Kanton Zürich kartiert diese Arbeit erstmals annähernd flächendeckend den Versauerungsgrad der Waldböden im Kanton Zürich anhand des geschätzten pHWerts im Oberboden. Dazu werden eine Reihe von nationalen und kantonalen Datensätzen über Bodenversauerung, Klima, Vegetation, Topo graphie und Geologie verarbeitet und zur statistischen Schätzung der Bodenversauerung verwendet. Um einigen Schwierigkeiten der statistischen Schätzung umweltnaturwissenschaftlicher Messgrössen zu begegnen, wird eine Kombination verschiedener statistischer Methoden eingesetzt, insbesondere der Geostatistik und der robusten Statistik.


Myriam Riek 
Towards Consistency of the PCAlgorithm for Categorical Data in HighDimensional Settings  Marloes Maathuis  Jun2012 
Abstract
The PCalgorithm is an algorithm used to learn about or estimate the causal structure among a causally sufficient set V of random variables from data. Under the assumption of faithfulness, the PCalgorithm yields an estimate of the graph representing the Markov equivalence class of causal structures over V that are compatible with the probability distribution defined over V. Consistency of an estimator is a crucial property. It has been proven to hold for the PCalgorithm applied to multivariate normal data in highdimensional settings where the number of variables is increasing with sample size, under some conditions ([10]). In this master thesis, an attempt was made to prove consistency of the PCalgorithm applied to categorical data in low and highdimensional settings.


Stephan Hemri 
Calibrating multimodel runo_off predictions for a head catchment using Bayesian model averaging 
Hans Rudolf Künsch
Felix Fundel 
May2012 
Abstract
One approach to quantify uncertainty in hydrological rainfallrunoff modeling is using meteorological ensemble prediction systems as input for a hydrological model. Such ensemble forecasts consist of a possibly large number of deterministic forecasts. Uncertainty is given by their spread. As such ensemble forecasts are often underdispersed, biased and do not account for other sources of uncertainty, like the hydrological model formulations, statistical postprocessing needs to be applied to achieve sharp and calibrated predictions. In this thesis postprocessing of runoff forecasts from summer 2007 to the end of 2009 for the river Alp in Switzerland is done by applying Bayesian model averaging (BMA). A total of 68 ensemble members coming from a deterministic and two ensemble forecasts are used as input for BMA. These forecasts cover different leadtimes from 1h to 240h. First, BMA based on univariate normal and inverse gamma distributions is performed under the assumption of independence between leadtimes. Then, the independence assumption is relaxed in order to simultaneously estimate multivariate runoff forecasts over the entire range of leadtimes. This approach is based on a BMA version that uses multivariate normal distributions. Since river discharge follows a highly skewed distribution, BoxCox transformation is applied in order to achieve approximate normality. Backtransformation combined with data quality leads in some cases to too high predicted probabilities of extremely high runoffs. Using the inverse gamma distribution, instead, cannot remove this problem, neither. Nevertheless, both, the univariate and multivariate, BMA approaches are able to generate well calibrated forecasts that are considerably sharper than the climatology


Linda Staub 
On the Statistical Analysis of Support Vector Machines 
Sara van de Geer

Mar2012 
Abstract
We analyze Support Vector Machines from a theoretical and computational point of view by explaining every building block of this algorithm separately, where we mainly restrict ourselves to binary classification. We start with loss functions and risks and then make a digression to the theory of kernel functions and their Reproducing Kernel Hilbert Spaces. We are then ready to perform the statistical analysis, where we assume in a first part the data to be independent and identically distributed. This analysis aims to investigate under which conditions on the regularization sequence the method is consistent and, more interestingly, to find the optimal learning rate and a way of nearly reaching it. We thereby explain the results given in [21] and add the missing proofs. Next, we briefly discuss the computational aspects of support vector machines, where we show that numerically the problem is reduced to solving a finite dimensional convex program. Subsequently, we explain how to use support vector machines in practice by applying the R function svm() from the package e1071 to independent and identically distributed data. We then slightly violate this assumption and generate data of a GARCH process which naturally carries a dependence structure and observe that the algorithm still produces good results for this kind of data. We finally find the theoretical explanation for this by performing a statistical analysis of support vector machines for weakly dependent data following the work of [22].


Christian Haas 
Analysis of market efficiency: Post earnings drift in Swiss stock prices  Peter Bühlmann  Mar2012 
Abstract
The study of stock market behaviour and market efficiency is a very active topic within probability theory and statistics. Market models and their implications have recently been in focus of not only the mathematical and economic community. In this thesis, we take a look at some market models and studies of market efficiency. We therefore establish the theory behind efficiency and regression.In chapter 7 we then study the post earnings drift for Swiss stocks. We find a significant intraday drift in direction of the first reaction after the earnings release. We then look at a strategy, using our result, and try to answer whether we found market inefficiency or not.


Ana Teresa Yanes Musetti 
Clustering methods for financial time series 
Martin Mächler
Werner Stahel 
Mar2012 
Abstract
The purpose of this thesis is to study a set of companies from the S&P 100 and determine whether share closing prices that move together correspond to companies belonging to the same economic sector. To verify this, different clustering methods were applied to a dissimilarity matrix corresponding to the degree of dependence between the companies. Since ﬁnancial data does not exhibit multivariate normal distribution, applying nonparametric dependence measures was needed. For this, the theory of the Hoeffding’s D, Kendall’s τ and Spearman’s ρ was reviewed. Then, in order to choose the best clustering solution, a set of validation statistics were applied. To compare in advance the performance of the different clustering methods and the validation statistics under different circumstances regarding the overlapping level of the clusters, two simulation studies were carried out. The ﬁrst simulation based on correlation matrices computed from covariance matrix samples from a Wishart distribution and the second one based on models with Gaussian mixture distributions. This study showed that the transformation of the data, either from dependence measures or distances to (dis)similarities, has an impact in the performance of the clustering methods. Additionally, regarding the validation statistics, in the simulation studies some of these statistics showed a poor performance in extreme scenarios, where the clusters are very well separated. Finally, when the companies belonging to the S&P 100 were clustered, the method PAM applied to the corresponding dissimilarity matrix estimated with the Hoeffding’s D gave the best solution compared to the clustering methods AGNES, DIANA and DSC, which agreed with the results from the simulation studies and the reviewed theory.


Lukas Patrick Abegg 
Analysis of market risk models 
Werner Stahel
Evangelos Kotsalis Lukas Wehinger 
Mar2012 
Abstract
In this thesis, risk models are evaluated in a joint project with the swissQuant GroupAG and a major Swiss bank. Risk models with high complexity, e.g. based on GARCHmodels and different distribution assumptions, as well as simpler models, e.g. based on EWMA models and normal distributions, are assessed and compared for weekly data. The outofsample results are assessed graphically and the evaluation is performed with statistical tests applied to large scale data. At the 95% confidence level, the quality of the ValueatRisk estimates under the simple and complex models are assessed to be similar. If the Expected Shortfall and the ValueatRisk at higher confidence levels are considered, however, the sophisticated methods improve the risk estimates. A risk model based on copulas, GARCH models and nonparametric distribution estimates is developed additionally and found to outperform the risk models provided.


Martina Albers 
Boundary Estimation of Densities with Bounded Support 
Geurt Jongbloed
Marloes Maathuis 
Mar2012 
Abstract
When estimating a density supported on a bounded or semiinfinite interval by the kernel density estimation, problems may arise at the boundary. In the past, many variations of the 'standard' kernel density estimator have been developed to achieve boundary corrections.Smooth estimates of the distribution and density functions have recently been derived for current status censored data. This topic is closely related to kernel density estimation and the mentioned boundary problem can also appear in this context.In this Master's thesis some boundary corrections were combined with the smooth distribution estimates for current status censored data. Simulations to analyze the performance of these new constructions were carried out using R.


Alexandros Gekenidis 
Learning Causal Models from Binary International Data  Peter Bühlmann  Mar2012 
Abstract
The goal of this thesis is to provide and test a method for causal inference from binary data. To this end, we first introduce the mathematical tools for describing causal relationships between random variables, such as directed acyclic graphs (DAGs for short), in which the random variables are represented by vertices whereas edges stand for causal influences. A DAG can, however, only be identified up to Markov equivalence which roughly means that one can estimate its skeleton, but not the direction of most of the edges. This can be improved by performing interventions, i.e. by forcing a certain value upon one or several random variables and observing the change in the values of the other factors to obtain additional data. The resulting Markov equivalence classes form a finer partitioning of the space of DAGs than the noninterventional ones, thus improving the estimation possibilities. Based upon this theory we will adapt the existing Greedy Interventional Equivalence Search algorithm (GIES, [1]) to the case of binary random variables and test it on simulated data


Eszter Ilona Lohn 
Estimating the clinical score of coma patients  a comparison of model selection methods 
Werner Stahel
Markus Kalisch 
Mar2012 
Abstract
The aim of this thesis is to explore the possibility of estimating coma patients' clinical awareness score by objective clinical measurements, in order to substitute the rather subjective doctors' examination which is expensive and time consuming. A comparison is made on variable selection and model fitting methods by crossvalidation. The basic analysis is extended towards block subset analysis, alternative crossvalidation schemes and analyzing the dynamics of the clinical score. As only a small sample is available, the phenomenon of over fitting is a serious concern throughout the analysis, which is seen through the difference of insample and crossvalidated model fits. In general we observe that lowvariance (higher bias) methods perform better on this sample size. In the end it is concluded that based on this sample the clinical measurements contain little information about the clinical awareness score.


Lisa Borsi 
Estimating the causal effect of switching to secondline antiretroviral HIV treatment using Gcomputation 
Marloes Maathuis
Markus Kalisch Thomas Gsponer 
Mar2012 
Abstract
Understanding causal effects between exposure and outcome is of great interest in many ﬁelds. In this work, the causal effect of switching to secondline antiretroviral treatment on death is estimated for a study population including HIVinfected patients experiencing immunological failure in Southern Africa (Zambia and Malawi). CD4 cell count is considered as a timevarying confounder of treatment switching and death, while it is itself affected by previous treatment. Given the impossibility to conduct a randomised experiment, we address the problem of timevarying confounding by Gcomputation. Under certain conditions, Gcomputation yields consistent estimates of the causal effect by simulating what would happen to the study population if treatment is set to a certain regime by intervention. In our analysis we compare intervention “always switch to secondline treatment” to intervention “always remain on ﬁrstline treatment”. We ﬁnd the resulting risk ratio to be 0.24 (95% CI 0.140.33), emphasizing that the risk of dying is smaller in the population that switched to secondline treatment than in the population that stayed on ﬁrstline treatment. Thus, we conclude that there is a beneﬁcial causal effect of switching to secondline treatment among HIVpatients experiencing immunological failure.


Gardar Sveinbjoernsson 
Practical aspects of causal inference from observational data  Peter Bühlmann  Mar2012 
Abstract
In this thesis we study methods to infer causal relationships from observational data. Under some assumptions causal effects can be estimated using Pearl’s intervention calculus provided that the data is supplemented with a known causal inﬂuence diagram. We study the IDA algorithm which estimates the equivalence class of this diagram and uses the intervention calculus to get a lower bound on the size of the causal effects. Since it can be a difficult task to discover structure, especially in high dimensional setting, we combine the IDA algorithm with stability selection, a subsampling method to select the most stable causal effects. In hope for improvement we verify our results on a data set where the true causal effects are known from experiments. We also investigate the robustness of our method with a simulation study where we look at violations of assumptions.


Simon Kunz 
Simulated Maximum Likelihood Estimation of the Parameters of Stochastic Differential Equations 
Lukas Meier
Werner Stahel 
Mar2012 
Abstract


Marcel Freisem 
Estimating rating transition probabilities and their dependence on macroeconomic conditions for a bank loan portfolio  Peter Bühlmann  Feb2012 
Abstract


Tulasi Agnihotram 
Statistical Analysis of Target SNPs and their Association with Phenotypes 
Peter Bühlmann
Markus Kalisch 
Feb2012 
Abstract
Genomics is not only influencing the field of medicine, but also distantly related fields such as behavioural sciences, economics etc. The primary goal of this thesis is to investigate relation between the genome represented by SNPs and the behavioural characteristics (such as risk aversion) of an individual, using supervised learning techniques. Human genome has 23 chromosomes, which contains information on millions of SNPs. Applying supervised learning techniques on millions of SNPs is difficult and may not be efficient.To simplify the analysis we select Target SNPs, which can represent all the surrounding SNPs. Target SNPs can be found by linkage disequilibrium with our modified Carlson's algorithm. By applying random forest (a supervised learning technique) on genotype data at Target SNPs as predictors and the categorized phenotype data as response vector, the error rates obtained corresponding to each phenotype were not informative.By using heuristic approach we select Best SNPs from all SNPs on chromosomes according to their rank correlation with phenotype. With the test data at Best SNPs as predictors and the categorized test data of phenotype as response vector, error rate of random forest did not suggest relationship between genotype and phenotype. Furthermore we apply this procedure on random SNPs and compare the results with the results of Target SNPs, Best SNPs and provide directions for future work.

2011
Student  Title  Advisor(s)  Date 

Sung Won Kim 
Study on Empirical Process, based on Empirical Process Theory and Applications in Nonparametric Statistics  Sara van de Geer  Dec2011 
Abstract
Any estimator is a function of empirical measure, while what we want to estimate is a function of theoretical measure. Then to justify our estimator we want to see that the estimator, a function of empirical measure, converges to the parameter, a function of theoretical measure, as the sample size grows. In general, however, the function to be measured is unknown, and one wants to see simultaneous convergence of the class of all possible functions. Thus, we present uniform laws of large numbers to show empirical meausure on class of functions converges to theoretical measure on that. To show it one requires entropy condition on the class, which ensures the proper size of the class of functions to be estimated, and condition of finite envelope, where the envelope is the supremum of the class of functions. Furthermore, we address uniform central limit theorem which gives the information how well the empirical measure converges to theoretical measure. If one can show the equicontinuity of the empirical process, indexed by the class of functions, and if this indexing class is totally bounded, then the class of functions is PDonsker, equivalently the process satisfies the uniform central limit theorem. That is, the empirical process converges to Gaussian distribution. Equicontinuity, derived for showing PDonsker, will open the way to deduce the rate of convergence, in our case, of least squares estimators. Therefore, as an application, we derive the rate of convergences of the least squares estimators for different classes of functions. Also, we consider the rate of convergences of the least squares estimators when the penalty is imposed for the complexity of the class of models. Even if one is not aware of the optimal model in the class, the proper choice of penalty would allow one to attain the optimal rate of convergence, as if one knows the optimal model. As the applications of uniform law of large numbers and uniform central limit theorem, convergence and normality of Mestimator are introduced, as well. There, one can see how empirical process theory is applied on the way to proving those properties. Furthermore, in order to see whether a class satisfies ULLN or UCLT, it is convenient to use Vapnik Chervonenkis index, VC index. VapnikChervonenkis class, whose VC is finite, satisfy both ULLN and UCLT with envelope condition, and it would play a role in empirical process


Andre Meichtry 
Back pain and depression across 11 years Analysis of Swiss Household Panel data 
Werner Stahel
Thomas Läubli 
Dec2011 
Abstract
Design and objective: In this longitudinal retrospective cohort study, we analysed back pain and depression data across 11 years in the general population of Switzerland. The main objective was to investigate the association between back pain and depression. Methods: We used data from the Swiss Household Panel. 7799 individuals (aged 13 93, mean 42.9 years, 56.2% women) were interviewed between 1999 and 2009. Observed depression and back pain were described across 11 years. Missingness was assumed to be independent of unobserved data. We estimated marginal structural models using inverseprobabilityofexposureandcensoring weights to assess the (causal) association between back pain history and depression. Correlated data was analysed by fitting marginal and transition models with generalised estimating equations yielding robust sandwich variance estimates. Results: Crosssectional analysis adjusting for other timefixed covariates showed that back pain was associated with a 42% increase in the odds of depression over time. The association of continuous past back pain up to time t−1 with depression at time t was 0.65 on a linear logistic scale (95% CI: 0.480.82), corresponding to a 92% (62127%) increase in the odds of depression. Assuming a causal model accounting for confounded back pain by past depression, a marginal structural model (inverseprobabilityofexposureandcensoring weighted model) regressing depression on past back pain showed an association of 0.63 (0.440.81) on a linear logistic scale, corresponding to a 87% (55126%) increase in the odds of depression. Expressing exposure history by cumulative back pain up to time t1, marginal structural model estimated a causal effect on depression at time t that increased with age at baseline and decreased for individuals with depression at baseline. Conclusion: Marginal structural models are well suited for the analysis of observational longitudinal data with timedependant potential causes of depression, however, marginal structural models do not address all issues of causal inference. Back pain history is one of many possible causes of depression. Future work must collect more socioeconomic and healthrelated covariates, investigate possible nonignorable missing and investigate other functions of back pain history.


Jongkil Kim 
Heavy Tails and Self Similarity of Wind Turbulence Data (corrected version July 2012)  Hans Rudolf Künsch  Nov2011 
Abstract
In this thesis, we perform the statistical analysis in order to figure out the characteristics of wind turbulence. We estimate the pdf of the increments of wind velocities which have heavy tails. Also, we estimates the autocovariances and the autocorrelations of the increments of wind velocities by revealing their second order properties for the purpose of showing Self Similarity. Parsimonious properties of wind turbulence are discussed by the estimated parameters. With reasonable assumptions, the relations between lag of wind increments and estimated parameters are suggested. Also, interpretations of the result are explained. In addition, the dependency between the wind increments and mean velocities are also discussed. Nonparametric tests are perform the whether the dependency exists between the increments of wind velocities and block mean velocities. Also, the dependency of two consecutive increments on the block means velocities are researched. Key words: Wind turbulence, Generalized Hyperbolic distribution, Normal Inverse Gaussian distribution, Selfsimilarity


Evgenia Ageeva 
Bayesian Inference for Multivariate t Copulas Modeling Financial Market Risk 
Martin Mächler
Peter Bühlmann 
Sep2011 
Abstract
The main objective of this thesis is to develop a Markov chain Monte Carlo (MCMC) method under the Bayesian inference framework for estimating metat copula functions for modeling ﬁnancial market risks. The complete posterior distribution of the copula parameters resulting from Bayesian MCMC allows further analysis such as calculating the risk measures that incorporate the parameter uncertainty. The simulation study of the ﬁctitious and real equity portfolio returns shows that the parameter uncertainty tends to increase the risk measures, such as the ValueatRisk and the Expected Shortfall of the proﬁtandloss distribution.


Emmanuel Payebto Zoua 
Subsampling estimates of the Lasso distribution.  Peter Bühlmann  Sep2011 
Abstract
We investigate possibilities offered by subsampling to etimate the distribution of the Lasso estimator and construct confidence intervals/hypothesis tests. Despite being inferior to the bootstrap in terms of higherorder accuracy in situations where the later is consistent,subsampling offers the advantage to work under very weak assumptions.Thus, building upon Knight and Fu (2000), we first study the asymptotics of the Lasso estimator in a low dimensional setting and prove that under an orthogonal design assumption, the finite sample component distributions converge to a limit in a mode allowing for consistency of subsampling confidence intervals. We give hints that this result holds in greater generality. In a high dimensional setting, we study the adaptive Lasso under assumption of partial orthogonality introduced by Huang, Ma and Zhang (2008) and use the partial oracle result in distribution to argue that subsampling should provide valid confidence intervals for nonzero parameters. Simulations studies confirm the validity of subsampling to construct confidence intervals, tests for null hypotheses and control the FWER through subsampled pvalues in a low dimensional setting. In the high dimensional setting, confidence intervals for nonzero coefficients are slightly anticonservative and false positive rates are shown to be conservative.


Hesam Montazeri 
Nonparametric Density and Mode Estimation for Bounded Data 
Rita Ghosh
Werner Stahel 
Aug2011 
Abstract
This thesis investigates the performances of various estimators in density estimation and mode estimation for bounded data. It is shown that many nonparametric estimators have boundary bias when the support of true probability density function has a compact support. Because the boundary region might be a large percentage of the whole support, boundary bias problem could be very serious in many complex and realworld applications. The widely accepted method for boundary bias correction in regression and density estimation is Automatic Boundary Correction [1]. This method is based on local polynomial fitting and no explicit correction for boundary effects is needed in this method. In the first part of this thesis, we consider applications of this method and Parzen's method in density estimation of some bounded univariate and bivariate data examples. It is shown that the local polynomial based method has no significant boundary bias in the considered examples. In addition, we also give a new formula for the asymptotic bias of the density estimate based on local polynomial fitting which includes the bin width parameter. In the second part of this thesis, we consider mode estimation and several methods are examined for bounded data. We show that many nonparametric mode estimation methods have boundary bias if the true global mode is located in boundary region. Among the considered methods, mode estimation based on local polynomial shows to have superior performance and it does not seem to have considerable boundary bias problem.


Xiaobei Zhou 
Prediction Models for Serious Outcome and Death in Patients with Nonspecific Complaints Presenting to the Emergency Department 
Werner Stahel
Markus Kalisch 
Aug2011 
Abstract
This paper is based on the Basel Nonspeciﬁc Complaints (BANC) by Nemec, Koller, Nickel, Maile, Winterhalder, Karrer, Laifer, and Bingisser [2010]. Nonspeciﬁc complaints (NSCs) are very common in emergency departments (EDs). However, when treating the patients with NSCs, emergency physicians have rarely experience. My research mainly focuses on the outcome variables a serious condition (o ser) and death in ED patients with NSCs. My primary goal is to ﬁnd a set of methods (classiﬁers) which classify with high accuracies for o ser and death. Moreover, we try to ﬁnd a series of risk factors (explanatory variables) which are highly correlated with the outcome variables. We do not ﬁnd a classiﬁer that clearly outperforms all others in all aspects. RandomForest, LogisticRegression and Adaboost turn out to be favorable according to different criteria. We ﬁnd that dealing with missing values using imputation increases classiﬁcation performance. Finally, we discuss SMOTE as an interesting but not fully satisfy method for dealing with highly unbalanced data.


Marc Lickes 
Portfolio optimization if parameters are estimated  Hans Rudolf Künsch  Aug2011 
Abstract
In the following we discuss the effect of parameter estimation in the context of mean variance portfolio optimizations. We compare the efficient frontier under a certainty equivalent approach and Bayesian predictive posterior distribution. We will show that the sample estimators lead to a risk underestimation and we will provide corrected estimators. In addition we will relax the assumption of identical returns and introduce dynamic linear models for time varying mean and covariance matrices. This study will conclude by analysing the performance of those estimators on a simulated multivariate normal data set and on a sample set of returns drawn from either the Dow Jones 30 or S&P500.


David Lamparter 
Stability Selection for Error Control in HighDimensional Regression  Peter Bühlmann  Aug2011 
Abstract
In the recent past, the development of statistical methods for highdimensional problems has greatly advanced leading to methods for model selection such as the lasso. However, the question of error control in highdimensional settings has proven to be difficult. Recently, an approach called stability selection has been proposed to tackle the problem. It combines a method for model selection and subsampling to deliver a form of error control. In this thesis, some variants of stability selection are introduced. It was tested if error control would actually hold up. Furhermore, some conditions were isolated where using these variants might have beneficial effects.


Marco Läubli 
Particle Markov Chain Monte Carlo for Partially Observed Markov Jump Processes  Hans Rudolf Künsch  Aug2011 
Abstract
The goal of the thesis was to investigate, understand and implement the so called particle Markov chain Monte Carlo (PMCMC) algorithms introduced by Andrieu, Doucet, and Holenstein (2010) and to compare them to classical MCMC algorithms. The PMCMC algorithms are introduced in the framework of state space models. Their key idea is to use sequential Monte Carlo (SMC) algorithms to construct efficient highdimensional proposals for MCMC algorithms. The performance of the algorithms is examined on a simple birthdeath process in discrete time as well as on the stochastic Oregonator, an idealized model of the BelousovZhabotinskii nonlinear chemical oscillator. In summary it can be said that the PMCMC algorithms produce satisfactory results even when using only standard components and they require comparably little problemspecific design effort from the user's side. On the other hand it must be mentioned that the computational effort, compared to classical methods, is tremendous and a serious drawback.


Christian Sbardella 
High dimensional regression and survival models 
Peter Bühlmann
Patric Müller 
Aug2011 
Abstract
In the highdimensional regression we have too many parameters relative to the number of observations and then we can have the problem of the overfitting. A method to solve this problem is to use the Lasso (Least Absolute Shrinkage and Selection Operator) to estimate the regression's coefficients. This estimator has become very popular because, among other properties, it does variable selection, in the sense that some estimated coefficients are equal to zero.We study the Lasso estimator proving its consistency and finding an oracle inequality in the case of squared error loss.In this thesis we also talk about survival analysis: this branch of the statistic studies the failure times of an individual (or of a group of individuals) to conclude if for example a new treatment is effective, or if a certain group of individuals has more survival probability than another. We mainly focus on the Cox Proportional Hazard model and the Weibull Proportional Hazard model.A natural question is: "Can we use the theory of the Lasso estimator in the survival analysis?"We try to answer this question in the last chapter of this thesis (Chapter 5).


Alexandra Federer 
Estimating networks using mutual information 
Marloes Maathuis
Markus Kalisch 
Jul2011 
Abstract
Identifying the relations between variables of a dataset and visualize these relationships in an independence network is important in many applications. We use the concepts of entropy and mutual information to estimate the dependency between two random variables. An advantage of this method in comparison to a correlation test is that mutual information measures also nonlinear dependency. To estimate the correlation graph of a dataset, we construct a statistical test for zero mutual information. We analyze the performance of this method compared with the wellknown method of estimating the correlation graph by defining a threshold for the mutual information regarding to ROCcurves.


Oliver Burkhard 
The Effect of Managed Care Models on Health Care Expenditure 
Marloes Maathuis
Markus Kalisch 
Jul2011 
Abstract
In this thesis we want to estimate the cost reduction effects by managed care plans that were introduced in Swiss health insurance in 1996. Those plans limit the free choice of health care provider and come with reduced premiums. The data comes from one insurer and the years 19972000.The challenge we face comes from the unobserved health of the insured. It can have an influence on both the choice of managed care plan and on the costs caused. We tackle the problem by generating an estimate of an auxiliary variable "latent health'" using Tobit regression which allows us to estimate the causal effect of managed care plans on costs using a Two Part model. We then look at different possibilities to improve the results.We find that the total effect of managed care consists of a part that can be explained through the auxiliary variable and a part that cannot, indicating true cost reduction effects by the managed care models.


Niels Hagenbuch 
A Comparison of Four Methods to Analyse a NonLinear MixedEffects Model Using Simulated Pharmacokinetic Data 
Martin Mächler
Werner Stahel 
Jun2011 
Abstract
Our study characterizes the behaviour of four different methods to estimate a nonlinear mixedeffects model in R . Three methods used a closedform analytical solution of a system of ordinary differential equations (ODEs), the fourth method used the system of ODEs directly. The three methods were nlme() from the package of the same name, nlmer() from package lme4a and nlmer() from package lme4. For the ODEs, we used nlmeODE() along with nlme(). The two methods using nlme() do not differ much in their estimates. Nonconvergences occurred. lme4a and lme4 provide fast and reliable (in terms convergence) routines nlmer() which have shortcomings as well: fixedeffects parameters’ standard errors are over or underestimated, inconsistently across the parameters; the estimation of the standard deviations of the random effects does not always profit from an increase in observations. The results across three simulations reveal unpredictable patterns of the estimators of lme4a and lme4 considering coverage ratios, bias and standard error as functions of number of observations. A limitation of this study is its limited number of simulation runs (250).


Stephanie Werren 
PseudoLikelihood Methods for the Analysis of Interval Censored Data  Marloes Maathuis  Mar2011 
Abstract
We study the work of Sen and Banerjee (2007), focusing on their method based on apseudolikelihoodratio statistic to obtain pointwise confidence intervals for null hypotheses on the distribution function of the survival time in a mixedcase interval censoring model. Mixedcase interval censored data arises naturally in clinical trials and a variety of other applied fields. The setting of such a model is one where n independent individuals are under study and each individual is observed a random number of times at possibly different observation timepoints. At each observation time it is recorded whether an event happened or not and one is interested in estimating the distribution function of the time to such an event, also called failure. However, the time to failure cannot be observed directly, but is subject to interval censoring. That is, one only obtains the information whether failure occurred between two successive observation timepoints or not.We extend the results from Sen and Banerjee (2007) to mixedcase interval censored data with competing risks. This is data, where the failure is caused from one of R risks, where R ∈ N is fixed. We define a naive pseudolikelihood estimator for the distribution function of the event that the system failed from risk r for each r = 1, 2, . . . ,R, analogous to Jewell, Van der Laan, and Henneman (2003). We prove consistency and the asymptotic limit distribution of the naive estimators and present a method to draw pointwise confidence intervals for these subdistribution functions based on the pseudolikelihoodratio statistic introduced by Sen and Banerjee (2007).


Karin Peter 
Marginal Structural Models and Causal Inference  Marloes Maathuis  Feb2011 
Abstract
We analyze data of an observational treatment study of HIV patients in Africa, collected by the Institute for Social and Preventive Medicine (ISPM) in Bern. In particular, we focus on patients who received frstline treatment and experienced immunologic failure, where immunologic failure might be an indication that the current treatment is no longer effective. Some of these patients were switched to a secondline treatment, according to the decision of their doctor (i.e. nonrandomized). Based on these data, we are interested in estimating the causal effect of the switch to secondline treatment on survival. The data contain information on the treatment regime and the CD4 counts of the patients, where both of these are time dependent. A main challenge in the analysis is the CD4 count, which indicates how well the immune system is working. The CD4 count may influence future treatment and survival, making it a confounder that one should control for. On the other hand, the CD4 count is likely to be influenced by past treatment, making it an intermediate variable that one should not control for. We address this problem by using marginal structural models. Conceptually, this method weighs each data point by its inverse probability of treatment weight (IPTW), creating data of an unconfounded pseudopopulation. Our results indicate that switching to secondline treatment is beneficial, and slightly more so than an analysis with classical methods would imply.


Reto Bürgin 
Pain after an intensive care unit stay 
Werner Stahel
Marianne Müller 
Feb2011 
Abstract
The present study examines pain occurring within twelve months after an intensive care unit (ICU) stay by focussing on three aspects: i) Which variables relate to pain after an ICU stay? ii) Which is the longitudinal association of ICUrelated variables and pain? And iii) do former ICU patients suffer more severe pain than comparable people who haven’t been in ICU recently?The first two aspects are examined with statistical analyses of data of 149 former ICU patients: Whilst these data contain three repeated pain measurements per patient  immediately after as well as six and twelve months after the ICU stay  the provided explanatory variables are physiological, emotional and sociodemographicrelated and were measured before, during and after the ICU stay. The third aspect is examined by using additional data of a control group of 153 subjects.Concerning the first aspect, stepwise regression model selections have identified gender, pain before the ICU stay, four ICUrelated variables, agitation and other illnesses as to be useful explanatory variables for pain after an ICU stay. Moreover, anxiety before the ICU stay and the length of stay in the ICU have shown significant associations too.The second aspect, the longitudinal study was examined by the use of a repeated measurement regression model. This model has shown a significant association between ICUrelated variables and pain, both six and twelve months after the ICU stay (pvalues: 0.005 and 0.025). Whilst the significance of these associations tends to decrease with the time that has elapsed since the ICU stay, the effect of variables which are not directly ICUrelated, particularly that of pain before the ICU stay, tends to increase.The third aspect was again analysed with a repeated measurement regression model. This model has demonstrated that ICU patients tend to suffer more severe pain than the subjects of the control group. However, this difference decreases as time passes from the initial ICU stay. As a result, twelve months after the ICU stay, the difference is no longer significant (pvalue: 0.3).Finally, the identification of explanatory variables for pain turned out to be the principal challenge of this study. As the discovered explanatory variables are indicators which leave room for interpretation, both an extended discussion of the study results  also with experts from medical sciences  as well as their comparison with similar studies were essential


Weilian Shi 
Distribution of Realized Volatility of Long Financial Time Series 
Werner Stahel
Dr. Michel Dacorogna 
Feb2011 
Abstract
Insurance companies face a difficult situation as the regulators ask for the same level of solvency during the crisis [Zumbach et al., 2000]. This master thesis focuses on the log returns and volatilities of very long financial time series. We investigate the distributions and tail behaviors of both log return and volatilities, where the Hill estimator is used for the tail index estimation of the volatility distribution. Taking the definition that a crisis occurs when the GDP consecutive drops for two quarters, the financial crisis has been identified as the biggest crisis after the Second World War. A linear regression model is conducted to analyze the connection between realized volatilities and the GDP log return before and after 1947, respectively. The negative correlation between them suggests that the volatility has the tendency to increase when the economy is experiencing a recession.

2010
Student  Title  Advisor(s)  Date 

Alain Helfenstein 
Forecasting ODPath Booking Data for Airline Revenue Management using a Random Forest Approach  Peter Bühlmann  Aug2010 
Abstract
A main issue of an airline's revenue management is to calculate an accurate forecast of the future demand of bookings. Poor estimates of demand lead to inadequate inventory controls and suboptimal revenue performance. Within this thesis we describe the structure of booking data within the airline industry that needs to be forecasted and discuss the current bayesian forecasting model implemented by Swiss Revenue Management.We then implement new forecasting models using different random forest (regression) approaches and discuss the accuracy of the predicted demand of all models. As a further result we will illustrate how an implementation of a regression using the random forest algorithm can fail.


Fabio Valeri 
Sample Size Calculation for Malaria Vaccine Trials with Attributable Morbidity as Outcome  Marloes Maathuis  Aug2010 
Abstract
Malaria is a ma jor public health issue. Big eﬀorts have been put into research to develop a vaccine against malaria. Problems arises in estimating vaccine eﬃcacy. Standard methods as the cutoﬀ method and logistic regression may have biased eﬃcacy estimates. An alternative approach which avoid bias is to apply a Bayes latent class model to estimate attributable risk. One problem using this probabilistic approach is that it is not clear how big a trial would need to be in order to have comparable power to that of the cutoﬀ method. To assess the size of a trial using this approach a hypothetical parasite density of a population has been constructed based on a latent class model and some other constraints. Samples have been drawn from these true values, measurement errors simulated and vaccine eﬃcacy estimated. This has been done for three diﬀerent vaccine type mechanism. For the vaccine we considered, to get a power of 80% the probabilistic method needs 3 to 12 times more individuals as in the cutoﬀ method. Whereas the probabilistic has no biased eﬃcacy estimates, two vaccine types have large or very large bias. If vaccine type is not well deﬁned standard methods to estimate vaccine eﬃcacy could produce large biased estimates which can result in a rejection of the vaccine. The probabilistic approach would avoid bias but due to larger size for the same power the costs will be higher.


Doriano Regazzi 
The Lasso for Linear Models with within Group Structure  Sara van de Geer  Aug2010 
Abstract
In an high dimensional regression model, we consider the problem ofestimating a grouped parameter vector. We assume there is within groupstructure, in the sense that the ordering of the variables within groups expresses their relevance. In this setting, we study two group lasso methods:the structured group lasso and the weighted group lasso. Our work consistsin the implementation of these two methods in R. First, we prove the convergence of their algorithms. Then, we run simulations and we compare thetwo estimators in various situations.


Anna Drewek 
A Linear NonGaussian Acyclic Model for Causal Discovery  Marloes Maathuis  Jul2010 
Abstract
The discovery of causal relationships between variables is important in many applications. Shimizu et al. proposed a method to discover the causal structure from observational data in linear nonGaussian acyclic models, abbreviated by LiNGAM (see Shimizu et al. 2006). We analyze their approach and empirically test the strictness of nonGaussianity byapproximating the Gaussian distribution with the tdistribution. Moreover, we compare the performance of the LiNGAM algorithm to that of the PC algorithm (Sprites et al. 2000). Finally, a combination of both algorithms is discussed (Hoyer et al. 2008) that enables the detection of causal structure in linear acyclic models with arbitrary distributions.


Rita Achermann 
Effect of proton pump inhibitors on clopidogrel therapy  Werner Stahel  Mar2010 
Abstract
In the present study, the interaction between clopidogrel and proton pump inhibitors (PPI) is investigated. A PPI might reduce the anti platelet function of clopidogrel and increase the risk of a second myocardial infarction. Patients with both drugs prescribed have a higher risk for such an event, but whether this is due to individual risk factors or a reduced effect of clopidogrel is an open question. The present study aims to assess the effect due to an interaction between the two drugs using health insurance data. Methods to adjust for confounders in observational data were applied, and new graph theory developments in combination with probability theory were evaluated. The study population consisted of 4 623 patients with prescribed clopidogrel, a hospital stay of at most 30 days before the first administration of clopidogrel, and health insurance coverage with Helsana. Hospitalization due to cardiac event and death were used as the clinical endpoints to assess, whether proton pump inhibitor prescription was associated with a higher risk of rehospitalization.A graph was constructed based on knowledge to derive theoretically, whether the effect was identifiable. Causal inference rules applied to this knowledge based graph showed, that the effect is identifiable when observational data are used. Graphs estimated from data did not disprove these findings. The effect of PPI on clopidogrel was calculated from the interventional distribution defined by the graph. Also standard statistical techniques, a Cox proportional hazard regression, was applied, once with covariates to adjust for confounding and once with a propensity score. An instrumental variable approach was not feasible, since no instrument was found.Patients with concomitant use of clopidogrel and proton pump inhibitors had a higher risk for rehospitalization due to a cardiac event by a factor of 1.33 (CI 95%: 1.10, 1.61) compared to patients with no prescription for PPI. Important for the analysis was, that some patients had PPI administred together with clopidogrel but had no prescription before. Treatment guidelines recommend PPI to prevent stomach bleeding, a side effect caused by clopidogrel. It is assumed that this patients had no higher individual risk for a recurrent myocardial infarction compared to patients with no PPI prescription. Hence, the patients can be compared to patients with no PPI prescription before and during the study phase to estimate the effect. Comparison of the baseline characteristics for 23 drug groups, as well as age and gender revealed only minor differences. Results calculated based on the interventional distribution defined by the graph showed similar results compared to Cox regression. Finally, the propensity score used as a stratifier in a Cox proportional hazard regression yielded similar results either. As alternative treatments for PPI are available, patients should not take these two drugs together.


Armin Zehetbauer 
A Statistical Interest Rate Prediction Model  Werner Stahel  Mar2010 
Abstract

2009
Student  Title  Advisor(s)  Date 

Nicoletta Andri 
Using Causal Inference for Identifying Coresets of the ICF  Marloes Maathuis  Sep2009 
Abstract
The World Health Organisation (WHO) has a strong interest in reducing the ICFcatalogue to a smaller set of items for different reasons such as time management and complexity. In this context, we analyse two data sets of the WHO concerning rheumatism/arthritis and chronic widespread pain consisting of variables from the ICFcatalogue. For this variable selection process we use the approach of Maathuis, Kalisch and Bühlmann which uses graph estimation techniques in combination with a causal method called back door adjustment. We show under which conditions this approach can be applied also to dichotomized data sets and how interactions between the variables can be handled. Significance of the estimates is assessed using permutation tests and a method called stability selection presented by Meinshausen and Bühlmann. Finally, the causal results are discussed and compared to associational results.


Simon Figura 
Response of Swiss groundwaters to climatic forcing and climate change A preliminary analysis of the available historical instrumental records 
Werner A. Stahel
Rolf Kipfer David M. Livingstone 
Sep2009 
Abstract
Research on groundwater quality over longterm periods has scarcely been done in the past. In this thesis groundwater temperature is used as an indicator for groundwater quality. Temperature measurements of 8 river recharged and 6 rainfed groundwaters were analysed. Some data sets also contained records of water level, spring discharge, pumping amount and oxygen concentration. The length of the records ranged from 20 to 52 years. Plots and trend and changepoint tests were used to describe the temperature developments. Correlations and regression models were established to analyse the impact of climatic forcing in the form of air temperature and the impact of groundwater quantity variables on groundwater temperature. The behaviour of oxygen concentrations was also briefly analysed.Most of the river recharged groundwaters showed an increase in temperature of 11.5◦C in the last 30 years. More than half of this warming took place in the period of 19871993. Results indicate that this warming was due to climatic forcing. The temperature of the rainfed groundwaters showed small to no increase. Some properties of air temperature development can be recognized in temperature of these groundwaters but a possible response of rainfed groundwaters to climatic forcing is outweighed by other factors.Measurements of oxygen concentrations were available at 4 sites. Decreasing concentrations at 3 measurement sites are likely caused by higher microbiological activity and lower oxygen solubility as a result of higher temperatures. This theory is contradicted by the increasing oxygen concentration at the fourth measurement site.


Lukas Rosinus 
Fehlende Werte EMAlgorithmus und Lasso in hochdimensionaler linearer Regression  Peter Bühlmann  Aug2009 
Abstract
Verschiedene Schätzer für hochdimensionale lineare Regressionsprobleme mit fehlenden Werten werden vorgeschlagen und untersucht [[?]]. Dabei wird Mithilfe des EMAlgorithmus der beobachtete negative LogLikelihood mit samt LassoBestrafung der Regressionsparameter β minimiert. Durch die Verwendung der LassoBestrafung werden die Regressionskoeﬃzienten sparse geschätzt. In Simulationsstudien werden die Methoden an verschiedenen multivariat normalverteilten Modellen untersucht. Dabei zeigt sich, dass die MissRegr Methode die besten Resultate erzielt. Mit dem EMAlgorithmus wird die inverse Kovarianzmatrix K = Σ−1 im Likelihood Sinn optimal geschätzt. Mit der Lasso Bestrafung werden dann auch die Regressionsparameter gut geschätzt, auch bei hohem Anteil fehlender Daten.


Philipp Stirnemann 
Unmatched Count Technik: Zum Zusammenhang zwischen Anonymität und statistischer Effizienz 
Werner A. Stahel
B. Jann 
Aug2009 
Abstract


Rudolf Dünki 
Robuste Variogrammschätzung und robustes Kriging  Hans Rudolf Künsch  Aug2009 
Abstract
The thesis describes the development of robust algorithms for the analysis of geostatistical data. Three algorithms where implemented in R and each of these allows for a simultaneous estimation of the regression parameters and the covariance parameters. All three algorithms returned consistent results. Two of them are implemented as a package of Rfunctions. The treatment of the nugget effect makes the essential difference between the two algorithms: the first algorithm treats the nugget as a part of the estimation of the covariance parameters. The other algorithm treats the nugget as a part of the regression problem. This bears advantages in the analysis of polluted data. Sets of 50 simulations with different degrees of added pollution were analysed. The resulting parameter estimates agreed with the true values within the statistically tolerable range. The exception was the set containing the most polluted data. The estimation of the range parameter was somewhat problematic when performed with small Huber constants i.e. the resulting range displayed a bias upward. In contrast to this, the nugget estimate was improved when choosing a small Huber constant. The algorithm treating the nugget effect as a part of the regression problem returned more stable results in the case of a high degree of pollution. A Huber constant of 1.333 ... 1.666 appeared appropriate in these cases. An increase in stability was also visible in the behaviour of influence functions. The algorithms were applied to data on contamination of soils with Cu in the surroundings of a metal smelter in Dornach. It could be shown that the estimated parameters allowed for kriging estimates which are comparable with earlier analyses. Despite this it was not possible to gain unambigous parameter estimates. The reason lies in the existence of a very flat and extended optimum region. This allows for fitting models with comparable goodness of fit characteristics for clearly distinct parameter sets.


Thomas André Rauber 
Parameter risk in reinsurance  Peter Bühlmann  Jul2009 
Abstract
In this thesis we consider parameter uncertainty that comes along in differentpricing areas in a reinsurance. By parameter risk we mean the riskof not estimating the parameter properly. We mainly look at parameter riskin the severity distribution. We differentiate three different ways ofcharacterising uncertainty. We first replace the parameter that has to beestimated by a randomvariable and derive some analytical result. Then we look into MaximumLikelihood Estimators and use the result that they are asymptotic normallydistributed. For some examples these asymptotic results are not accurateenough. Considering these cases we will classify the uncertainty by usingbootstrap. Finally we will specify where uncertainty arises in theExperience, Exposure and Credibility Rating in praxis. We will see anexample of Credibility Rating which blends Experience and Exposure Ratingby minimizing the parameter risk.


Alessia Fenaroli 
Propagating Quantitative Traits in Gene Networks  Marloes Maathuis  Feb2009 
Abstract
Gene networks have been created to extend the knowledge of the gene functions in a specific organism. Such networks describe connections between genes involved in the same biological process.McGary, Lee and Marcotte have related a gene network of the baker yeast, called the YeastNet, with a morphological traits variation dataset, the SCMD, and have defined a method which assigns scores to each gene of the network in order to predict their activity. The researchers have tested the predictability of YeastNet with ROC curves and the respective AUC values by computing a leaveoneout crossvalidation and have obtained the median value 0.615. Our contribution to this study includes: the definition of other score methods that take into account the quantitative data given by the SCMD dataset, in opposition to the dichotomization applied to these data made by McGary et all.; some new rules to predict the activity of each gene based on their scores, more complicated than the simple idea of comparing the scores with a cutoff adopted by McGary et all. but more efficient; and a different procedure, the 10fold crossvalidation, to compute the network predictability analysis.Thanks to these changes we have improved the YeastNet prediction quality by 5%, whose median value now is 0.665.


Simon Lüthy 
Merkmalswichtigkeit im Random Forest  Peter Bühlmann  Feb2009 
Abstract
In der Bioinformatik und verwandten Wissenschaftsgebieten, wie die statistische Genforschung und die genetische Epidemiologie, ist die Vorhersage von kategoriellen Antwortvariablen (wie der Krankheitsstatus eines Patienten oder die Eigenschaften eines Molekuls) einerseits und die verlässliche Identifikation der relevanten Merkmale andererseits, eine wichtige Aufgabe. In der Genforschung enthalten typische Datensätze hunderte oder gartausende von Genen beziehungsweise Merkmalen, doch stehen oftmals verhältnismassig wenige Beobachtungen, anhand deren man die Vorhersagen und Identifikationen machen will, zur Verfügung. Der Random ForestAlgorithmus löst dieses Problem sehr gut.In dieser Arbeit möchten wir in einem ersten Schritt die Entstehung eines Entscheidungsbaumes, mit dessen Hilfe ganze VorhersageWälder {sogenannte Random Forests{ generiert werden, erklären. Wir erläutern kurz die Vorgehensweise bei der Erzeugung eines solchen Waldes und definieren die permutierte Fehlerfreiheit (engl. permutation accuracy importance) als ein Mass fur die Merkmalswichtigkeit.In einem zweiten Schritt weisen wir auf die Problematik hin, die auftritt, wenn man die permutierte Fehlerfreiheit auf Datenmengen mit stark korrelierenden Variablen oder mit Variablen, die sich in der Anzahl ihrer Kategorien unterscheiden, anwenden möchte. Wir präsentieren den Lösungsvorschlag nach Strobel et al. (2007), die einen anderen Algorithmus zur Erzeugung des Waldes propagieren.Wir führen zwei weitere Masse für die Merkmalswichtigkeit ein, zeigen anhand von Simulationen ihr Verhalten auf verschiedenen Datenmodellen und vergleichen sie mit der permutierten Fehlerfreiheit. Nach unserer Meinung ist die permutierte Fehlerfreiheit im Random Forest nach wie vor ein starkes und glaubwürdiges Werkzeug in der Variablenselektion.


Patric Müller 
Image restoration Blind deconvolution for noised Gaussian blur  Sara van de Geer  Feb2009 
Abstract
Blind deconvolution is an inverse problem with one or more unknown parameters.Nowadays, one of the more common practical applications of deconvoultion is in image analysis, where it is used determining how to restore blurred images. To recover the original image, however, we first have to estimate the unknown parameteres the image was blurred. In the last years, this topic has attracted significant attention, resulting in numerous studies. This thesis studies blind deconvolution from theoretical and practical point of view.On the other side, we provide the necessary tools we will utilise to improve the quality of blurred and noised pictures. Our results give rise to algorithms computing estimations if the aforementioned unknows.The applicability of the explored techniques then is demonstrated by means of several practical examples.The thesis is concluded by a brief qualitative analysis of the limits of deconvolutionwith regard to image restoration. To this end we show that the process isillconditioned. Thus, it might be at best inefficient, but at worst impossible, to retrieve the original picture from a blurred one.

2008
Student  Title  Advisor(s)  Date 

Diego Colombo 
Goodness of fit test for isotonic regression  Marloes Maathuis  Jul2008 
Abstract
We study the work of Durot and Tocquet (2001), whom proposed a new test of the hypothesis H0 : ”f = f0” versus the composite alternative Hn : ”f != f0”, under the assumption that the true regression function f is monotone decreasing on [0, 1]. The test statistic is based on the L1distance between the isotonic estimator ˆ fn of f and the given function f0, since a centered and normalized version of this distance, is asymptotically standard normal distributed under the null hypothesis H0, provided that the given function f0 satisfies some regularity conditions. The main purpose to study asymptotic normality of the isotonic estimator, relies on the study of its asymptotic power under the alternative Hn : ”f = f0 + cn"n”. The idea is to study the minimal rate of convergence for cn, such that the test has a prescribed asymptotic power. Durot and Tocquet show that this minimal rate is n−5/12 if "n does not depend on n and n−3/8 if it does.Our contribution is a more detailed explanation of the models, of the main results and the insertion of some extra particular steps in the proofs. To check these theoretical results in simulations like Durot and Tocquet, we write new R codes. Namely, we perform a simulation study to compare the power of this test with that of the likelihood ratio test, for the case where f0 is linear, and we also compare these simulations results to the ones obtained by Durot and Tocquet. Moreover, we propose extra simulations for the power of another test not treated by Durot and Tocquet and we will see that it is always most powerful than the one they studied. Finally, we conduct a new simulation study in the case where the given monotone function f0 is quadratic.


Alain Weber 
Probabilistic predictions of the future seasonal precipitation and temperature in the Alps  Hans Rudolf Künsch  Jul2008 
Abstract
This work presents probabilistic predictions of the future (20712100) seasonalprecipitation and temperature in the Alps. The predictions combine climate forecasts from different numerical simulations in a Bayesian ensemble approach. It is well known that these climate simulations have systematic errors, which should be taken into account. Unfortunately, simulations are driven by boundary conditions, which are very different to those of the last century. This is a problem because there exist no comparable data from the past to estimate the bias of a climate model under similar boundary conditions. It becomes necessary to rely on assumptions, which can hardly be proven wright or wrong. Recently,Christoph Buser showed that predictions of seasonal temperature in the Alps differ for two reasonable assumptions. In this work we compare predictions of precipitation for the same two assumptions. In addition, one of the corresponding Bayesian modelsis extended to predict the bivariate distribution of precipitation and temperature.


Patricia Hinder 
Additive Isotonic Regression  Sara van de Geer  Mar2008 
Abstract
In this master thesis we study the isotonic regression model for one or more covariates. We will first give an introduction to the one dimensional regression problem with calculated using the pooladjacent violator algorithm (PAVA). We will extend the regression problem to multiple covariates and assume an additive model. The functions will be estimated with a classic backfitting estimator. We compare the backfitting estimator with an oracle estimator and discuss that they can be estimated with the smoothed by applying a kernel smoother to the isotonized data. The monotonicity of the kernel smoother ist guaranteed by using a logconcave kernel. We will study another approach of the additive isotonic regression problem that is based on boosting. The function are expanded into a sum of basis functions and componentwise boosting algorithm is applied.


Manuel Koller 
Robust Statistics:Tests for Robust Linear Regression  Werner Stahel  Mar2008 
Abstract
Analyzing data using statistical methods means to break reality down toa mathematical framework, a model. Often this model is based on strongassumptions, for example normally distributed data. Classical statisticsprovides methods that fit the chosen model perfectly. But in reality themodel assumptions usually hold only approximately. Anomalies and untrueassumptions might render the statistical analysis useless. Robust statistics aims for methods that are based on weaker assumptionsand thus allow small deviations from the classical model. However,robust statistics is not restricted to the use of robust estimationmethods alone. It also extends to methods used to draw inference. In thepast, there has not been much research focused on robust tests.In this thesis we study the quality of inference performed by of twostateoftheart robust regression procedures. We then propose a designadapted scale estimator and use it as part of a new robust regressionestimator, the MMDestimator. This new estimator improves the quality ofrobust tests considerably.A simulation study is performed to compare the performance of thementioned regression procedures in combination with various covariancematrix estimators. We found large differences between the testedmethods. Some methods were able to approximately reach the desired levelin the corresponding tests for most tested scenarios while othersproduced estimates that were only useful in specific high samplesettings. We infer that the covariance matrix estimator needs to becarefully selected for every new scenario.


Philipp Rütimann 
Variablenselektion in hochdimensionalen linearen Modellen mittels Schrumpfvarianten des PCAlgorithmus  Peter Bühlmann  Mar2008 
Abstract
In dieser Arbeit geht es um Variablenselektion in hochdimensionalen linearen Modellen. Dazu wird der Ansatz von Professor Peter Bühlmann und Markus Kalisch basierend auf dem PCAlgorithmus übernommen. Dieser Ansatz wird in der Arbeit dahingehend verändert, dass die Korrelationen, statt mit dem Maximum Likelihood Schätzer, mit verschiedenen Schrumpfschätzern berechnet werden. Diese neuen Varianten des PCAlgorithmus werden mittels ROCPlots und weiteren graphischen Vergleichsmethoden mit der Standardvariante verglichen.Des Weiteren geht es in dieser Masterarbeit um Dimensionsreduktion. Diese wird verwendet um die Dimension der hochdimensionalen linearen Modelle zu verringern. Es stellt sich heraus, dass sich dadurch die Varianten des Algorithmus klar Verbessern. Somit kam die Idee auf, die Dimensionsreduktion auch im Falle des robusten PCAlgorithmus zu verwenden. Doch dies ergibt nicht die selben positiven Resultate wie bei den Schrumpfvarianten.


Bruno Gagliano 
Asymptotic theory for discretely observed stochastic volatility models  Sara van de Geer  Feb2008 
Abstract
This thesis investigates the estimation of parameters for discretely observed stochastic volatility models. The main concern is to give a general methodology for estimating the unknown parameters from a discrete set of observations of the stock price. Two estimation methods, the minimum contrast and estimating functions, are introduced and it is shown that, under certain assumptions, the estimators obtained are consistent and asymptotically normal. Finally, a series of simulations is performed to confirm the results and an application to realworld stock data is made.


Sandra König 
Analyse von Skisprungdaten  Sara van de Geer  Feb2008 
Abstract
In der vorliegenden Arbeit soll untersucht werden, welche Faktoren beim Skispringen signiﬁkant mehr Weite bringen. Als einfachstes Modell wird eine lineare Regression angepasst. Dabei zeigt sich wie erwartet, dass Wind, Anlaufgeschwindigkeit und Gewicht die Weite eines Sprungs beeinﬂussen. Für eine detailliertere Analyse werden Verallgemeinerungen des linearen Modells herangezogen. Insbesondere das Gemischte Eﬀekte Modell zeigt, dass es springerspeziefische Eﬀekte (wie etwa das Fluggefühl) gibt; weiter wird die isotone Regression betrachtet sowie die Möglichkeit, mittels Multiscale Testing die Isotonie einer Funktion zu überprüfen. Da insbesondere der Wind immer wieder Wettkämpfe mitzubestimmen scheint, wird sein Einﬂuss durch Messungen an weiteren Stellen detaillierter untersucht. Dabei stellt sich heraus, dass der Wind besonders beim Schanzentisch eine wichtige Rolle spielt. Eine weitere oﬀene Frage war, ob Podestplätze bei der Junioren Weltmeisterschaft ein Indikator für spätere Erfolge sind. Da es ebenso viele Beispiele für wie gegen diese Hypothese gibt, war anders als bei den vorhergehenden Untersuchungen keine intuitive Antwort vorhanden. Die Natur der Daten macht das Testen schwierig, daher wird wiederum eine Regressionsanalyse gemacht. Mathematisch schwierig zu beurteilen ist die Frage, wann Punkterichter, die den Sprung subjektiv bewerten, parteiisch sind. Eine mögliche Beschreibung der sehr komplexen Situation liefert das Gemischte Eﬀekte Modell.


Francesco Croci 
The World of Volatility Estimators  Sara von de Geer  Jan2008 
Abstract
This thesis investigates the estimation of the volatility of an asset return process. The main concern is to give a general overview for how to estimate volatility nonparametrically and efficiently. First of all, I have introduced the basic notions of stochastic theory and a special and unusual limit theorem that I will use throughout the thesis. Then, I deal with several volatility estimators, from the easiest and worst one, the so called realized volatility (RV) estimator, to the so far best estimator, the so called multiscale realized volatility (MSRV) estimator, which converges to the true volatility at the rate of n1/4. Finally, in the last section, we consider microstructure as an arbitrary contamination of the underlying latent securities price, through a Markov kernel Q. The main result there is that, subject to smoothness conditions, the two scales realized volatility (TSRV) is robust to the form of contamination Q.

2007
Student  Title  Advisor(s)  Date 

Sonja Angehrn 
Random Forest Klassifikator zur Erkennung von Alarmsignalen in Sicherheitssystemen  Peter Bühlmann  Aug2007 
Abstract
In dieser Diplomarbeit werden die drei Klassifikatoren Logistische Regression, CART und Random Forest auf ihre Verwendbarkeit für einen Erkennungsalgorithmus überprüft, in welchem von verschiedenen Geräuschsignalen bestimmt werden soll, ob sie der Klasse Alarm oder Normal zugehören. Es stellt sich heraus, dass der Random ForestAlgorithmus von den drei Klassifikatoren für diese Problemstellung am besten geeignet ist. Anschliessend wird dieser Klassifikator anhand verschiedener Szenarien mit einem bestehenden HMMAlgorithmus verglichen.Für die Implementierung der Klassifikatoren stehen mehrere Features zur Verfügung. In dieser Arbeit wird für den Random Forest und den HMMAlgorithmus überpfüft, welche Auswahl dieser Features eine möghlichst kleine Fehlerrate ergibt.


Sarah Gerster 
Learning Graphs from Data: A Comparison of Different Algorithms with Application to Tissue Microarray Experiments  Peter Bühlmann  Aug2007 
Abstract
A new algorithm (logilasso) to learn network structures from data has been introduced in “Penalized Likelihood and Bayesian Methods for Sparse Contingency Tables with an Application to FullLength cDNA Libraries” (Dahinden, Parmigiani, Emerick and Buehlmann, 2007). The main idea is to study the interactions between the variables by performing a model selection in loglinear models.In this master thesis, a few other graphical model fitting algorithms are compared to the logilasso. The chosen algorithms are the PC, the MaxMinHillClimbing (MMHC) and the Greedy Equivalent Search (GES). They all base on different approaches to fit a graphical model. Those methods are presented and the algorithms are described. Their performance, in the sense of their ability to reconstruct a graph, is tested on simulated data. The algorithms are also applied to Renal Cell Carcinoma data toillustrate a typical domain of application for such algorithms.


Lorenza Menghetti 
Density estimation, deconvolution and the stochastic volatility model  Sara van de Geer  Aug2007 
Abstract
The stochastic volatility model contains the stochastic volatility process observed at discrete time instance with vanishing gaps whose density is to be estimated. The volatility density based on logarithm of the squared process is estimated with the deconvolving kernel density estimator. Since the error density is supersmooth, the convergence is very slow.This thesis studies the theoretical and empirical behaviour of the bias and the variance of the estimator. Empirical study suggests considering the bandwidth to be smaller than the theoretical bandwidth and confirms the slow rate of convergence.


Giovanni Morosoli 
Optimale Anpassung einer Portfolioschadenhöhenverteilung an ein individuelles Risiko  Peter Bühlmann  Aug2007 
Abstract
In der Einführung wird das Ziel dieser Diplomarbeit erklärt und werden die zur Verfügung stehenden Schadendaten präsentiert. Grundsätzlich besteht unsere Aufgabe aus der Berechnung eines optimalen Gewichts für die individuelle und die Portfolioschadenhöhenverteilung. Im Kapitel 2 wird das Problem des sogenannten Data fittings analysiert; mit anderen Worten, gegeben eine Stichprobe von Schadenhöhen, versucht man eine geeignete Verteilung zu finden, welche die gegebenen Schäden erzeugen könnte. Insbesonder untersuchen wir zwei versicherungsspezifische Methoden: das Erweiterungsverfahren, welches eine Verallgemeinerung der Maximum Likelihood Methode ist, und die Join Operation, welche als eine erste "grobe" Anpassung einer Portfolioschadenhöhenverteilung an ein individuelles Risiko interpretiert werden kann.Im dritten Kapitel benützt man den Chi2Test um eine Anpassung einer Portfolioverteilung an ein individuelles Risiko zu bestimmen. Diese Anpassung hängt aber stark von den gewählten Signifikanzniveau ab; daher, im 4. Kapitel analysieren wir das Problem der Wahl eines geeigneten Signifikanzniveaus, indem wir eine Art von "Credibility Approach" verwenden. Im letzten Kapitel diskutieren wir die erhaltenen Resultate und einige Hinweise für eventuelle zukünftige Entwicklungen.


Nicolò Valenti 
Regression under shape restriction and the option price model  Sara van de Geer  Aug2007 
Abstract
Many types of problems are concerned with identifying a meaningful structure in real world situations. A structure involving orderings and inequalities is often useful since it is easy to interpret, understand, and explain. In many settings, economic theory only restricts the direction of the relationship between variables, not the particular functional form of their relationship. Let c(X) denote the call price as function of the strike price X. By the no arbitrage principle, c is a convex, decreasing function ofX, i.e. it satisfies certain shape constraints. It can be argued that economic theory virtually places no other restrictions on c, and that the estimation of the stateprice density should be carried out using only these shape restrictions (and some bounds on first and second derivative). Furthersmoothness assumptions or parametric assumptions may not be justified and have the potential risk of misspecifying the stateprice density. Our work consists of studying estimation under such shape restrictions. We first consider monotone regression function estimation, the socalled sotonic regression problem. Second, we analyse the problem of convex regression estimation. Then we build a nonparametric estimator of the call pricing that is decreasing and convex for small samples.


Enrico Berkes 
Statistical Analysis of ChIP on Chip Experiments  Peter Bühlmann  Jul2007 
Abstract
With the end of the Human Genome Project, the challenge of the emerging discipline of modern biology is to determine the role of the newly characterised genes in man and model organisms. This new sequence data represents, for the first time, a realistic opportunity to link the function (and dysfunction) of specific tissues and cell types to the activity of the genes expressed within them, and so to identify genes and gene products that could act as therapeutic targets. The underlyingstrategy in the identification and functional characterisation of target genes will rely heavily on the ability to perform high throughput analysis of gene expression, at both the tissue and cellular level. Gene expression is regulated by proteins, specific for every gene, that bind themselves tothe target gene and promote or repress its transcription. In the last years two methods have been refined in order to study the gene regulation mechanism: microarray and ChIP on chip experiments. However the large quantity of data and the uncertainties, due to noise, provided by these experiments make the interpretation of the results difficult and laborious. For this reason many statistical methods have been developed trying to obtain the most relevant information from these data.Our work consists of modifying Motif Regressor, an already existing method to analyze data of microarray experiments, and using this new algorithm to search the transcription factor DNAbinding motifs of HIF1alpha, a protein involved in gene regulation under hypoxia. The results show thatour algorithm is fast, effective, does not require many biological experiments and gives important suggestions on where future biological researches could be directed.


Jürg Schelldorfer 
Multivariate Analyse linearer Mischungen mit bekannten potenziellen Quellenprofilen  Werner Stahel  Jul2007 
Abstract
Die Konzentration von gewissen Schadstoffen in der Luft kann mathematisch durch eine lineare Mischung von Beiträgen verschiedener Quellen approximiert werden. In diesem Zusammenhang ist die Aufgabe der multivariaten Statistik, mit geeigneten Verfahren die Anzahl der vorhandenen Quellen, deren Emissionsprofile sowie deren Aktivitäten (in Abhängigkeit der Zeit) zu schätzen.In dieser Arbeit präsentieren wir Verfahren, wie wir die Kenntnisse über mögliche vorgegebene Quellenprofile benutzen können, um die Datenanalyse bei einem linearen Mischungsmodell zu verbessern.


Miriam BlattmannSingh 
Nonparametric volatility estimation with a functional gradient descent algorithm for univariate financial time series  Peter Bühlmann  Mar2007 
Abstract


Claudia Soldini 
Variablenselektion in hochdimensionalen Regressionsmodellen bei nichthomogenen Daten: Die Nutzung von Bacillus subtilis zur Synthese von Ribaflavin  Peter Bühlmann  Mar2007 
Abstract
Die vorliegende Arbeit ist Teil eines interdisziplinären Forschungsprojektes. Ihr Ziel ist die Identifizierung von wichtigen Mechanismen, die an der Herstellung eines Vitamins durch ein Bakterium teilnehmen. Dafur stützt man sich auf Daten, die aus einer Genexpressionstudie einer pharmazeutischen Firma stammen. Da man mit einer grossen Anzahl von Genen zu tun hat, werden Regressionsmethoden angewendet, die für hochdimensionale Probleme geeignet sind, und Variablen selektieren können. Die Gene werden als Prädiktoren und die Menge des produzierten Vitamins als Zielvariable betrachtet.Die Experimente wurden unter verschiedenen Bedingungen durchgeführt, so dass man es mit einem nichthomogenen Datensatz zu tun hat. Die Menge des produzierten Vitamins variiert in Abhängigkeit von den Bakterienstämmen, die untersucht wurden, und vom Zeitpunkt, zu dem die Messungen genommen wurden. Es ist daher interessant, die wichtigsten Gene oder Gruppen von Genen zu identifizieren, die für solche Unterschiede verantwortlich sind. Zu diesem Zweck werden statistische Tests durchgeführt, sowohl auf den einzelnen Genen als auch auf Gruppen von Genen. Diese werden mit Hilfe einer Clusteranalyse gebildet, wobei als Ähnlichkeitsmass die Korrelation verwendet wird.


Nicolas Städler 
Statistische Modellentwicklung für nichtinvasive Blutzuckermessung mittels Sensoren  Werner Stahel  Mar2007 
Abstract
Impedanzsignale zur nichtinvasiven Messung der Blutzuckerkonzentration werden durch eine Vielzahl anderer Einflussfaktoren (Temperatur, Schweiss, Durchblutung, usw.) beeinträchtigt. Um den Einfluss solcher Störparameter auf die Impedanzsignale zu quantifizieren, werden diese mit Sensoren gemessen. Ziel dieser Arbeit ist es, mittels einer linearen Regression und verschiedener VariablenSelektionsMethoden möglichst allgemeingültige Modelle zu entwickeln, welche die GlukoseKonzentration in Abhängigkeit der Impedanzsignale und anderer Einflussfaktoren vorhersagen. In einem ersten Teil der Arbeit kommen die klassischen SelektionsMethoden SchrittweiseVorwärts, SchrittweiseRückwärts und "all subsets" zum Zuge. Da die erklärenden Variablen enorme Messungenauigkeiten aufweisen, werden diese in einem nächsten Schritt geglättet. Im Verlaufe der Arbeit zeigt sich, dass gewöhnliche Selektionskriterien wie AIC und Cp zu extrem überangepassten Modellen führen. In einem entscheidenden Schritt wird alternativ zum AIC und Cp ein an die spezielle Struktur der Daten besser angepasstes Kriterium vorgeschlagen. Mit dem neuen Kriterium wird sowohl eine AdaptiveLasso, als auch eine SchrittweiseVorwärtsSelektion durchgeführt. Beide Methoden führen zu sehr ähnlichen und vernünftigen Modellen mit einem R2 von 0.73. Besondere Aufmerksamkeit wird dem AdaptiveLasso gewidmet. Die Analyse zeigt, dass eine datenabhängige Gewichtung im AdaptiveLasso einen erheblichen Fortschritt gegenüber dem gewöhnlichen Lasso bringt. Da die funktionale Form des Modells a priori unbekannt ist, wird zudem eine Analyse mit dem Namen "Multi Adaptive Regression Splines (MARS)" benutzt. Diese Methode erweist sich aber als ungeeignet.

2006
Student  Title  Advisor(s)  Date 

Massimo Merlini 
Identifikation relevanter Mechanismen der Vitaminproduktion  Peter Bühlmann  Sep2006 
Abstract
Die Systembiologie ist eine junge interdisziplinäre Wissenschaft, deren Ziel es ist, biologische Organismen in ihrer Gesamtheit zu verstehen. In dieser Arbeit wird ein Forschungsprojekt vorgestellt, das die Produktion eines speziellen Vitamins durch einen Mikroorganismus untersucht. Man möchte die wesentlichen Mechanismen identifizieren, die am Fermentierungsprozess teilnehmen, um die Produktion zu optimieren.


Michael Amrein 
Parameterschätzung in zeitstetigen Markovprozessen  Hans Rudolf Künsch  Aug2006 
Abstract
In dieser Arbeit geht es um Parameterschätzungen in einer bestimmten Klasse von zeitstetigen, homogenen MarkovKetten, die sich insbesondere zur Modellierung von gewissen chemischen Reaktionen oder Systemen aus der Populationsdynamik eignet. Die Daten sollen dazu in Form einer Zeitreihe vorliegen, das heisst, man kennt die Werte des Prozesses zu diskreten Zeiten.Die Übergangswahrscheinlichkeiten zwischen je zwei Observationen werden mit Hilfe von PoissonVerteilungen approximiert. Die Güte dieser Näherung wird durch das Einführen von zusätzlichen Zeitpunkten (und latenten Daten) zwischen den eigentlichen Beobachtungszeiten verbessert. Zur approximativen Bestimmung des MaximumLikelihoodSchätzers wird der EMAlgorithmus gepaart mit MonteCarlo beziehungsweise MarkovKettenMonteCarloMethoden verwendet. Daraus resultieren schlussendlich zwei Algorithmen, die an verschiedenen Beispielen, insbesondere an künstlichen Datensätzen, getestet werden.


Andrea Cantieni 
Effiziente Approximation der a posteriori Verteilung für komplexe Simulationsmodelle in Umweltwissenschaften  Hans Rudolf Künsch  Aug2006 
Abstract


Elma Rashedan 
Models for Emission Factors of Passenger Cars linking them to Driving Cycle Characteristics  Werner Stahel  Jul2006 
Abstract


Carmen Casanova 
Vorhersage von PartikelgrössenVerteilung anhand BildananlyseDaten  Werner Stahel  Mar2006 
Abstract


Andreas Elsener 
Statistical Analysis of Quantum Chemical Data; Using Generalized XML/CML Archives  Peter Bühlmann  Mar2006 
Abstract


Simone Elmer 
Sparse LogitBoosting in hochdimensinalen Räumen  Peter Bühlmann  Mar2006 
Abstract
Das Ziel meiner Diplomarbeit ist es, das Klassifikationsverfahren SparseLogitBoost zu entwickeln und dies in R zu implementieren. Weiter soll das Verfahren auf simulierte und natürliche Daten angewendet werden und die Vorhersagegenauigkeit mit anderen Klassifikationsverfahren verglichen werden.

2005
Student  Title  Advisor(s)  Date 

Roman Grischott 
Robuste Geostatistik im Markovmodellen am Beispiel eines Schwermetalldatensatzes  Hans R. Künsch  Sep2005 
Abstract


Michael Hornung 
Klassifikation hochdimensionaler Daten unter Anwendung von BoxCox Transformationen  Peter Bühlmann  Aug2005 
Abstract
Die Regressionsmethoden Lasso, relaxed Lasso und Boosting werden benutzt, umsowohl simulierte wie natürliche hochdimensionale Daten vorherzusagen und zu klassieren.Dabei bestehen die betrachteten Daten nicht nur aus den erklärenden Variablensondern auch aus deren BoxCox Transformationen, was die Vorhersagegenauigkeitvergrössern soll. Da die Zielvariable bei den natürlichen Datensätzen diskret ist, richtenwir unser Augenmerk vor allem auf den Missklassifikationsfehler. Es zeigt sich, dassbei einzelnen Datensätzen durch die Verwendung der BoxCox Transformationen wohlVerbesserungen der Vorhersagekraft auftreten können, aber häufig auch Verschlechterungen in Kauf genommen werden müssen.Im zweiten Teil dieser Arbeit wird die Korrelation der durch die drei Regressionsmethodenausgewählten Modellvariablen betrachtet und zu verringern versucht. Dabei werdenzwei unterschiedliche Ansätze verfolgt. Als erstes wird durch eine LassoähnlicheMethode, die zusätzliche Gewichte im Bestrafungsterm benutzt, die Korrelation zumTeil beträchtlich verringert. In einem zweiten Schritt werden aus den gegebenen Variablendurch Mittelung von Gruppen bestehend aus stark korrelierten Variablen neueErklärende konstruiert. Diese werden dann für weitere Klassifikationen benutzt. Auchdiese Methode verringert die Korrelation der Variablen teilweise stark. Jedoch lassensich keine allgemeinen Aussagen machen und beide Ideen führen in der Regel zu einerVergrösserung des Missklassifikationsfehlers.


Stefan Oberhänsli 
Robustheit bei Multiplem Testen: Differentielle Expression bei Genen  Peter Bühlmann  Aug2005 
Abstract
Warum Multiples Testen? Seit Datenerhebungen aller Art nicht mehr von Hand gemacht werden, sondern mit Computerunterstützung, sind die Datenmengen stark angestiegen. Damit wurden Methoden nötig, welche mit so umfangreichen Datensätzen umgehene können  und gleichzeitig möglichst wenige Fehler machen. Üblicherweise umfassen Datensätze hunderte von Faktoren. Damit wird es möglich, ganz verschiedene (eventuell schon vermutete) Zusammehänge zu testen. Weiter erlauben solch umfangreiche Datensätze ein exploratives Vorgehen, d.h. man betrachtet die Daten im Prinzip ohne Vorwissen und schaut, welche Zusammenhänge sich aufdrängen. Dieses Vorgehen ist allerdings statistisch heikel, da mit einer geschickten Auswahl von Testprozeduren oder vorgängiger "Datenbereinigung" fast beliebige Zusammenhänge "belegt" werden können. Der einschränkende Faktor bei wissenschaftlichen Untersuchungen ist sehr oft das festgesetzte Budget. Trotzdem möchte man möglichst viel Information aus den gesammelten Daten erhalten. Es ist viel billiger, einer Testperson eine Frage mehr zu stellen als eine weitere Testperson zu rekrutieren. In einem Experiment werden also aus finanzieller Sicht besser mehr Variablen gemessen als das ganze Experiment öfter zu wiederholen. Es gibt dann zwar weniger Beobachtungen, dafür mehr Faktoren, deren Zusammenhänge es zu analysieren gilt. In derartigen Fällen ist es unvermeidlich, sehr viele Tests simultan (Multiple Tests) durchzufähren. Bei der Analyse und Interpretation von Multiplen Tests treten erhebliche Schwierigkeiten auf, welche bei einfachem Testen inexistent sind. Wie diese Schwierigkeiten gemeistert werden können, wird im Verlaufe der Arbeit beschrieben.


Rahel Liesch 
Statistical Genetics for the Budset in Norway Spruce  Peter Bühlmann  Mar2005 
Abstract
Genetic variations is needed for plants to respond and adapt to environmental challenges. Understanding the genetic variation of adaptive traits and the forces that shaped it is one of the main goals of evolutionary biology. This is a difficult task, as most adaptive traits are quantitative traits, i.e. traits that are controlled by many loci intercting with the environment. The aim of this thesis was (i) to analyze the genetic variation of the timing of budset of Norway spruce (Picea abies L) within and among 15 populations covering the natural range of the species and (ii) to relate the variation among population for timing of budset with the variation observed at both neutral and candidate genes. The former was done through a classical ANOVA after choosing the adequate model. The latter was achieved by estimating and calculating confidence intervals for Wright's fixation indices (a measure of amongpopulation differentiation) for budset, on the one hand, and neutral or candidate genes, on the other hand. Estimating confidence intervals for Wright's fixation index for quantitative trait, such as timing of busdet, has been and can be done in many different ways. In some studies the delta method has been used whereas in others nonparametric bootstrapping was favored. In almost all studies, the choice of a certain method was not justified or discussed, nor, when bootstrap was retained, was the choice of a particular bootstrap strategy of type warranted. We therefore simulated several datasets and applied miscellaneous methods to find the most appropriate method. We concluded that either a semiparametric of a parametric bootstrap gave the best results in the case of the spruce dataset. Using a nonparametric bootstrap, sampling over populations and families would definitely be the most adequate way of obtaining a confidence interval. Finally, Wright's fixation index for budset was significatly larger than differentiation at both candidate and neutral loci suggesting strong local adaptation.

2004
Student  Title  Advisor(s)  Date 

Lukas Meier 
Extemwertanalyse von Starkniederschlägen  Hans R. Künsch  Mar2004 
Abstract
Zusammenfassung:Klimaveränderungen sind von grossem Interesse, da sie einen bedeutenden Einfluss auf den Menschen und die Umwelt haben können. In dieser Arbeit untersuchen wir mit Hilfe von Methoden der Extremwerttheorie den zeitlichen Verlauf von Starkniederschlägen für 104 Messstationen in der Schweiz. Wir modellieren die stationenweisen Überschreitungen von genügend hohen Schwellen mit einem 2dim. Poisson Punktprozess und nichtstationären Modellen für die Lokations und Skalenparameter. Wir finden so für viele Stationen eine grosse Evidenz eines positiven Trends. Um die einzelnen Trendschätzungen zu kombinieren, verwenden wir ein Analogon zu einem hierarchischen Modell. Die räumliche Analyse der Resultate zeigt jedoch Anomalien, die eine Kombination der Messstationen erschwert. Wir untersuchen deshalb alternative Ansätze, hauptsächlich um saisonale Besonderheiten besser zu modellieren. Es zeigen sichdabei grosse saisonale Unterschiede bei der räumlichen Abhängigkeit der Trendschätzungen der verschiedenen Stationen, die genauer untersucht werden könnten.


Andreas Greutert 
Methoden zur Schätzung der Clusteranzahl  Peter Bühlmann  Mar2004 
Abstract
Im Zusammenhang mit MicroarrayExperimenten werden laufend neue Methoden der ClusterAnalyse entwickelt. Drei solche Methoden werden im Technical Report von Fridlyand und Dudoit (citeyear{clest}) vorgestellt. Fridlyand und Dudoit verfolgen zwei Ziele. Erstens möchten sie durch die resampling Methode clest die Clusteranzahl schätzen. Zweitens soll die Genauigkeit der Clusterung verbessert werden. Um die Genauigkeit zu verbessern, schlagen sie zwei Bagging Methoden für Clusteralgorithmen vor.Wir werden uns mit dem clestAlgorithmus auseinander setzen. Damit wir den Algorithmus verstehen und anwenden können, ist einige Theorie notwendig. Im Kapitel 2 beginnen wir mit einer kurzen Einführung in die ClusterAnalyse. Weitere Methoden, die die Clusteranzahl schätzen, werden im Kapitel 3 vorgestellt. In den Kapiteln 4, 5 und 6 wird clest mit seinen Parametern eingeführt.Das Ziel dieser Diplomarbeit besteht darin, den clestAlgorithmus zu verstehen und wenn möglich ihn zu verbessern. Dazu war es notwendig den Algorithmus clest in R zu implementieren (siehe Anhang B). Das grosse Ziel clest zu verbessern, wollen wir erreichen, indem wir die verschiedenen Parameter von clest verändern. Eine weitere Aufgabe besteht darin, ein Mass für die Sicherheit der ClusteranzahlSchätzung zu konstruieren (siehe Kapitel 7). Weiter sollen auch bestehende Schätzmethoden mit clest verglichen werden.


Käthi Schneider 
Mischungsmodelle für evozierte Potenziale in Nervenzellen  Hans R. Künsch  Mar2004 
Abstract
Dieser Arbeit liegen 18 Datensätze neurobiologischer Daten über evozierte Potentiale zugrunde. Jeder Datensatz enthält Amplituden und NoiseWerte, wobei die AmplitudenWerte die evozierten Potentiale darstellen.Da es sich um neurobiologische Daten handelt, werden in Kapitel 2 zuerst einige biologische Begriffe und Abläufe erklärt. Diese spielen bei der Erhebung der Daten, welche ebenfalls thematisiert wird, eine Rolle. Nebst einer ersten Übersicht über die Daten wird zudem auf die quantale Hypothese eingegangen, da sie bei der Auswertung der Daten eine wesentliche Rolle spielt.ZielsetzungAn die einzelnen AmplitudenWerte der Datensätze werden Mischverteilungsdichten angepasst. Dazu sind verschiedene Modelle zu betrachten und gleichzeitig ist zu überprüfen, welches Modell am besten dafür geeignet ist.Als erster Schwerpunkt werden Mischungsmodelle betrachtet, die von abhängigen Daten ausgehen. Deshalb muss vorher geprüft werden, ob überhaupt Abhängigkeiten zwischen evozierten Potentialen bestehen. Falls solche vorhanden sind, ist zu untersuchen, wie diese modelliert werden und ob diese Modelle die besseren Schätzungen der Mischverteilungsdichten liefern.Der zweite Schwerpunkt wird auf die quantale Hypothese gelegt. Man möchte wissen, ob sich evozierte Potentiale als eine Überlagerung einer zufällig ausgeschütteten Anzahl Quanten modellieren lassen oder nicht.


Jeannine Britschgi 
Analyse einer Brustkrebsstudie  Hans R. Künsch  Feb2004 
Abstract
Das Ziel dieser Diplomarbeit besteht darin, für eine Gruppe von Brustkrebspatientinnen, deren Tumor operativ entfernt wurde, eineÜberlebenszeitanalyse durchzuführen. Es interessiert uns aber weniger die Zeitspanne, bis eine Patientin an Brustkrebs stirbt, sondern vielmehr die Zeit bis zu einem Rezidiv (Wiederauftreten des Tumors). Wir möchten für die Patientinnen ein gutes PrognoseModell konstruieren, das die Zeit eines Rückfalls des Tumors voraussagt. Dieses Modell wird eine Funktion sein. Wir wollen herausfinden, welche der vielen erklärenden Variablen notwendig sind, um diese Funktion gut zu charakterisieren. Esstellt sich die Frage, ob die Angaben über die Lymphknoten, welche ebenfalls operativ entfernt und nach Ablegern des Tumors untersucht wurden, notwendig sind, oder ob sich auch ohne diese Informationen gute PrognoseModelle finden lassen.

2003
Student  Title  Advisor(s)  Date 

Corinne Dahinden 
Schätzung des Vorhersagefehlers und Anwendungen auf Genexpressionsdaten  Peter Bühlmann  Nov2003 
Abstract
Im Kapitel 2: Microarray Prädiktoren werden verschiedene Methoden vorgestellt, welche wir später verwenden, um Microarrays zu klassifizieren.Im Kapitel 3 werden Schätzungen des Vorhersagefehlers einführt.Im Kapitel 4: Schätzung der Vertrauensintervalle werden Schätzungen der Standardabweichungen für die im Kapitel 3 eingeführten Schätzer besprochen.In den Kapiteln Kapitel 57 werden die verschiedenen Schätzungen für die Fehler und die Vertrauensintervalle anhand von Simulationen miteinander verglichen.Diese Erkenntnisse werden im Kapitel 8: Vergleich von Microarray Prädiktoren mit und ohne klinische Variablen angewandt.Im Kapitel 9: Prevalidierung wird die gleichnamige Technik eingeführt und angewandt auf verschiedene Microarray Prädiktoren um die Relevanz der klinischen Variablen zu bestimmen.In dieser Diplomarbeit habe ich sehr viel simuliert und dabei einige der verwendeten Fehlerschätzer in der Statistiksoftware R selbst programmiert. Den Code der wichtigsten Programme findet man unter /u/dahinden/Diplomarbeit/RCode.


Christof Birrer 
Konstruktion von Vorschlagsdichten für Markovketten Monte Carlo mit Sprüngen zwischen Räumen unterschiedlicher Dimension  Hans R. Künsch  Sep2003 
Abstract
In der vorliegenden Diplomarbeit ging es darum, Vorschlagsdichten für Markovketten Monte Carlo zu konstruieren, wobei vor allem im ARModell gearbeitet wurde. Die Arbeit baut auf dem Paper von Brooks, Giudici, und Roberts (2003) auf. Es sollte der Vorschlag im Diskussionsbeitrag von H.R. Künsch untersucht werden, der eine sorgfältiger ausgesuchte Sprungfunktion empfiehlt als die naheliegende, mit der im Paper gearbeitet wurde. Zu diesem Zweck sollten auch Simulationen mit der StatistikSoftware R durchgeführt werden. In einem zweiten Teil sollte untersucht werden, ob auch für ARCHModelle und Gauss'sche graphische Modelle geschicktere Sprungfunktionen zu finden sind als die offensichtlichen. Dabei sollte mit der KullbackLeiblerDistanz gearbeitet werden.


Christoph Buser 
Differentialgleichungen mit zufälligen zeitvariierenden Parametern  Hans R. Künsch  Mar2003 
Abstract
Biologische Prozesse werden mit Differentialgleichungen beschrieben. Die Annahme, dass die Parameter zeitlich invariant sind, erleichtert das Lösen und wird in der Praxis oft getroffen. Die dadurch entstehenden systematischen Fehler werden in Kauf genommen, solange sie nicht zu gross sind.In unserem Beispiel haben wir drei Grössen: die Biomasse (Bakterien), das Substrat (Nahrung) und den Sauerstoff. Es handelt sich um Konzentrationen. Messbar ist nur die Sauerstoffkonzentration. Wir rekonstruieren die anderen Grössen aus diesen Messdaten. Dazu arbeiten wir mit einem Glätter, welcher Daten der Zukunft und der Vergangenheit berücksichtigt.Wir geben die Konstanz der Parameter auf und modellieren diese mit zeitvariierenden stochastischen Prozessen, genauer gesagt mit dem meanreverting OrnsteinUhlenbeck Prozess. Das Modell wird flexibler. Der Ansatz ist bayesianisch. Wir suchen nicht die besten Parameter, sondern konstruieren die bedingte Verteilung der Parameter, gegeben die Sauerstoffmessdaten. Das ist nicht in geschlossener Form möglich. Wir verwenden den MetropolisHastings Algorithmus und erzeugen eine Markovkette, welche asymptotisch die gewünschte Verteilung hat. Um zweidimensionale Vorschlagsdichten zu umgehen, arbeiten wir mit dem GibbsSampler, der jeweils einen der beiden Parameter wählt, der neu vorgeschlagen wird.In der ersten Simulation nehmen wir im MetropolisHastings Algorithmus bedingte OrnsteinUhlenbeck Prozesse als Vorschläge für die neuen Parameterwerte. Die Daten werden nicht in die Vorschlagsdichte einbezogen. Wir unterteilen das Zeitintervall $[0,T]$ in zufällige Intervalle gleicher Durchschnittslänge und ändern den Parameter nur auf einem solchen Intervall. Das ist notwendig, um vernünftige Akzeptierungswahrscheinlichkeiten zu erhalten.In der zweiten Simulation benutzen wir die quadratischen Abweichungen der Sauerstoffdaten, um in einem Intervall einen Vorschlag zu konstruieren. Die zusätzliche Information reduziert die Varianz der Vorschlagsdichte. Der Rechenaufwand vergrössert sich.Während des Verfahrens sind wir mit einem Problem konfrontiert. Solange Substrat vorhanden ist, dominiert der Wachstumsparameter den Sterbeparameter. Dieser Maskierungseffekt erhöht die Unsicherheit bei der Bestimmung des Sterbeparameters im ersten Zeitabschnitt. Die Unsicherheit überträgt sich auf die Hauptprozesse. In beiden Simulationen gelingt es meist gut bis sehr gut, die Verteilungen aller Prozesse zu bestimmen. Probleme des Filters, der nur Messwerte der Vergangenheit verwendet, werden durch den Glätter behoben. Der Glätter bringt mehr Daten in das Verfahren und ist dem Filter vorzuziehen.Der Algorithmus ist rechenintensiv. Einerseits ist zum Erreichen der stationären Verteilung eine lange Einschwingphase erforderlich. Andererseits verringern wir die Abhängigkeiten in der Markovkette, indem wir nicht jedes Element verwenden. Daher ist eine grosse Anzahl Schritte im Algorithmus notwendig.Es gibt Varianten der Vorschlagsdichte. Wir verzichten auf den GibbsSampler und arbeiten zweidimensional. Möglicherweise wird so das Zusammenspiel der beiden Parameter besser wiedergegeben und der Maskierungseffekt kompensiert.Ein anderer Algorithmus versucht, mehr Information aus den Sauerstoffabweichungen zu gewinnen, indem deren Vorzeichen berücksichtigt wird.


Eric André Graf 
Vorhersage des Luftqualitätsindexes  Hans R. Künsch  Mar2003 
Abstract
In dieser Arbeit geht es um die Entwicklung eines Modells für die Vorhersage eines Luftqualitätsindexes (LQI). Dieser LQI beschreibt in Worten den Zustand der Luft. Der LQI wird stündlich auf dem Internet publiziert (verbwww.inluft.ch).Der Luftqualitätsindex LQI zeigt die Wirkung der aktuellen Luftqualität auf die Gesundheit an.Bei der Messung von Luftschadstoffen (Ozon O3, Stickoxide NOx, Stickstoffmonoxid NO und Feinstaub PM10) werden Zahlen erzeugt. Diese geben Auskunft über die Konzentration der einzelnen Stoffe in der Aussenluft. Der LQI wird aufgrund dieser KonzentrationsAngaben berechnet und gibt Auskunft über den Einfluss der Schadstoffe auf das körperliche Befinden. Die Aussage des LQI ist stark generalisiert, sie entspricht aber den heutigen Kenntnissen über kurzfristigen Auswirkungen der Schadstoffe auf den menschlichen Organismus.Für jeden Schadstoff werden nun Indexstufen von 1 bis 6 zugeordnet in Bezug dessen Konzentration.
