We've Moved!
Visit SDSU’s new digital collections website at https://digitalcollections.sdsu.edu
Description
Multivariate failure time data arise when individuals under study are naturally clustered. This type of data requires multivariate extensions of existing statistical methodolgy. Due to their nonparametric approach and interpretable results, tree-based methods have become some of the most flexible and popular analytic tools for modeling complex data structures. This dissertation is intended to present new methodology for random forests and variable importance measures. Brieman's [7] original random forest (RF) method is shown to be unreliable when the number of categories of potential predictor variables varies [38]. We introduce a new RF algorithm that reduces the bias in variable importance ranking for correlated survival data. The multivariate exponential tree algorithm of Fan and Su [15] is used to build trees, due to its superior prediction accuracy and computational efficiency. Simulation studies for assessing various variable importance methods are presented. We compare the proposed method to the traditional Cox proportional hazards frailty model in their prediction accuracy. We apply our method to the VA Dental Longitudinal Study to assess tooth loss.To generate even more randomization into the RF procedure, we introduce a second RF method for correlated survival data. It consists of randomizing completely both attribute and cut point choice while splitting a tree node. To ensure the quality of each split, two different split evaluation criteria are used, the likelihood ratio and the score test criteria based on the semi-parametric and exponential frailty models respectively. We show the proposed method is computationally inexpensive, yet accurate in uncovering the true variable importance rankings. In order to apply the proposed approach to RF, computational efficiency becomes particularly challenging. RF methods have been recognized in recent times to be highly effective and ideally suited for parallelization. We present a parallel formulation of the proposed RF method to address large datasets. We apply the new tree based method and its respective variable importance measures to correlated survival data from a dental school database, consisting of 373,202 observations and present the results of our analysis.