Description
Clustered outcomes with longitudinally collected covariates arise frequently in applied research and present difficulties in statistical analysis because of the complex dependence structure. Due to their nonparametric approach and interpretable results, tree-based methods have become some of the most flexible and popular analytic tools for modeling complex data structures. This dissertation is intended to propose new statistical methods used in determining true visual field progression in high-risk ocular hypertension or early glaucoma patients. Traditional tree models such as classification and regression trees (CART) cannot be readily applied to data arising from ophthalmologic studies. This is because data from the two eyes of the same patient are usually correlated. To overcome the analytical difficulty with correlated outcome data, a simplifying approach is to randomly select one eye per patient. Analyses based on a single observation per cluster are convenient in that a standard tree method can be employed. However, the major disadvantage of this approach is a loss of information because half of all the information collected is unused in the analysis. Tree based methods that can incorporate correlated data are much needed because using data from both eyes can provide more power in any statistical hypothesis testing and improve prediction accuracy in a classification problem. In Chapter 1, we give an introduction to tree-based methods and discuss the motivation behind this thesis. It introduces the challenges in managing glaucoma and detecting glaucomatous visual field progression based on standard automated perimetry. In Chapter 2, we propose a classification tree method for correlated binary data by modifying the splitting function of CART. By applying a robust Wald test from the generalized estimating equations (GEE), data from both eyes can be used while adjusting correlation between two eyes of the same patient. Simulations were conducted under a variety of model configurations to investigate the performance of the split criteria. The proposed approach was also applied to data from the perimetry and psychophysics in glaucoma (PPIG) study to look for baseline prognostic indicators for visual field glaucomatous progression. Both traditional CART and the proposed method based on the robust Wald statistic were applied to the PPIG study data. Results based on test sample (a portion of data not used in the model building process) also show improved accuracy when using data from both eyes. In terms of finding important predictors of glaucoma progression, results based on the proposed method were consistent with many locations that have been discussed in the ophthalmology literature to indicate progression risk. In addition, some new test point locations were uncovered that appear to be associated with increased risk of glaucomatous progression. In Chapter 3, we propose an extension of the existing random forests (RF) classification method based on the new tree method from Chapter 2. We then apply the new RF method to the PPIG study data incorporating both baseline and longitudinal covariates. In order to account for the correlated nature of sequential data (i.e., data collected over time), we rectify the pointwise linear regression (PLR) method that is popular in the ophthalmology literature by performing generalized least square (GLS) regression for longitudinal visual field data from each test point location of each eye of each patient. The slopes and the associated p-values for slopes indicate the magnitude and statistical significance of the change in the visual sequence and are used in the RF construction. As can be seen from the results, the application of RF to the PPIG data by incorporating data from both eyes as well as features from the longitudinal data provides improved accuracy for predicting glaucoma progression. In Chapter 4, we further extend the RF method to deal with visual field data that are known to be noisy. We propose a two-step splitting strategy. First, a measurement error, or random effects, model is fit to the longitudinal visual field data from all test points of all eyes of all patients. This model is designed to remove the measurement errors in the visual field data. Second, the true slopes from the random effects model, together with all the clinical and social-economical baseline covariates, are then considered as potential splitting variables to split the node. In addition, we apply the same longitudinal data set from the PPIG study to showcase the improved fit and prediction accuracy. In order to apply the proposed two-step splitting approach to RF, computational efficiency becomes particularly challenging. RF methods have been recognized to be highly effective and ideally suited for parallelization. The final chapter presents a parallel formulation of the proposed method with RF incorporating joint models for correlated binary outcome and longitudinal covariates. We also provide the analysis of the longitudinal PPIG data to demonstrate excellent speedups and scalability. The last chapter provides a summary of this work, noting the novel contributions to the field of work; and gives suggestions for future work