Description
Tree-based methods, such as classification and regression trees, are used to predict response based on the values of input variables. They also can be used to aid in variable selection for model-building by identifying "important" variables -- those that play a strong role in accurate predictions. Ensemble methods, including the random forest method, use a collection of trees and have been developed to improve prediction performance. The aims of this study were to use linear and nonlinear combinations of input variables in the random forest method to improve prediction performance compared to random forest employed using only single input variables and to identify important variables and variables types among the single and combination input variables. While seeking to reduce prediction error, this study also sought to develop guidelines for this type of approach. Using three different datasets with categorical response variables, it was found that including combinations of input variables in a random forest could improve prediction compared to the random forest grown using only single input variables. Additionally, important variables were able to be clearly identified. The linear and/or nonlinear combination variables were very often among the most important variables in a given random forest. The grouping of variable types that performed the best as input variables in random forest for one dataset did not necessarily perform the best for a different dataset. It was clear, however, that the most important linear and nonlinear combinations of variables were not necessarily combinations of the most important single variables. This implies that single variables that appear relatively unimportant should not be excluded from the combinations of variables. The approach for how many and which type of combinations of variables to include in a random forest needs to be customized for a given dataset based on its characteristics.