Propensity score methods have shown to reduce selection bias in observational studies. However, the consistency of the propensity score (PS) estimators strongly depends on a correct specification of the PS model. Logistic regression (LR) and recently machine learning tools are commonly used to estimate the propensity scores. We introduce a stacked generalization ensemble learning approach to improve propensity score estimation by fitting a meta learner on the predictions of a suitable set of diverse base learners. We perform a comprehensive Monte Carlo simulation study, implementing eight scenarios that mimic characteristics of typical data sets in educational studies. The treatment effect is estimated using the PS in Inverse Probability of Treatment Weighting (IPTW) with ATE weights. Performance of the models was evaluated by PS prediction accuracy, percent absolute bias, mean squared error and standard errors of treatment effect estimates, weight distribution and achieved covariate balance. Our proposed ensembles, especially using LR and GBM as meta learners trained on a set of 13 base learner predictions, led to superior reduction of bias compared to all underlying base learners. We examine modifications of the underlying base learner set and support recent literature that both, superior PS prediction accuracy and superior balance do not necessarily lead to more precise treatment effect estimates. Our findings suggest that stacked ensembles will allow educational researchers to obtain more precise treatment effect estimates in propensity score studies. We apply our best models to assess the average treatment effect of a Supplemental Instruction (SI) program in an introductory psychology (PSY 101) course at San Diego State University. We show that our methods balance the data after weighting and then confirm results in the recent literature that SI has a significantly positive impact on student success in the PSY101 course.