This three-article dissertation aims to address three methodological challenges to ensure comparability in educational research, including scale linking, test equating, and propensity score (PS) weighting. Under the item response theory (IRT), the first study aims to improve test scale comparability by evaluating the effect of six missing data handling approaches on scale linking accuracy when missing responses occur within common items. The second study aims to provide a new equating method to account for rater errors in rater-mediated assessments. Specifically, the performance of using an IRT observed-score equating method with a hierarchical rater model is investigated under various conditions as compared to a traditional IRT observed-score equating method. The third study examines the performance of six covariate balance diagnostics when using PS weighting method with multilevel data. Specifically, a set of simulated conditions is used to examine the ability of within-cluster and pooled absolute standardized bias, variance ratio, and percent bias reduction methods in identifying a correct PS model. In addition, the association between the balance statistics and the bias in treatment effect is explored. By advancing the methodology for addressing comparability issues, the dissertation intends to enhance the validity and improve the quality of educational research.