Master of Science Capstone Project
An end-to-end machine learning system analyzing 545,316 hospital admissions and 150M+ data points to predict 30-day ICU readmissions, enabling hospitals to target high-risk patients for early intervention and optimize resource allocation.
Hospital readmissions cost approximately $26 billion annually in the United States. Approximately 20% of Medicare beneficiaries experience readmission within 30 days, with average US hospital readmission rates of 14.67% across all conditions. Despite targeted interventions, predicting which ICU patients will return remains a significant challenge.
Leveraging the MIMIC-IV database with rigorous preprocessing, feature engineering, and model validation
MIMIC-IV: De-identified health data from Beth Israel Deaconess Medical Center (2008-2019)
57 engineered features spanning multiple clinical domains
Systematic comparison of interpretable and complex algorithms
Rigorous temporal validation preventing data leakage
XGBoost emerged as the top performer with strong discrimination and excellent calibration
| Model | AUC | Sensitivity | PPV |
|---|---|---|---|
| Logistic Regression | 0.655 | 64.2% | 28.5% |
| Random Forest | 0.660 | 65.1% | 29.2% |
| XGBoost | 0.683 | 68.8% | 29.8% |
Key Finding: The 29.8% PPV represents a 50% relative improvement over the 20% baseline readmission rate, enabling more efficient resource allocation.
Well-Calibrated Model: ECE = 0.022, indicating trustworthy probability estimates
The model enables a tiered intervention strategy: High-risk patients (>40% probability) receive intensive transitional care management with home visits ($800-1000/patient), moderate-risk patients (20-40%) receive standard TCM with phone follow-up ($400-600/patient), and low-risk patients (<20%) receive educational materials and portal access ($100-200/patient).
Key Insight: A model doesn't need to be perfect to be valuable. A 0.683 AUC translates to substantial clinical and financial impact when applied at scale. The difference between 20% and 30% PPV represents millions in annual savings.
Proactive examination of model performance across demographic groups ensures equitable healthcare delivery.
Prospective validation study, external validation if multi-center data available, address identified limitations
Pilot deployment with 20-30% of discharge population, develop tiered intervention protocols, implement fairness monitoring
Quarterly model retraining, continuous quality improvement dashboard, integrate NLP from clinical notes
XGBoost achieved 0.683 AUC on temporally held-out test data with minimal degradation from validation, indicating excellent generalization.
117% ROI after accounting for intervention and model costs, with Number Needed to Screen of 13 patients to prevent one readmission.
Administrative EHR data can identify high-risk patients with clinically meaningful accuracy, enabling efficient resource allocation at scale.
View the complete code, methodology, and detailed findings on GitHub