This project develops a supervised learning model to predict ISNCSCI motor score at 12 months post-injury using data from the first 30 days post-injury on an EMSCI-like simulated dataset. Limitations are discussed at the end.
## [1] "Packages loaded."
## [1] "Registered 12 workers for parallel processing."
The simulated data is a cohort based on descriptive statistics of EMSCI subjects reported by Bourguignon et al., 2022 (BMC Med). We aim to model realistic heterogeneity by creating several recovery phenotypes with high variance and potential plateauing that consider known SCI pathophysiology and recovery patterns.
## [1] "EMSCI-like cohort data simulation complete."
## [1] "Total patients: 1500"
## [1] "Total observations: 5516"
## [1] "Mean assessments per patient: 3.68"
| Characteristic | Overall N = 1,5001 |
Baseline AIS grade
|
|||
|---|---|---|---|---|---|
| A N = 6171 |
B N = 2001 |
C N = 2841 |
D N = 3991 |
||
| Age at injury | 46.1 (18.4) | 45.8 (18.6) | 47.0 (18.6) | 46.2 (18.9) | 46.2 (17.6) |
| Sex | |||||
| Female | 365 (24%) | 137 (22%) | 62 (31%) | 67 (24%) | 99 (25%) |
| Male | 1,135 (76%) | 480 (78%) | 138 (69%) | 217 (76%) | 300 (75%) |
| Cause of injury | |||||
| Non-Traumatic | 118 (7.9%) | 43 (7.0%) | 25 (13%) | 12 (4.2%) | 38 (9.5%) |
| Traumatic | 1,382 (92%) | 574 (93%) | 175 (88%) | 272 (96%) | 361 (90%) |
| Injury level | |||||
| Cervical | 798 (53%) | 357 (58%) | 107 (54%) | 149 (52%) | 185 (46%) |
| Lumbar | 172 (11%) | 54 (8.8%) | 18 (9.0%) | 35 (12%) | 65 (16%) |
| Thoracic | 530 (35%) | 206 (33%) | 75 (38%) | 100 (35%) | 149 (37%) |
| 1 Frequencies mimic descriptive statistics reported by Bourguignon et al., 2022 (BMC Med) regarding EMSCI patient data. | |||||
We pass all features to the stacked ensemble, relying on the inherent feature selection capabilities of the base learners.
## [1] "Modeling data created with static and trajectory features."
We define a grouped 5-fold cross-validation using 3 repeats and a preprocessing recipe to clean the predictors.
# Use 5-fold grouped CV by center, repeated 3 times for robust variance estimation
set.seed(123)
cv_splits <- group_vfold_cv(
modeling_data,
group = center_id,
v = 5, # 10-fold may also work if the partitition is balanced
repeats = 3
)
# Create preprocessing recipe
emsci_recipe <- recipe(motor_score_12m ~ ., data = modeling_data) %>%
update_role(patient_id, new_role = "ID") %>%
update_role(patient_id_global, new_role = "ID") %>%
update_role(center_id, new_role = "ID") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE) %>%
step_normalize(all_numeric_predictors(), -all_of(c("patient_id", "patient_id_global", "center_id"))) %>%
step_integer(all_logical_predictors()) %>%
step_zv(all_predictors())
print("Cross-validation and preprocessing recipe defined.")
## [1] "Cross-validation and preprocessing recipe defined."
We define a set of diverse base-learner models and computationally-efficient hyperparameter tuning grids for each.
# Define model specifications
# 1. ElasticNet (linear regularized)
set.seed(123)
elnet_spec <- linear_reg(
penalty = tune(),
mixture = tune()
) %>%
set_engine("glmnet") %>%
set_mode("regression")
# 2. Random Forest
set.seed(123)
rf_spec <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = 2000 # Reduced for resource efficiency; increase if possible
) %>%
set_engine("ranger", num.threads = 1) %>%
set_mode("regression")
# 3. XGBoost
# Early stopping and validation reduces unnecessary computation
set.seed(123)
xgb_spec <- boost_tree(
trees = 500, # Reduced for tuning efficiency
tree_depth = tune(),
learn_rate = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune()
) %>%
set_engine(
"xgboost",
nthread = 1,
validation = 0.2, # Early stopping uses validation split
early_stop = 50 # Stop after 50 rounds without improvement
) %>%
set_mode("regression")
# 4. SVM (Radial Basis Function kernel)
set.seed(123)
svm_spec <- svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_engine("kernlab") %>%
set_mode("regression")
# 5. MARS (Multivariate Adaptive Regression Splines)
set.seed(123)
mars_spec <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_engine("earth") %>%
set_mode("regression")
print("Base-learner specifications set.")
## [1] "Base-learner specifications set."
# Random grids for high-dimensional or sensitive hyperparameters (ElasticNet, XGBoost, SVM)
# Regular grids for robust, fewer hyperparameters (RF, MARS)
# ElasticNet: 15-point random grid
set.seed(123)
elnet_grid <- grid_random(
penalty(range = c(-5, 1)),
mixture(range = c(0, 1)),
size = 15 # reduced for computation efficiency, increase if possible for all models
)
# Random Forest: Systematic grid (15 points)
set.seed(123)
rf_grid <- grid_regular(
mtry(range = c(5, 20)),
min_n(range = c(5, 15)),
levels = c(mtry = 5, min_n = 3)
)
# XGBoost: 15-point random grid
set.seed(123)
xgb_grid <- grid_random(
tree_depth(range = c(3, 10)),
min_n(range = c(2, 20)),
learn_rate(range = c(-3, -0.5)), # Tuning log10(learning_rate) from 0.001 to 0.3
loss_reduction(range = c(-5, 1)),
sample_prop(range = c(0.5, 1.0)),
size = 15
)
# SVM: 15-point random grid
set.seed(123)
svm_grid <- grid_random(
cost(range = c(-5, 5)),
rbf_sigma(range = c(-5, 0)),
size = 15
)
# MARS: Systematic grid (15 points)
set.seed(123)
mars_grid <- grid_regular(
num_terms(range = c(5, 25)),
prod_degree(range = c(1L, 2L)),
levels = c(num_terms = 5, prod_degree = 3)
)
print("Hyperparameter grids finalized.")
## [1] "Hyperparameter grids finalized."
We used a workflow set to tune all five models across our grouped CV splits and save the out-of-fold predictions to be used inputs for the final ensemble.
## Base learners tuning: 526.64 sec elapsed
## [1] "Base learners tuned."
We create the stacked ensemble by adding the tuned base learners as candidates and blending their out-of-fold predictions using a non-negative LASSO meta-learner to find optimal, interpretable weights. Then, we fit the complete stacked model to the full training set.
# Create and fit the stacked ensemble
tic("Stacking")
# Initialize stack
set.seed(123)
emsci_stack <- stacks() %>%
add_candidates(model_set_results) # Add all tuned models at once
print(paste("Stack initialized with", ncol(emsci_stack), "candidate models"))
## [1] "Stack initialized with 70 candidate models"
# Blend the stack (fit meta-learner with non-negative LASSO)
set.seed(123)
emsci_stack_blended <- emsci_stack %>%
blend_predictions(
penalty = 10^seq(-4, -0.5, length.out = 20),
mixture = 1, # LASSO
non_negative = TRUE, # enforce interpretability
metric = metric_set(rmse) # RMSE prioritizes predictive accuracy
)
print("Stack blended successfully")
## [1] "Stack blended successfully"
# Fit the final stacked model on the full training data
set.seed(123)
emsci_stack_fitted <- emsci_stack_blended %>%
fit_members()
toc()
## Stacking: 58.93 sec elapsed
print("Final stacked ensemble fitted.")
## [1] "Final stacked ensemble fitted."
We perform a two-stage evaluation using an independent set of 5-fold grouped CV splits. We use the base-learner predictions as inputs and the optimal non-negative LASSO penalty found during blending and generate the final out-of-sample performance metrics.
# Obtain out-of-fold predictions and performance metrics
tic("Model evaluation")
# Create the L1 dataset (base learner predictions + IDs)
l1_data <- emsci_stack %>%
mutate(.row = row_number()) %>%
left_join(
modeling_data %>%
mutate(.row = row_number()) %>%
select(.row, patient_id_global, center_id),
by = ".row"
)
print(paste("L1 dataset created:", nrow(l1_data), "patients"))
## [1] "L1 dataset created: 1500 patients"
# Define NEW, INDEPENDENT CV folds for meta-learner evaluation
set.seed(456) # Use a different seed for independence
l1_folds <- group_vfold_cv(l1_data, group = center_id, v = 5)
# Get optimal penalty from Stage 1
best_penalty <- emsci_stack_blended$penalty$penalty
print(paste("Using meta-learner penalty from Stage 1:", best_penalty))
## [1] "Using meta-learner penalty from Stage 1: 0.206913808111479"
# Define meta-learner spec
# Use !! to inject the value, prevents silent fit_resamples scope error
meta_spec <- linear_reg(
penalty = !!best_penalty,
mixture = 1 # LASSO
) %>%
set_engine("glmnet", lower.limits = 0) # Non-negative
# Define meta-learner recipe
meta_recipe <- recipe(motor_score_12m ~ ., data = l1_data) %>%
update_role(center_id, new_role = "ID") %>%
update_role(patient_id_global, new_role = "ID") %>%
update_role(.row, new_role = "ID")
# Create workflow
eval_metrics <- metric_set(rmse, rsq, mae)
meta_wf <- workflow() %>%
add_recipe(meta_recipe) %>%
add_model(meta_spec)
# Evaluate with CV
set.seed(456)
tic("Meta-learner CV")
meta_cv_results <- fit_resamples(
meta_wf,
resamples = l1_folds,
metrics = eval_metrics, # Pass the metrics
control = control_resamples(save_pred = TRUE)
)
toc()
## Meta-learner CV: 23.01 sec elapsed
print("Meta-learner CV complete")
## [1] "Meta-learner CV complete"
# Extract CV predictions
cv_predictions <- tidyr::unnest(meta_cv_results, .predictions)
# Create final plotting data
plot_data_cv <- cv_predictions %>%
left_join(
modeling_data %>%
mutate(.row = row_number()) %>%
select(.row, patient_id_global, baseline_ais),
by = ".row"
) %>%
select(patient_id_global, motor_score_12m, .pred, baseline_ais)
# Calculate final CV metrics
final_cv_metrics <- collect_metrics(meta_cv_results)
print(paste("Out-of-fold predictions extracted for",
nrow(plot_data_cv), "patients"))
## [1] "Out-of-fold predictions extracted for 1500 patients"
print("Performance metrics and predictions ready.")
## [1] "Performance metrics and predictions ready."
We extract the non-negative LASSO coefficients from the fitted stack, group them by base-learner type, and visualize the total weight and number of members each model contributes to the final ensemble.
We gather performance metrics for the final stacked ensemble and the best-performing base learners.
| Cross-validated performance metrics | |||
| Stacked ensemble (retained 14 candidates) vs. best base learners | |||
| Model1 | MAE | RMSE | R-squared |
|---|---|---|---|
| Best ElasticNet | 10.94 [10.49, 11.39] | 14.32 [13.85, 14.79] | 0.776 [0.742, 0.811] |
| Best Random Forest | 11.53 [11.01, 12.05] | 14.82 [14.25, 15.39] | 0.761 [0.729, 0.793] |
| Best XGBoost | 11.29 [10.81, 11.78] | 14.82 [14.17, 15.46] | 0.759 [0.727, 0.791] |
| Best SVM | 10.73 [10.24, 11.21] | 14.41 [13.91, 14.91] | 0.775 [0.740, 0.809] |
| Best MARS | 11.09 [10.64, 11.55] | 14.47 [13.98, 14.96] | 0.771 [0.735, 0.807] |
| Stacked ensemble | 11.28 [10.52, 12.05] | 14.75 [13.76, 15.75] | 0.739 [0.704, 0.774] |
| 1 Base learner metrics are optimistically biased (from stage 1 tuning). Stacked ensemble metrics are unbiased (from independent stage 2 CV). | |||
We assess model calibration using the out-of-fold predictions via predicted vs observed scores.
We stratify the data by predicted score range to analyze calibration-in-the-large.
| Calibration stratified by score range | ||||
| Score Range | N | Mean Predicted | Mean Observed | Calibration Error1 |
|---|---|---|---|---|
| 0-25 | 815 | 10.0 | 9.6 | -0.40 |
| 26-50 | 285 | 39.3 | 39.6 | 0.32 |
| 51-75 | 342 | 61.5 | 62.3 | 0.84 |
| 76-100 | 58 | 81.8 | 81.5 | -0.29 |
| 1 Calibration error = Observed - predicted. positive = underprediction | ||||
We check the model error behavior by generating standard diagnostic plots from the out-of-fold predictions.
We assess the agreement between observed and predicted scores, visualizing the mean bias and 95% limits of agreement to check for any systematic error trends across the range of scores.
We visualize the model performance stratified by the baseline AIS grade to compare each group’s error against the overall model performance.
We evaluate the model’s clinical utility by defining a Minimal Clinically Important Difference (MCID) and plotting a histogram to find the percentage of predictions within this acceptable error.
We test the model’s classification utility by setting a recovery threshold and generating a Precision-Recall curve from the model’s predictions.
We conduct a permutation-based feature importance analysis, plotting the top 20 features ranked by the mean drop in performance (RMSE) when their values are shuffled.
## [1] "DALEX explainer created successfully"
## Permutation importance: 1608.05 sec elapsed
We calculate SHapley Additive exPlanations (SHAP) values for the full dataset to explain the stacked model’s predictions and create a global importance bar chart, ranking the top 20 features by their mean absolute SHAP value.
## SHAP calculation (full sample): 4522.95 sec elapsed
We generate SHAP dependency plots for the top numeric features to visualize their marginal effects and explore potential interactions.
We identify the patients with the best and worst predictions from the cross-validation set and generate permutation-based “break-down” plots for these two individuals to visualize how each feature contributed to their specific final score.
Note: OOF (Out-of-Sample) shows the model’s unbiased prediction for this
subject calculated during cross-validation, where the model had never
seen this subject The OOF error is used only to identify this patient as
“best” or “worst”. IN-SAMPLE is the prediction from the final model,
which was trained on all the data (including the chosen subjects). This
is the prediction that the plot explains. The plot starts with the value
for the average prediction for the dataset (E[f(x)]) and each bar is a
feature’s SHAP value, showing its push on the prediction until reaching
the final prediction (f(x))
The model’s performance is limited by its reliance on simulated data, which lacks the noise, heterogeneity, and missingness characteristic of real-world multi-center SCI data.
We implemented a standard two-stage CV approach. While it is computationally efficient and uses repeated grouped CV to obtain stable out-of-fold predictions, it carries an optimization bias. Nested CV would be preferred but computationally prohibitive.
The project uses embedded selection by passing all features to the stacked ensemble, employing the inherent feature selection capabilities of the base learners. Wrapper methods (e.g., Boruta) would be preferred but computationally prohibitive.
The analysis implements a pragmatic and computationally efficient two-stage evaluation in lieu of a full nested CV framework, introducing known trade-offs:
The hyperparameters for the base learners were selected using the same 5-fold CV that generated the out-of-fold predictions. This is standard for the stacks package but is slightly optimistic.
The optimal penalty for the meta-learner was selected in stage 1 using out-of-fold predictions from all 1500 patients. This single penalty was re-used in our stage 2. The final metrics are optimistic. This approach is a practical compromise that trades a minor, acceptable bias for a massive gain in computational efficiency.
Future iterations will include deeper investigation using SHAP dependency plots, partial dependence plots, and a quantitative analysis of the ensemble diversity.
## These are not valid conclusions. This is a programming exercise. For completeness, here are the results:
## Overall performance (5-fold CV):
## - RMSE: 14.75 (13.76, 15.75)
## - MAE: 11.28 (10.52, 12.05)
## - R-squared: 0.739 (0.704, 0.774)
## Calibration:
## - Calibration slope: 1.012 (ideal = 1)
## - Calibration intercept: -0.326 (ideal = 0)
## Clinical utility:
## - Predictions within MCID (±4 points): 23.3%
## Bland-Altman analysis:
## - Mean bias: 0.02 points
## - 95% Limits of Agreement: [-29.18, 29.23]
## Meta-Learner composition:
## - Total members selected: 14 (out of 69 candidates)
## - Dominant model type: MARS (48.2% of total weight)
## Subgroup performance:
## - AIS A (n=617): RMSE=14.76, R-squared=0.718
## - AIS B (n=200): RMSE=15.83, R-squared=0.722
## - AIS C (n=284): RMSE=14.75, R-squared=0.753
## - AIS D (n=399): RMSE=14.72, R-squared=0.754
## R version: R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32
## Key ML packages:
## tidymodels 1.4.1
## tune 2.0.1
## stacks 1.1.1
## ranger 0.17.0
## xgboost 1.7.11.1
## earth 5.3.4
## DALEX 2.5.3
## fastshap 0.1.1