Introduction

This project develops a supervised learning model to predict ISNCSCI motor score at 12 months post-injury using data from the first 30 days post-injury on an EMSCI-like simulated dataset. Limitations are discussed at the end.

## [1] "Packages loaded."

## [1] "Registered 12 workers for parallel processing."

Simulated data

The simulated data is a cohort based on descriptive statistics of EMSCI subjects reported by Bourguignon et al., 2022 (BMC Med). We aim to model realistic heterogeneity by creating several recovery phenotypes with high variance and potential plateauing that consider known SCI pathophysiology and recovery patterns.

## [1] "EMSCI-like cohort data simulation complete."

## [1] "Total patients: 1500"

## [1] "Total observations: 5516"

## [1] "Mean assessments per patient: 3.68"

Cohort characteristics

**Simulated cohort characteristics** (N=1,500)
Characteristic	Overall N = 1,500¹	Baseline AIS grade
Characteristic	Overall N = 1,500¹	A N = 617¹	B N = 200¹	C N = 284¹	D N = 399¹
Age at injury	46.1 (18.4)	45.8 (18.6)	47.0 (18.6)	46.2 (18.9)	46.2 (17.6)
Sex
Female	365 (24%)	137 (22%)	62 (31%)	67 (24%)	99 (25%)
Male	1,135 (76%)	480 (78%)	138 (69%)	217 (76%)	300 (75%)
Cause of injury
Non-Traumatic	118 (7.9%)	43 (7.0%)	25 (13%)	12 (4.2%)	38 (9.5%)
Traumatic	1,382 (92%)	574 (93%)	175 (88%)	272 (96%)	361 (90%)
Injury level
Cervical	798 (53%)	357 (58%)	107 (54%)	149 (52%)	185 (46%)
Lumbar	172 (11%)	54 (8.8%)	18 (9.0%)	35 (12%)	65 (16%)
Thoracic	530 (35%)	206 (33%)	75 (38%)	100 (35%)	149 (37%)
¹ Frequencies mimic descriptive statistics reported by Bourguignon et al., 2022 (BMC Med) regarding EMSCI patient data.

Feature consolidation

We pass all features to the stacked ensemble, relying on the inherent feature selection capabilities of the base learners.

## [1] "Modeling data created with static and trajectory features."

Cross-validation strategy and preprocessing recipe

We define a grouped 5-fold cross-validation using 3 repeats and a preprocessing recipe to clean the predictors.

# Use 5-fold grouped CV by center, repeated 3 times for robust variance estimation
set.seed(123)
cv_splits <- group_vfold_cv(
  modeling_data,
  group = center_id,
  v = 5, # 10-fold may also work if the partitition is balanced
  repeats = 3 
)

# Create preprocessing recipe
emsci_recipe <- recipe(motor_score_12m ~ ., data = modeling_data) %>%
  update_role(patient_id, new_role = "ID") %>%
  update_role(patient_id_global, new_role = "ID") %>%
  update_role(center_id, new_role = "ID") %>%
  step_dummy(all_nominal_predictors(), one_hot = FALSE) %>% 
  step_normalize(all_numeric_predictors(), -all_of(c("patient_id", "patient_id_global", "center_id"))) %>% 
  step_integer(all_logical_predictors()) %>%
  step_zv(all_predictors())

print("Cross-validation and preprocessing recipe defined.")

## [1] "Cross-validation and preprocessing recipe defined."

We define a set of diverse base-learner models and computationally-efficient hyperparameter tuning grids for each.

# Define model specifications

# 1. ElasticNet (linear regularized)
set.seed(123)
elnet_spec <- linear_reg(
  penalty = tune(), 
  mixture = tune()
) %>%
  set_engine("glmnet") %>%
  set_mode("regression")

# 2. Random Forest
set.seed(123)
rf_spec <- rand_forest(
  mtry = tune(),
  min_n = tune(),
  trees = 2000                # Reduced for resource efficiency; increase if possible
) %>%
  set_engine("ranger", num.threads = 1) %>% 
  set_mode("regression")

# 3. XGBoost
# Early stopping and validation reduces unnecessary computation
set.seed(123)
xgb_spec <- boost_tree(
  trees = 500,                # Reduced for tuning efficiency
  tree_depth = tune(),
  learn_rate = tune(),  
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune()
) %>%
  set_engine(
    "xgboost", 
    nthread = 1,
    validation = 0.2,          # Early stopping uses validation split
    early_stop = 50            # Stop after 50 rounds without improvement
  ) %>%
  set_mode("regression")

# 4. SVM (Radial Basis Function kernel)
set.seed(123)
svm_spec <- svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_engine("kernlab") %>%
  set_mode("regression")

# 5. MARS (Multivariate Adaptive Regression Splines)
set.seed(123)
mars_spec <- mars(num_terms = tune(), prod_degree = tune()) %>%
  set_engine("earth") %>%
  set_mode("regression")

print("Base-learner specifications set.")

## [1] "Base-learner specifications set."

# Random grids for high-dimensional or sensitive hyperparameters (ElasticNet, XGBoost, SVM)
# Regular grids for robust, fewer hyperparameters (RF, MARS)

# ElasticNet: 15-point random grid
set.seed(123)
elnet_grid <- grid_random(
  penalty(range = c(-5, 1)),
  mixture(range = c(0, 1)),
  size = 15                  # reduced for computation efficiency, increase if possible for all models
)

# Random Forest: Systematic grid (15 points)
set.seed(123)
rf_grid <- grid_regular(
  mtry(range = c(5, 20)),
  min_n(range = c(5, 15)),
  levels = c(mtry = 5, min_n = 3)
)

# XGBoost: 15-point random grid
set.seed(123)
xgb_grid <- grid_random(
  tree_depth(range = c(3, 10)),
  min_n(range = c(2, 20)),
  learn_rate(range = c(-3, -0.5)), # Tuning log10(learning_rate) from 0.001 to 0.3
  loss_reduction(range = c(-5, 1)),
  sample_prop(range = c(0.5, 1.0)),
  size = 15
)

# SVM: 15-point random grid
set.seed(123)
svm_grid <- grid_random(
  cost(range = c(-5, 5)),
  rbf_sigma(range = c(-5, 0)),
  size = 15
)

# MARS: Systematic grid (15 points)
set.seed(123)
mars_grid <- grid_regular(
  num_terms(range = c(5, 25)),
  prod_degree(range = c(1L, 2L)),
  levels = c(num_terms = 5, prod_degree = 3)
)

print("Hyperparameter grids finalized.")

## [1] "Hyperparameter grids finalized."

We used a workflow set to tune all five models across our grouped CV splits and save the out-of-fold predictions to be used inputs for the final ensemble.

## Base learners tuning: 526.64 sec elapsed

## [1] "Base learners tuned."

Stacked ensemble creation

We create the stacked ensemble by adding the tuned base learners as candidates and blending their out-of-fold predictions using a non-negative LASSO meta-learner to find optimal, interpretable weights. Then, we fit the complete stacked model to the full training set.

# Create and fit the stacked ensemble
tic("Stacking")

# Initialize stack
set.seed(123)
emsci_stack <- stacks() %>%
  add_candidates(model_set_results) # Add all tuned models at once

print(paste("Stack initialized with", ncol(emsci_stack), "candidate models"))

## [1] "Stack initialized with 70 candidate models"

# Blend the stack (fit meta-learner with non-negative LASSO)
set.seed(123)
emsci_stack_blended <- emsci_stack %>%
  blend_predictions(
    penalty = 10^seq(-4, -0.5, length.out = 20),
    mixture = 1,  # LASSO
    non_negative = TRUE,                # enforce interpretability
    metric = metric_set(rmse)           # RMSE prioritizes predictive accuracy
  )

print("Stack blended successfully")

## [1] "Stack blended successfully"

# Fit the final stacked model on the full training data
set.seed(123)
emsci_stack_fitted <- emsci_stack_blended %>%
  fit_members()

toc()

## Stacking: 58.93 sec elapsed

print("Final stacked ensemble fitted.")

## [1] "Final stacked ensemble fitted."

Meta-learner evaluation (two-stage CV)

We perform a two-stage evaluation using an independent set of 5-fold grouped CV splits. We use the base-learner predictions as inputs and the optimal non-negative LASSO penalty found during blending and generate the final out-of-sample performance metrics.

# Obtain out-of-fold predictions and performance metrics
tic("Model evaluation")

# Create the L1 dataset (base learner predictions + IDs)
l1_data <- emsci_stack %>%
  mutate(.row = row_number()) %>%
  left_join(
    modeling_data %>% 
      mutate(.row = row_number()) %>% 
      select(.row, patient_id_global, center_id),
    by = ".row"
  )

print(paste("L1 dataset created:", nrow(l1_data), "patients"))

## [1] "L1 dataset created: 1500 patients"

# Define NEW, INDEPENDENT CV folds for meta-learner evaluation
set.seed(456) # Use a different seed for independence
l1_folds <- group_vfold_cv(l1_data, group = center_id, v = 5)

# Get optimal penalty from Stage 1
best_penalty <- emsci_stack_blended$penalty$penalty

print(paste("Using meta-learner penalty from Stage 1:", best_penalty))

## [1] "Using meta-learner penalty from Stage 1: 0.206913808111479"

# Define meta-learner spec
# Use !!  to inject the value, prevents silent fit_resamples scope error
meta_spec <- linear_reg(
  penalty = !!best_penalty,
  mixture = 1 # LASSO
) %>%
  set_engine("glmnet", lower.limits = 0) # Non-negative 

# Define meta-learner recipe
meta_recipe <- recipe(motor_score_12m ~ ., data = l1_data) %>%
  update_role(center_id, new_role = "ID") %>%
  update_role(patient_id_global, new_role = "ID") %>%
  update_role(.row, new_role = "ID")

# Create workflow
eval_metrics <- metric_set(rmse, rsq, mae)
meta_wf <- workflow() %>%
  add_recipe(meta_recipe) %>%
  add_model(meta_spec)

# Evaluate with CV
set.seed(456)
tic("Meta-learner CV")
meta_cv_results <- fit_resamples(
  meta_wf,
  resamples = l1_folds,
  metrics = eval_metrics, # Pass the metrics
  control = control_resamples(save_pred = TRUE)
)
toc()

## Meta-learner CV: 23.01 sec elapsed

print("Meta-learner CV complete")

## [1] "Meta-learner CV complete"

# Extract CV predictions
cv_predictions <- tidyr::unnest(meta_cv_results, .predictions)

# Create final plotting data
plot_data_cv <- cv_predictions %>%
  left_join(
    modeling_data %>% 
      mutate(.row = row_number()) %>%
      select(.row, patient_id_global, baseline_ais),
    by = ".row"
  ) %>%
  select(patient_id_global, motor_score_12m, .pred, baseline_ais)

# Calculate final CV metrics
final_cv_metrics <- collect_metrics(meta_cv_results)

print(paste("Out-of-fold predictions extracted for", 
            nrow(plot_data_cv), "patients"))

## [1] "Out-of-fold predictions extracted for 1500 patients"

print("Performance metrics and predictions ready.")

## [1] "Performance metrics and predictions ready."

Meta-learner weights

We extract the non-negative LASSO coefficients from the fitted stack, group them by base-learner type, and visualize the total weight and number of members each model contributes to the final ensemble.

Evaluation metrics

We gather performance metrics for the final stacked ensemble and the best-performing base learners.

Model¹	MAE	RMSE	R-squared
Cross-validated performance metrics
Stacked ensemble (retained 14 candidates) vs. best base learners
Best ElasticNet	10.94 [10.49, 11.39]	14.32 [13.85, 14.79]	0.776 [0.742, 0.811]
Best Random Forest	11.53 [11.01, 12.05]	14.82 [14.25, 15.39]	0.761 [0.729, 0.793]
Best XGBoost	11.29 [10.81, 11.78]	14.82 [14.17, 15.46]	0.759 [0.727, 0.791]
Best SVM	10.73 [10.24, 11.21]	14.41 [13.91, 14.91]	0.775 [0.740, 0.809]
Best MARS	11.09 [10.64, 11.55]	14.47 [13.98, 14.96]	0.771 [0.735, 0.807]
Stacked ensemble	11.28 [10.52, 12.05]	14.75 [13.76, 15.75]	0.739 [0.704, 0.774]
¹ Base learner metrics are optimistically biased (from stage 1 tuning). Stacked ensemble metrics are unbiased (from independent stage 2 CV).

Calibration plot

We assess model calibration using the out-of-fold predictions via predicted vs observed scores.

Calibration by score range

We stratify the data by predicted score range to analyze calibration-in-the-large.

Score Range	N	Mean Predicted	Mean Observed	Calibration Error¹
Calibration stratified by score range
0-25	815	10.0	9.6	-0.40
26-50	285	39.3	39.6	0.32
51-75	342	61.5	62.3	0.84
76-100	58	81.8	81.5	-0.29
¹ Calibration error = Observed - predicted. positive = underprediction

Residual diagnostics

We check the model error behavior by generating standard diagnostic plots from the out-of-fold predictions.

Bland-Altman plot

We assess the agreement between observed and predicted scores, visualizing the mean bias and 95% limits of agreement to check for any systematic error trends across the range of scores.

Subgroup analysis by AIS grade

We visualize the model performance stratified by the baseline AIS grade to compare each group’s error against the overall model performance.

Clinical utility evaluation

We evaluate the model’s clinical utility by defining a Minimal Clinically Important Difference (MCID) and plotting a histogram to find the percentage of predictions within this acceptable error.

We test the model’s classification utility by setting a recovery threshold and generating a Precision-Recall curve from the model’s predictions.

Permutation feature importance

We conduct a permutation-based feature importance analysis, plotting the top 20 features ranked by the mean drop in performance (RMSE) when their values are shuffled.

## [1] "DALEX explainer created successfully"

## Permutation importance: 1608.05 sec elapsed

SHAP feature importance

We calculate SHapley Additive exPlanations (SHAP) values for the full dataset to explain the stacked model’s predictions and create a global importance bar chart, ranking the top 20 features by their mean absolute SHAP value.

## SHAP calculation (full sample): 4522.95 sec elapsed

We generate SHAP dependency plots for the top numeric features to visualize their marginal effects and explore potential interactions.

Local Importance (Break-down)

We identify the patients with the best and worst predictions from the cross-validation set and generate permutation-based “break-down” plots for these two individuals to visualize how each feature contributed to their specific final score.

Note: OOF (Out-of-Sample) shows the model’s unbiased prediction for this subject calculated during cross-validation, where the model had never seen this subject The OOF error is used only to identify this patient as “best” or “worst”. IN-SAMPLE is the prediction from the final model, which was trained on all the data (including the chosen subjects). This is the prediction that the plot explains. The plot starts with the value for the average prediction for the dataset (E[f(x)]) and each bar is a feature’s SHAP value, showing its push on the prediction until reaching the final prediction (f(x))

Limitations

The model’s performance is limited by its reliance on simulated data, which lacks the noise, heterogeneity, and missingness characteristic of real-world multi-center SCI data.
We implemented a standard two-stage CV approach. While it is computationally efficient and uses repeated grouped CV to obtain stable out-of-fold predictions, it carries an optimization bias. Nested CV would be preferred but computationally prohibitive.
The project uses embedded selection by passing all features to the stacked ensemble, employing the inherent feature selection capabilities of the base learners. Wrapper methods (e.g., Boruta) would be preferred but computationally prohibitive.
The analysis implements a pragmatic and computationally efficient two-stage evaluation in lieu of a full nested CV framework, introducing known trade-offs:

The hyperparameters for the base learners were selected using the same 5-fold CV that generated the out-of-fold predictions. This is standard for the stacks package but is slightly optimistic.
The optimal penalty for the meta-learner was selected in stage 1 using out-of-fold predictions from all 1500 patients. This single penalty was re-used in our stage 2. The final metrics are optimistic. This approach is a practical compromise that trades a minor, acceptable bias for a massive gain in computational efficiency.

Future iterations will include deeper investigation using SHAP dependency plots, partial dependence plots, and a quantitative analysis of the ensemble diversity.

Final summary

## These are not valid conclusions. This is a programming exercise. For completeness, here are the results:

## Overall performance (5-fold CV):

##    - RMSE: 14.75 (13.76, 15.75)

##    - MAE: 11.28 (10.52, 12.05)

##    - R-squared: 0.739 (0.704, 0.774)

## Calibration:

##    - Calibration slope: 1.012 (ideal = 1)

##    - Calibration intercept: -0.326 (ideal = 0)

## Clinical utility:

##    - Predictions within MCID (±4 points): 23.3%

## Bland-Altman analysis:

##    - Mean bias: 0.02 points

##    - 95% Limits of Agreement: [-29.18, 29.23]

## Meta-Learner composition:

##    - Total members selected: 14 (out of 69 candidates)

##    - Dominant model type: MARS (48.2% of total weight)

## Subgroup performance:

##    - AIS A (n=617): RMSE=14.76, R-squared=0.718
##    - AIS B (n=200): RMSE=15.83, R-squared=0.722
##    - AIS C (n=284): RMSE=14.75, R-squared=0.753
##    - AIS D (n=399): RMSE=14.72, R-squared=0.754

Session Information

## R version: R version 4.5.1 (2025-06-13 ucrt)

## Platform: x86_64-w64-mingw32

## Key ML packages:

## tidymodels 1.4.1 
## tune 2.0.1 
## stacks 1.1.1 
## ranger 0.17.0 
## xgboost 1.7.11.1 
## earth 5.3.4 
## DALEX 2.5.3 
## fastshap 0.1.1

Predicting motor recovery after spinal cord injury: A Machine Learning exercise on simulated EMSCI data

This is an ML programming exercise.

Juan Carlos Ocampo

2025-11-04