Choice of Machine Learning Models is critical to detecting subconscious racial bias in surgical infections.

Author(s):
Addison Heffernan; Reetam Ganguli; Isaac Sears; Robert Parker; Daithi Heffernan

Background:

Surgical quality datasets are critical to decision making tools including surgical infection (SI). Machine learning models ((MLMs), a branch of artificial intelligence, have gained traction in surgical algorithms. Given their non-human mathematical basis, MLMs could detect unseen subconscious biases within surgical datasets.

Methods:

5 years of NSQIP data was imported into Python, pre-processed and split into 80 % training and 20% testing to generate MLMs. We tested hierarchy of validity of four MLMs (XGBoost(XGB), K-Next Nearest Neighbor(KNN), Random Forest(RanFor) and Logistic Regression(LR)) to predict post-op SI. ROC Area Under Curves(AUC) were generated. To assess built-in racial bias by merely “knowing a patient’s race”, MLMs were tested with vs without vs random allocation of “race”. Finally, to test adaptability of MLMs to low data processing, low resource environments, we adapted and tested MLMs to a database from a rural academic, hospital in Kenya.

Results:

Overall, 3,416,094 NSQIP patients, ave. age 57.4 yrs, 57% were male, and 11% non-elective. SI rate was 3.3% in elective in 8.6% for emergency cases. First, a hierarchy of MLMs predictive demonstrated that XGB and RanFor had superior predictability (AUC 0.90) compared with KNN (AUC=0.67), LR worst performance (p<0.01 XBG vs KNN and vs LR), Further XBG, RandFor and KNN performed best when applied to emergency compared to elective cases (p<0.05 for paired groups). Non-White patients had an increased risk for infection (OR=1.19;p<0.05). Concerningly, merely “telling” XBG or RanFor MLMs the race of the patient significantly affected whether a non-White patient would be predicted to develop a post-operative SI [XBG (0.96 vs 0.90; p;<0.05) and RanFor (0.97 vs 0.91; p<0.05)]. Manipulating “race” within the dataset did not affect either KNN (0.67 vs 0.65) or LR (0.52 vs 0.51). Next, we translated Python-based MLMs to a prospectively gathered dataset of critically ill patients from Kenya (442 patients, ave age 36 years, 64% male, 26% infection rate). KNN remained the lowest performing model (76.1% predictability). This time RanFor proved superior to all other models with 83% predictability.

Conclusions:

We identified a hierarchy of MLMs- XGBoost proving superior in emergency cases. This is critical since MLMs and AI are increasingly used in surgical decision algorithms. MLMs can detect, and potentially correct, subconscious inherent bias built into human generated datasets. MLMs can also be tools used in low resource countries.