What does it take to build medical machine learning models that work equally well for all patients? Or, inversely, why do models often work better for some patient groups than others? This is what we set out to answer in our recent Cell Press Patterns perspective.
More specifically, how can we level up performance instead of achieving equal performance by pulling down all groups to the lowest performance level? (See excellent work by Sandra Wachter, Brent Mittelstadt, Chris Russell on this.) We argue that there is a path toward equal performance in most medical applications. It may, however, be long and winding.
Representation in the training data is an obvious and important issue, but it is not the only one. Patient groups can (and often do) still underperform even if they are well-represented. There is another crucial factor: the difficulty of the prediction task may differ between patient groups. We mean this in a purely statistical sense - the relationship between prediction inputs and prediction targets can simply be less deterministic in some groups. Why would difficulty differ between patient groups? Measurements may be more noisy (e.g. abdominal ultrasound in obese patients), or there may be unobserved causes of the outcome that are more important in some groups than others (e.g. hormone fluctuations, comorbidities).
Complicating everything, the prediction targets may be noisy (think disease labels automatically inferred from EHRs) or biased against certain groups (think underdiagnosis), affecting both the model and its evaluation. Same for selection biases. Under label noise/bias and selection bias, the equality of performance metrics on a test set is neither necessary nor sufficient for “performance fairness”. Thus, investigating and addressing these should be the first priority!
To address differing task difficulty, the causes of the differences must be understood and resolved. This may require identifying appropriate additional (or alternative) measurements that resolve residual uncertainty in affected groups. In other words: we may not always need more data from underperforming groups, but rather better data. (Or both.) Importantly, standard algorithmic fairness solutions are strictly limited in what they can achieve: if the statistical relationship between inputs and outputs is simply more noisy in some group, no amount of “fair learning” can fix this!
In the paper (co-authored with Sune Holm, Melanie Ganz-Benjaminsen, Aasa Feragen), we discuss many more concrete medical examples of the different sources of bias, and we propose some tentative solution approaches. We also connect the different sources of performance differences with a group-conditional bias-variance-noise decomposition due to Irene Chen, Fredrik Johansson, and David Sontag, as well as with epistemic and aleatoric uncertainty.