Software as a Medical Device:

Regulating AI in Healthcare via Responsible AI

KDD 2021 Tutorial

Tutorial Presentors

Muhammad Aurangzeb Ahmad ^1,2
Dr. Carly Eckert M.D. ^2,3
Christine Allen ²
Vikas Kumar ²
Ankur Teredesai ^2,5

Affiliations

Department of Computer Science, University of Washington – Bothell
KenSci Inc. Seattle, Washington
Department of Bioinformatics and Medical Education, University of Washington
Department of Epidemiology, University of Washington
Department of Computer Science, Center for Data Science, University of Washington - Tacoma

Corresponding Author: Muhammad Aurangzeb Ahmad maahmad@uw.edu

Short Abstract

Long Abstract

Responsible AI is a fundamental requirement for application of AI and machine learning (ML) in healthcare. With the increased adoption of AI/ML in this domain, there is an growing recognition and demand to regulate AI/ML in healthcare to avoid potential harm and unfair bias against vulnerable populations. The regulatory bodies like the FDA, European Union’s GDPR, China’s New Generation AI Governance Expert Committee etc. have either promulgated or put forward regulatory frameworks for responsible AI. The challenge of implementing a responsible AI system which complies with potential regulations thus becomes a daunting task given the complexity of the problem. This is especially difficult for researchers and practitioners who are starting out or who may not have domain expertise in both healthcare and AI. A survey of the various regulations proposed by around a hundred governmental bodies and commissions as well as leaders in the tech sector like Google, Facebook, Amazon, Baidu etc. reveal that many of these proposals are short on specifics. This has also led to charges of ethics washing where guidelines for ethical or responsible AI are used as a cover to not invest in meaningful AI/ML infrastructure and systems. In this tutorial we offer a guide to help navigate through the complex regulations and explain the various constituent practical elements of a responsible AI system in healthcare in the light of proposed regulations. Additionally, we breakdown and emphasize that the recommendations from regulatory bodies like FDA or the EU are necessary but not sufficient elements of creating a responsible AI system.

In the tutorial we will elucidate how regulations and guidelines often focus on epistemic concerns to the detriment of practical concerns e.g., requirement for fairness without explicating what fairness constitutes in a given use case. We posit that responsible AI/ML in healthcare is a systems level problem. It constitutes the IT-infrastructure, the AI/ML components, healthcare delivery and services tied to the ML models, and monitoring the outcomes. Regulatory regimes have traditionally required expectation of well defined and somewhat deterministic behavior from software. AI/ML in general does not fit into this mold given the stochastic and probabilistic nature of such systems. Preliminary guidelines released by the FDA in early 2021 for AI/ML based software as a medical device (henceforth SaMD document) extends regulatory frameworks by incorporating pre-determined change control plans in software regulation i.e., defining beforehand the scope of what is likely to change in a model. In this tutorial we will go through a use case regarding what are the technical requirements for such a change control plan and what limitations it may have post-deployment.

FDA’s SaMD document and EU’s GDPR among other AI governance documents talk about the need for implementing sufficiently good machine learning practices. In this tutorial we seek to spell out what that would mean from a practical perspective for real world use cases in healthcare throughout the machine learning cycle i.e., Data Management, Data Specification, Feature Engineering, Model Evaluation, Model Specification, Model Explainability, Model Fairness, Reproducibility, checks for data leakage and model leakage. We will illustrate these with real world use cases and example code using open-source and publicly available datasets and libraries like InterpretML, ErrorAnalysis, fairMLHealth, fairLearn and AIF360. As an example, consider feature engineering which would need to address questions like, whether it is lossy vs. Lossless, how does data missingness look like, what features are available at runtime vs. training time etc.

To further illustrate, consider the problem of predicting the length of stay in a hospital. We will start with pros and cons of treating it as a classification vs. regression problem and how the academic literature addressing this problem mostly ignores how insights derived from such prediction models are used in practice e.g., for measuring model performance MAE for regression or precision and recall for classification on the whole model is not really useful. Optimizing for length of stay prediction may end up deoptimizing for risk of readmission to the hospital for the population as whole and thus detrimental if taken in isolation. Lastly, model performance disparities across racial and ethnic groups may reveal systematic problems that would need to be rectified via mitigation strategies that involve clinicians, discharge planners and vulnerable populations who may be negatively impacted.

We note that conceptualizing responsible AI as a process rather than an end product accords well with how AI/ML systems are used in practice. AI governance documents like FDA’s SaMD strongly recommend taking a stake-holder centric view of AI/ML problems. Consider a disease prediction model: A clinician may want to optimize for a model that maximizes overall performance, a patient would want to know that since they belong to a minority group how likely is it that they would be scored incorrectly by the model. Creating equitable models may require balancing requirements for multiple competing optimization criteria. The creation of fair and unbiased models may require agreeing upon which notion of fairness (Statistical Parity vs. Equalized odds vs. Sufficiency) is applicable. The same applies for model transparency, we will discuss a taxonomy of use cases in healthcare AI/ML where we discuss the trade-off needed to balance between post-hoc explanations, fairness measurements and practical constraints of model deployment.

While governance documents the need for addressing real world issues related to model performance, they hardly go into details, In the tutorial we will discuss and illustrate use cases related to model performance in training vs. production, real world usage monitoring and how that should be used as feedback to enhance the models. To summarize, the focus of the tutoria is on responsible AI/ml in healthcare is a systems level phenomenon which requires compliance at the level of the IT infrastructure, technical aspects of AI/ML system, the services and outcomes that may be associated with the intended use case, and the human factors.

Tutorial Outline

AI/ML in Healthcare
Operationalizing Responsible Healthcare AI
Regulating Healthcare AI via Responsible AI
Regulatory Frameworks
- IT Industries
- Healthcare and Medicine
AI/ML Based Software as a medical device (SaMD)
Regulatory frameworks in Healthcare AI
- Non-AI/ML SaMD
- Pre-determined change control plan
Recommended Machine Learning Practices
- Data Management
- Data Specification
- Feature Engineering
- Model Evaluation
- Model Specification
- Model Explainability
- Model Fairness
- Reproducibility
- Data leakage and model leakage checks
- Robustness & Resilience
Stakeholder Centric Approaches
- Value alignment
- Transparency
- Stakeholder centric views of ML Cycle
- Fairness (Stakeholder centric)
Real-World Performance
- Training vs. in-production performance
- Real world usage monitoring
- Satekholder oversight
- Model update vs. Stakeholder feedback
Responsible AI beyond regulatory frameworks
Conclusion

Bios

Muhammad Aurangzeb Ahmad

Muhammad Aurangzeb Ahmad is a Research Scientist at KenSci Inc. a Machine learning/AI healthcare informatics company focused on risk prediction in healthcare. He is also Affiliate Associate Professor in the Department of Computer Science at University of Washington Bothell. He has had academic appointments at University of Washington, Center for Cognitive Science at University of Minnesota, Minnesota Population Center, and the Indian Institute of Technology at Kanpur. He has published more than 50 research papers in top machine learning and data mining conferences KDD, AAAI, SDM, PAKDD etc.

Dr. Carly Eckert M.D

Dr. Carly Eckert MD, MPH, is a preventive medicine physician and epidemiologist. Carly has worked in industry for 4 years, collaborating closely with data scientists developing and implementing machine learning solutions in healthcare. Carly is the Chief Clinical Officer at Greenlight Ready where she works on last-mile implementation for public health practice. She is also a clinical advisor at KenSci and a doctoral student in epidemiology at University of Washington where she studies transfer learning in trauma outcomes prediction.

Christine Allen

Christine received her MS in Computer Science and Systems from the University of Washington after starting her data career as an analyst and researcher at the Institute for Health Metrics and Evaluation (IHME). Her work has been published by the Lancet, the Journal of the American Medical Association (JAMA), JAMA Oncology, the Lancet Neurology, and others. He current focuses are ML interpretability and responsible AI in the clinical domain.

Vikas Kumar

Vikas Kumar is a Data Scientist working at KenSci. In this role, Vikas works with a team of data scientists and clinicians to build consumable and trustable machine learning solutions for healthcare. His focus is in building explainable models in healthcare and application of recommendation systems in clinical settings. Vikas holds a Ph.D. with a major in Computer Science and minor in Statistics from the University of Minnesota, Twin Cities. He has worked on modeling and application of recommendation systems in various domains, such as media, location, and healthcare. His focus has been to interpret the balance users seek between known (or familiarity) and unknown (or novel) items to build adaptive recommendations. Prior to his Ph.D., he completed his Bachelor’s at the National Institute of Technology, India and worked as a software engineer in Microsoft India.

Ankur Teredesai

Ankur Teredesai is a Professor in the Department of Computer Science at University of Washington Tacoma, and founder and director of the Center for Data Science at University of Washington. He is also the founder and CTO of KenSci, a vertical machine learning/AI healthcare informatics company focused on risk prediction in healthcare. Professor Teredesai has published more than 70 research papers in top machine learning and data mining conferences like KDD, AAAI, CIKM, SDM, PKDD etc. He is also the information officer of KDD., Ankur Teredesai Deep Explanations in Machine Learning via Interpretable Visual Methods PAKDD 2020 Singapore June 11-16, 2020

References:

Adelman, Larry. “Unnatural causes: Is inequality making us sick?.” Preventing Chronic Disease 4, no. 4 (2007).
Barocas, Solon and Hardt, Moritz., Fairness in machine learning, NeurIPS Tutorial, 2017.
Bellamy, Rachel KE, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia et al. “AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias.” IBM Journal of Research and Development 63, no. 4/5 (2019): 4-1.
Benjamens, Stan, Pranavsingh Dhunnoo, and Bertalan Meskó. “The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database.” NPJ digital medicine 3, no. 1 (2020): 1-8.
Binns, Reuben. “Fairness in machine learning: Lessons from political philosophy.” arXiv preprint arXiv:1712.03586 (2017).
Chen, Esther H., Frances S. Shofer, Anthony J. Dean, Judd E. Hollander, William G. Baxt, Jennifer L. Robey, Keara L. Sease, and Angela M. Mills. “Gender disparity in analgesic treatment of emergency department patients with acute abdominal pain.” Academic Emergency Medicine 15, no. 5 (2008): 414-418.
Chouldechova, Alexandra, and Aaron Roth. “The frontiers of fairness in machine learning.” arXiv preprint arXiv:1810.08810 (2018).
Corbett-Davies, Sam., Goel, Sharad., Defining and Designing Fair Algorithms, Tutorials at EC 2018 and ICML 2018.
Corbett-Davies, Sam, and Sharad Goel. “The measure and mismeasure of fairness: A critical review of fair machine learning.” arXiv preprint arXiv:1808.00023 (2018).
Dafoe, Allan. “AI governance: a research agenda.” Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK (2018).
Dawes, Robyn M., David Faust, and Paul E. Meehl. “Clinical versus actuarial judgment.” Science 243, no. 4899 (1989): 1668-1674.
Dresser, Rebecca. “Wanted single, white male for medical research.” The Hastings Center Report 22, no. 1 (1992): 24-29.
Evans, Barbara J., and Frank A. Pasquale. “Product Liability Suits for FDA-Regulated AI/ML Software.” The Future of Medical Device Regulation: Innovation and Protection (I. Glenn Cohen, Timo Minssen, W. Nicholson Price II, Christopher Robertson & Carmel Shachar eds., Cambridge University Press, 2021 forthcoming) (2020).
Friedler, Sorelle A., Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. “A comparative study of fairness-enhancing interventions in machine learning.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 329-338. 2019.
Gasser, Urs, and Virgilio AF Almeida. “A layered model for AI governance.” IEEE Internet Computing 21, no. 6 (2017): 58-62.
Ghassemi, Marzyeh, Tristan Naumann, Peter Schulam, Andrew L. Beam, and Rajesh Ranganath. “Opportunities in machine learning for healthcare.” arXiv preprint arXiv:1806.00388 (2018).
Grote, Thomas, and Philipp Berens. “On the ethics of algorithmic decision-making in healthcare.” Journal of Medical Ethics 46, no. 3 (2020): 205-211.
Hajian, Sara., Bonchi, Francesco and Castillo, Carlos ., Algorithmic bias: From discrimination discovery to fairness-aware data mining, KDD Tutorial, 2016.
Harvey, H. Benjamin, and Vrushab Gowda. “How the FDA regulates AI.” Academic radiology 27, no. 1 (2020): 58-61.
Holstein, Kenneth, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. “Improving fairness in machine learning systems: What do industry practitioners need?.” In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-16. 2019.
Hutchinson, Ben., and Mitchell, Margaret., “50 Years of Test (Un) fairness: Lessons for Machine Learning.” In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 49-58. 2019.
Kamiran, Faisal, and Toon Calders. “Classifying Without Discriminating.” In Proc. 2nd International Conference on Computer, Control and Communication, 2009.
Kilbertus, Niki, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. “Avoiding discrimination through causal reasoning.” In Advances in Neural Information Processing Systems, pp. 656-666. 2017.
Kiseleva, Anastasiya. “AI as a Medical Device: Is It Enough to Ensure Performance Transparency and Accountability?.” EPLR 4 (2020): 5.
Krieger, Nancy. “Discrimination and health inequities.” International Journal of Health Services 44, no. 4 (2014): 643-710.
Miller, Tim. “Explanation in artificial intelligence: Insights from the social sciences.” arXiv preprint arXiv:1706.07269 (2017).
Nabi, Razieh, and Ilya Shpitser. “Fair inference on outcomes.” In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
Narayanan, Arvind 21 fairness definitions and their politics, FAT* Tutorial, ACM Conference on Fairness, Accountability, and Transparency (ACM FAT*) 2018
Perry, Brandon, and Risto Uuk. “AI governance and the policymaking process: key considerations for reducing AI risk.” Big data and cognitive computing 3, no. 2 (2019): 26.
Pesapane, Filippo, Caterina Volonté, Marina Codari, and Francesco Sardanelli. “Artificial intelligence as a medical device in radiology: ethical and regulatory issues in Europe and the United States.” Insights into imaging 9, no. 5 (2018): 745-753.
Rajkomar, Alvin, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. “Ensuring fairness in machine learning to advance health equity.” Annals of internal medicine 169, no. 12 (2018): 866-872.
Rajkomar, Alvin, Jeffrey Dean, and Isaac Kohane. “Machine learning in medicine.” New England Journal of Medicine 380, no. 14 (2019): 1347-1358.
Rudin, Cynthia, and Berk Ustun. “Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice.” Interfaces 48, no. 5 (2018): 449-466.
Smedley BD, Stith AY, Nelson AR. Institute of medicine, committee on understanding and eliminating racial and ethnic disparities in health care. Unequal Treatment: Confronting Racial and Ethnic Disparities in Healthcare Washington, DC: National Academies Press; 2003.
Vayena, Effy, Alessandro Blasimme, and I. Glenn Cohen. “Machine learning in medicine: addressing ethical challenges.” PLoS medicine 15, no. 11 (2018).
Wang, Weiyu, and Keng Siau. “Artificial intelligence: a study on governance, policies, and regulations.” MWAIS 2018 proceedings (2018).
Wu, Wenjun, Tiejun Huang, and Ke Gong. “Ethical principles and governance technology development of AI in China.” Engineering 6, no. 3 (2020): 302-309.