Bringing AI into Clinical Use
- Dr. Nigam Shaw is a professor of medicine at Stanford University and chief data scientist for Stanford Healthcare, with research focused on bringing AI into clinical use safely, ethically, and cost-effectively 14s.
- Dr. Shaw has an extensive background, including being an inventor on patents, authoring over 300 scientific publications, co-founding three companies, and being inducted into the American College of Medical Informatics and the American Society for Clinical Investigation 27s.
- The quality of AI and machine learning models in healthcare is heavily dependent on the quality of the data they are trained on, with data being collected from patient timelines 1m33s.
- Patient timelines are visualized as a series of data points collected over time, including ECGs, blood pressure, respiratory rate, cardiac output, medication orders, lab tests, and reports 2m7s.
Data Quality and Patient Timelines
- In a typical healthcare setting, not all data is collected at every point in time, and there is often a lack of longitudinal coverage for individual patients 2m46s.
- The manipulation and processing of patient timeline data have a significant impact on the performance of AI and machine learning models 3m22s.
- With large amounts of patient timeline data, models can be built to support decision-making in healthcare, including whether to treat a patient and how to treat a patient 4m1s.
Classification vs. Prediction vs. Recommendation
- The decision of whether to treat a patient can be broken down into classification or diagnosis tasks, or prediction tasks, such as prognosis 4m18s.
- The terms "prediction" and "classification" are often misused, with classification being the correct term when analyzing an image to determine if it contains a specific object or condition, such as pneumonia or a dog, as the outcome is already present 4m46s.
- In medicine, many things that claim to be predictions are actually classifications, such as sepsis predictors, which are actually figuring out if a patient has sepsis, and this distinction is important as it affects how the information is used 5m13s.
- The distinction between prediction and classification is crucial, as predicting an outcome may lead to attempts to prevent it, while classifying and diagnosing a condition leads to treatment, not prevention 5m30s.
- Recommendation is the hardest task, given the data's limitations and biases, and it has been a 40-year journey in medicine to figure out reliable recommendation 5m58s.
- There are three things that can be done with AI in healthcare: classification, prediction, and recommendation, and it's essential to consider whether these technical exercises are advancing the science of medicine, the practice of medicine, or the delivery of medical care 6m13s.
Advancing Science, Practice, and Delivery of Medical Care
- An example of advancing the science of medicine is the discovery of three subtypes of heart failure with preserved ejection fraction, which would be a classification 6m42s.
- Advancing the practice of medicine would involve developing a test to determine the subtype of heart failure and having a treatment available to target the specific subtype 6m58s.
- Advancing the delivery of medical care would involve implementing the test and treatment over time, resulting in improved patient outcomes, such as longer life, lower costs, and better quality of life 7m28s.
The Green Button Project and On-Demand Data Analysis
- The "Green Button Project" is an example of advancing the practice of medicine, where a simple query system was developed to help clinicians make decisions at the bedside by analyzing similar patient cases 8m1s.
- A bedside consultation service was created to provide written reports with recommendations for patient care, utilizing aggregated data from millions of patients to make better decisions 8m16s.
- Research has shown that medical evidence can be unreliable, with physicians often making decisions without prior published data, highlighting the need for on-demand data analysis 8m47s.
- A project was conducted to analyze data on demand, which was later scaled and shared through a company called Atropos Health, reducing the time it takes to conduct bedside studies from a day or two to under 24 hours 9m36s.
- The use of generative AI has further reduced the time it takes to conduct studies, allowing for on-demand analysis in a matter of minutes 10m9s.
Predictive AI and Cost Savings
- An example of simple AI being used to predict which patients will become medically costly in the future, and taking proactive action to enroll them in management programs, resulted in an estimated 10-15% cost savings without sacrificing quality 10m54s.
- AI and machine learning can be used to make various predictions, including operational, biological, and delivery-related predictions, such as predicting no-shows, classifying images, and deciding who to put on an air ambulance 11m53s.
- The AI model provides a risk estimate value, but the actual value comes from taking responsive action based on that estimate, such as early intervention or Advanced Care planning, depending on the specific case 12m28s.
- A three-star logo is used to remember the interplay between the model's risk estimate, the work capacity to follow through, and the action taken, with the goal of achieving net benefit 13m10s.
- Research has shown that about 25 papers and five to six faculty members have worked on studying this interplay, leading to the key insight that the focus should be on what can be achieved given work capacity 13m36s.
- A plot is used to illustrate the relationship between the rank-ordered cases based on the probability of an event happening and the cumulative benefit of taking action, with the goal of determining how far down the list action can be taken before diminishing returns are seen 13m53s.
Fair, Useful, Reliable (FIR) Models
- The approach is called Fair, Useful, Reliable (FIR) models, which is a multi-step process that involves usefulness simulations, financial projections, ethical considerations, and prospective evaluation 14m59s.
- The current way of doing things in AI is unsustainable, with 220 atomic pieces of guidance on how to do good AI, but half of them focus only on building the model, with little emphasis on workflow analysis and implementation 16m22s.
- The FIR approach is a packaged process that has been developed over 57 years of work and is used on a routine basis in the campus healthcare system 15m43s.
Unsustainable AI Practices and the FIR Approach
- The current organization of medical research is unsustainable, with an example of a model taking 10 years and $28 million to be tested and validated at multiple sites, highlighting the need for more efficient processes like form assessment in healthcare 16m47s.
- To create fair, useful, and reliable AI in healthcare, a three-step approach can be taken: Discovery (solving for the science), Development (validating the intent), and Dissemination (scaling) 17m43s.
- In the Discovery stage, the process is often too slow and costly, while in the Development stage, the focus should be on achievable benefit and financial sustainability, which may require changes to business models 18m15s.
- The firm assessment is a tool used to evaluate the development of AI models, with a link to the assessment available at fm.stan.edu 18m49s.
Firm Assessment and Workflow Definition
- The first step in the firm assessment is to define the workflow, including what actions will be taken and by whom, with an example workflow provided for a classifier that identifies undiagnosed pediphil arter disease 19m8s.
- Clarity on responsible action is key when building AI models, including defining the policy and workflow, and determining the threshold for action 19m42s.
- An Ethics assessment is also conducted as part of the firm assessment, considering factors such as equity, reliability, governance, and autonomy in decision-making 20m6s.
Form Assessment and Capacity Planning
- A form assessment process is used to evaluate the impact of AI projects in healthcare, considering factors such as the number of patients affected, sustainability, and potential ethical problems, with the goal of identifying good projects to pursue 20m46s.
- The assessment process involves analyzing six cases, with the first row of the table showing an example where 1,400 patients are impacted, and the project is deemed sustainable with no ethics problems 20m57s.
- Capacity planning is necessary for responsible AI in healthcare, as launching multiple projects simultaneously can be challenging with limited personnel, and operational engineering work is done to determine the number of concurrent assessments needed to achieve a certain throughput 21m33s.
- Little's law, a basic operations engineering principle, is used to calculate the required team size, indicating that a team that can handle two assessments at the same time is needed to complete at least one assessment per month 22m12s.
- Good governance is essential to ensure that everything needed is done, and a life cycle is established to make sure that the form assessment is integrated into the workflow, with governance, IT support, and standard work 22m35s.
- The governance process involves making decisions, assigning responsibility, and conducting analyses to produce numbers and inform decision-making, with the goal of ensuring that AI projects are fair, useful, and reliable 23m12s.
- The four key components of the process are standard work, IT support, governance, and form assessment, which work together to ensure that AI projects are well-planned and executed 23m34s.
Language Models and Patient Timelines
- The process was developed to ensure that machine learning projects are fair, useful, and reliable, but the emergence of large language models (LLMs) in 2022 has introduced new challenges and uncertainties 23m49s.
- A language can be viewed as a sequence of tokens from a finite vocabulary, and with this lens, a patient timeline can be seen as a language, consisting of tokens such as ICD codes, CPT codes, and LOINC codes 24m46s.
- There are two ways to build language models: the classical way using natural language, which can be used for chat and summarization, and using patient timelines to forecast what will happen, a unique way of using language models in healthcare 25m17s.
Evaluating Language Models in Healthcare
- A study was conducted to verify the effectiveness of language models in healthcare, using GPT 3.5 and GPT 4 to answer questions from a bedside service, and the results showed that agreement with reference answers increased and disagreement decreased from GPT 3.5 to GPT 4, but around 40-50% of the time, physicians couldn't decide if the answers were correct 26m19s.
- Another study, MedAlign, was conducted to evaluate the alignment of language model outputs with medical needs, and the results showed a 35% error rate in answering medical prompts, even in the best-case setup 27m57s.
- Research is being conducted to train models that can forecast patient outcomes, using a forecasting classifier or predictor, and the results show that the number of positive examples used to train the model affects its performance 28m22s.
- The performance of different models, including grade and boosted models, logistic regression, random forest, and timeline-trained language models, was compared using a receiver operator curve, with the timeline-trained language model (dark blue line) consistently showing higher accuracy 28m39s.
- The timeline-trained language model achieved an accuracy of around 78%, outperforming the highest accuracy of classical methods (red, orange, or light blue dots) while using 95% less training data and training eight times faster 29m11s.
Climber and Motor: Open-Source Language Models
- The models, called Climber and Motor, have been publicly released and can be found on GitHub 29m50s.
- The focus should be on verifying the benefits of language models and generative AI, as there are many tech companies building models that cost millions of dollars, which academic sites cannot afford 30m7s.
- It is essential to ask hard questions about whether these models actually work as advertised, and there is a need to develop a worldview for generative AI 30m36s.
- The development and dissemination of this worldview are uncertain, and there is a need to focus on verifying benefits, as seen in conflicting articles about whether AI is better than doctors 30m50s.
Building Fair, Useful, and Reliable Models
- To build fair, useful, and reliable models, data engineering and data science work must be done collaboratively, with data engineers and data scientists working side by side 31m56s.
- The data science team should have more data engineers than data scientists, and they should work together to extract, clean, and make decisions about the data, as these decisions will affect the kind of science that can be done 32m7s.
Efficiency in AI Model Development
- The time it takes to develop and refine AI models in healthcare decreases with each replication, as the team's maturity and established platforms and procedures contribute to increased efficiency and reliability 32m45s.
- The first end-to-end development of an AI model in healthcare took around 5-7 years, but subsequent replications took significantly less time, with the goal of reducing the time by 50% with each replication 33m14s.
Verification of EHR Data Accuracy
- Electronic Health Record (EHR) data is inherently noisy and prone to errors, so it's essential to verify conclusions by looking for multiple lines of corroboration, such as independent pieces of information confirming a diagnosis 34m2s.
- To establish the accuracy of EHR data, it's crucial to look beyond a single code or piece of information and instead consider multiple factors, such as lab results, medications, and other relevant data points 34m18s.
Stanford's Approach to LLMs
- Stanford is not training its own Large Language Model (LLM) for medical analysis, instead opting to use an open-source model with no copyright issues and fine-tuning it for their specific needs 34m37s.
- Training an LLM from scratch is expensive and not considered an efficient approach, especially given the rapid pace of advancements in the field 35m1s.
Promising Applications of Machine Learning in Medicine
- The most promising applications of machine learning in medicine include both clinical and non-clinical areas, such as operational applications like transcription and billing, as well as clinical applications like disease diagnosis and management 35m10s.
- The choice of application and prioritization of AI development in healthcare depends on the specific environment and structural issues of the healthcare system, with different priorities in resource-rich versus resource-poor settings 35m31s.
- In resource-poor settings, AI may be used for clinical care, such as retinal scanning algorithms for diabetic patients, due to the lack of alternative options and the potential for AI to provide better-than-no-care solutions 36m27s.
- The development and implementation of AI in healthcare must consider the local context and prioritize applications that address the most pressing needs and challenges in that specific environment 36m50s.
Real-world Applications and Generative AI in Behavioral Health
- Academic work has been successfully applied to solve real-life problems in the healthcare industry, such as predicting mortality for improving Advanced Care planning, which has improved the care of over 6,000 patients at Stanford Healthcare.
- Generative AI has potential applications in Behavioral Health, but there are concerns about its reliability, as seen in incidents like Gemini providing harmful advice to a high schooler, making it uncertain if it's the best use of the technology today.
Patient Involvement and Stakeholder Collaboration
- Patient involvement in implementing AI models, such as the FEMR model, is ensured through a patient family advisory council, which reviews the workflow and actions taken by the algorithm to ensure patient comfort and consent.
- Multiple stakeholders, including clinicians, patients, administrators, and developers, should be involved in the development and deployment of AI algorithms to ensure their appropriateness and effectiveness.
Machine Learning in Laboratories
- Machine learning has practical applications in laboratories, such as in histology, cytology, or flow cytometry, with examples including deep neural nets that can assist pathologists in reading slides and identifying areas of interest.
- AI is widely used in pathology, ranging from cell sorters that use lasers and calculations to count white blood cells, to more advanced tools that help pathologists read slides and augment their work.
Data Gaps in EHR Systems
- Critical data gaps exist in current Electronic Health Record (EHR) systems, with the biggest issue being the presence of too many systems, often ranging from 500 to 1,000, which can include EHR systems like Epic or Cerner, as well as specialized systems for various departments, resulting in medical data being scattered across hundreds of systems 41m7s.
- It is a myth to believe that all medical data is in the EHR, and instead, there is an opportunity to combine and put this data in one place 41m46s.
Addressing Bias in Predictions
- When addressing bias in predictions made by models, it is essential to distinguish between two interpretations of bias: systematic differences in model performance for people belonging to different subgroups, and systematic differences in the actual benefit or reward resulting from the model's output 42m9s.
- The latter type of bias is more concerning, as it can result in unequal benefits for people belonging to different subgroups, and addressing this requires focusing on policies and workflows driven by the model's output rather than just removing model-side differences 42m48s.
- Algorithmic fixes may not always be effective in addressing bias, and it is crucial to consider other factors that can impact the actual benefit or reward, such as policies and workflows 43m16s.
Traceability, Explainability, and Trust
- Traceability and explainability requirements can slow down the progress of AI model advancements, but it depends on the purpose of these requirements, which can include debugging, mitigating outcomes, or establishing trust 43m50s.
- Different scenarios require different types of interpretability, including engineers' interpretability for debugging, causal interpretability for mitigating outcomes, and transparency for establishing trust 44m25s.
- Providing the wrong type of explanation can be counterproductive, and it is essential to understand the purpose of the request for explainability or interpretability to provide the most effective response 45m10s.
- Establishing trust in AI models can be done through prospective studies, similar to how trust is established in medications, even if the exact mechanism of action is not fully understood 45m20s.
AI Assistance and Medical Errors
- Studies have compared error rates in medical diagnosis and treatment with and without AI assistance, with one recent study by Jonathan Chen finding that doctors sometimes make mistakes when using AI, possibly due to suboptimal use 45m59s.
- The study by Jonathan Chen involved giving case vignettes to physicians with and without access to AI, and found that AI alone performed better than doctors with AI assistance 46m6s.
- More research is needed to understand how AI can be effectively used in practice, rather than just debating whether it should be used 46m56s.
EHRs as Raw Material for AI Training
- Electronic Health Records (EHRs) can be considered one of the raw materials for teaching AI, but should not be the only source, and should be supplemented with information from textbooks and online sources 47m15s.
- Medical Imaging AI has many practical uses, with the FDA approving around 1,000 image-based models, mostly in radiology and cardiology 48m4s.
Medical Imaging AI and Resources
- The Stanford Center for Health Education offers resources for learning more about machine learning and medicine, including a program and online content 49m1s.







