Dr. Nigam Shaw's Background and Expertise
- Dr. Nigam Shaw is a professor of medicine at Stanford University and the chief data scientist for Stanford Healthcare, with research focused on safely and effectively integrating AI into clinical use 14s.
- Dr. Shaw has an extensive background, including being an inventor on patents, authoring over 300 scientific publications, and co-founding three companies 27s.
- He was inducted into the American College of Medical Informatics in 2015 and the American Society for Clinical Investigation in 2016 37s.
- Dr. Shaw holds an MBBS from Baroda Medical College in India, a PhD from Penn State University, and completed post-doctoral training at Stanford University 46s.
Data Quality and Patient Timelines in AI/ML Models
- The quality of AI and machine learning models is heavily dependent on the quality of the data they are trained on, with data being collected from patient timelines 1m33s.
- Patient timelines can include various data types, such as ECGs, blood pressure, respiratory rate, and medication orders, which are used to build models 2m7s.
- In a typical healthcare setting, not all data is collected at every point in time, and there is often a lack of longitudinal coverage for individual patients 2m46s.
- The manipulation and processing of patient timeline data have a significant impact on the performance of AI and machine learning models 3m22s.
AI/ML in Healthcare Decision-Making: Classification, Prediction, and Recommendation
- With large amounts of patient timeline data, models can be built to support decision-making in healthcare, including whether to treat a patient and how to treat them 4m1s.
- The decision of whether to treat a patient can be broken down into classification or diagnosis tasks, or prediction tasks, from a computational standpoint 4m18s.
- In general language, terms such as prognosis, prediction, and classification are often misused, and it's essential to understand the distinction between them, as classification is not the same as prediction, and many things in medicine that masquerade as predictions are actually classifications 4m29s.
- A classification is when a model identifies something that already exists, such as diagnosing pneumonia or identifying a dog in an image, whereas a prediction would imply that the model is forecasting a future event 4m46s.
- The distinction between classification and prediction is crucial, as it affects how we approach treatment and prevention, and it's essential to understand that many models, such as sepsis predictors, are actually classifiers and not predictors 5m13s.
- Recommendation is the hardest task, given the data's limitations and biases, and it's been a 40-year journey in medicine to figure out reliable recommendation systems 5m48s.
- There are three things that can be done with AI and machine learning in medicine: classification, prediction, and recommendation, and it's essential to consider whether these technical exercises are advancing the science of medicine, the practice of medicine, or the delivery of medical care 6m13s.
- An example of advancing the science of medicine is the discovery of three subtypes of heart failure with preserved ejection fraction, which would be a classification that advances our scientific understanding of the disease 6m42s.
- An example of advancing the practice of medicine would be developing a test that can identify the subtype of heart failure and provide a treatment plan accordingly, which would improve patient outcomes and reduce costs 6m58s.
Advancing Medical Science, Practice, and Delivery with AI
- An example of advancing the delivery of medical care is the "Green Button Project," which aimed to query similar patients' data to inform treatment decisions at the bedside, and it has been successful in improving patient outcomes and reducing costs 8m1s.
- A bedside consultation service was developed to provide written reports with recommendations for patient care by analyzing timeline objects from millions of other patients, which helped make better decisions than would have been made otherwise 8m14s.
- A study found that about 80% of the time, physicians did not have prior published data to inform their decisions, and less than 3% of the decisions had a study specific to the question at hand that the physician knew 9m4s.
- The lack of access to prior published data highlights the need to analyze data on demand, which was achieved through a prior project that was later spun out into a company called Atropos Health 9m36s.
- Atropos Health reduced the time it took to conduct bedside studies from a day or two to under 24 hours, and sometimes even a few hours, and with the advent of generative AI, studies can now be done in a few minutes 10m3s.
AI-Driven Healthcare Advancements and Cost Savings
- The use of different technologies such as machine learning, chatbots, and generative AI enables rapid decision-making and can improve patient care 10m37s.
- A relatively simple AI model was used to predict which patients would become medically costly the next year, and proactive action was taken to enroll them in care programs, resulting in an estimated 10-15% cost savings without sacrificing quality 10m54s.
- The use of AI and machine learning can drive advancements in healthcare delivery, including predicting cost, biology, practice, and delivery outcomes, such as no-shows, patient transport, and image classification 11m42s.
- A consistent theme in AI applications is that the AI provides a risk estimate, and the value comes from taking action based on that estimate, such as early intervention in cost blooms or advanced care planning in mortality prediction 12m34s.
- The model stratifies by risk value, and the actual intervention necessary may vary, such as providing transportation support in the case of predicting no-shows 12m51s.
- A three-star logo is used to remember the interplay between the model, work capacity, and the action taken, with the yellow box representing computer science and statistics, the green box representing the number from the model, and the red box representing the action taken 13m10s.
- The interplay between the model, work capacity, and action taken is studied, with about five or six faculty members working on it and publishing around 25 papers on the topic 13m36s.
- The key insight from this research is that the focus should be on what can be achieved given the work capacity, rather than just building a model 13m42s.
The FUR Model and Responsible AI Development
- A plot is used to illustrate the cumulative benefit of taking action on cases ranked by probability, with the goal of determining how far down the list action can be taken before diminishing returns are seen 13m56s.
- The approach used is called Fair Useful Reliable (FUR) models, which is a multi-step process that includes usefulness simulations, financial projections, ethical considerations, and prospective evaluation 14m59s.
- The FUR approach is used on a routine basis in the healthcare system, but it was developed because the current way of doing things is unsustainable, with 220 atomic pieces of guidance on how to do good AI, but only half of them focusing on how to build a model 16m23s.
- A model was developed over 10 years at a cost of $28 million to determine who should receive immediate attention in the emergency department (ED) and who can wait for registration, highlighting the unsustainable nature of current medical research practices 16m45s.
- The current organization of work in medical research is unsustainable, which is why processes like form assessment are needed to make healthcare activities more responsible and sustainable 17m20s.
- A three-step approach is proposed for responsible and sustainable innovation in healthcare: Discovery (solving for the science), Development (validating the intent), and Dissemination (scaling) 17m43s.
- For standard AI, the Discover stage is too slow and costly, and the development stage needs to focus on achievable benefits and financial sustainability, which may require changing business models 18m12s.
Firm Assessment for Responsible AI Deployment
- The firm assessment is a tool used to evaluate the responsible development and deployment of AI in healthcare, and it includes a link to a blog post and a website (fm.stan.edu) for more information 18m49s.
- The firm assessment involves evaluating the workflow, including the steps involved in using a classifier to identify patients with undiagnosed diseases, and ensuring clarity on responsible action 19m8s.
- The assessment also includes an ethics evaluation, considering factors such as equity, reliability, governance, autonomy, and decision-making processes 20m6s.
- A form assessment process is used to evaluate the impact of algorithms on patients, considering factors such as the number of people affected, sustainability, and potential ethical problems, to ensure that projects are fair, useful, and reliable 20m46s.
- The assessment process involves analyzing six cases, with the goal of identifying projects that can benefit a large number of people without causing harm to certain subgroups 20m57s.
- Capacity planning is necessary to ensure that responsible AI can be implemented in a healthcare system, and operational engineering work is done to determine the number of concurrent assessments needed to achieve a certain goal 21m36s.
- Little's law is used to calculate the required team size, which indicates that a team that can handle two assessments at the same time is needed to complete at least one assessment per month 22m12s.
- Good governance is essential to ensure that everything that needs to be done is actually done, and a life cycle is established to make sure that the form assessment is integrated into the workflow 22m35s.
- The workflow consists of four key components: standard work, IT support, governance, and form assessment, which are necessary to make informed decisions about AI projects 23m8s.
- The form assessment process produces numbers and analyses that help the governance body make decisions about AI projects, considering factors such as patients affected, sustainability, and potential harm to certain subgroups 23m20s.
- The goal of the form assessment process is to ensure that AI projects are fair, useful, reliable, and do not cause harm to certain subgroups, and to make informed decisions about which projects to pursue 23m49s.
Large Language Models (LLMs) and Their Application in Healthcare
- The introduction of Large Language Models (LLMs) in 2022 has changed the landscape and raised new questions about the form assessment process 24m2s.
- A language, in the context of computers, is a sequence of tokens from a finite vocabulary, such as a dictionary, and can include natural languages like English, Spanish, and Gujarati, as well as the "EHR language" used in electronic health records (EHRs) consisting of codes like ICD, CPT, and LOINC codes 24m24s.
- The EHR language is a sequence of tokens representing events and actions in a patient's timeline, such as visits, prescriptions, and diagnoses, which can be used to build language models for forecasting and generative AI in healthcare 24m55s.
- There are two ways to build language models: the classical approach using natural language text and documents, and the use of patient timelines to forecast future events and outcomes 25m17s.
- A study was conducted to evaluate the effectiveness of language models, specifically GPT 3.5 and GPT 4, in answering questions from a bedside service, with results showing an increase in agreement and a decrease in disagreement and uncertainty among 12 physicians 26m19s.
- However, the study also found that around 40-50% of the time, physicians couldn't decide whether the language model's answers were correct or not, limiting its utility 26m53s.
- Another project, MedAlign, aimed to assess the alignment of language model outputs with medical needs, with results showing a 35% error rate in answering medical prompts, even in the best-case scenario 27m57s.
- Research is being conducted to train models that can forecast patient outcomes, with a focus on using positive examples to develop forecasting classifiers 28m19s.
- A performance receiver operator curve was used to evaluate the performance of different models, including logistic regression, random forest, and a timeline-trained language model, with the language model showing the highest accuracy 28m35s.
- The language model achieved an accuracy of around 78%, which is better than the highest accuracy achieved by classical methods, and it trains eight times faster and uses 95% less training data 29m16s.
- The models, called Climber and Motor, have been publicly released and can be found on GitHub 29m53s.
- The focus should be on verifying the benefits of language models and generative AI, rather than just building them, as many tech companies are doing 30m9s.
- Academic sites have a crucial role in asking hard questions about the effectiveness of these models, despite not having the same resources as tech companies 30m25s.
- The development and dissemination of a worldview for generative AI is needed, but it is unclear how this will scale 30m38s.
Collaboration between Data Engineers and Data Scientists
- Building fair, useful, and reliable models requires collaboration between data engineers and data scientists, who should work together as part of the same team 31m56s.
- Data engineering and data science are two functions that need to work together, with data engineers playing a crucial role in extracting, cleaning, and preparing data for analysis 32m4s.
- The arrangement between data engineers and data scientists should be collaborative, with decisions made during data cleanup and extraction affecting the kind of science that can be done 32m28s.
- The time it takes to analyze or model data and come up with something presentable depends on the team's maturity, with the first time taking significantly longer and subsequent replications becoming faster, cheaper, and more reliable 32m44s.
- The first end-to-end project took around five to seven years to complete, while the next project took a year and a half, and the third project took four months, with the goal of reducing the time by 50% with each replication 33m14s.
Data Integrity and Model Development in Healthcare
- Electronic Health Record (EHR) data is noisy and contains errors, so it's essential to look for multiple lines of collaboration to confirm any conclusions, such as verifying a diabetes diagnosis with HbA1c levels and medication data 34m2s.
- Stanford is not training its own Large Language Model (LLM) for medical analysis, instead opting to use an open-source model with no copyright issues and fine-tuning it, as training from scratch is expensive 34m37s.
Applications of AI in Medicine: Clinical vs. Non-Clinical
- The most promising applications of machine learning in medicine include operational applications such as transcription, responding to messages, and billing, which tend to have higher adoption rates in academic centers 36m7s.
- In resource-constrained settings, AI may be used for clinical care, such as retinal scanning algorithms for diabetic patients, as the alternative may be no care at all 36m27s.
- The choice of whether to use AI for clinical or non-clinical applications depends on the structural issues of the healthcare system and the environment in which the work is being done 35m29s.
- The circumstances in which technology is deployed can significantly impact its effectiveness, and there are many examples of academic work being applied to solve real-life problems in the healthcare industry 36m52s.
- One such example is the work done by Stanford Healthcare's HMS and HS Davies on predicting mortality for improving Advanced Care planning, which has improved the care of over 6,000 patients 37m12s.
Generative AI, Patient Involvement, and Ethical Considerations
- Generative AI may be applicable in Behavioral Health, but there are concerns about its potential for errors, such as the incident where Gemini told a high schooler to harm themselves while giving homework advice 37m42s.
- Patient involvement is crucial in implementing the FEMR model, and this can be achieved through a patient family advisory council, which reviews the workflow and ensures that patients are comfortable with algorithm-driven care 38m28s.
- Multiple stakeholders, including clinicians, patients, administrators, and developers, should be involved in the development and implementation of algorithms to ensure that they are effective and unbiased 39m23s.
AI Applications in Medical Imaging and Pathology
- Machine learning has practical applications in laboratories, such as in histology, cytology, and flow cytometry, with examples including the use of deep neural nets to sort and analyze cells 39m35s.
- The Pathology Department at Stanford has developed a system called Nuclei, which uses a deep neural net to help pathologists read slides and identify areas of interest 39m50s.
- Other examples of AI in pathology include cell sorters, which use lasers and calculations to analyze cells, and have become a widely used tool in healthcare systems 40m33s.
Challenges and Opportunities with EHR Systems
- The biggest issue with current EHR systems is that there are too many systems, with a typical hospital running anywhere from 500 to 1,000 systems, making it a myth to believe that all medical data is in the EHR, as it is scattered over hundreds of systems 41m19s.
- The biggest gap or opportunity is to combine all the data in one place, as the current system makes it difficult to access and utilize the data effectively 41m55s.
Addressing Bias in AI Models and Healthcare Workflows
- When it comes to removing bias in predictions made by models, there are two interpretations of bias: a systematic difference in the model's performance for people belonging to different subgroups, and a systematic difference in the actual benefit or reward that people receive as a result of the model's output 42m12s.
- The latter interpretation is the one to worry about, as it is dependent on policies and workflows driven by the model's output, rather than just the model itself 42m51s.
- To address this issue, it is suggested to focus on policies and workflows rather than just removing model-side differences, and to consider the human-centered AI Institute's blog post on "when algorithmic fixes failed" 43m25s.
Traceability, Explainability, and Trust in AI Models
- Traceability and explainability requirements can slow down the progress of AI model advancements, but it depends on the purpose of the explainability, which can be for debugging, mitigating outcomes, or establishing trust 43m55s.
- The need for explainability, interpretability, or transparency can be broken down into different scenarios, each with different requirements, and it is essential to understand the purpose of the request to provide the appropriate explanation 44m53s.
- Establishing trust in AI models for medical diagnosis and treatment requires prospective studies to test their effectiveness, similar to how trust is established in conventional drugs like Tylenol, even when their exact mechanisms of action are not fully understood 45m18s.
- Studies have been conducted to compare error rates in medical diagnosis and treatment with and without AI assistance, such as a study by Jonathan Chen, which found that doctors sometimes make mistakes when using AI, possibly due to suboptimal use of the technology 45m59s.
- The study by Jonathan Chen involved giving case vignettes to physicians with and without access to AI, and the results showed that AI alone performed better than doctors with AI assistance, highlighting the need for more research on how to effectively use AI in medical practice 46m4s.
Data Sources for Training AI in Healthcare
- Electronic Health Records (EHRs) can be considered one of the raw materials to teach AI, but they should not be the only source, as sanitized versions of online sources and textbooks are also necessary 47m7s.
- There are many practical Medical Imaging AI models in use, with the FDA approving around 992 AI-related medical devices, half of which are image-based, primarily in radiology and cardiology 47m44s.








