Introduction to Human-Centered AI in Digital Health
- Human-centered AI for digital health involves utilizing data streams from wearables, mobile devices, and other sources to make predictions using machine learning models, with a focus on predicting repeat health events that can be intervened upon 10s.
- The goal of predicting these health events is to create a just-in-time digital intervention, which requires collaborating with behavior scientists to deliver interventions at the right moment, such as in the case of substance use and craving 2m6s.
- Research is being conducted in various areas, including predicting blood pressure spikes using Fitbit and stress data, as well as predicting stress using blood pressure and Fitbit data, with the aim of identifying intervenable contexts 4m30s.
- The key to making this work is personalized machine learning, which involves training a separate AI model per person, allowing for repeated predictions and interventions within the same person 6m40s.
Data Sources and Personalized Machine Learning
- This approach differs from traditional AI for healthcare, where a single model is used for everyone, and instead uses individualized models to make predictions and intervene in real-time, using data sets that are specific to each person 8m10s.
- The data sets used in this approach are unique in that they are divided into training and test sets that occur at different times for the same person, allowing for real-world deployment and evaluation of the interventions 9m40s.
- Researchers, such as Dr. Nungwin at UCSF, are working on developing efficacious interventions, and the role of AI is to deliver these interventions at the right time to maximize adherence and efficacy 3m20s.
- Personalized models face a challenge due to the limited amount of labeled data available, which can be addressed through personalized self-supervised learning, a concept that may be familiar to some, as seen with the example of chat GPT, which is a fine-tuned version of the foundation model GPT, 10s.
- The creation of GPT involved training a model to predict the next word in a sequence of words, using a large amount of text data from the internet, which eliminated the traditional bottleneck of obtaining labeled data, and instead, the only limitation was the amount of text data that could be processed, 2m6s.
- This approach can be applied to wearable signals, where a model can be trained to predict missing portions of a bio signal, given the surrounding portions, and this can be done in a personalized manner, using data from an individual's wearable device, without the need for labeled data, 4m42s.
- By wearing a wearable device, an individual can generate a large amount of unlabeled data, which can be used to train a personalized foundation model, that can then be fine-tuned to predict challenging health outcomes, such as stress or pressure spikes, and this can also be done in a multimodal manner, using multiple bio signals, 6m15s.
Challenges in Personalized AI Models and Self-Supervised Learning
- The use of personalized self-supervised learning for wearable signals can provide insights into how an individual's bio signals relate to each other, without the need for labeled data, and this can be a powerful tool for predicting health outcomes, 8m30s.
- The relationship between physical activity, heart rate, and skin temperature can be affected by individual physiology, and wearable devices can provide this information, but personalized models are needed to account for these differences 10s.
- Researchers played around with publicly available datasets, including a stress dataset, and found that personalized models performed better than generalized models, requiring only a few labeled data points to converge 42s.
- However, when applying this approach to a dataset of nurses wearing wearables during the COVID-19 pandemic, challenges arose due to inconsistent labeling across and within nurses, as well as noisy data, which made it difficult to work with personalized models 2m6s.
Human-Computer Interaction (HCI) and Model Personalization
- These challenges highlight the importance of addressing fundamental issues with human behavior when building models that serve people, and the need for Human-Computer Interaction (HCI) to reduce user burden while maintaining the efficacy of interventions 2m6s.
- One core HCI question is how to reduce the burden of personalizing AI models for users of digital therapeutics, while maintaining the model's performance and user engagement, and potential approaches include leveraging active learning and machine learning 2m6s.
- Active learning involves using the model's output to determine when to obtain calibration data from the end user, such as when the model is uncertain about its prediction, but this approach is not always simple or practical, especially in real-world scenarios 2m6s.
- A layered approach is necessary when developing AI models for digital health, involving the examination of the model's output, patient feedback, and passive sensing of context, such as location, to intervene at the most critical time 10s.
User Engagement and Intervention Design
- The timing and content of interventions can impact the burden on the end-user, which in turn affects their receptivity to the intervention, adherence, and engagement, with engagement referring to the use of an app and adherence referring to following the app's recommendations 2m6s.
- The design of studies to measure and account for these complex issues is ongoing, with the goal of addressing key challenges in the field, including the impact of AI model evaluation metrics on end-user concerns 2m6s.
- A common challenge in evaluating AI models is the reliance on default metrics, such as precision, recall, and F1 score, provided by tools like scikitlearn, without considering what these metrics actually tell us about patient and clinician use of the models 4m30s.
Evaluation Metrics and Their Limitations in AI Models
- The precision metric, which calculates the ratio of true positives to true positives plus false positives, has limitations, as it is influenced by the number of positives and negatives in the data set, and may not provide a complete picture of the model's performance 6m40s.
- Santos Kumar, a mentor at the MD center, has collaborated on the development of causal diagrams to better understand user behavior in mobile health, highlighting the complexity of factors that influence intervention effectiveness 1m30s.
- The same AI model can have different performance metrics in different settings, such as a population screening tool versus a specialty clinic, due to variations in population prevalence, and this can affect the precision and recall of the model 10s.
- Precision is sensitive to population prevalence, meaning that the same model can look good in one setting and bad in another, which is why sensitivity and specificity are often used instead of precision and recall in AI for healthcare 42s.
- Sensitivity is the same as recall, and it is calculated as the number of true positives over the total number of actual positives, while specificity is the number of true negatives over the total number of actual negatives, and these metrics are not sensitive to population prevalence 1m30s.
- Clinicians often prefer precision over sensitivity and specificity, despite the latter being less sensitive to population prevalence, and it is important to understand why they prefer precision and to carefully consider the metrics used to evaluate AI models 3m40s.
- Reporting the entire confusion matrix can provide more information than just using default metrics, and it allows users to calculate different metrics based on their needs, but in practice, people often prefer using interpretable metrics like sensitivity and specificity 5m10s.
Threshold Adjustments and Model Performance in Clinical Contexts
- Easy-to-implement AI modeling decisions can have a significant impact on the end user, whether it is the patient or the clinician, and this is an important HCI challenge that needs to be addressed 8m20s.
- A toy example is used to illustrate the concept of changing the decision threshold in a model, where the red data points represent people with cancer and the green data points represent people without cancer, and the purple line represents the predicted probability of the model, with the default threshold being 0.5, but it can be changed to optimize the model's performance 10s.
- The model's performance can be adjusted by changing the threshold, for example, setting it to 0.3 to identify all cancer patients, but this may result in some people without cancer being misclassified as having cancer, or setting it to a higher threshold to achieve high specificity, but this may result in missing some people with cancer 2m6s.
- In clinical situations, the preferred scenario may vary, and students often prefer the higher sensitivity scenario, where the model catches everyone with cancer, even if it means some people without cancer are misclassified, but this may not always be the preferred approach in every clinical scenario 4m42s.
- The example of hospitalists encountering patients with sepsis is used to illustrate the importance of considering the clinical context when evaluating the performance of a model, and Epic's electronic health record system is mentioned as a platform where AI models can be deployed, including a sepsis model that was rolled out and later validated in real-world hospital settings 6m15s.
- The validation study of the Epic sepsis model found that it had a sensitivity of 86% and a specificity of 81%, which may seem good, but the model's performance can still be optimized by adjusting the threshold, and the importance of considering the positive and negative predictive values, also known as precision, is highlighted 10m10s.
- The precision of a model used to predict sepsis was 34%, which is the likelihood that a patient actually has the condition when the model makes a positive prediction, and this low precision led to alert fatigue among clinicians who became overwhelmed with false alerts 10s.
Clinical Validation and Model Deployment
- Clinicians have lost trust in models with low precision, and they prefer models with high specificity, which is the case with the Apple Watch's hypertension prediction model that has a sensitivity in the low 40s but a specificity of around 92% 42s.
- Apple's decision to optimize for specificity was made after consulting with clinicians who informed them about the issue of alert fatigue, and this approach is considered a good move as it allows for more effective use of the model 2m6s.
- The importance of considering the clinical use cases of a model and how simple modeling decisions can affect its use is highlighted, and there are many examples of this across the field that require further study in HCI research 2m6s.
Community Engagement in Parkinson's Disease Assessment
- A research lab created a digital Parkinson's assessment that included mouse tracing tasks, keyboard pressing tasks, and cognitive assessments, and this was developed in collaboration with the community, including people with Parkinson's and their spouses, to ensure that the assessment was informed by their needs 4m30s.
- The development of the Parkinson's assessment involved stakeholder engagement from the onset, including participation in community events, such as those organized by the Hawaii Parkinson's Association, to inform the design of the assessment and ensure that it was relevant and effective 6m10s.
- The development of an assessment for Parkinson's disease involved partnering with Jerry Boster, the president of the Hawaii Parkinson's Association at the time, who provided valuable input on what to build, and there were two levels of community engagement: co-designing the assessment and community-based recruitment afterwards 10s.
- The collected data was used to build AI models, which performed well, but interestingly, the models performed better on Mac devices than Windows devices, possibly due to the homogeneity of MacBooks compared to Windows, and this could be an issue in terms of social determinance of health 42s.
Bias and Disparities in AI Model Performance
- The model also performed better for right-handed individuals compared to left-handed individuals, which can be problematic for motor assessments, and when the model's performance is broken down between different groups, it is found that it performs much better on some groups versus others 1m6s.
- Further research is being conducted to scale up the model and investigate factors such as internet latency, webcam resolution, technological proficiency, and age, which are proxies for social determinants of health, and can be directly measured in these assessments 2m6s.
- Efforts are also being made to algorithmically mitigate the discrepancies in the model's performance using techniques such as adversarial debiasing, and a paper has been published in the AMIA conference on this topic, but the focus is on defining bias and optimizing the model for bias metrics 3m30s.
- There are many quantitative metrics for measuring bias in AI models, including disparate impact, equalized odds, and equal opportunity, but these metrics have different implications and can conflict with each other, making it challenging to determine what to optimize for 4m40s.
Importance of Stakeholder and Clinician Involvement
- Discussions about Human-Computer Interaction (HCI) and Human-Centered AI for digital health are crucial, and there is a need for formal HCI work to determine the preferable approaches under different contexts 10s.
- The example of Parkinson's disease is used to illustrate the importance of engaging the community and patients, but also highlights the need to engage clinicians in the process, as they can provide valuable insights that may not be immediately apparent 42s.
- Medication timing is a major factor in Parkinson's disease, with the common medication levodopa being prescribed to control motor symptoms, and this cyclic process of symptoms and medication can affect the design of studies and the interpretation of results 2m6s.
- The study on Parkinson's disease did not account for medication timing, which is a significant confounder, and clinicians noted that predicting Parkinson's symptoms would be more useful than predicting the diagnosis itself 2m6s.
- Clinicians also emphasized the importance of predicting symptoms, as this would be more actionable and could help reduce the waiting list times for appointments and the burden on clinicians 2m6s.
- Another important factor that was not accounted for is the asymmetry of Parkinson's disease, which often starts on one side of the body before progressing to the other 2m6s.
- Disease stage is also a crucial factor to consider, and workflow fit is essential to ensure that AI models are integrated into the clinicians' workflow in a way that is easy to use and does not increase their burden 2m6s.
Designing AI Models for Clinical Workflow Integration
- The design challenge is to increase the information presented to clinicians while not making them stressed out, and this requires careful consideration of how to implement AI models in a way that is intuitive and user-friendly 2m6s.
- The University of California, San Francisco (UCSF) and the University of Hawaii are mentioned as institutions where the research was conducted, and the MDS-UPDRS assessment is noted as a common clinical practice for measuring Parkinson's symptoms 2m6s.
Legal and Ethical Considerations in AI Deployment
- The question of legal implications of optimizing sensitivity or specificity in models is raised, and it is noted that depending on the context of use, some models may require FDA approval, which can involve locking the model and its hyperparameters, and any changes would require reapproval 42s.
- In cases where FDA approval is not required, there are ongoing discussions about who should be liable in the event of errors, whether it is the AI model or the clinician, and these discussions are currently unresolved 2m6s.
- The issue of including disclaimers or limitations in original publications that are discovered after the fact is discussed, and it is noted that many of these limitations are often included in the publication's limitation section, but if not, it can be difficult for readers to be aware of them without additional context 4m30s.
Publication Practices and Peer Review in AI Research
- The importance of peer reviewers in identifying limitations and requiring authors to add them to the publication is highlighted, and it is also noted that the point of a paper can influence what limitations are included, with some papers focusing on specific aspects of a model's performance rather than its overall clinical usability 6m15s.
- The example of a Parkinson's paper is given, where the point of the paper was to highlight the lack of robustness of models to device type and handedness, rather than to present a clinically usable diagnostic assessment, and therefore the limitations of the model in terms of clinical usability were not the focus of the paper 8m0s.
Iterative Research and Future Directions
- The process of conducting studies is iterative, similar to the Human-Computer Interaction (HCI) field, and even a formal, large-sample-size preliminary study can provide important lessons that can be applied to future work, such as a study on Parkinson's disease, which is currently being developed into a clinically useful tool through NIH grant proposals 10s.
- There has been a question about standardizing model specifications, especially for private companies like Apple, but the answer to this question is currently unknown and requires further discussion 4m30s.
- The space of interventions being considered includes just-in-time interventions, such as stress prediction, and the example of nurses with high individual variability in stress prediction highlights the complexity of predicting human behavior and the need for alternative approaches 5m40s.
Alternative Interventions and Cognitive Approaches
- Alternative interventions that target cognitive processes, rather than relying on prediction, are being explored, including cognitive behavioral therapy, and other work is being done on chatbot-related projects that focus on safety 7m10s.
- The original Parkinson's study did not reveal any major surprises or findings that were at odds with later discoveries, but it did emphasize the importance of making the study accessible to an older, less technologically proficient population 10m30s.
Cultural and Demographic Considerations in Digital Health Studies
- Other studies, such as a substance use study in Hawaii using Fitbits, have provided interesting insights, including the value of collecting annotations of substance use and craving in real-time, which can inform the development of more effective interventions 12m10s.
- The study was conducted in Hawaii, with an emphasis on Native Hawaiian, Pacific Islander, and Filipino populations, and it was found that participants were hesitant to disclose their substance use during family gatherings, known as pauhana, which typically occurs after work hours 10s.
- In other studies, such as those involving Parkinson's disease, simple design elements like font size were found to be important considerations 2m6s.
Measuring Engagement and Adherence in Digital Therapeutics
- Measuring user interaction and effectiveness in digital therapeutics is a significant challenge, and it involves distinguishing between engagement and adherence, with engagement being easier to measure through metrics like logging and app usage 2m6s.
- Adherence, on the other hand, is more difficult to measure, and researchers have been exploring methods such as automatic measurement and self-reporting, including asking participants if they completed a task and how burdensome it was to do so 2m6s.
- The goal is to study the relationship between burden and adherence, which is an open question in the field, and to better understand how to improve engagement and adherence in digital therapeutics 2m6s.








