YouTube video summary

Denys Linkov on Micro Metrics for LLM System Evaluation

Technology

16 Dec 202410 min summary

Denys Linkov on Micro Metrics for LLM System Evaluation

Save to your library

Chat with this summary

Micrometrics and Business Value

A micrometric is a specific metric defined to measure a problem seen in production or something that is foreseen to happen, and it is used to move some kind of business metric, which is different from broader metrics like accuracy and data science metrics such as F1 and Rouge 42s.
The idea of micrometrics is to optimize for business value rather than premature optimization for broader metrics, which may not reflect the user experience 53s.
An example of a micrometric is measuring how often a large language model switches languages, which was an issue that caused user upset, and implementing a retry mechanism to fix the issue 1m28s.
The retry mechanism was able to fix 99% of the language switching issues, and it was a simple solution that was tracked and measured 1m48s.
The use of micrometrics can help identify fundamental flaws within models, such as the language switching issue, which was not caused by a prompt template update 2m8s.
Different industries and domains will have different micrometrics, and it's up to the domain experts to define what's actually happening and what metrics to use 2m36s.
As a platform provider, one can only guess and see the problems that are encountered from customer complaints or interactions, but domain experts should learn to define their own metrics 2m43s.

Accuracy Metrics and Model Evaluation

There is a difference between overall accuracy metrics and micrometrics, and the choice of large language model may still depend on the specific use case 3m4s.
Evaluating models is challenging, and accuracy can be measured in different ways, such as exact match or Rouge, and different models may perform better in different use cases 3m18s.
Evaluating the accuracy of Large Language Models (LLMs) is challenging due to flaws in various metrics, and it's essential to consider it as an approximation rather than an absolute measure 3m27s.
Human feedback and labeling can also be inconsistent, with low overlap between expert labelers and average people, making it difficult to achieve an agreeable answer 3m41s.
The use of LLMs can be attributed to laziness in defining good training and evaluation sets, but it's crucial to move back to defining these sets and knowing exactly what to look for 4m10s.

Defining Training and Evaluation Sets

Defining training and evaluation sets involves understanding the desired metrics, which can be granular and trackable through complex systems like retrieval and generation pipelines 4m19s.
In retrieval pipelines, metrics can include relevancy of retrieved documents, while generation pipelines involve more complex metrics like the BLEU score, which can be challenging to measure, especially for customer-facing agents 5m26s.
Measuring the value or accuracy of compound answers that require multiple sources can be particularly difficult, as seen in datasets like Multihop QA 5m53s.
Generation metrics can include specific requirements, such as mentioning the user's name, using a particular greeting, or adhering to brand voices, which can become defining factors and micrometrics 6m27s.

Metrics and Brand Voice

Defining metrics and priorities is not a one-time exercise but a continuous process that involves updating the system to adapt to evolving information and data drift 5m13s.
A 100-list of brand voice characteristics can be used to measure the response of a language model (LM), with specific guidelines for companies that can be quantitatively measured, such as searching for certain keywords 6m53s.
Companies often struggle with defining a brand response, as there isn't a well-defined metric for qualities like kindness, requiring user or human evaluation to determine statistical significance 7m0s.
Programmatic guidelines can be used to measure specific aspects of a brand's voice, such as how to handle situations where a customer orders a product not available at the store 7m37s.
In a multimodal world, the response to complex scenarios could involve escalating to a more expensive model or connecting the customer to a human 8m21s.

Prompt Engineering and Optimization

A balance between short-term and long-term improvements can be achieved by using techniques like prompt engineering, which is still an immature field that requires more rigor and measurement 8m32s.
Auto-optimization frameworks like DPy can be used to define a training set, test set, and optimizer to find a good set of prompts, and assertions can be used to validate the correctness of a prompt 9m1s.
The development of prompt engineering is expected to continue evolving, with a focus on building rigor and using multiple responses to evaluate the effectiveness of a prompt 9m22s.

User Experience and Design Patterns

In end-user applications, presenting multiple responses to the user can be useful in specific contexts, such as content evaluation, where the user is guiding the model and doing human preference 9m43s.
The design patterns for generative AI are still in the early stages, and more research is needed to develop effective user experience (UX) patterns 10m5s.
Many current chatbots and language models still use methods from 20 years ago, and consistency across answers is a concern, especially when upgrading or switching between models, as it can cause confusion for customers who receive different responses to the same question 10m16s.
The amount of context used can affect the model's performance, and intentional choices must be made about what factors to consider, such as time of day or user behavior, to improve the customer relationship 10m34s.

Model Upgrades and Performance

Model upgrades require evaluation to mitigate regressions, and a test suite is essential for tracking changes in model performance 11m7s.
Upgrading from one model version to another can result in significant changes in performance, even if the model is the same, as seen in the case of upgrading from ChatGPT's original version to the November version, which resulted in a 10% drop in accuracy 11m13s.
Model providers often do not provide transparency about changes made to the model or updated metrics, making it necessary for developers to conduct their own evaluations 11m38s.
Having multiple models can lead to conflicts, as different versions may perform better on certain tasks or have specific benefits, such as cost savings, requiring careful evaluation and migration strategies 11m58s.
Large language models can be complex and non-deterministic, making it challenging to understand the impact of changes, even when using traditional model training methods 12m48s.

Micrometrics and Content Moderation

Micrometrics can be useful for evaluating model performance, and a crawl-walk-run approach can be used to start with macrometrics and gradually move to more specific metrics 13m18s.
Specific metrics can be used to evaluate LLM systems, such as measuring retrieval versus generation in RAG, and these micro metrics can be used to do more specific things 13m34s.
Policy on content moderation is a common metric, where the goal is to determine how to respond when a user says something inappropriate, and this can be measured by tracking how many bad questions the user asks or how many out-of-domain questions they ask 13m42s.
Different industries have different trade-offs between false positives and false negatives, and measuring how often these occur can be an important metric 14m30s.

Voiceflow Platform Overview

Voiceflow is an AI orchestration platform that helps customers build out different workflows for tasks such as customer support and lead generation, and they focus on team collaboration, control, and hosted solutions 14m39s.
Voiceflow's platform allows users to define business logic and build out workflows using a low-code approach, and they also provide features such as content moderation and event logging 15m13s.
The platform allows users to instrument the system in different ways, such as using different prompts or API calls, and this can be used to track milestones that users go through in a workflow 16m29s.
The hosted aspect of the platform allows users to build prototypes quickly and launch to production, and this can be useful for tasks such as building a new bank account opening workflow 16m0s.
The platform provides useful analytics, such as tracking where users drop off in a workflow, and this can be used to improve the overall user experience 16m51s.

Custom Functions and Multimodal Models

Custom functions and components can be written and reused for specific business requirements, allowing developers to build tailored user experiences 16m58s.
Multimodal models are emerging, but they are not replacing existing tools; instead, they provide another option for people to use, and their application depends on the specific scenario 17m30s.
A platform can be used to orchestrate different models and tools, such as choosing a model to process user-uploaded receipts and then verifying the information against an ERP or API-based system 17m59s.
The platform allows for balancing general knowledge from AI systems with the need to program specific workflows and constrain users to a particular process 18m19s.
The platform provides flexibility, enabling users to build differently and create custom workflows, from simple user input loops to more complex, strictly defined requirements 18m36s.
In certain industries, such as banking, specific legal policies or terms and conditions must be output verbatim, requiring a more controlled workflow and less interpretation by large language models 19m9s.

AI Product Development and Iteration

Many people start by building a basic product and then refine it based on user feedback and production data, treating AI as a product that requires ongoing development and improvement 19m43s.
Launching an AI-powered product is not a one-time event; it requires ongoing evaluation and adaptation to changing business needs and user behaviors 20m1s.
The technology landscape is constantly evolving, and it's essential to update and adapt, just like software, using Agile Release processes to ensure living artifacts rather than static ones 20m9s.
Building experiences can be interesting, especially when working with a large number of users, including 4 million free users and 60 enterprises, each building different things 20m29s.
Integrating with other APIs can still be a challenge, and having a natural language interface might make it easier, but tool use is still immature for what is needed 20m42s.
Defining business processes and making specific API calls can be more efficient than using natural language queries, which can be too vague 20m59s.

LinkedIn Courses and Learning Preferences

Creating LinkedIn courses on various topics, including prompt engineering, AI pricing, and grounding techniques, has been a successful experience, with a wide range of attendees 21m42s.
The process of creating courses on LinkedIn involves working with a content manager to define priorities and interests, creating a table of contents, writing the content, and collaborating with a producer to bring the course to life 22m25s.
The courses attract a variety of people, including those who are curious about AI, with some having more advanced experience, but the platform is not yet known for expert-curated courses 22m44s.
The platform is available to educational providers, universities, libraries, and companies, which use it as their Learning Management System (LMS), resulting in a diverse range of attendees 23m6s.
The best-performing courses tend to be introductory ones, such as "Intro to GP4," as people are looking for a basic understanding of the AI field 23m25s.
Individuals have different learning preferences, with some opting for courses over self-directed learning through resources like Jet open.com 23m37s.

Conclusion

The conversation is concluded with appreciation for the guest's participation in the podcast and their presence at Kon San Francisco 23m42s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all Technology →

AI’s looming geography problem | Cameron Miner | TEDxPortsmouth

AI’s looming geography problem | Cameron Miner | TEDxPortsmouth

YouTube09 Jul 2026

AI Sovereignty Wars, Palantir-Nvidia Deal, SCOTUS Birthright Ruling, Newsom’s CA Budget Lie

AI Sovereignty Wars, Palantir-Nvidia Deal, SCOTUS Birthright Ruling, Newsom’s CA Budget Lie

YouTube06 Jul 2026

Why We Deprecated Google Analytics (And Built a System 3x Cheaper)

Why We Deprecated Google Analytics (And Built a System 3x Cheaper)

YouTube05 Jul 2026

Lumière sur l’ordinateur quantique : la prochaine compétition commence | Valérian GIESZ | TEDxSaclay

Lumière sur l’ordinateur quantique : la prochaine compétition commence | Valérian GIESZ | TEDxSaclay

YouTube03 Jul 2026

Autonomous vehicle hype is back, and Humble Robotics is bringing it to freights | Equity Podcast

Autonomous vehicle hype is back, and Humble Robotics is bringing it to freights | Equity Podcast

YouTube02 Jul 2026

Everyone from OpenAI to SpaceX is building their own chips | Equity Podcast

Everyone from OpenAI to SpaceX is building their own chips | Equity Podcast

YouTube27 Jun 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content