YouTube video summary

Namee Oberst on Small Language Models and How They are Enabling AI-Powered PCs

Artificial intelligence

04 Nov 202416 min summary

Namee Oberst on Small Language Models and How They are Enabling AI-Powered PCs

Save to your library

Chat with this summary

Introduction to AI and Small Language Models (LLMs)

Microsoft Reactor provides events, training, and community resources to help developers, entrepreneurs, and startups build on AI technology, which can be learned more about by visiting aka.ms/infoq/reactor 9s
Nami Oberst is the founder of AI Blocks, the company behind the open-source LLM framework called LLMware, used for Gen AI-based applications in financial services and legal industries 34s
LLMware aims to make generative AI easy to use, deploy, and develop for the enterprise and regulated industries, with a focus on development and deployment securely, safely, and cost-effectively 1m40s
Nami Oberst's background as a corporate attorney working in big law motivated her to start an AI company to automate repetitive tasks, and her experience working with highly regulated industries led her to focus on small language models 2m22s
Small language models can perform focused and targeted tasks, such as contract analysis, information retrieval, and providing concrete facts, and have been found to be more reliable and predictable than open AI calls in some ways 3m29s
AI Blocks started working on small language models 20 months ago and launched an open-source project about a year ago, with the goal of enabling mobile devices and edge computing servers to leverage Gen solutions that were previously limited to large language models 2m1s
The emergence of small language models is enabling AI-powered PCs and allowing for the deployment of Gen AI solutions in industries that were previously limited by the need for large language models 1m7s
Open AI had significant issues with outages and accuracy, prompting the development of alternative models that can be run locally and produce comparable results for specific use cases 3m58s.
The first models released were called Dragon, which are seven billion parameter models trained to not hallucinate and provide accurate answers with quality scores, aiming to offer users the accuracy of a corporate lawyer 4m37s.
The use of small language models has evolved to include smaller RAG fine-tunes and function calling models with one to three billion parameters, which can automate workflows and processes 5m11s.

The Rise of Small Language Models (SLMs)

Large language models (LLMs) have been a major innovation in AI and ML, but they require significant computing resources and raise privacy concerns, limiting their adoption 6m7s.
Small language models (SLMs) offer an alternative to LLMs, providing a more feasible solution for companies with limited computing resources and data privacy concerns 7m26s.
The development of SLMs has the potential to make a significant impact on the adoption of AI solutions, offering a more accessible and efficient alternative to LLMs 5m57s.
SLMs can be used in various business and technical use cases, empowering end-users, software engineers, and devops engineers to be more productive 6m39s.
The use of SLMs can also address the challenges associated with LLMs, such as the need for significant computing resources and data privacy concerns 7m5s.
Small language models (SLMs) offer many of the same benefits as large language models (LLMs) but are smaller in size, trained using smaller data sets, and don't require a lot of computing resources 7m31s.
SLMs are valuable for use cases where there are constraints on resources or a need to localize model execution, and they are opening up new opportunities to run SLMs on smartphones and other mobile devices used for Edge Computing 7m58s.
SLMs keep data within the device, making them great candidates for use cases where privacy, latency, or other concerns exist in sending data to the cloud 8m17s.
The definition of a small language model has changed over time, and it is now possible to run models with up to 14 billion parameters on an Intel-based AI PC 9m23s.
The latest AI PCs, such as those from Intel, are enabling the development of SLMs that can run on commodity laptops, with prices starting from around $1,100 10m27s.
The small language models themselves are getting better by the week, making innovation possible, and the hardware on edge devices is also improving 10m6s.
The innovation in training SLMs is also advancing, with new technologies such as Apple Intelligence being released at the end of the month 11m8s.

Advantages and Applications of SLMs

SLMs are not a one-size-fits-all solution and are not suitable for every use case, but they are valuable for specific applications where resources are limited or model execution needs to be localized 7m42s.
A 3 billion parameter model is being developed to run on-device, made possible by pruning a 6.4 billion parameter model, which involves removing unnecessary parts and using distillation techniques to create a smaller version 11m11s.
The use of small language models is becoming increasingly innovative, making it difficult to say that large language models are necessarily better, as the choice of model depends on the specific use case and task 11m47s.
Using the right size model for the right problem is crucial, and leveraging small models with proprietary data can result in comparable accuracy to larger models 12m21s.
A study by Nvidia found that using 40 times fewer training tokens, small language models can achieve comparable results to larger models, and even outperform them in some cases 12m40s.
The combination of distillation, pruning, and fine-tuning with proprietary data can result in a model that is 16% better than a model of the same size trained from scratch 13m2s.
The ability to run these models on a laptop, without the need for a special GPU farm, is democratizing AI and making it more accessible to a wider range of users 13m14s.
The increased accessibility of AI models is expected to revolutionize the field and open up new use cases, making it possible for people to use AI for day-to-day tasks and micro-tasks 13m43s.
The development of small language models is also addressing concerns around data privacy and leakage, as users will be able to query their documents and chat with the model without worrying about data security 14m48s.
Small language models can be used to bring AI to workers' fingertips, increasing productivity by automating various tasks and making information more accessible, especially for those with laptops 15m7s.
The use of small language models can democratize AI use cases, making them more accessible to a wider range of people and organizations 15m41s.
Basic use cases for small language models include finding information in documents, such as searching for specific details in an 80-page contract, which can be done locally on a laptop without the need for internet access 15m54s.
Small language models can also be used for tasks like summarization, SQL queries, and transcription of voice recordings, automating microtasks that are part of day-to-day work life 16m40s.
The use of small language models on local devices can help address data privacy concerns, as sensitive information does not need to be uploaded to the cloud 17m2s.
Small language models can be used to automate workflows, such as creating reports for financial analysts, which can include tasks like looking up company information, stock prices, and historical data 17m42s.
Agent workflows can be created to automate tasks, such as making API calls to services like Yahoo Finance or Wikipedia, to gather information and generate reports 17m59s.
The promise of AI is to serve as a co-pilot for everyday working life, making it accessible to users on their devices, rather than being a behemoth use case only large companies can access 18m20s.
Small language models and the Retrieval-Augmented Generator (RAG) are a good fit for each other, as they can be trained with a company's private information and used to ask domain-specific questions, making them suitable for commodity hardware 19m0s.

Technical Aspects and Performance of SLMs

RAG is not necessarily better with large models, and studies have shown that large language models are not designed for complex RAG, making the key to successful RAG deployment the workflow and accuracy of the chain from document ingestion to inferencing 19m35s.
Combining RAG with a specialized embedding model that understands the domain can lead to fast inference speeds, especially with new AI PCs that have integrated GPUs, resulting in performance differences compared to older Intel-chipped devices 20m14s.
The difference in running PyTorch versus OpenV GPU can be significant, with a 5-second to 15-second difference for a 21-question inference test for a 1.1 billion parameter model, allowing for subsecond response times with the right inferencing technique and hardware 20m43s.
AI is becoming more accessible, coming to users at their fingertips, and bringing value to those who can utilize it, rather than sending data to the cloud 21m1s.
The AIML and Data Engineering Trends report, recorded in August, provides more information on the topic and will be linked for reference 21m20s.
Small language models are being used in applications like auditing and compliance, enabling proactive compliance and regulations by design rather than by accident 21m38s.
Features related to compliance and auditability, such as AI explainability and guardrails, are crucial for AI-powered PCs 22m3s.
Small language models shine in AI explainability, allowing for visibility into every single step of the decision-making process 22m16s.
Chaining workflows with small language models can create decision trees based on model inferences and answers, enabling course correction and fault identification 22m32s.
Unlike large language models, small language models provide visibility into the decision-making process, allowing for the identification of mistakes and the rationale behind the workflow design 23m24s.
AI explainability is critical for Enterprise applications, enabling the exposure of options considered by the model and the chosen outcome at every step of the process 24m0s.
Observability and explainability factors are crucial for systematic and observable decision-making, especially when deploying new processes 24m36s.
The use of small language models captures every interaction, inference, and decision, providing all necessary data for auditability and compliance purposes 24m46s.

Comparing SLMs with LLMs and Their Combined Use

Large language models are good at preserving the context of conversations, making them suitable for customer-facing chatbots that can go on for days, due to their large context window that can keep the conversation going and preserve context over long sessions 26m40s.
Small language models can be used for processes that run in the background, such as hourly, daily, weekly, or monthly tasks, and can be chained together on CPUs to run on inexpensive hardware 27m36s.
Small language models can be powerful on the edge in devices, doing real-time analytics for finding defects or other use cases in manufacturing processes, and then sending results to the cloud for offline analytics to generate more insights 28m30s.
The combination of large and small language models can be the best choice for certain use cases, where large models can capture conversations and preserve context, and small models can drive insights and analytics off of that conversation in batch processes 26m1s.
Large language models are not necessarily better than small models, as small models can be just as good in terms of performance or accuracy, despite being smaller in size 25m51s.
The use of small language models can provide visibility into every single step in the process, which is beneficial for auditors who want to know all the under-the-hood details 25m30s.
Small language models can be used for specific use cases, such as Edge on-device analytics, real-time defect detection, and other use cases in manufacturing processes 28m11s.
Local modeling can generate insights that can be sent to the cloud to train large Val models, creating a feedback loop between localized small language models and cloud-based large language models, making each other better 28m35s.

Cost-Effectiveness and Adoption of SLMs

Using a large language model for everything can be overkill and a tremendous waste of resources, especially for startups that are financially constrained 29m5s.
Small language models can be a first step in the learning and adoption process for companies, allowing them to learn the process, solutions, and invest in a bigger solution later 29m38s.
Enterprises are extremely cost-sensitive and look for cost efficiency, security, safety, and performance, making small language models a viable option 29m52s.
During the exploration phase, companies can try low-cost, high-performance, and easy-to-run models and then grow into large models 30m30s.

Tools and Resources for Using SLMs

To adopt or try small language models, the required infrastructure or tools depend on the laptop, with different recommendations for Mac and Intel-based machines 30m48s.
For Mac users, the GGF quantize version and a solution like AMA are recommended, while for Intel-based machines, the open Vino library is the preferred choice 31m17s.
The open Vino library is supported by LLmware, but users need to download it themselves to work with the library 31m45s.
The Microsoft version of the NX Onyx model is a cross-platform approach that offers a middle-of-the-road performance, with negligible differences in performance compared to the ggf model on a Dell machine 32m4s.
Onyx is a good option for non-Mac and non-Intel based machines, while ggf is the fastest way to run inferencing on a Mac, and openVINO is the best option for Intel-based machines 32m34s.
A four-year-old Dell machine can achieve the same performance as an M3 in terms of inference speed using the right software, highlighting the importance of matching software to hardware 32m48s.

The Future of AI and SLMs

The development of small language models is enabling AI-powered PCs, allowing for the democratization of AI and making it as ubiquitous as regular software 34m17s.
Model HQ is a product that allows users to run openVINO models without needing to know C++, making it easy to use and accessible to a wider audience 34m36s.
The future of AI development and PC hardware is expected to lead to better and more ubiquitous AI, with the potential for AI to be as common as regular software in the next three years 35m5s.
Small language models can operate at the edge, without the need for cloud connectivity, and can be used in a variety of devices, including laptops, smartphones, IoT devices, and sensors in manufacturing plants or autonomous vehicles 33m49s.
The emergence of small models is accelerating the power of AI PCs and devices, with limitless use cases and potential applications 34m7s.
AI will become ubiquitous in the future, making it a standard feature in applications, similar to software, with some applications having to explicitly state that they do not include AI 35m29s.
The power of small language models is increasing due to advancements in distillation, pruning, and combination techniques, allowing them to run on smaller hardware footprints 35m46s.
The definition of a small language model is changing, enabling the possibility of running large models, such as a 20 billion parameter model, on a laptop 36m6s.
Smaller models, like three billion parameter models, are becoming increasingly powerful, making them a viable option for many applications 36m26s.
Small language models and AI-powered PCs are complementary technologies that will drive innovation in each other at a rapid pace 36m39s.
The price point of AI-powered PCs is relatively inexpensive, considering their powerful capabilities, with options available for around $1,000 to $2,000 37m35s.
The Lunar Lake version of AI-powered PCs is becoming available for consumers, offering powerful GPU capabilities 37m12s.

Best Practices and Recommendations for SLMs

Best practices for using small language models include starting with Microsoft models, such as the Five series, and being aware of the rapidly evolving nature of these models 38m20s.
It is recommended to try out small language models and experiment with different options, such as using LL mware for inference on Mac devices 38m35s.
Small language models can be easily installed and used, and they are capable of performing various tasks, with the caveat that they may not be as good as larger models for tasks like video or image generation 38m39s.
These models can be stacked together to create workflows, allowing for workflow automation, and can be used for tasks such as sentiment analysis, named entity recognition, and information extraction 39m24s.
There are a dozen or so models available that have specific functions and can be chained together in a workflow, and there are also many examples and YouTube videos available to help users get started 40m1s.
When deploying small language models to production, they are more secure in some ways because they are less susceptible to suggestions and hacks, and are less likely to respond to prompt injection attacks 40m52s.
Small language models are also great for observability, as they can be easily tested and debugged, and can be swapped out or fine-tuned if they are not performing as desired 41m35s.
The use of small language models can also make it easier to identify and fix issues, as it is clear where the model is failing or succeeding, and the data set can be examined to identify any problems 41m43s.

SLM Ops and Deployment Considerations

The deployment of small language models, or "SLM Ops", is an important consideration, and users should be aware of the potential benefits and challenges of using these models in production 40m34s.
When creating an AI workflow, it's often better to start with small language models, chain them together, and then increase the model size if necessary, rather than starting with a large model like Open AI 42m16s.
Starting with a small model, such as a 1 billion parameter model, and then substituting larger models, like 3 billion or 7 billion, can be an effective approach 42m36s.
A 10 billion parameter model can likely solve most workflows, except for hard exceptions like video creation and image generation 42m51s.
It's recommended to start small, explore, and iterate when working with AI models 43m18s.

Online Resources and Communities for AI

For online resources, Mark Tech Post is a good source for leading-edge AI research, and YouTube channels like AI Anytime and World of AI are great for tutorials and exposure to the latest AI developments 43m35s.
Hugging Face's LinkedIn site is also a good resource for promoting new models, and InfoQ's website has good information on AI, including articles on Apple Foundation models 44m18s.
AI is becoming ubiquitous and is being integrated into various communities, including architecture, devops, cloud, security, and machine learning 44m52s.
The question remains when AI will become a regular thing, like having a personal website, and no longer be a topic of excitement 45m14s.
The integration of AI in everyday technology is expected to become the norm, similar to how it is now understood that everything will have the internet in it 45m27s.
Looking back, it will be clear when AI became an integral part of daily life, but it may not be immediately apparent when it happens 45m42s.
Nami encourages the InfoQ community to keep experimenting and playing around with AI-powered technologies 46m0s.
Lmware, an open-source site, offers an end-to-end solution for small language models and is free for anyone to try out 46m7s.
Small language models have the potential to commoditize and localize language model solutions, allowing for a bigger impact on the software development community 46m24s.
The AI/ML and data engineering community page on the InfoQ website is a resource for learning more about AI/ML topics, including recent podcasts and the AI/ML Trends report for 2024 46m40s.
The AI/ML Trends report for 2024 covers topics such as small language models, AI-powered PCs, coding assistance, and other trends in the field 46m56s.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all Artificial intelligence →

Why this CEO thinks video games make better training data than the internet | Equity Podcast

Artificial Intelligence

Why this CEO thinks video games make better training data than the internet | Equity Podcast

YouTube09 Jul 2026

Is AI making us boring? | Sandra Matz | TEDxNewEngland

Artificial Intelligence

Is AI making us boring? | Sandra Matz | TEDxNewEngland

YouTube07 Jul 2026

Open Models vs Frontier Models: Who Actually Wins? | The $100K Token Budget Every Engineer Will Need

Artificial Intelligence

Open Models vs Frontier Models: Who Actually Wins? | The $100K Token Budget Every Engineer Will Need

YouTube06 Jul 2026

Gemma 4: The Free ChatGPT That Runs Offline

Artificial Intelligence

Gemma 4: The Free ChatGPT That Runs Offline

YouTube02 Jul 2026

Humans Plus AI Coaches to Close the Student Counseling Gap | John Branam | TEDxPortland

Artificial Intelligence

Humans Plus AI Coaches to Close the Student Counseling Gap | John Branam | TEDxPortland

YouTube01 Jul 2026

What AI reveals about the human mind | Abbie Brazenall | TEDxRoyal Holloway

Artificial Intelligence

What AI reveals about the human mind | Abbie Brazenall | TEDxRoyal Holloway

YouTube27 Jun 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content