Yash Bottle's Background and Journey
- Yash Bottle, the founder and CEO of Applied Compute, is a recent Stanford graduate who went directly to OpenAI research after completing his undergraduate studies, and he is going to talk about his journey and how he built Applied Compute into a successful business 10s.
- Yash had an incredible journey, making tough choices, including deciding what to study, and he is going to walk through his journey, from growing up in Austin, Texas, to coming to Stanford for school 2m6s.
- During his time at Stanford, Yash was a self-described "bad student" who never went to class and instead watched online lectures, but he ended up building a bunch of projects on campus and got connected to Sam Altman, the CEO of OpenAI, through mutual friends 4m30s.
- Yash met Sam Altman during his freshman year summer, and Sam gave him a small check to cover food and rent, allowing him to work on a project instead of doing a summer internship, and this experience ultimately led to him joining OpenAI 6m15s.
Joining OpenAI and Early Career
- In late 2022, Yash was introduced to Chat GBT, and he was so impressed that he decided to join OpenAI, specifically the post-training team, where he worked with people he admired in the language model universe 8m45s.
- During his time at OpenAI, Yash worked on evaluations, which he recommends to anyone joining a new company, as it allows you to work on challenging tasks and gain recognition, and he also witnessed the emergence of reasoning models that achieved massive performance increases 11m30s.
Innovation at OpenAI and the Formation of Applied Compute
- The concept of applying models to tasks outside of competitive encoding and math led to the development of an agent that could browse the internet and write code, which was showcased to leadership and resulted in the formation of a team called Long Horizon Tasks, focused on agentic coding research, 10s.
- The team's research eventually became CodeEx, and the founder later left to start Applied Compute, a company that helps enterprises create specialized models to enhance their business using frontier technology, 10s.
Evolution of Deep Learning and Model Development
- The advancement in model development over the past century, particularly in the last four years, is notable, with the introduction of Alexnet being a pivotal moment for deep learning, 2m6s.
- Deep learning is a method of machine learning that allows for the learning of underlying representations from data, resulting in smart models with millions or billions of parameters, but the exact workings of these models are not fully understood, 2m6s.
- The use of GPUs and large datasets, such as ImageNet, has led to significant gains in predictive accuracy, and model training has become a complex process with various aspects, including the development of new architectures like the transformer, 2m6s.
- The transformer, introduced by researchers at Google Brain, is a new architecture that allows for the scaling of language model training and has been a key factor in the advancement of model development in recent years, 2m6s.
- The development of AI models has led to significant improvements in performance on existing hardware, with techniques such as attention enabling better performance in next token prediction and scaling to long sequences in language 10s.
Scaling Laws and the Emergence of General Intelligence
- The era of pre-training from 2018 to 2019 involved teaching models to predict the next token by optimizing on loss, which led to better performance, and this was followed by the era of scaling laws, which showed that scaling up models leads to improved performance 42s.
- The OpenAI scaling laws and the Kaplan scaling laws demonstrated the importance of scaling up models, with GPT3 being a breakthrough moment in achieving general intelligence, and the chinchilla scaling laws providing a compute-optimal way to scale models 1m6s.
Model Usability and Safety through Reinforcement Learning
- The development of models that are generally useful has led to a focus on creating interfaces that make them accessible to everyday people, with reinforcement learning and human feedback preference tuning being used to steer models and improve their safety and usefulness 2m6s.
- The introduction of GPT4 and models like 01 has led to significant improvements in model quality, with the emergence of chain of thought as a completely emergent behavior, enabling models to reason and correct themselves, and combining this with tool use has led to the development of AI co-workers 4m6s.
Ingredients and Bottlenecks in Model Development
- The ingredients that go into making a great model include data, compute, talent, and algorithms, and understanding the bottlenecks in the past, present, and future is crucial for continued progress in AI development 6m6s.
- The bottlenecks in developing AI models have shifted over time, from having sufficient compute to train models, to finding the correct architecture, to scaling pre-training data, and now to making models usable through preference tuning, with the current bottleneck being continual learning, which enables models to learn from extremely sparse rewards 10s.
- Continual learning is considered the holy grail of AI development, as it would allow models to learn from real-world interactions with minimal data, similar to how humans learn from experiences, such as burning their hands on a stove, and it is hoped that this will provide loud feedback for the models 2m6s.
Code as a Focus for AI Research and Applications
- The convergence of labs on focusing on software engineering, particularly code, is due to the unique properties of code, including the ability to verify rewards through compilation, unit tests, and the ease of generating synthetic data, making code a valuable area of study for reinforcement learning 4m30s.
- Code is also considered a general language for interacting with the real world, and many researchers believe that coding models are AGI-complete, meaning that every task can be boiled down to a coding task, which is why models like Claude are being developed to write code for various tasks 6m20s.
- There is also a need to develop AI models that can excel in non-code or non-code-adjacent jobs, such as creating presentations or slides, with an example given of a slide being completely generated using cloud co-work with an initial set of conversations 9m40s.
Slide Generation and Non-Code Applications
- The relationship between code and slide generation is explored, with the ability of models to generate slides being linked to their proficiency in code, and the possibility of combining model outputs with auxiliary rewards to optimize for aesthetics, such as using a reward model trained on human preferences for aesthetically pleasing slides 10s.
Pre-training and Post-training in Language Models
- Pre-training and post-training are two key concepts in language modeling, with pre-training involving massive training efforts using internet-scale data and compute to learn patterns in language, resulting in a form of intelligence, and post-training being the process of aligning the pre-trained model with specific tasks and guidelines, such as chat formats and safety guidelines 4m6s.
- Pre-training is a process of compression, where human knowledge is condensed into a set of weights that understand patterns in language, but it requires significant compute and data, and the resulting model may not always produce sensible or safe outputs, highlighting the need for post-training to refine the model's performance 6m6s.
- The scarcity of data is a major challenge in both pre-training and post-training, with the amount of available data being limited, and the need to optimize for loss and predict the next token in a sequence being a key objective in pre-training, but the availability of data is a constraint, with labs approaching the frontier of available data 10m6s.
Challenges and Solutions in Data Scarcity
- The concept of chinchilla scaling laws is mentioned, which involves scaling model size and data to improve performance, but this approach is limited by the availability of data, and new approaches may be needed to continue improving model performance 12m6s.
- The current state of AI training requires significant computational power and large amounts of data, which can only be achieved by labs due to the huge capital expenditure requirement, and various methods such as supervised approaches and reinforcement learning are being explored for post-training 10s.
- In 2025, RLVR (Reinforcement Learning from Virtual Rewards) gained prominence, and an article by Karpathy discusses its development and applications, which will be further explored in the discussion 2m6s.
- The issue of running out of data is being addressed by generating new data through AI, with most tokens in the world expected to be AI-generated in the future, and companies like Scale and Merur are working on creating new data sources 4m42s.
- To overcome the data limitation, researchers are focusing on pre-training data, which involves scaling up existing data through methods like scanning ancient books and synthetic generation, as well as developing new architectural advances to make better use of available data 6m15s.
- Another approach is to construct RL environments, where models can learn by interacting with a simulated world, exchanging compute for lower-quality data, and learning from verifiable rewards, which can lead to more effective training than pre-training alone 8m30s.
Reinforcement Learning Environments and Training
- The use of RL environments allows models to learn by trying to implement specific features or tasks, with the model attempting the task hundreds or thousands of times, and receiving rewards based on its performance, enabling more efficient learning 10m45s.
Evaluations and Benchmarking in AI
- The discussion also touches on the topic of evaluation (eval), which is an important aspect of AI training, but the details of eval are not fully explored in this segment 12m50s.
- Evaluations, or evals, are crucial because they serve as a benchmark for models, allowing them to understand how they perform on specific tasks and defining what good and bad outcomes look like, which is essential for training models on reward functions 10s.
- Evals are highly protected assets for labs because they set the roadmap for model development, and having a well-defined eval enables labs to optimize their models towards a specific goal, with the example of Sweetbench being an eval that started the code model race 1m20s.
- The concept of evals is also important for enterprises, as each enterprise has its own definition of good and bad, and they require specialized models that can optimize towards their specific standards, leading to a tiered effect with different evals for labs and enterprises 2m40s.
Applied Compute and Enterprise Applications
- Applied Compute was founded to address the need for specialized models for enterprises, with the goal of helping them optimize their models towards their specific needs and differentiate themselves from competitors 4m50s.
- The company's approach involves creating specialized systems and training models for individual enterprises, with an example being their work with DoorDash, where they helped the company with tasks such as onboarding over 100,000 merchants to their platform every year 7m10s.
- Merchants coming to the platform supply unstructured information about their business, including menus and menu extraction, which is a hard task to digitize, especially when following a specific style guide like Door Dash's, which requires understanding what can be mixed and matched, what's an add-on, and what's a special ingredient 10s.
- To solve this task, using general models was not effective, but having humans correct the model's outputs and then optimizing the model directly against reducing the error rate was successful, allowing the company to define what good and bad looks like and optimize towards the desired outcomes 42s.
- The problem of digitizing menus would have been addressed with an OCR model or a vision model in the past, but in this case, a Vision Language Model (VLM) with a transformer architecture was used, which is a vision model with a transformer architecture 2m6s.
- Specializing an existing model can be more effective than waiting for a new, potentially better model like GPT7, as enterprises care about being at the frontier at any point in time, and the time to value and ROI of training their own models today can be significant 4m30s.
Efficiency in Model Training and Compute Budgets
- The investment in post-training is significantly lower than in pre-training, with an estimate that about 5% of the training compute needed for pre-training is required for post-training, based on the example of Deepseek V3 and DeepSeek R1, which were trained on 2.4-2.5 million H800 hours and 150K hours, respectively 8m30s.
- The trend of pre-training models is starting to change, with people now doing data centerwide and multi-data centerwide reinforcement learning (RL) runs, which allows for massively increasing the batch size of each training step and resulting in better performance 10s.
- There are three scaling laws: pre-training scaling, post-training scaling, and test time scaling, which is inference, and post-training scaling can significantly increase the batch size and lead to better performance 10s.
- The compute spent on RL is increasing quite heavily and is expected to go up as a relative percent of the total training budget, with more compute resulting in better performance 42s.
- A model was recently put into production with Cognition Windsurf, which can check for bugs in code in under two seconds, and this is an example of a particularly useful application of RL outside of converting a menu to a DoorDash 2m6s.
Model Harness and Context Code Development
- The value added for Cognition and Windsurf is that they are extending their product suite from just writing code to also testing and bug checking, and this is an example of model harness context code development 2m6s.
- Model harness context code development involves focusing on multiple layers, including the application layer, harness, and context, and being able to plug into different data sources is extremely important for squeezing value out of models 2m6s.
- Expanding their product frontier in this way may be a true competitive advantage for Cognition, as it allows them to provide more value to their customers and differentiate themselves from others 2m6s.
Combining General and Specialized Models
- The development of AI models is advancing through the combination of general models and specialized models, such as fast sub-agents or agents trained on proprietary data, to create powerful systems 10s.
- Companies like Ramp Labs are utilizing AI models, such as RL models, to improve product experiences, for example, by enabling fast search inside spreadsheets 42s.
Continual Learning and Model Adaptation
- Continual learning is a key area of focus, which involves training AI models to learn from sparse rewards and adapt to changing environments, allowing them to improve over time 2m6s.
- Continual learning is a gradual process that relies on access to the right data, including feedback from users and understanding of the context in which the AI model is being used 2m6s.
- Examples of continual learning in practice include companies like Cursor, which has developed a model called Composer that can learn from user interactions and improve its performance through online training 2m6s.
- The process of continual learning can involve collecting data, taking training steps, and repeating the process to achieve improvement, with companies like Cursor seeing results in a matter of days or weeks 2m6s.
- Unlike offline training, continual learning in production environments presents unique challenges, such as dynamic environments and limited ability to replay and test scenarios 2m6s.
Online Training and Real-World Challenges
- Researchers experimented with taking a massive batch of conversations, denoising the gradient, and then taking a step to improve the model, with each step being quite big and involving a lot of samples, which can take hours per step 10s.
- The concept of context base is being explored, where agents can expend compute offline to analyze documents and past human interactions with agents, extracting learnings to improve performance downstream, resulting in a massive increase in performance while using the same amount of tokens 1m6s.
Innovations in Model Architecture and Context
- Innovations are expected at the level of weight updates, context, and the harness itself to capture information, with examples including context base and other approaches 2m6s.
- There is discussion about non-transformer models, with some arguing that transformers are not an efficient architecture and that alternative models like the Mamba architecture could be a dominant player in the future 3m42s.
Debates on Transformer Architecture and Future Models
- The opinion is that scaling transformers is currently working, and it is likely that AI will discover better architectures through scaling rather than human innovation, although there are smart people on the other side of the debate, including Ilia and Yan Mukun 5m6s.
- The core insight of those who disagree is that pre-training levels of data are not necessary to learn underlying representations of language, and that humans do not need this level of data, therefore, the architecture should not require it 6m42s.
- Investments are being made in compute scale outs, and some people are optimizing for the architecture directly in the chips, but currently, most labs are investing in the transformer architecture, with research on new architectures being experimental 8m6s.
Compute Scarcity and Hardware Innovation
- Many companies, including Applied Compute, are facing a scarcity of compute, which is driving the demand for innovations in energy sources and more efficient chips, and this could lead to massive innovations in these areas 42s.
- To address the scarcity of compute, companies could focus on making better hardware to optimize the code of training and chip design, which could be a potential area of exploration 2m6s.
- Nvidia is considered a long-term winner in the compute and chip provider market, with a 75% margin on top of their chips, and they will continue to supply labs with their products 4m6s.
- However, there is a risk that labs might decide to in-house their own chip design, which could potentially disrupt Nvidia's dominance in the market 5m30s.
Data Market and Synthetic Data Generation
- The data market is considered a challenging space, with models getting smarter and making it harder to create new tasks, which could lead to a shift towards synthetic data generation 7m10s.
- Synthetic data generation is becoming more prevalent, especially for RL tasks, where models can exploit generator verifier gaps, and this could change the data market landscape 9m20s.
- The best founders in the data market are those who can pivot and adapt to the changing landscape, and they will be the ones to shape the next wave of innovation in this space 11m30s.
Robotics Data and RL Environments
- Robotics data, such as egocentric data collected using a GoPro, is being utilized, and it is expected that RL environments will evolve and improve over time 10s.
Favorite AI Products and Visual Learning Tools
- The conversation shifted to discussing favorite AI products, with a mention of GPT, specifically image GPT2, also referred to as Image Duo, which is enjoyed for its ability to provide visual representations and walkthroughs of concepts 42s.
- Image Duo is particularly useful for individuals who struggle with design or prefer visual learning, as it can generate nice visual representations of how things work, such as when a course syllabus is fed into the tool 2m6s.








