Introduction 0s
- The presentation is given by Sorab Sherti, a product manager at Uber, and his colleague, Mat Ranis, a senior software engineer at Uber, who will discuss their experience in building agentic systems with Visual Studio Code extensibility and GitHub Copilot at Uber 11s.
- Uber is a well-known company that operates in over 10,000 cities, does 30 million trips a day, and has a massive code base of tens of millions of lines of code 42s.
- The company faces challenges with its large code base and the constant addition of new code, which led to the development of a tool to solve a specific problem, particularly around authoring tests 51s.
- The presentation will cover a packed agenda, including a live demo, and will also allow time for questions 59s.
- The speakers are excited about the announcements around Co-Pilot extensibility, including the Co-Pilot multiple file edit feature, and believe that their product has the right abstractions in place to take advantage of these features 1m38s.
- Although the product has not been updated to use the new features announced in the past 24 hours, the speakers are confident that the audience can extrapolate from what they will be showing 1m45s.
Problem Statement 1m50s
- Writing unit tests is extremely important while developing code, but it is hard on engineers and difficult to make them complete enough and adhere to company-specific conventions 1m54s.
- Maintenance after writing the test is also often a challenge, and having a super complete suite requires a significant maintenance burden 2m15s.
- The process of writing a unit test at Uber involves 29 steps, which can be time-consuming and take away from the productivity of the engineer 2m30s.
- Most engineers do not enjoy writing unit tests every day, and the goal is to provide a tool to speed up the process and help arrive at better unit tests 2m45s.
- The initial tooling for writing unit tests consisted of prompt presets that could generate unit tests with Uber conventions, but they were clunky and not very usable in practice 3m28s.
- The initial tooling required engineers to read the documentation, get the prompts, put them in, generate the test, splice them in, test them, and iterate, which led to people not using them very often 3m43s.
- The goal is to provide a better tool to solve the problem of writing unit tests and make the process more efficient and productive 2m55s.
Introducing AutoCover 3m55s
- AutoCover is a product that is being iterated on, and it involves using GitHub Copilot to generate tests through a chat participant, resulting in a table test with edge cases and everything after some internal looping 4m22s.
- An agentic system is a system comprised of multiple steps, each performing a specific task, and these steps are chained together to allow agents to collaborate and specialize in solving problems 4m47s.
- By adding guardrails from deterministic tools like compilers and linters to the output of a probabilistic model, the overall precision and quality of the output can be greatly increased 5m24s.
- The approach involves breaking down a large problem like writing a test into smaller, solvable problems, similar to what a human would do when writing a unit test, and then feeding that into a model to get better results 6m1s.
- The process of writing a unit test involves diving into the code base, understanding how it works, figuring out the state of coverage, trying to write a test, compiling it, and feeding all that back into the system 6m22s.
- The presentation will walk through each of the systems involved in the process, starting with an example of a complex code snippet that is indicative of something that may be seen in a code base 6m43s.
How does it work? 6m59s
- The example code used for demonstration is a piece of source code that contains functionality to parse resource identifiers, such as URIs, and can be found in a utel folder in prod, with one Go file and no existing tests. 7m0s
- The code's coverage status is unknown due to the lack of test files, but as soon as at least one test case is added, the coverage status can no longer be inferred and must be calculated. 7m45s
- Autocover is introduced as a tool that can be dropped into a code base to calculate coverage, starting with the preparer agent, which sets up the environment for the rest of the agents to operate properly. 8m2s
- The preparer agent performs a setup step, including setting up the file, creating the file if needed, scaffolding out the file for some languages, and collecting context such as surrounding files and summaries of imports. 8m10s
- The preparer agent also performs other setup tasks, such as regenerating mocks, to prevent issues in subsequent steps. 8m48s
- The preparer agent's tasks can be done deterministically, and this is a key aspect of the system that will be referred to later. 9m3s
- The generator is the first probabilistic agent, which uses the collected intelligence to generate test cases for the uncovered lines of code, producing candidate test cases that can be refactored later. 9m21s
- The generator takes the target area and generates test cases, which are individual and atomic units that can be deterministically checked against build tooling and coverage tooling. 9m33s
- The initial phase of the process involves generating test cases and setting up and tearing down sequences for each specific test case, which are then packaged and put in shared memory for other agents to use 10m32s.
- The test generation is similar to the "/test" command in Copilot, which generates a test and everything else around it is machinery to aid and produce better results 10m41s.
- The executor is a deterministic agent that does a lot of work to ensure that the test cases are initially good or not, and it runs tooling to make sure dependencies are imported and the file is syntactically correct 11m7s.
- The executor tries out the test cases one by one, and if a test case fails, the feedback is included in the shared memory for future agents to use 11m39s.
- Building the executor requires knowledge of all languages and frameworks used in the organization, and it can be challenging to consolidate and prepare the organization for this 12m16s.
- The results of the executor are then passed to the fixer, a probabilistic step that has a specific purpose of fixing failed test cases 13m4s.
- The fixer sends parallel prompts or calls to the language model (LM) for each test case, giving it a larger context window to elaborate and produce better results 13m26s.
- The process is built in Langra, which provides a nice abstraction for doing these types of things 13m38s.
- Fixer is a tool that addresses issues in test cases by analyzing compilation errors and making necessary changes, such as removing or adding versions, to make the test cases work properly 13m58s.
- Fixer can collaborate with the executor to mark candidate test cases for re-execution, as the initial fix may not be valid, and it may take multiple iterations to get it right 14m32s.
- The system is a complex state machine that manages the flow of test cases, and it's not just a linear flow through the system 14m51s.
- After fixing test cases, they are passed to the refactor agent, which ensures that the test cases adhere to Uber's best conventions and combines them into larger table tests 15m9s.
- The refactor agent uses a directive to look at Uber's best conventions, which are split up by language and are modular and nice 15m13s.
- The abstraction used for an agent must support composability of multiple sub-agents, allowing for the integration of different types of tooling, such as syntax tree-based linters 15m42s.
- The refactor agent can integrate multiple different types of tooling, such as go format, ra, or black, to reduce verbosity and improve test cases 16m9s.
- The validator is the QA component that checks if the generated test case asserts anything and if it asserts the right things 16m31s.
- The validator ensures that the test cases are not just blindly executing code and not asserting anything, which is a big worry for engineers 16m41s.
- The validator has multiple sub-agents, and it's the first one that is used to validate the test cases 16m56s.
- A system has been developed with multiple sub-agents to improve test case generation, including a simple deterministic sub-agent that checks for assertions, a probabilistic sub-agent (LLM Critic) that validates tests against a list of conventions, and a mutation tester. 17m0s
- The LLM Critic sub-agent goes through conventions and maps them back to the test, flagging inconsistencies and providing feedback, such as suggesting additional test cases for concurrency functionality. 17m14s
- The system uses a multi-shot approach to adherence, where a model is asked to generate or refactor a test, and then another model validates whether the first model complied with the style guide. 18m7s
- The agent abstraction needs to account for situations where the model cannot infer what's going on and needs to return control to the human, such as when there are insufficient doc comments or the code is written in an obtuse way. 18m43s
- The system is being developed to add more interaction paradigms and UI elements in the IDE, allowing the model to ask for more context when it doesn't understand the code. 19m5s
- An example of the validator's feedback is shown, where it suggests adding edge cases to a table test, such as checking for Unicode symbols or empty identifiers. 19m25s
- The system iterates on the test cases based on the feedback, passing it back to the fixer and executor to arrive at a new iteration of the test case with more comprehensive coverage. 19m50s
- The mutation testing subcomponent is being developed to further improve the test cases. 20m10s
- The concept of a mutant is introduced, which refers to making an unintended change to the source code and checking if the test suite catches it, with the goal of identifying potential bugs or shortcomings in the test suite 20m31s.
- An example of a mutant is provided, where the code is changed to parse a string as V1 instead of V2, which is a breaking change that the test suite should catch 21m5s.
- Another example of a mutant is given, where an error is swallowed instead of being returned, which is a contract-breaking change that the test suite should detect 21m31s.
- The mutants that pass the test suite are bundled up into feedback and used to add more test cases to address the shortcomings found 21m55s.
- The generation of mutants can be done using both deterministic and model-based approaches, such as parsing the syntax tree and inverting if conditions, or using a large language model (LLM) to introduce subtle bugs 22m7s.
- The system uses a complex state machine to generate and test code, and the candidate code travels through each state without the user's knowledge 22m37s.
- A live demo is mentioned, where the execution of Auto Cover will be stopped at pre-selected states to show what the model or system was thinking and what the feedback is before the next iteration 23m2s.
Demo 23m21s
- A sample project is demonstrated, which includes the "bars" and "bars V2" projects, as well as a stringification tool, but lacks tests and has 0% coverage 23m22s.
- The project's source code is highlighted in red, indicating that it hasn't been tested, but some lines are not marked as uncovered because they are just declarations 23m50s.
- A more targeted approach is taken by scoping in on specific functions that need testing, rather than selecting the entire file and generating tests for it 24m22s.
- The process of generating tests is started by invoking the "autocover" command in the chat panel, which opens up a new test file automatically 24m48s.
- The "autocover" command is a fire-and-forget process, but in the future, it is planned to allow users to chat again with the copilot to ask questions, make corrections, or add more scenarios 25m8s.
- A debug viewer is used to show the intermediate state of the test generation process, which is not intended for human consumption 25m30s.
- The first state in the graph is shown, which is the initial state after the test cases have been generated, and the test configuration panel is introduced, which shows the current node, the number of tests generated, and the number of tests needed to cover the code 25m49s.
- The current state of the code is unknown, as it hasn't been run yet, and the test case state is also unknown, with the system having generated a table test or future with a good format to make refactoring easier 26m36s.
- The feedback section is empty because the test case was just generated, and there's no information on whether it passes or builds 27m5s.
- The imports were split up because sometimes interaction with the build system is necessary, such as running the package manager to pull in new dependencies 27m14s.
- After the initial run of the executor, the test functions remain the same, but a new test case state is added, showing that the generated code builds, the test passes, and coverage is raised 27m36s.
- The system currently uses coverage as a metric but plans to expand to other metrics like survival or mutation testing mutant survival rate and branch coverage 27m55s.
- If a test fails, the system would show a different state with a build or test failure, and provide feedback in the form of output from the test runner or build system 28m28s.
- The system has simplified tech feedback to a simple textual field, but can also handle complex structure data that can be passed back to the model for subsequent generation phases 29m0s.
- After the initial validation step, the system passes through a few steps and reaches validation, where it determines that the new test state is validation failed, indicating something wrong with the test case 29m26s.
- The system generates test cases and provides feedback on how to improve them, including suggestions for adding edge cases and improving readability 29m34s.
- The feedback is based on various heuristics, such as catching edge cases, improving readability, and using named parameters, which are continuously evolving 30m17s.
- The system also incorporates mutation testing results, which can show up in the feedback if a mutation testing surviving bug is found 30m42s.
- The final state of the system is that it outputs a test case that has been iterated upon and improved based on the feedback, with the exact examples suggested in the feedback added to the test case 31m3s.
- The system can be adjusted to iterate a certain number of times, and the feedback is flagged as addressed once it has been incorporated into the test case 31m33s.
- The user of the system, Autocover, would see a populated test file with multiple tests and expansions, including new edge cases that have been added 31m44s.
- The system is able to achieve 100% coverage of the file, with highlighted parts that are covered in green and potential indicators for tested branches and covered code 32m20s.
- The system also provides metrics on the number of test cases that cover a particular piece of code, which is a new metric being experimented with 32m41s.
- The Auto Cover feature can generate around 110 lines of code with just 15 characters typed to invoke it, making it a keystroke-safe and effective tool, with many engineers at Uber using it 33m0s.
- If there is missing coverage, it is likely that the code is too complex or there is a bug in the source that should not be covered 33m17s.
- The goal is to expand the feature to support other languages, with Java and Python support currently in development and expected to be productionized at Uber 33m39s.
- A standalone mode is being developed to allow users to invoke features like validation logic and mutation testing without having to fully adopt AI-generated code 33m55s.
- The standalone mode aims to provide value to developers who are not ready to fully trust AI-generated code, but still want to use specific features 34m8s.
- The feature will allow users to get feedback earlier in the development cycle, rather than waiting until submitting a PR 34m21s.
- Mutation testing will be made easier by allowing users to invoke a tool to do it for them, making it more accessible to developers 34m30s.
- A consulter mechanism is being added to integrate with native UI or UX in the IDE, allowing agents to seed control back to humans when needed 34m47s.
- The consulter mechanism will integrate with features like chat boxes, doc comments, and other paradigms to prompt users for input 35m0s.
- The feature will also integrate with other entry points in the IDE, such as gutter indicators, code lens, and highlighting, to make it easier for users to discover and use 35m25s.








