Recap Asana’s LLM Testing Playbook: A Comprehensive Analysis of Claude 3.5 Sonnet

TLDR

Asana’s LLM Testing Playbook outlines their comprehensive QA process for evaluating large language models like Claude 3.5 Sonnet. The process includes unit testing, integration testing, end-to-end testing, and additional assessments for new models to ensure reliable and high-performance AI-powered features. This rigorous approach helps Asana maintain data integrity, response accuracy, and overall model quality, ensuring their AI tools exceed user expectations.

Unit Testing

Unit testing is the cornerstone of Asana’s LLM QA process. As part of their methodology, Asana’s LLM Foundations team developed an in-house unit testing framework that allows engineers to evaluate LLM responses similarly to traditional software unit tests. This approach is crucial because LLMs often produce slightly different outputs even when given the same input data. By using LLMs to validate their own assertions, Asana ensures that key details, such as task deadlines, are accurately captured by the model.

Asana’s unique “needle-in-a-haystack” test is a prime example of their rigorous testing methodology. In this test, the model is required to find relevant data within a vast project, ensuring that it can synthesize accurate answers from large datasets. The diagram below illustrates the elements of Asana's unit testing framework:

For instance, one test might involve querying the model to identify a project’s launch date buried within extensive documentation. The model’s ability to consistently find and report this detail accurately demonstrates its effectiveness in real-world applications.

Integration Testing

Integration testing at Asana involves assessing how well the LLM can manage complex workflows that require chaining multiple prompts together. This is particularly important for AI-powered features that rely on the LLM’s ability to retrieve data and generate accurate user-facing responses based on that data.

For example, Asana’s LLM might be tested on its ability to retrieve specific project updates and then summarize those updates in a clear, user-friendly format. The integration tests ensure that these chains of prompts work together cohesively before new features are released. The diagram below represents the integration testing framework:

This method ensures that features like Asana’s AI-powered task management systems can reliably assist users in their daily workflows, providing them with the accurate information they need.

End-to-End Testing

End-to-end (e2e) testing at Asana is designed to simulate the actual experience of their customers. By using realistic data in sandboxed test instances of Asana, the team can evaluate the LLM’s performance in scenarios that closely mirror real-world usage.

While this type of testing is more time-consuming and requires manual evaluation by product managers, it provides invaluable insights into the model's overall quality, including aspects of intelligence that are difficult to quantify through automated tests. For instance, end-to-end testing might involve a comprehensive scenario where the LLM needs to handle a multi-step project planning task from start to finish, including generating updates and identifying potential risks. The end-to-end testing framework is depicted below:

Through these rigorous tests, Asana ensures that their AI-powered tools can handle complex, real-world tasks with a high degree of reliability and intelligence.

Additional Tests for New Models

When testing pre-production models like Claude 3.5 Sonnet, Asana employs additional assessments to measure performance metrics such as time-to-first-token (TTFT) and tokens-per-second (TPS). These tests are crucial for ensuring that the LLM can respond quickly and efficiently, providing a smooth user experience.

Moreover, Asana’s evaluation of Claude 3.5 Sonnet included a tool-use benchmark, which tested the model’s agentic capabilities. This involved both quantitative benchmarks and qualitative testing using Asana’s internal multi-agent prototyping platform. For example, one test might involve the LLM autonomously managing a series of tasks, making decisions, and adjusting workflows based on the data it receives. The additional testing framework for new models is shown below:

These additional tests provide deeper insights into the LLM’s capabilities, ensuring that it can be effectively integrated into Asana’s suite of AI tools.

Conclusion

Asana’s rigorous testing framework for evaluating frontier LLMs like Claude 3.5 Sonnet underscores their commitment to delivering reliable, high-performance AI-powered features. By implementing a comprehensive QA process that includes unit testing, integration testing, end-to-end testing, and additional assessments for new models, Asana ensures that their AI teammate remains a valuable and trusted tool for their users.

As the frontier of large language models continues to evolve, Asana’s investment in robust QA processes allows them to stay ahead of the curve, ensuring that their AI-powered features not only meet but exceed user expectations.

For more detailed insights, you can read the full article by Bradley Portnoy on Asana’s official website: Asana's LLM testing playbook: our analysis of Claude 3.5 Sonnet.