Smart Chaos: LLMs, No More Human Modeling

#chaosengineering #awspartyrock #sre #genai

Modern enterprise distributed systems are very complex in nature, and, of course, it's hard to manage them too. This complexity arises from various issues, including those in cloud hardware, cloud networking services, databases, serverless or complex compute layers, and caching layers. Operating such a setup is highly challenging, and failures can occur at any or all logical layers. Sometimes, failures can happen in combination with other errors. Bugs may manifest long after changes are propagated to production. What's worse is that bugs may propagate to other layers too. Problems tend to exacerbate at higher levels of the system due to recursion.

If you look at the history, even the largest companies like Facebook have had their share of issues. In 2021, Fastly had an outage that impacted Amazon, eBay, Reddit, Spotify, Twitch, The Guardian, and The New York Times. The reason for this is that traditional testing is falling short.

The testing verifies the known, but in the dance with failures, the steps are often unrehearsed and spontaneous. The inability to test these unknowns is the greatest risk systems are carrying.
The answer for this is Chaos Engineering. Chaos engineering is able to identify weak points in the systems, which helps us to:

Develop cost-effective failover and restoration solutions
Observe how systems respond to real-world events
Build confidence against failure
Improve recovery time (MTTR)
Identify weaknesses and fix them proactively
Prepare and educate SREs

The emergence of Generative AI has enabled us to find an innovative solution for this problem. Generative AI refers to machine learning models that can create new, original content like text, images, audio, and more.
Two of the most prominent Generative AI use cases are Text generation and Image generation. Let's park this thought for a moment.
Chaos Engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in the target environment. Let's see the key components of the chaos engineering workflow.
Below are the key processes of the Chaos Engineering workflow.

One thing to highlight is, parallel to your chaos experiments, you need to measure everything by enabling observability.
Before we go further, feel free to check the app i developed - Smart Chaos powered by AWS

Now let's see how we can leverage GenAI in Chaos Engineering, going through each and every process in the workflow

Discovery Phase: Leverage GenAI for anomaly detection in historical data, uncovering potential weaknesses or areas of interest.
Dependency Analysis: Utilize GenAI to analyze system dependencies, identifying points of failure or vulnerabilities in the architecture.
Steady State Definition: Train GenAI models to predict performance metrics, aiding in defining the system's steady state.
Hypothesis Creation: Implement GenAI for automated hypothesis generation.
Experiment Design: Use GenAI for scenario simulation, assisting in designing controlled experiments and optimizing testing efficiency.
Blast Radius Definition: Utilize GenAI to assess the potential blast radius of chaos experiments, considering dependencies and system topology.
Rumsfeld Matrix - "Known Knowns, Known Unknowns, and Unknown Unknowns": Leverage GenAI for data analysis to categorize knowns and unknowns, aiding in identifying potential unknown unknowns.
Monitoring and Analysis: Integrate GenAI models into real-time monitoring for anomaly detection during chaos experiments and root cause analysis.
Documentation and Reporting: Employ GenAI for automated reporting, summarizing chaos experiment outcomes and consolidating insights.
Iterative Improvement: Establish a continuous feedback loop with GenAI, allowing for adaptive experimentation and improved chaos engineering approaches.

Now, that's what I think a comprehensive solution would look like. Let me walk you through what I have done with the AWS PartyRock.
Smart Chaos Apps use LLMs to automate the Chaos Engineering workflow:

Discovery Phase: Refer to the architecture diagram or service map to discover services using GenAI.
Dependency Analysis: Utilize GenAI to analyze system dependencies, identifying points of failure or vulnerabilities in the architecture.
Steady State Definition: Utilize GenAI for defining the system's steady state.
Hypothesis Creation:
Utilize GenAI for hypothesis generation.
Experiment Design: Use GenAI for scenario/experiment design simulation.
Blast Radius Definition: Utilize GenAI to assess the potential blast radius of chaos experiments, considering dependencies and system topology.
Rumsfeld Matrix - "Known Knowns, Known Unknowns, and Unknown Unknowns": Leverage GenAI for data analysis to categorize knowns and unknowns, aiding in identifying potential unknown unknowns.
Smart Chaos Chatbot: Leverage GenAI-based chatbot to improve the entire process (continuous improvement).

Let me walk through what it is doing:

Users can input a link to the System Architecture Diagram or Service Map details.
Dependency Analysis: LLM analyzes the provided architecture diagram, identifying key system dependencies for potential chaos testing.
Steady State Definition: LLM examines the architecture diagram, performs Dependency Analysis, and generates the top 10 Steady State Definitions for chaos testing.
Hypothesis Generation: LLM utilizes the architecture diagram, Dependency Analysis, and Steady State Definitions to automatically generate hypotheses supporting chaos testing.
Experiment Design: LLM considers the architecture diagram, Steady State Definitions, and Hypothesis Generation to create Chaos testing test cases and an experiment list.
Rumsfeld Matrix - Known Unknown: LLM reviews the architecture diagram and lists the Rumsfeld Matrix – Known Unknown for chaos testing.
Test Case: Users can input one of the generated Test Cases.
Blast Radius: LLM refers to the Test Case and the architecture diagram, generating the Blast Radius for the specified Test Case during chaos testing.
Chat Bot: Having the ability to provide additional on-demand assistance

The main challenges I faced were:
The primary challenge was determining the amount of data to feed to LLM. Although LLM performed well, I decided to simplify the process by allowing users to provide either an architecture diagram or a runtime service map. While additional data such as observability data could provide better details beyond service discovery, I opted for simplicity.

What are the best LLMs for this work?
There are a lot of LLM options available, and let's try to find out which LLMs are suitable for each use case

Discovery Phase - Claude: Claude excels at analyzing and understanding architecture diagrams and service maps. Its visual comprehension skills make it ideal for service discovery.
Dependency Analysis - Jurassic-2: Jurassic-2's capabilities in highly technical and logical analysis of systems architecture make it well-suited for analyzing dependencies.
Steady State Definition - Claude's: Advanced natural language generation capabilities and technical knowledge base make it well-suited for automatically generating potential steady state definitions during chaos engineering experiments.
Hypothesis Creation - Command: Command's conversational nature lends itself to rapid hypothesis ideation and iteration.
Experiment Design - Claude LLM: Claude's safety-focused capabilities help design chaos experiments that minimize the blast radius.
Blast Radius Definition - Jurassic-2: Jurassic-2's technical precision helps accurately assess the potential blast radius across dependencies.
Rumsfeld Matrix - Liama 2: Liama 2's nuanced language understanding aids in categorizing knowns/unknowns and identifying unknown unknowns.
Test cases - Claude: Claude's natural language capabilities and technical knowledge allow it to effectively comprehend and analyze user input test cases, making it well-suited for the Test Case input step in Smart Chaos experiments.
Chatbot - Claude: Claude's natural language capabilities make it the ideal choice for powering a Smart Chaos chatbot.

What is needed to get Smart Chaos develop outside PartyRock?

Finally, let's explore how we can migrate it from PartyRock to host on AWS. Here are some key considerations for moving the Smart Chaos application to a dedicated deployment on AWS:

Frontend: Create an Angular application for the frontend UI. Host the static assets in an S3 bucket and serve them through CloudFront for performance.
Backend: Implement the core backend functions like discovery, dependency analysis etc. as Lambda functions.
Leverage Step Functions to orchestrate the workflow between the Lambda functions.
Connect the Lambdas to AWS Bedrock to access the LLMs for processing.
Storage: Upload architecture diagrams and test cases to S3 buckets.
Use Amazon Elasticsearch Service for the knowledge graph storage and lookups.

In summary, to transition the Smart Chaos application from PartyRock to AWS, you'll need to create an Angular frontend, implement core backend functions as Lambda functions, orchestrate workflows with Step Functions, connect Lambdas to AWS Bedrock for LLM processing, utilize S3 for storage, and employ Amazon Elasticsearch Service for knowledge graph storage.

Btw, if your looking for quick solution to take out your PartyRock app, there is nice app develop by Stephen Sennett - GenStack - Bring PartyRock apps to your place
This post is an extension of the presentation I did as part of Conf42 Chaos Engineering 2024 - 'Smart Chaos: Leveraging Generative AI.' You can watch the video here. Here again I used AWS PartyRock for the POC.

DEV Community

Smart Chaos: LLMs, No More Human Modeling

Top comments (0)

Read next

How a Pod is Deleted - Behind the Scenes Breakdown

How to Set up Disk and Bandwidth Limits in Docker

How To Fix OOMKilled

Generative AI Call Center