Danilo Poccia for AWS

Posted on Oct 15

Beyond Traditional Testing: Addressing the Challenges of Non-Deterministic Software

#testing #ai #devops #python

Software development of non-deterministic systems have become increasingly common. From distributed systems with untrusted inputs to AI-powered solutions, there is a growing challenge in ensuring reliability and consistency in environments that are not fully predictable. The integration of Large Language Models (LLMs) and other AI technologies can in fact introduce data that can change every time it is computed.

Non-deterministic software, by its very nature, can produce different outputs for the same input under seemingly identical conditions. This unpredictability presents significant challenges for testing.

This article explores some of the fundamental characteristics of non-deterministic software, discuss established best practices for testing such systems, examine recent innovations in the field with a focus on AI-driven techniques, and provide practical examples complete with Python code samples. It’ll also investigate the unique challenges posed by LLMs in software testing and offer guidance on implementing a comprehensive testing strategy for these complex systems.

All code is available in this repository. All examples are in Python but the same ideas can be applied to any programming language. At the end, I recap a few testing frameworks that can be used with other programming languages, including C#, Java, JavaScript, and Rust.

Characteristics and Challenges of Non-Deterministic Software

Non-deterministic software can be seen as a reflection of the complex, often unpredictable world we live in. Unlike deterministic software systems, which produce the same output for a given input every time, non-deterministic systems introduce an element of variability.

Non-determinism in software can arise from various sources such as inherent randomness in the algorithms being used or the effect of an internal state that is not observable from the outside. It might also be the result of numerical computing errors. For example, when dealing with floating-point arithmetic, tiny rounding errors can accumulate and lead to divergent results.

A newer source of non-determinism is the integration of generative AI components, such as Large Language Models (LLMs), because at every invocation their outputs can vary significantly for the same input.

To demonstrate non-deterministic behavior, let's have a look at a simple Python example:

import random
from typing import Optional

def non_deterministic_function(x: int) -> Optional[int]:
    if random.random() < 0.1:  # 10% chance of failure
        return None
    return x * 2

# Running this function multiple times with the same input
from collections import Counter

results = Counter()
for _ in range(20):
    result = non_deterministic_function(5)
    results[result] += 1

for result, count in results.items():
    print(f"Result {result}: {count} times")

If you run this code, it will return 10 most of the time, because the function is the input 5, but about 10% of the time, it returns None. This simple example illustrates the challenge when testing non-deterministic software: how to write tests for a function that doesn’t always behave the same way?

To tackle these challenges. we can adapt traditional testing methods and implement new approaches. From property-based testing to AI-driven test generation, the field of software testing is evolving to meet the demands of an increasingly non-deterministic digital world.

Effective Testing Strategies for Non-Deterministic Software

Testing non-deterministic software requires a shift in how we approach software quality assurance. One interesting approach we can use to test non-deterministic software is property-based testing.

With property-based testing, rather than writing tests for specific input-output pairs, you define properties that should hold true for all possible inputs. The testing framework then generates a large number of random inputs and checks if the defined properties hold for each of them.

Let look at an example of property-based testing using the Hypothesis library in Python:

import random
from typing import List
from hypothesis import given, strategies as st

def non_deterministic_sort(lst: List[int]) -> List[int]:
    """A non-deterministic sorting function that occasionally makes mistakes."""
    if random.random() < 0.1:  # 10% chance of making a mistake
        return lst  # Return unsorted list
    return sorted(lst)

@given(st.lists(st.integers()))
def test_non_deterministic_sort(lst: List[int]) -> None:
    result = non_deterministic_sort(lst)

    # Property 1: The result should have the same length as the input
    assert len(result) == len(lst), "Length of the result should match the input"

    # Property 2: The result should contain all elements from the input
    assert set(result) == set(lst), "Result should contain all input elements"

    # Property 3: The result should be sorted in most cases
    attempts = [non_deterministic_sort(lst) for _ in range(100)]

    # We allow for some failures due to the non-deterministic nature
    # Replace 'any' with 'all' to make the test fail if any attempt is not sorted
    assert any(attempt == sorted(lst) for attempt in attempts), "Function should produce a correct sort in multiple attempts"

# Run the test
if __name__ == "__main__":
    test_non_deterministic_sort()

In this example, we're testing a non-deterministic sorting function that occasionally makes mistakes. Instead of checking for a specific output, we can verify properties that should hold true regardless of the function’s non-deterministic behavior. For example, we can check that the output has the same length as the input, contains all the same elements, and is correctly sorted in at least some of multiple attempts.

While property-based testing is powerful, it can be slow and costly when LLMs are involved in the test cases. This is because each test run may require multiple invocations of the LLM, which can be computationally expensive and time-consuming. Therefore, it’s crucial to carefully design property-based tests when working with LLMs to balance thoroughness with efficiency.

Another crucial strategy for testing non-deterministic software is to check if it is feasible to create repeatable test environments. This involves controlling as many variables as possible to reduce the sources of non-determinism during testing. For example, you can use fixed random seeds, mock external dependencies, and use containerization to ensure consistent environments.

When dealing with AI, especially LLMs, you can use semantic similarity measures to evaluate outputs rather than expecting exact matches. For instance, when testing an LLM-based chatbot, you might check if the model’s responses are semantically similar to a set of acceptable answers, rather than looking for specific phrases.

Here’s an example of how to test an LLM’s output using semantic similarity:

import json
import boto3
from typing import List, Callable

from scipy.spatial.distance import cosine

AWS_REGION = "us-east-1"
EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v2:0"

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def get_embedding(text: str) -> List[float]:
    body = json.dumps({"inputText": text})
    response = bedrock_runtime.invoke_model(
        modelId=EMBEDDING_MODEL_ID,
        contentType="application/json",
        accept="application/json",
        body=body
    )
    response_body = json.loads(response['body'].read())
    return response_body['embedding']

def semantic_similarity(text1: str, text2: str) -> float:
    embedding1 = get_embedding(text1)
    embedding2 = get_embedding(text2)
    return 1 - cosine(embedding1, embedding2)

def test_llm_response(llm_function: Callable[[str], str], input_text: str, acceptable_responses: List[str], similarity_threshold: float = 0.8) -> bool:
    llm_response = llm_function(input_text)
    print("llm_response:", llm_response)

    for acceptable_response in acceptable_responses:
        similarity = semantic_similarity(llm_response, acceptable_response)
        print("acceptable_response:", acceptable_response)
        if similarity >= similarity_threshold:
            print("similarity:", similarity)
            return True

    return False

# Example usage
def mock_llm(input_text: str) -> str:
    # This is a mock LLM function for demonstration purposes
    return "The capital of France is Paris, a city known for its iconic Eiffel Tower."

input_text = "What is the capital of France?"
acceptable_responses = [
    "The capital of France is Paris.",
    "Paris is the capital city of France.",
    "France's capital is Paris, known for its rich history and culture."
]

result = test_llm_response(mock_llm, input_text, acceptable_responses)
print(f"LLM response test passed: {result}")

In this example, we use Amazon Bedrock to compute semantic embeddings of a simulated LLM’s response and a set of acceptable responses. Then, we use cosine similarity to determine if the LLM’s output is semantically similar enough to any of the acceptable responses.

On another note, an interesting development not strictly related to non-deterministic software testing is the use of LLMs themselves to generate test data and check test outputs. This approach leverages the power of LLMs to understand context and generate diverse, realistic test cases.

Here’s an example generating structured test data in JSON format:

import json
from typing import Union, List, Dict, Any
import boto3

AWS_REGION = "us-east-1"

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"   

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def generate_structured_test_data(prompt: str, num_samples: int = 5) -> Union[List[Dict[str, Any]], None]:
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{
            'role': 'user',
            'content': [{ 'text': prompt }]
        }]
    )
    generated_data = response['output']['message']['content'][0]['text']
    try:
        json_data = json.loads(generated_data)
    except json.JSONDecodeError:
        print("Generated data is not valid JSON")
        return None  # or raise an exception

    return json_data

# Example usage
prompt = """Generate 5 JSON objects representing potential user inputs for a weather forecasting app.
Each object should have 'location' and 'query' fields.
Output the result as a valid JSON array.
Output JSON and nothing else.
Here's a sample to guide the format:
[
  {
    "location": "New York",
    "query": "What's the temperature tomorrow?"
  }
]"""

test_inputs = generate_structured_test_data(prompt)

print(json.dumps(test_inputs, indent=2))

In this example, we're using Amazon Bedrock and the Anthropic Claude 3.5 Sonnet model to generate structured JSON test inputs for a weather forecasting app. Using this approach, you can create a wide range of test cases, including edge cases that could be difficult to think initially. These test cases can be stored and used multiple times.

Similarly, LLMs can be used to check test outputs, especially for tasks where the correct answer might be subjective or context-dependent. This approach is more precise than just using semantic similarity but is slower and more costly. The two approaches can be used together. For example, if the semantic similarity test has passed, we then use an LLM for further checks.

import boto3
from typing import Any, Dict, List

AWS_REGION = "us-east-1"

MODEL_ID = "us.anthropic.claude-3-5-sonnet-20240620-v1:0"   

bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

def check_output_with_llm(input_text: str, test_output: str, prompt_template: str) -> bool:
    prompt = prompt_template.format(input=input_text, output=test_output)

    response: Dict[str, Any] = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[{
            'role': 'user',
            'content': [{ 'text': prompt }]
        }]
    )

    response_content: str = response['output']['message']['content'][0]['text'].strip().lower()
    if response_content not in ["yes", "no"]:
        raise ValueError(f"Unexpected response from LLM: {response_content}")
    return response_content == "yes"

# Example usage
input_text = "What's the weather like today?"
test_output = "It's sunny with a high of 75°F (24°C) and a low of 60°F (16°C)."
prompt_template = "Given the input question '{input}', is this a reasonable response: '{output}'? Answer yes or no and nothing else."

is_valid = check_output_with_llm(input_text, test_output, prompt_template)

print('input_text:', input_text)
print('test_output:', test_output)
print(f"Is the test output a reasonable response? {is_valid}")

In this example, we're using again an Anthropic Claude model to evaluate whether the system’s response is reasonable given the input question. Depending on the difficulty of the test, we can use a more or less powerful model to optimize speed and costs.

This approach can be used for testing chatbots, content generation systems, or any other application where the correct output isn’t easily defined by simple rules.

These strategies - property-based testing, repeatable environments, semantic similarity checking, and LLM-assisted test generation and validation - form the foundation for an effective testing of non-deterministic software. They allow to make meaningful assertions about system behavior even when exact outputs cannot be predicted.

Advanced Techniques for Testing Complex Non-Deterministic Systems

Using AI to generate test cases can go beyond generative AI and LLMs. For example, machine learning models can analyze historical test data and system behavior to identify patterns and generate test cases that are most likely to uncover bugs or edge cases that a human tester might miss.

Let's see an example of using a simple machine learning model to generate test cases for a non-deterministic function.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from typing import List, Tuple

# Simulated historical test data
# Features: input_a, input_b, system_load
# Target: 0 (pass) or 1 (fail)
X = np.array([
    [1, 2, 0.5], [2, 3, 0.7], [3, 4, 0.3], [4, 5, 0.8], [5, 6, 0.4],
    [2, 2, 0.6], [3, 3, 0.5], [4, 4, 0.7], [5, 5, 0.2], [6, 6, 0.9]
])
y = np.array([0, 0, 0, 1, 0, 0, 0, 1, 0, 1])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Function to generate new test cases
def generate_test_cases_from_historical_test_data(n_cases: int) -> np.ndarray:
    # Generate random inputs
    new_cases = np.random.rand(n_cases, 3)
    new_cases[:, 0] *= 10  # Scale input_a to 0-10
    new_cases[:, 1] *= 10  # Scale input_b to 0-10

    # Predict failure probability
    failure_prob = clf.predict_proba(new_cases)[:, 1]

    # Sort cases by failure probability
    sorted_indices = np.argsort(failure_prob)[::-1]

    return new_cases[sorted_indices]

# Generate and print top 5 test cases most likely to fail
top_test_cases = generate_test_cases_from_historical_test_data(100)[:5]
print("Top 5 test cases most likely to fail:")
for i, case in enumerate(top_test_cases, 1):
    print(f"Case {i}: input_a = {case[0]:.2f}, input_b = {case[1]:.2f}, system_load = {case[2]:.2f}")

This example demonstrates to use of a random forest classifier to generate test cases that are more likely to uncover issues in a system. Model are can be better than humans in learning from historical data to predict which combinations of inputs and system conditions are most likely to cause failures.

Another related technique is the use of chaos engineering for testing non-deterministic systems. For example, you can deliberately introduce failures and perturbations into a system to test its resilience and identify potential issues before they occur in production.

For instance, you can randomly terminate instances in a distributed system, simulate network latency, or inject errors into data streams. By systematically introducing chaos in a controlled environment, you are able to uncover weaknesses in a systems that might not be apparent under normal testing conditions.

When it comes to testing AI-powered systems, especially those involving Large Language Models (LLMs), a similar approach is to use adversarial testing, where input prompts are designed to challenge the LLM’s understanding and generate edge cases.

Here’s an example of how to implemented a simple adversarial testing framework for an LLM:

import random
import string

def generate_adversarial_prompt(base_prompt, num_perturbations=3):
    perturbations = [
        lambda s: s.upper(),
        lambda s: s.lower(),
        lambda s: ''.join(random.choice([c.upper(), c.lower()]) for c in s),
        lambda s: s.replace(' ', '_'),
        lambda s: s + ' ' + ''.join(random.choices(string.ascii_letters, k=5)),
    ]

    adversarial_prompt = base_prompt
    for _ in range(num_perturbations):
        perturbation = random.choice(perturbations)
        adversarial_prompt = perturbation(adversarial_prompt)

    return adversarial_prompt

def test_llm_robustness(llm_function, base_prompt, expected_topic, num_tests=10):
    for _ in range(num_tests):
        adversarial_prompt = generate_adversarial_prompt(base_prompt)
        response = llm_function(adversarial_prompt)

        # Here I use my semantic similarity function to check if the response
        # is still on topic despite the adversarial prompt
        is_on_topic = semantic_similarity(response, expected_topic) > 0.7

        print(f"Prompt: {adversarial_prompt}")
        print(f"Response on topic: {is_on_topic}")
        print("---")

# Example usage (assuming I have my LLM function and semantic_similarity function from before)
base_prompt = "What is the capital of France?"
expected_topic = "Paris is the capital of France"

test_llm_robustness(mock_llm, base_prompt, expected_topic)

This example generates adversarial prompts by applying random perturbations to a base prompt, then tests whether the LLM can still produce on-topic responses despite these challenging inputs. Other approaches to generating adversarial prompts include the use of different human languages, asking to output in poetry or specific formats, and asking for internal information such as tool use syntax.

Because no single technique is a silver bullet, the most effective testing strategies often involve a combination of approaches, tailored to the specific characteristics and requirements of the system under test.

In the next section, let's explore how to implement a comprehensive testing strategy that incorporates advanced techniques alongside more traditional methods, creating a robust approach to testing even the most complex non-deterministic systems.

Comprehensive Strategy for Testing Non-Deterministic Software

To effectively test complex systems, we need a comprehensive strategy that combines multiple techniques and adapts to the specific challenges of each system.

Let's go through the process of implementing such a strategy, using a hypothetical AI-powered recommendation system as an example. This system uses machine learning models to predict user preferences, incorporates real-time data, and interfaces with a Large Language Model to generate personalized content descriptions. We can use it as an example of a non-deterministic system with multiple sources of unpredictability.

The first step in this strategy is to identify the critical components of the system and assess the potential impact of failures. In this sample recommendation system, we can find the following high-risk areas:

The core recommendation algorithm
The real-time data processing pipeline
The LLM-based content description generator

For each of these components, let's consider the potential impact of failures on user experience, data integrity, and system stability. This assessment can be used to guide testing efforts, ensuring that resources are focused where they’re most needed.

Then, with the previous risk assessment in hand, we can design a layered testing approach that combines multiple techniques.

Unit Testing with Property-Based Tests

For individual components, we can use property-based testing to ensure they behave correctly across a wide range of inputs. Here’s an example of how to test the recommendation algorithm.

from hypothesis import given, strategies as st
import numpy as np
from typing import List

def recommendation_algorithm(user_preferences: List[float], item_features: List[float]) -> float:
    # Simplified recommendation algorithm
    return np.dot(user_preferences, item_features)

@given(
    st.lists(st.floats(min_value=-1, max_value=1), min_size=5, max_size=5),
    st.lists(st.lists(st.floats(min_value=-1, max_value=1), min_size=5, max_size=5), min_size=1, max_size=10)
)
def test_recommendation_algorithm(user_preferences: List[float], item_features_list: List[List[float]]) -> None:
    recommendations = [recommendation_algorithm(user_preferences, item) for item in item_features_list]

    # Property 1: Recommendations should be in the range [-5, 5] given our input ranges
    assert all(-5 <= r <= 5 for r in recommendations), "Recommendations out of expected range"

    # Property 2: Higher dot products should result in higher recommendations
    sorted_recommendations = sorted(zip(recommendations, item_features_list), reverse=True)
    for i in range(len(sorted_recommendations) - 1):
        assert np.dot(user_preferences, sorted_recommendations[i][1]) >= np.dot(user_preferences, sorted_recommendations[i+1][1]), "Recommendations not properly ordered"

# Run the test
test_recommendation_algorithm()

Integration Testing with Chaos Engineering

To test how our components work together under various conditions, we can use chaos engineering techniques. For example, we can randomly degrade the performance of the real-time data pipeline, simulate network issues, or introduce delays in API responses. This helps ensure the system remains stable even under suboptimal conditions.

System Testing with AI-Generated Test Cases

For end-to-end testing, we can use AI to generate diverse and challenging test scenarios. This might involve creating complex user profiles, simulating various usage patterns, and generating edge case inputs for our LLM.

Continuous Monitoring and Adaptation

A good testing strategy doesn’t end when we deploy to production. We need robust monitoring and observability tools to catch issues that might not have surfaced during testing.

This includes:

Real-time performance monitoring
Anomaly detection algorithms to identify unusual behavior
A/B testing for gradual rollout of changes
User feedback collection and analysis

Observability tools often include native anomaly detection capabilities to help find important information across a large amount of telemetry data. For example, Amazon CloudWatch implements anomaly detection for metrics and logs.

Here’s a simple example of how to implement an anomaly detection system using basic statistical methods:

import numpy as np
from scipy import stats
from typing import List, Union

class AnomalyDetector:
    def __init__(self, window_size: int = 100) -> None:
        self.window_size: int = window_size
        self.values: List[float] = []

    def add_value(self, value: float) -> None:
        self.values.append(value)
        if len(self.values) > self.window_size:
            self.values.pop(0)

    def is_anomaly(self, new_value: float, z_threshold: float = 3.0) -> bool:
        if len(self.values) < self.window_size:
            return False  # Not enough data to detect anomalies yet

        mean = np.mean(self.values)
        std = np.std(self.values)

        if std == 0:
            return False # To avoid an error if all values are the same.

        z_score = (new_value - mean) / std

        return abs(z_score) > z_threshold

# Usage
detector = AnomalyDetector()

# Simulate some normal values
for _ in range(100):
    detector.add_value(np.random.normal(0, 1))

# Test with a normal value
print(detector.is_anomaly(1.5))  # Probably False

# Test with an anomaly
print(detector.is_anomaly(10))  # Probably True

Alternative property-based testing tools for other programming languages

In these examples, I used the Hypothesis Python module. Here are a few interesting alternatives for other programming languages.

Language	Recommended Library	Reasoning
C#	FsCheck	Widely used in the .NET ecosystem, supports both C# and F#.
Clojure	test.check	Part of Clojure's core.spec, well-integrated with the language.
Haskell	QuickCheck	The original property-based testing library, still the standard in Haskell.
Java	jqwik	Modern design, good documentation, and seamless integration with JUnit 5.
JavaScript	fast-check	Actively maintained, well-documented, and integrates well with popular JS testing frameworks.
Python	Hypothesis	Most mature, feature-rich, and widely adopted in the Python ecosystem.
Scala	ScalaCheck	The de facto standard for property-based testing in Scala.
Ruby	Rantly	More actively maintained compared to alternatives, good integration with RSpec.
Rust	proptest	More actively developed than quickcheck for Rust, with helpful features like persistence of failing examples.

Continuous Improvement

A good testing strategy includes a process for continuous improvement. This involves:

Regular review of test results and production incidents
Updating the test suite based on new insights
Staying informed about new testing techniques and tools
Adapting the testing strategy as the system evolves

Implementing a comprehensive testing strategy for non-deterministic software is no small task. It requires a combination of technical skill, creativity, and a deep understanding of the system under test. However, by combining multiple testing techniques, leveraging AI and machine learning, and maintaining a commitment to continuous improvement, we can create robust, reliable systems even in the face of uncertainty.

When we look to the future, the only thing we can be certain is that the field of software testing will continue to evolve. New challenges will arise as systems become more powerful and complex. But with the foundations explored in this article - from property-based testing to AI-driven test generation, from chaos engineering to semantic similarity checking - there is a solid base on which to build on.

The strategies and techniques discussed here are not set in stone. They’re a starting point, a foundation upon which you can add your own approach tailored to your specific needs and challenges. The world of software is ever-changing, and so too must be our approaches to testing it. I encourage you to embrace the uncertainty, stay curious, and never stop learning. The future of software testing is definitely interesting, and it’s in our hands to learn and, when possible, shape it.

To continue learning, have a look at the repository that includes all code in this article.