DEV Community

Cover image for A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter
Dmitrii
Dmitrii

Posted on

A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Most LLMs and SMLs are not designed for calulations (not talking about OpenAI o1 or o3 models). Just imagine the following dialogue:

  • Company: Today is Wednesday; you can return the delivery parcel within 24 hours.
  • Client: Okay, let's do it on Tuesday.

Are you sure the next AI response will be correct? As a human, you can understand that next Tuesday is six days ahead, while 24 hours is just one day. However, most LLMs cannot reliably handle such logic. Their responses are non-deterministic.

This issue worsens as the context grows. If you have 30 rules and a conversation history of 30 messages, the AI loses focus and makes mistakes easily.

Common Use-Case

  • You're developing an AI scheduling chatbot or AI agent for your company.
  • The company has scheduling rules that are frequently updated.
  • Before scheduling, the chatbot must validate customer input parameters.
  • If validation fails, the chatbot must inform the customer.

What Can We Do?

Combine traditional code execution with LLMs. This idea is not new but remains underutilized:

  • OpenAI integrates this feature into its Assistant API, but not in Complitions API.
  • Google recently introduced code interpreter capabilities in Gemini 2.0 Flash.

Image description

Our Solution Tech Stack

  • Docker (Podman)
  • LangGraph.js
  • Piston

Code Interpreter Sandbox

To securely run generated code, the most popular cloud code interpreters are e2b, Google, and OpenAI as I mentioned before.

However, I was looking for an open-source, self-hosted solution for flexibility and cost-effectiveness. So, 2 good options:

  • Piston
  • Jupyter

I chose Piston for its ease of deployment.


Piston Installation

It took me a while to understand how to add python execution environment to Piston.

0. Enable cgroup v2

For Windows WSL, this article was helpful.

1. Run a Container

docker run --privileged -p 2000:2000 -v d:\piston:'/piston' --name piston_api ghcr.io/engineer-man/piston
Enter fullscreen mode Exit fullscreen mode

2. Checkout the Piston Repository

git clone https://github.com/engineer-man/piston
Enter fullscreen mode Exit fullscreen mode

3. Add Python Support

Run the following command:

node cli/index.js ppman install python
Enter fullscreen mode Exit fullscreen mode

By default, this command uses your container API running on localhost:2000 to install Python.

Example Code Execution

Using the Piston Node.js Client:

import piston from "piston-client";

const codeInterpreter = piston({ server: "http://localhost:2000" });

const result = await codeInterpreter.execute('python', 'print("Hello World!")');

console.log(result);
Enter fullscreen mode Exit fullscreen mode

AI Agents Implementation

Source code on GitHub

We're going to use some advanced techniques:

  • Graph and subgraph architecture
  • Parallel node execution
  • Qdrant for storage
  • Observability via LangSmith
  • GPT-4o-mini, a cost-efficient LLM

Refer to the LangSmith trace for a detailed overview of the flow:
https://smith.langchain.com/public/b3a64491-b4e1-423d-9802-06fcf79339d2/r

Step 1: Extract datetime-related scheduling parameters from user input

Example: "Tomorrow, last Friday, in 2 hours, at noon time."
We use code interpreter to ensure reliable extraction, as LLMs can fail even with current date-time contextual information.

Example Prompt for Python Code Generation:

Your task is to transform natural language text into Python code that extracts datetime-related scheduling parameters from user input.  

## Instructions:  
- You are allowed to use only the "datetime" and "calendar" libraries.  
- You can define additional private helper methods to improve code readability and modularize validation logic.  
- Do not include any import statements in the output.  
- Assume all input timestamps are provided in the GMT+8 timezone. Adjust calculations accordingly.  
- The output should be a single method definition with the following characteristics:  
  - Method name: \`getCustomerSchedulingParameters\`  
  - Arguments: None  
  - Return: A JSON object with the keys:  
    - \`appointment_date\`: The day of the month (integer or \`None\`).  
    - \`appointment_month\`: The month of the year (integer or \`None\`).  
    - \`appointment_year\`: The year (integer or \`None\`).  
    - \`appointment_time_hour\`: The hour of the day in 24-hour format (integer or \`None\`).  
    - \`appointment_time_minute\`: The minute of the hour (integer or \`None\`).  
    - \`duration_hours\`: The duration of the appointment in hours (float or \`None\`).  
    - \`frequency\`: The recurrence of the appointment. Can be \`"Adhoc"\`, \`"Daily"\`, \`"Weekly"\`, or \`"Monthly"\` (string or \`None\`).  

- If a specific value is not found in the text, return \`None\` for that field.  
- Focus only on extracting values explicitly mentioned in the input text; do not make assumptions.  
- Do not include print statements or logging in the output.  

## Example:  

### Input:  
"I want to book an appointment for next Monday at 2pm for 2.5 hours."  

### Output:  
def getCustomerSchedulingParameters():  
    """Extracts and returns scheduling parameters from user input in GMT+8 timezone.  

    Returns:  
        A JSON object with the required scheduling parameters.  
    """  
    def _get_next_monday():  
        """Helper function to calculate the date of the next Monday."""  
        current_time = datetime.utcnow() + timedelta(hours=8)  # Adjust to GMT+8  
        today = current_time.date()  
        days_until_monday = (7 - today.weekday() + 0) % 7  # Monday is 0  
        return today + timedelta(days=days_until_monday)  

    next_monday = _get_next_monday()  
    return {  
        "appointment_date": next_monday.day,  
        "appointment_month": next_monday.month,  
        "appointment_year": next_monday.year,  
        "appointment_time_hour": 14,  
        "appointment_time_minute": 0,  
        "duration_hours": 2.5,  
        "frequency": "Adhoc"  
    }

### Notes:
Ensure the output is plain Python code without any formatting or additional explanations.
Enter fullscreen mode Exit fullscreen mode

Step 2: Fetch Rules from Storage

And then transform them into Python code for validation.

Image description

Step 3: Run Generated Code in Sandbox:

const pythonCodeToInvoke = `
import sys
import datetime
import calendar
import json

${state.pythonValidationMethod}

${state.pythonParametersExtractionMethod}

parameters = getCustomerSchedulingParameters()

valiation_errors = validateCustomerSchedulingParameters(parameters["appointment_year"], parameters["appointment_month"], parameters["appointment_date"], parameters["appointment_time_hour"], parameters["appointment_time_minute"], parameters["duration_hours"], parameters["frequency"])

print(json.dumps({"validation_errors": valiation_errors}))`;

    const traceableCodeInterpreterFunction = await traceable((pythonCodeToInvoke: string) => codeInterpreter.execute('python', pythonCodeToInvoke, { args: [] }));
    const result = await traceableCodeInterpreterFunction(pythonCodeToInvoke);
Enter fullscreen mode Exit fullscreen mode

Image description

Source code on GitHub


Potential Improvements

  • Implement an iterative loop for LLMs to debug and refine Python code execution dynamically.
  • Human in the loop for validation method code generation.
  • Caching generated code.

Final Thoughts

Bytecode execution and token-based LLMs are highly complementary technologies, unlocking a new level of flexibility. This synergistic approach has a bright future, for example AWS's recent "Bedrock Automated Reasoning", which appears to offer a similar solution within their enterprise ecosystem. Google and Microsoft also will show us something similar very soon.

Top comments (0)