Introduction
The dream of an autonomous AI agent isn’t just about generating smart responses — it’s about making those responses fast, interactive, and context-aware. To achieve this, you need to manage state across asynchronous tasks, handle real-time communication, and separate logic cleanly.
In this blog, you’ll learn how to design an ideal AI agent by:
- Using asynchronous server-sent events (SSE) to create live, real-time AI responses.
- Simplifying state management with context variables (
contextvar
). - Decoupling logic from network operations to build a scalable, maintainable architecture.
By the end, you’ll have a step-by-step understanding of how to design an agent that’s efficient, elegant, and easy to scale.
1️⃣ The Architecture of an Ideal AI Agent
Most "basic" AI agents are tangled in a mess of network calls, domain logic, and asynchronous event management. This makes it difficult to debug and hard to scale.
The ideal agent separates these concerns:
- Event-driven Communication: Uses asynchronous Server-Sent Events (SSE) to stream updates to users in real time.
- Context-Aware State Management: Manages context across multiple async calls using Python’s contextvar.
- Decoupled Business Logic: Avoids tightly coupling logic with network operations, making it easier to maintain.
Diagram: Ideal AI Agent Architecture
+----------------------------+
| Client (Web Browser) |
| Listens for SSE Events |
+----------------------------+
⬇
+----------------------------+
| FastAPI Backend |
| (Async Streaming) |
+----------------------------+
⬇
+----------------------------+
| Agent Logic |
| 1️⃣ Generates Output |
| 2️⃣ Emits Real-Time Events |
+----------------------------+
⬇
+----------------------------+
| Event Queue + Context |
| Context-Aware State |
+----------------------------+
2️⃣ Key Concepts to Build the Ideal Agent
1. Asynchronous Event Streaming (SSE)
Instead of waiting for the entire AI response to finish, we can stream each "chunk" of the response to the user. This makes the interaction feel faster, even if the total response time is the same.
How It Works:
- The client opens an event stream (
text/event-stream
).- Every time the agent generates new content (like a sentence or paragraph), it streams that chunk to the client.
- When the full response is complete, the event stream closes.
Why It’s Important:
- Feels more interactive to users.
- Allows for partial responses — users can see content as it’s created.
2. Context-Aware State Management (contextvar
)
Agents often deal with asynchronous tasks that happen in parallel. Without context, shared state between these tasks becomes difficult.
Problem:
Two user requests hit the server at the same time. How do you ensure their states are separate?
Solution:
Use Python’s
contextvars
. This lets you manage request-specific variables, even when multiple requests happen at once. Think of it like thread-local storage for async code.How It Works:
- When a new request arrives, a queue is created in the context.
- This queue holds the event messages (chunks) for that specific request.
- As the agent generates output, it emits chunks into the queue.
- Once the queue is empty and the task is done, the context is cleaned up.
3. Decoupled Agent Logic
The best AI agents keep network logic and business logic separate. Instead of directly streaming from the agent, we push events into a queue and handle event streaming separately.
This separation makes it easier to test, debug, and maintain the system.
Concepts You’ll Need:
- emit_event(): Adds events to the queue.
- close(): Closes the queue when the task finishes.
- Streaming Response: Sends chunks from the queue to the client.
3️⃣ Step-by-Step Guide to Building the Ideal Agent
Step 1: Setup the Environment
Install the necessary libraries:
pip install fastapi uvicorn pydantic
Step 2: Build the Context-Aware Event System
This system tracks and streams events from the agent to the client. Here’s how:
- Use
contextvar
to store the event queue. - Create functions to emit events and close the queue.
- The agent will generate chunks, add them to the queue, and the queue will stream them.
context.py
import asyncio
import contextvars
# Create a context variable to store request-specific data
chat_context_var = contextvars.ContextVar("chat_context")
# Function to initialize the context (per request)
def build_chat_context():
queue = asyncio.Queue()
async def emit_event(event):
await queue.put(event)
async def close():
await queue.put(None) # Signals to close the stream
return emit_event, close, queue
Step 3: The Chat Service
This is the core logic where the agent processes a request.
- When a client sends a chat request, we create a new context.
- The context tracks the queue where messages (events) are stored.
- Each message chunk from the agent is streamed to the client in real time.
chat.py
from fastapi import FastAPI, APIRouter, Request
from fastapi.responses import StreamingResponse
from context import build_chat_context, chat_context_var
import asyncio
router = APIRouter()
async def process_messages(messages):
emit_event, _, queue = chat_context_var.get()
for message in messages:
await asyncio.sleep(1) # Simulate a delay to send chunks one-by-one
await emit_event({"content": message})
await queue.put(None) # End of stream
@router.post("/api/stream")
async def stream(request: Request):
emit_event, close, queue = build_chat_context()
chat_context_var.set((emit_event, close, queue))
task = asyncio.create_task(process_messages(["Hello", "How are you?", "Goodbye"]))
async def event_generator():
while True:
event = await queue.get()
if event is None: # End of stream
break
yield f"data: {event}\n\n"
return StreamingResponse(event_generator(), media_type="text/event-stream")
Step 4: Run the Service
Run the FastAPI server:
uvicorn chat:app --reload
Make a POST request to:
http://localhost:8000/api/stream
Watch the real-time response as chunks appear.
4️⃣ Advanced Techniques
1. Use async def emit_event()
Instead of yield
The "yield" method can cause issues when events come from multiple functions. Instead, push events to a queue using emit_event()
. This avoids yield problems, especially when sub-functions need to send events.
2. Manage Long-Running Tasks
Use asyncio.create_task()
to process long-running tasks without blocking the entire stream. This allows multiple users to receive independent updates.
3. Use WebSockets Instead of SSE
For more interactive experiences, use WebSockets. Unlike SSE, WebSockets support bi-directional communication.
5️⃣ Key Takeaways
- Context-aware agents can separate network logic from agent logic.
- Use SSE (Server-Sent Events) for real-time feedback.
- Manage agent state using Python’s contextvar to keep state isolated.
- emit_event() makes it simple to send updates from any part of the agent logic.
6️⃣ Full Code Example
Here’s the complete file structure:
├── context.py # Handles contextvars and event system
├── chat.py # The core logic for the streaming service
└── main.py # Starts the FastAPI server
7️⃣ Final Thoughts
Building an "ideal" AI agent isn’t just about improving its intelligence — it’s about making it more interactive, more maintainable, and more scalable. By using async events, contextvars, and real-time streams, you can create an agent that "feels" fast and responsive.
If you’re ready to level up your agents, apply these principles to your next AI project.
Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events
I cover :
- How to design systems for AI applications
- How to stream the answer of a Large Language Model
- Differences between Server-Sent Events and WebSockets
- Importance of real-time for GenAI UI
- How asynchronous programming in Python works
- How to integrate LangChain with FastAPI
- What problems Retrieval Augmented Generation can solve
- How to create an AI agent ... and much more.
Worth checking out if you want to build your own LLM applications.
Top comments (0)