DEV Community

Louis Sanna
Louis Sanna

Posted on

Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers

Introduction: Why Real-Time Streaming AI is the Future

Real-time AI is transforming how users experience applications. Gone are the days when users had to wait for entire responses to load. Instead, modern apps stream data in chunks.

For developers, this shift isn't just a "nice-to-have" — it's essential. Chatbots, search engines, and AI-powered customer support apps are now expected to integrate streaming LLM (Large Language Model) responses. But how do you actually build one?

This guide walks you through the process, step-by-step, using FastAPI, Transformers, and a healthy dose of asynchronous programming. By the end, you'll have a working streaming endpoint capable of serving LLM-generated text in real-time.

💡 Who This Is For:

  • Software Engineers who want to upgrade their back-end skills with text streaming and event-driven programming.
  • Data Scientists who want to repurpose ML skills for production-ready AI services.

Table of Contents

  1. What Is a Streaming LLM and Why It Matters?
  2. Tech Stack Overview: The Tools You'll Need
  3. Project Walkthrough: Building the Streaming LLM Backend
    • Environment Setup
    • Setting Up FastAPI
    • Building the Streaming Endpoint
    • Connecting the LLM with Transformers
  4. Client-Side Integration: Consuming the Stream
  5. Deploying Your Streaming AI App
  6. Conclusion and Next Steps

1️⃣ What Is a Streaming LLM and Why It Matters?

When you type into ChatGPT or ask a question in Google Bard, you'll notice the response appears one word at a time. Streaming LLMs send chunks of text as they're generated instead of waiting for the entire message, i.e. they deliver in real-time.

Here’s why you should care as a developer:

  • Faster User Feedback: Users see responses sooner.
  • Lower Latency Perception: Users feel like the system is faster, even if total time is the same.
  • Improved UX for AI Chatbots: Streaming text "feels" human, mimicking natural conversation.

If you’ve used ChatGPT, you’ve already experienced this. Now it’s time to learn how to build one yourself.


2️⃣ Tech Stack Overview: The Tools You'll Need

To build your streaming LLM backend, you’ll need the following tools:

📦 Core Technologies

Tool Purpose
FastAPI Handles API requests and real-time streaming
Uvicorn Runs the FastAPI app as an ASGI server
Transformers Access pre-trained language models
asyncio Handles asynchronous event loops
contextvars Keeps track of context in async tasks
Server-Sent Events (SSE) Streams messages to the client
Docker Optional for containerization and deployment

💡 Note: Server-Sent Events (SSE) is different from WebSockets. SSE allows the server to push data to the client, while WebSockets support bi-directional communication. For LLM streaming, SSE is simpler and more efficient.


3️⃣ Project Walkthrough: Building the Streaming LLM Backend

Step 1: Environment Setup

  1. Install Python and Pip: Ensure Python 3.7+ is installed.
  2. Create a Virtual Environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
    
  3. Install Dependencies:

    pip install fastapi uvicorn transformers asyncio
    
    

Step 2: Set Up FastAPI

Create a file named app.py. Here’s the basic FastAPI setup.

from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Welcome to Real-Time LLM Streaming!"}

Enter fullscreen mode Exit fullscreen mode

Run the server:

uvicorn app:app --reload

Enter fullscreen mode Exit fullscreen mode

Visit http://127.0.0.1:8000/ in your browser. You should see:

{ "message": "Welcome to Real-Time LLM Streaming!" }

Enter fullscreen mode Exit fullscreen mode

Step 3: Build the Streaming Endpoint

Instead of returning a single response, we’ll stream it chunk-by-chunk. Here’s the idea:

  1. The client makes a request to /stream.
  2. The server "yields" parts of the response as they are generated.

Here’s the code for the streaming endpoint:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def event_stream():
    for i in range(10):
        await asyncio.sleep(1)  # Simulate response delay
        yield f"data: Message {i}\n\n"

@app.get("/stream")
async def stream_response():
    return StreamingResponse(event_stream(), media_type="text/event-stream")

Enter fullscreen mode Exit fullscreen mode

🔥 Test It:

Run the server and visit http://127.0.0.1:8000/stream — you'll see "Message 0", "Message 1", etc., appear every second.


Step 4: Connect the LLM with Transformers

Now, let’s swap out the dummy messages for LLM-generated responses.

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from transformers import pipeline
import asyncio

app = FastAPI()
llm_pipeline = pipeline("text-generation", model="gpt2")

async def generate_response(prompt):
    for chunk in llm_pipeline(prompt, max_length=50, return_full_text=False):
        yield f"data: {chunk['generated_text']}\n\n"
        await asyncio.sleep(0.1)

@app.get("/stream")
async def stream_response(prompt: str):
    return StreamingResponse(generate_response(prompt), media_type="text/event-stream")

Enter fullscreen mode Exit fullscreen mode

🔥 Test It:

Run the server and visit:

http://127.0.0.1:8000/stream?prompt=Once upon a time

Enter fullscreen mode Exit fullscreen mode

You’ll see the AI model stream the response live.


4️⃣ Client-Side Integration: Consuming the Stream

On the front end, you can use EventSource (a native browser API) to consume the stream.

Here’s the simplest way to do it:

<!DOCTYPE html>
<html lang="en">
<body>
  <h1>LLM Streaming Demo</h1>
  <pre id="stream-output"></pre>

  <script>
    const output = document.getElementById('stream-output');
    const eventSource = new EventSource('http://127.0.0.1:8000/stream?prompt=Tell me a story');

    eventSource.onmessage = (event) => {
      output.innerText += event.data + '\n';
    };
  </script>
</body>
</html>

Enter fullscreen mode Exit fullscreen mode

This will display a live feed of the AI response on your webpage.


5️⃣ Deploying Your Streaming AI App

You’ve got it working locally, but now you want to deploy it to the world. Here’s how:

Step 1: Dockerize the App

Create a file called Dockerfile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8

WORKDIR /app
COPY . /app

RUN pip install -r /app/requirements.txt

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

Enter fullscreen mode Exit fullscreen mode

Step 2: Build and Run the Docker Image

docker build -t streaming-llm .
docker run -p 80:80 streaming-llm

Enter fullscreen mode Exit fullscreen mode

6️⃣ Conclusion: What’s Next?

Congratulations! 🎉 You’ve built a real-time, streaming LLM from scratch using FastAPI, Transformers, and Server-Sent Events. Here's what you’ve learned:

  • How streaming works (and why it matters).
  • How to use FastAPI for streaming endpoints.
  • How to stream LLM responses with Hugging Face Transformers.

Where to Go Next?

  • Optimize Your LLM: Use Hugging Face models like GPT-J or distilGPT2 for smaller, faster models.
  • Explore WebSockets: For two-way streaming (not just server->client).
  • Deploy to Cloud: Deploy your app to AWS, GCP, or Heroku.

🧠 Pro Tip: Add interactive client-side UI, like a chat interface, to create your own mini ChatGPT!

With this guide, you're ready to level up your developer skills and build interactive, AI-driven experiences. 🚀

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

  • How to design systems for AI applications
  • How to stream the answer of a Large Language Model
  • Differences between Server-Sent Events and WebSockets
  • Importance of real-time for GenAI UI
  • How asynchronous programming in Python works
  • How to integrate LangChain with FastAPI
  • What problems Retrieval Augmented Generation can solve
  • How to create an AI agent ... and much more.

Top comments (0)