wellallyTech

Posted on Feb 17

Privacy First: Chat with Your Medical Reports Locally using Llama-3 and MLX on Mac 🍎

#privacy #ai #medical #health

Your health data is probably the most sensitive information you own. Yet, in the age of AI, most people blindly upload their blood work and MRI results to cloud-based LLMs just to get a summary. Stop right there! 🛑

In this tutorial, we are going to build a Local RAG (Retrieval-Augmented Generation) system. We will leverage the power of Apple Silicon's unified memory, the high-performance MLX framework, and Llama-3 to create a private medical assistant that never leaks a single byte to the internet. By using Local RAG and MLX-optimized Llama-3, you can perform complex semantic search and data extraction on your medical PDFs while keeping your data strictly on-device.

The Architecture: Why MLX?

Traditional RAG stacks often rely on heavy Docker containers or cloud APIs. However, if you are on a Mac (M1/M2/M3), the MLX framework (developed by Apple Machine Learning Research) allows you to run Llama-3 with incredible efficiency by utilizing the GPU and unified memory architecture.

Here is how the data flows from your dusty PDF report to a meaningful conversation:

graph TD
    A[Medical PDF Report] -->|PyMuPDF| B(Text Extraction & Cleaning)
    B --> C{Chunking Strategy}
    C -->|Sentence Splitting| D[ChromaDB Vector Store]
    E[User Query: 'Is my cholesterol high?'] -->|MLX Embedding| F(Vector Search)
    D -->|Retrieve Relevant Context| G[Prompt Augmentation]
    G -->|Context + Query| H[Llama-3-8B via MLX]
    H --> I[Private Local Answer]

    style H fill:#f96,stroke:#333,stroke-width:2px
    style D fill:#bbf,stroke:#333,stroke-width:2px

Prerequisites

Before we dive into the code, ensure you have an Apple Silicon Mac and the following stack installed:

Llama-3-8B: We'll use the 4-bit quantized version for speed.
MLX: Apple's native array framework.
ChromaDB: Our lightweight vector database.
PyMuPDF (fitz): For high-accuracy PDF parsing.

pip install mlx-lm chromadb pymupdf sentence-transformers

Step 1: Parsing Sensitive PDFs with PyMuPDF

Medical reports are notoriously messy—tables, signatures, and weird formatting. We use PyMuPDF for its speed and reliability in extracting clean text.

import fitz  # PyMuPDF

def extract_medical_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"

    # Simple cleaning: remove extra whitespaces
    clean_text = " ".join(text.split())
    return clean_text

# Usage
raw_data = extract_medical_text("my_blood_report_2024.pdf")
print(f"Extracted {len(raw_data)} characters.")

Step 2: Vector Embeddings and Local Storage

To find relevant information (like "What was my Glucose level?"), we need to convert text into vectors. We'll store these in ChromaDB.

💡 Pro-Tip: For more production-ready examples and advanced RAG patterns, check out the detailed guides on the WellAlly Tech Blog, where we dive deep into optimizing local inference.

import chromadb
from chromadb.utils import embedding_functions

# Initialize local ChromaDB
client = chromadb.PersistentClient(path="./medical_db")
# Using a local embedding model
emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

collection = client.get_or_create_collection(name="medical_reports", embedding_function=emb_fn)

def add_to_vector_store(text, metadata):
    # Chunking text into 500-character pieces
    chunks = [text[i:i+500] for i in range(0, len(text), 500)]
    ids = [f"id_{i}" for i in range(len(chunks))]

    collection.add(
        documents=chunks,
        ids=ids,
        metadatas=[metadata] * len(chunks)
    )

add_to_vector_store(raw_data, {"source": "annual_checkup_2024"})

Step 3: Local Inference with Llama-3 & MLX

Now for the magic. We use mlx-lm to load a quantized Llama-3-8B. This allows the model to run comfortably even on a MacBook Air with 16GB of RAM. 🚀

from mlx_lm import load, generate

# Load the model and tokenizer
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

def query_private_ai(user_question):
    # 1. Retrieve context from ChromaDB
    results = collection.query(query_texts=[user_question], n_results=3)
    context = "\n".join(results['documents'][0])

    # 2. Construct the prompt
    prompt = f"""
    You are a private medical assistant. Use the provided medical report context to answer the user's question. 
    If you don't know the answer based on the context, say so. 
    Context: {context}
    ---
    Question: {user_question}
    Answer:
    """

    # 3. Generate response using MLX
    response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=500)
    return response

# Example Query
print(query_private_ai("What are the key concerns in my blood report?"))

Taking it Further: The "Official" Way

While this script gets you started, building a production-grade medical AI requires handling multi-modal data (like X-rays) and ensuring rigorous HIPAA-like compliance even on local edge devices.

The team at WellAlly has been pioneering "Privacy-First AI" architectures. If you're interested in scaling this to multiple users or integrating it into a secure healthcare workflow, I highly recommend reading their latest deep-dives on https://www.wellally.tech/blog. They cover how to fine-tune Llama-3 specifically for clinical terminology which significantly reduces hallucinations.

Conclusion 🥑

You just built a private, high-performance medical RAG system! By combining Llama-3, MLX, and ChromaDB, you’ve achieved:

Zero Data Leakage: Your health data never leaves your Mac.
High Performance: MLX makes local LLMs feel snappy.
Intelligence: Llama-3 provides reasoning that simple keyword searches can't match.

What's next? 🛠️

Try implementing a "Table Parser" for more accurate lab result extraction.
Add a Streamlit UI to make it look like a real app.
Let me know in the comments: What's your biggest concern with Cloud AI?

Stay private, stay healthy! 💻🛡️

DEV Community