Neel

Posted on Aug 18, 2023 • Edited on Sep 18, 2023

A document Q&A chatbot built with Next.js, supabase and GPT

#javascript #ai #nextjs #chatgpt

I have been trying to build a document Q&A chatbot for the past two months. I initially started off with an application that doesn't rely on GPT, but uses the impira/layoutlm-document-qa model. It didn't work out as the model was not suitable for carrying out causal conversations.

My next attempt involved a python application that extracts the content from the PDF document using tesseract-ocr + pytesseract and passes it to OpenAI api for driving the entire Q&A process. I was just getting started with langchainso there were lot of boiler plate and I was not using it right. That's when I decided to scrap it all and start afresh

Third time is a charm

With the learnings I got from the second try, I stuck to GPT as the core LLM, I chose Next.js 13 as the one-stop solution for building the UI and backend of the application

Before getting into the details, below are the core tools used to build the application

And here is a sneak peek of everything put together

The flow

The user journey of the application and what goes on behind the scenes for each user action are as follows,

⬆️ Uploading the document

The journey starts with the user uploading the PDF document. For this, I used the react-dropzone library. The library supports both drag-and-drop and click-based uploads. Once the user drops the file into the input, the file is sent to the backend for processing

import { useDropzone } from 'react-dropzone';

const { getRootProps, getInputProps, isDragActive } = useDropzone({
    onDrop: (acceptedFiles) => {
        //check if acceptedFiles array is empty and proceed
        //call the API with the file as the payload
        const formData = new FormData();
        formData.append('file', acceptedFiles[0]);
        axios.post("/api/upload", { data: formData });
    },
    multiple: false, //to prevent multi file upload
    accept: {
      'application/pdf': ['.pdf'] //if the file is not a PDF, then the list will be empty
    }
});

The /upload api takes care of three things

Uploading the document to supabase storage bucket
Extracting the content from the document
Persisting the document details in the supabase Database

// Route to handle the upload and document processing

import { NextResponse } from 'next/server';
import { loadQAChain } from 'langchain/chains';
import { Document } from 'langchain/document';
import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { supabase } from '../supabase';

export const POST = async (req) => {
  const form = await req.formData();
  const file = form.get('file');

  // Using langchain PDFLoader to extract the content of the document
  const docContent = await new PDFLoader(file, { splitPages: false })
    .load()
    .then((doc) => {
      return doc.map((page) => {
        return page.pageContent
          .replace(/\n/g, ' '); // It is recommended to use the context string with no new lines
      });
    });

  const fileBlob = await file.arrayBuffer();
  const fileBuffer = Buffer.from(fileBlob);

  // Uploading the document to supabase storage bucket
  await supabase()
    .storage.from(bucket)
    .upload(`${checksum}.pdf`, fileBuffer, {
      cacheControl: '3600',
      upsert: true,
      contentType: file.type
    });

  // storing the document details to supabase DB
  await supabase
        .from("documents_table")
        .insert({
            document_content: docContent,
            // insert other relevant document details
        });

  return NextResponse.json({ message: "success" }, { status: 200 });
};

🔃 Initialising Socket.io

The application uses socket.io to send and receive messages in a non-blocking way. After uploading the document successfully, the UI invokes an API - /api/socket to open a socket server connection

Setting up a socket.io server is usually easy, but it was a bit challenging with Next.js 13. The recent versions of Next.js introduced a new paradigm called the AppRouter and setting up a socket server is not possible with it (or at least I couldn't find any documentation for it anywhere). So I had to use the old PageRouter paradigm to initialize a socket server connection

Handler file => src/pages/api/socket.js

Dependencies required => yarn add socket.io socket.io-client

import { Server } from 'socket.io';

export default function handler(req, res) {
  const io = new Server(res.socket.server, {
    path: '/api/socket_io',
    addTrailingSlash: false
  });

  res.socket.server.io = io;

  // When the UI invokes the /api/socket endpoint, it opens a new socket connection
  io.on('connection', (socket) => {
    socket.on('message', async (data) => {
      // For every user message from the UI, this event will be triggered
      const { message } = data;

      // pass on the question and content from the message to Langchain
    });
  });

  res.end();
}

💬 The real chatting

Now that we have the document content handy and the socket open to receive events, it's time to do some chatting.

The UI emits a socket event called message every time the user enters a new message. This message will include 2 important things in the payload,

The actual question
The content extracted from the document (we will get this as the response from the /upload API)

We have already setup an event listener when we initialized the socket server and we will do the actual LLM stuff within this listener to get the answer from GPT.

Dependencies required => yarn add openai pdf-parse langchain

import { Document } from 'langchain/document';
import { loadQAStuffChain } from 'langchain/chains';
import { llm } from '@/app/api/openai';

export default function handler(req, res) {
  const io = new Server(res.socket.server, {
    path: '/api/socket_io',
    addTrailingSlash: false
  });

  res.socket.server.io = io;

  // When the UI invokes the /api/socket endpoint, it opens a new socket connection
  io.on('connection', (socket) => {
    socket.on('message', async (data) => {
      // For every user message from the UI, this event will be triggered
      const { question, content } = data;

      const llm = new OpenAI({
          openAIApiKey: process.env.OPENAI_API_KEY,
          modelName: 'gpt-3.5-turbo'
      });

      // We will be using the stuff QA chain
      // This is a very simple chain that sets the entire doc content as the context
      // Will be suitable for smaller documents
      const chain = loadQAStuffChain(llm, { verbose: true });

      const docs = [new Document({ pageContent: content })];

      const { text } = await chain.call({
        input_documents: docs,
        question
      });

      // Emitting the response from GPT back to the UI
      socket.emit('ai_message', { message: text })
    });
  });

  res.end();
}

In the above snippet, we use the Stuff QA chain which is a simple and suitable one for smaller documents. We generate a new Document object array with the content of our target document. The final step is invoking the call method to pass on the payload to OpenAI api and get the response

Behind the scenes, langchain generates a prompt in the following format and sends it to the OpenAI api to get the answer

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

<the_entire_document_content>

Question: <some question>?

Helpful Answer:

The above prompt will be the same for every question you ask and the document content will be passed over as the context.

On receiving the response from the API, we emit a new event called ai_message. The UI has an event listener for this event and we handle the logic of displaying the chat bubbles based on the message

socket?.on("ai_message", async (data) => {
  setConversations((prev) => {
    return [
      ...prev,
      {
        user: "ai",
        message: data.message,
      },
    ];
  });
});

This concludes the entire flow

⚙ Behind The Scenes

The code snippets mentioned above are the trimmed version with only the key items required for this article. Below are some of the things that the application handles along with the document QA

Persisting the document details: The document details such as the checksum of the document, original document name and the content of the document are stored in the supabase DB. This data will be used to display a chat history on the UI and enables the user to return to their conversations anytime

Persisting the conversations: The chat messages sent by the user and generated by the AI are also persisted in the database. The user can click on any document's chat history and see all the exchanged conversations

Storing the actual document: The original PDF document uploaded by the user is stored in supabase storage bucket. This is for enabling the user to download the document from the chat section to see the originally uploaded content

User authentication: I wanted to try supabase authentication, so I added a login flow to the application. With supabase's Row Level Security (RLS) and policies, the chat history and the conversations will be shown only to the authenticated users

🙋🏻 One more thing...

If your aim is just to chat with a small document with just a couple of pages, then you can skip this section

Until this point whatever we have seen will do good for documents with no more than 4 pages of textual content. For instance, if you want to chat with a research paper that often goes over 10 pages then it becomes a scalability issue to send the entire content of the document back and forth for every question. I initially tested it out with the well-known research paper about transformers that has over 40k characters and the application choked out with that content. So I had to rethink the solution

Embeddings to the rescue... Embeddings are vectors that hold a bunch of numbers that denote the similarity or closeness between different sentences or words. To tackle the problem, I did the following

Split the extracted document content into smaller chunks

Generate OpenAI embeddings for each chunk

import { PDFLoader } from 'langchain/document_loaders/fs/pdf';
import { CharacterTextSplitter } from 'langchain/text_splitter';
import { MemoryVectorStore } from 'langchain/vectorstores/memory';

export const extractDocumentContent = async (file) => {
  const chunks = await new PDFLoader(file).loadAndSplit(
    new CharacterTextSplitter({
      chunkSize: 5000,
      chunkOverlap: 100,
      separator: ' '
    })
  );

  const openAIEmbedding = new OpenAIEmbeddings({
    openAIApiKey: process.env.OPENAI_API_KEY,
    modelName: 'text-embedding-ada-002'
  });

  const store = await MemoryVectorStore.fromDocuments(chunks, openAIEmbedding);

  const splitDocs = chunks.map((doc) => {
    return doc.pageContent.replace(/\n/g, ' ');
  });

  return {
    wholeContent: splitDocs.join(''),
    chunks: {
      content: splitDocs,
      embeddings: await store.embeddings.embedDocuments(splitDocs)
    }
  };
};

Store the chunk and its respective embeddings in subapabse DB. This blog can be referred to know how to work with vector data types in supabase

const saveDocumentChunks = async (file) => {
  // Invoke the function from above to get the chunks
  const { chunks } = await extractDocumentContent(file);
  const { content, embeddings } = chunks;

  // store the content from the chunk and its respective embedding to the DB
  // for context, a simple embedding vector will look something like this
  // [-0.021596793,0.0027229148,0.019078722,-0.019771526, ...]
  for (let i = 0; i < content.length; i++) {
    const { error } = await supabase
      .from('document_chunks')
      .insert({
        chunk_number: i + 1,
        chunk_content: content[i],
        chunk_embedding: embeddings[i] // chunk_embedding is of type `vector` in the Database
      });

    if (error) {
      return { error }
  }

  // Be mindful of implementing a rollback strategy even if storing a single chunk fails
  return { error: null };
};

Do a similarity search on the vector Database using the user's question to filter only the relevant chunks (We need to set up a supabase plpgsql function to rank the chunks based on similarity)

create function match_documents (
    query_embedding vector(1536),
    match_count int default null,
    filter_checksum varchar DEFAULT ''
) returns table (
    document_checksum varchar,
    chunk_content text,
    similarity float
) language plpgsql as $ $ #variable_conflict use_column
begin return query
select
    document_checksum,
    chunk_content,
    1 - (
        document_chunks.chunk_embedding <= > query_embedding
    ) as similarity
from
    document_chunks
where
    document_checksum = filter_checksum
order by
    document_chunks.chunk_embedding <= > query_embedding
limit
    match_count;
end;
$ $;

Use the filtered chunks as the context for answering the questions

import { Document } from 'langchain/document';
import { loadQAStuffChain } from 'langchain/chains';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

const inference = async (question) => {
  const openAIEmbedding = new OpenAIEmbeddings({
    openAIApiKey: process.env.OPENAI_API_KEY,
    modelName: 'text-embedding-ada-002'
  });

  const { data, error } = await supabase.rpc('match_documents', {
    query_embedding: await openAIEmbedding.embedQuery(question),
    match_count: 5,
    filter_checksum: "unique_document_checksum"
  });

  if (error) return { error };

  const content = data.map((v) => v.chunk_content}).join(' ')

  // set the `content` as the context of the prompt and ask the questions
  const chain = loadQAStuffChain(llm, { verbose: true });
  const docs = [new Document({ pageContent: content })];

  const { text } = await chain.call({
    input_documents: docs,
    question
  });

  return { answer: text };
};

With these improvements in place, I uploaded the same research paper and started a conversation. To my surprise, it worked on the first try and gave back the answers with ease

In this approach, you need not pass the document content for every message because the content will be fetched from the DB based on the question and its relevance to the content

🚀 Where is the code?

I have published the entire project on Github with the instructions to set up the project locally and to set up the supabase project

Github project

🏁 Conclusion

Like they say, "third time is a charm". Finally, after two failed attempts, I have built an application that works fine with all the integrations and the accuracy with which GPT handles the questions feels mystical sometimes. You can try out the application locally by cloning the repository and running it. Just ensure that you have set up a working supabase project and are ready to spend a few bucks on OpenAI.

Happy hacking!

Further reads

Stream the response from GPT

📚 References

Langchain

Next.JS 13

Supabase

https://supabase.com/blog/openai-embeddings-postgres-vector

https://js.langchain.com/docs/modules/data_connection/vectorstores/integrations/supabase

https://js.langchain.com/docs/api/document_loaders_fs_pdf/classes/PDFLoader

https://js.langchain.com/docs/api/text_splitter/classes/CharacterTextSplitter

Top comments (3)

Neel • Aug 22 '23 • Edited

Update

The socket io server implementation with Next.js was more of a hack than a stable solution and I realised it after using the application for a while

When deployed to vercel which handles the API routes as server functions, the implementation will totally break. It works fine if you deploy it to other platforms or containerize the app and run is as a service, but it comes with a weird content length limitation. The socket server cannot emit events if the payload exceeds 1KB or if there are multiple events emitted at the same time. As this solution is a workaround, it is not documented anywhere

To overcome these, I chose supabase realtime which is easy to implement and suitable for non-blocking event driven communications

supabase.com/docs/guides/realtime/...