In this post, we'll explore some more coding to build a simple chat app that we can use to ask questions limiting the LLM answers to a specific topic of our desire. But before we start with coding, first, we need to define some concepts that will help us to understand the whole picture.
What is RAG?
Retrieval-Augumented Generation (RAG) is a process of interacting with LLMs using a source of truth to provide the base knowledge the LLM has to provide responses to a user prompt.
This approach was proposed in May 2020, to solve issues that fine-tuned models face such as long-term memory, lack of accuracy for specific outputs, time-consuming, and high costs. But it was largely adopted only in 2023 when it got consolidated as one of the most used techniques to work with LLMs.
How it works?
I won't go deep on this explanation, I'll stay around what's important to know from a practical perspective.
RAG basically needs 4 things to work: a prompt, a store, a retriever, and an LLM.
- Prompt: generally, it's textual data provided as input by a user or as instructions by the application
- Store: a vector store where the embedded source of truth data (RAG knowledge) is stored
- Retriever: it's a method used to find relevant pieces of data in the store
- LLM: it's used for interpreting the user prompt, following the application instructions, reading the retrieved context, and generating a response
And, the workflow goes as the following:
- A data source is selected to be used (PDFs, DOCX, Web Pages, Video, Audio, Image, etc)
- The data source is split into smaller pieces of documents ("chunks")
- An LLM is used to embed (convert to vectorial representation) each document chunk
- The embedded data is stored in a vector store
- A user sends a prompt
- The application may provide further instructions about how the LLM should behave as a "system message"
- The user prompt is embedded
- The embedded user prompt is used to search in the vector store for semantical similar chunks of documents
- The most similar documents are retrieved and converted back to textual representation
- The user prompt, system prompt, and document chunks are sent to a generative LLM
- The LLM generates a response in natural language that'll be sent back to the user
There are many different use cases where RAG can be the best choice for working with LLM like Questions & Answers, Virtual Assistants, Content Generation, and many others.
What's LangChain?
LangChain is a framework originally written in Python - but it also has a JavaScript version - that helps with the development of solutions using AI APIs in really many ways. It has integrations with the most used LLM APIs, vector stores, it can handle document splitting, handles many different retrievers, file loaders, embeddings, prompt templates, and more.
It also has a feature called "chain" that helps to link calls and results from LLM to another. That's very helpful when you have to handle multiple LLM calls in a single request. For example: you can ask an LLM to summarize some content, then get its response and send it to another LLM asking it to generate a nice response for the user, including the summarization data. Or you can split the user request into different LLMs to take advantage of their specializations. Or even to reduce costs.
There are many different use cases for chains. This is really a great feature. And, I'd say LangChain is a must-have for most use cases.
Now that the basic concepts are covered, let's move to the code.
The resource you won't find on Google
I'd like to bring your attention to this because I found it to be very important. When working with open source projects, it's quite common that the documentation sometimes misses some features or details over the API. And that's not different with LangChain. People are doing incredible work with it, but sometimes it's hard to find methods, attribute definitions, and explore the possibilities by just following the official documentation, especially the JS. It does cover lots of things, but not everything. So I highly recommend you keep the API definition (https://api.js.langchain.com/index.html) in hand when you work with it. There are many cases where you'll try to find details on the official docs or by googling it and won't find any relevant resource, but you can find it in the API definition. So keep that in mind. And thanks to LangChain team for providing it for us, this is tremendously helpful. If you ever lose this link, you can get it from references on their GitHub repo: https://github.com/langchain-ai/langchainjs?tab=readme-ov-file#-documentation
Building a RAG Chat
We'll reuse the same base project we did for the OpenAI chat app post. If you missed that, I recommend you take a look at it to understand the project details and follow the OpenAI API setups. You can also download the code from this GitHub repo: https://github.com/soutot/ai-series/tree/main/nextjs-chat-base
First, create a new directory and name it nextjs-langchain-openai. Then initialize the project just like we did before.
Once you get it up and running, we can start by installing the Langchain packages:
pnpm install @langchain/community @langchain/core @langchain/openai langchain
The main package is langchain, but we'll also need @langchain/community to use some packages developed by community, and @langchain/openai to get specific integrations with OpenAI API. The @langchain/core is the base package needed to use all packages other than the main. You can find more details in the official docs: https://js.langchain.com/docs/get_started/installation
Now we need to install our vectorstore dependency. For this sample, we'll use HNSWLib as it's very simple to set up. More details here: https://js.langchain.com/docs/integrations/vectorstores/hnswlib
pnpm install hnswlib-node
Then, update your next.config.mjs as the following:
/** @type {import('next').NextConfig} */
const nextConfig = {
webpack(config) {
config.externals = config.externals || [];
config.externals = [...config.externals, "hnswlib-node"]
config.resolve.alias['fs'] = false;
return config
},
};
export default nextConfig;
This is needed to prevent NextJS errors when importing hnswlib and using fs.
With what we need all set, we'll now add a new endpoint to upload our file and create the embeddings.
Inside the api directory, create an embed/route.ts file. And, we'll start by importing the packages needed
import {HNSWLib} from '@langchain/community/vectorstores/hnswlib'
import {OpenAIEmbeddings} from '@langchain/openai'
import {RecursiveCharacterTextSplitter} from 'langchain/text_splitter'
import {NextResponse} from 'next/server'
Then, we'll create our POST method and read the file it'll receive from the form data
export async function POST(request: Request) {
const data = request.formData()
const file: File | null = (await data).get('file') as unknown as File
if (!file) {
return NextResponse.json({message: 'Missing file input', success: false})
}
const fileContent = await file.text()
We'll now initialize the text splitter. We're going to use RecursiveCharacterTextSplitter as this is the recommended way of starting splitting texts. You can find more splitters in the official docs: https://js.langchain.com/docs/modules/data_connection/document_transformers/
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 100,
separators:['\n']
})
This is how we split the file content into chunks, by using the createDocuments method
const splitDocs = await textSplitter.createDocuments(fileContent)
Then we initialize the embedding model. In this case, we'll use OpenAIEmbedding which by default uses text-embedding-ada-002. You can use different models, even others than OpenAI's. More details in the official docs: https://js.langchain.com/docs/integrations/text_embedding/openai
const embeddings = new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
})
And store it in the HSNW vector store
const vectorStore = await HNSWLib.fromDocuments(splitDocs, embeddings)
await vectorStore.save('vectorstore/rag-store.index')
return new NextResponse(JSON.stringify({success: true}), {
status: 200,
headers: {'content-type': 'application/json'},
})
Okay, the embedding processing is done. Now we'll be able to generate the vectorial representation of our data source and store it in a vector store for further usage.
We have to update the frontend to send the document to be embedded
Open the page.tsx and add this new function to perform the upload process
async function uploadFile(file: File) {
try {
const formData = new FormData()
formData.append('file', file)
const response = await fetch('/api/embed', {
method: 'POST',
body: formData,
})
if (response.ok) {
console.log('Embedding successful!')
} else {
const errorResponse = await response.text()
throw new Error(`Embedding failed: ${errorResponse}`)
}
} catch (error) {
throw new Error(`Error during embedding: ${error}`)
}
}
Now, go to the handleSelectedFile method and call this function passing the selectedFile
const handleFileSelected = async (event?: ChangeEvent<HTMLInputElement>) => {
if (!event) return clearFile()
setIsUploading(true)
const {files} = event.currentTarget
if (!files?.length) {
return
}
const selectedFile = files[0]
await uploadFile(selectedFile)
setFile(selectedFile)
setIsUploading(false)
event.target.value = '' // clear input as we handle the file selection in state
}
Cool, we're done here. Now, as the last part, let's update our main API route to retrieve a response based on the user prompt.
First, let's import all dependencies:
import {HNSWLib} from '@langchain/community/vectorstores/hnswlib'
import {BaseMessage} from '@langchain/core/messages'
import {ChatPromptTemplate} from '@langchain/core/prompts'
import {ChatOpenAI, OpenAIEmbeddings} from '@langchain/openai'
import {LangChainStream, StreamingTextResponse} from 'ai'
import {ConversationalRetrievalQAChain} from 'langchain/chains'
import {ChatMessageHistory, ConversationTokenBufferMemory} from 'langchain/memory'
import {NextResponse} from 'next/server'
import {z} from 'zod'
You can find more details of each dependency in the official docs: https://api.js.langchain.com/ and https://js.langchain.com/docs
Now, let's create a prompt template that'll be used to instruct the LLM how it should behave:
const QA_PROMPT_TEMPLATE = `You are a good assistant that answers questions. Your knowledge is strictly limited to the following piece of context. Use it to answer the question at the end.
If the answer can't be found in the context, just say you don't know. *DO NOT* try to make up an answer.
If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context.
Give a response in the same language as the question.
Context: """"{context}"""
Question: """{question}"""
Helpful answer in markdown:`
Let's create the POST method and read the user prompt
export async function POST(request: Request) {
const body = await request.json()
const bodySchema = z.object({
prompt: z.string(),
})
const {prompt} = bodySchema.parse(body)
Now let's prepare the retriever to read the data from the vector store
try {
const embeddings = new OpenAIEmbeddings({
openAIApiKey: process.env.OPENAI_API_KEY,
})
const vectorStore = await HNSWLib.load('vectorstore/rag-store.index', embeddings)
const retriever = vectorStore.asRetriever()
We now initialize the LLM that'll be used to perform the response. Pay attention to the temperature: 0. For RAG we want the LLM to answer more precisely what's in the document, so lower temperatures are better to prevent different terms or concepts than what's retrieved from the store
const {stream, handlers} = LangChainStream()
const llm = new ChatOpenAI({
temperature: 0,
openAIApiKey: process.env.OPENAI_API_KEY,
streaming: true,
modelName: 'gpt-3.5-turbo',
callbacks: [handlers],
})
And, finally, perform the LLM request and send back the streaming response
const chain = ConversationalRetrievalQAChain.fromLLM(llm, retriever, {
returnSourceDocuments: true,
qaChainOptions: {
type: 'stuff',
prompt: ChatPromptTemplate.fromTemplate(QA_PROMPT_TEMPLATE),
},
})
chain.invoke({question: prompt, chat_history: ''})
return new StreamingTextResponse(stream)
Note that we're passing chat_history: '' as we're still not handling it. So for now the LLM will not have the context of the previously sent messages. This is good enough for our testing purposes. I can walk through how to work with history and memory in a further post.
Now, just run
pnpm run dev
and, if everything's correct, you'll be able to see the app running.
Attach a file so it'll be uploaded to the backend and the vector store will be created from the embedding document. Then ask a question that can be answered by the document content you've just uploaded.
In the example below, I used the Part I of this AI series post:
Well, that's just the beginning. There are so many things that can be done, even by changing this simple example.
Hope someone finds it helpful.
See you in the next part.
GitHub code repository: https://github.com/soutot/ai-series/tree/main/nextjs-chat-rag
Top comments (0)