Leveraging the capabilities of the GPT language model within a Node.js framework can substantially improve the ease of data access and engagement. This article provides a comprehensive guide on employing GPT alongside diverse vector stores and JavaScript libraries to build applications that are more dynamic and efficient.
The focus of this article is on working with local PDF documents, thereby illustrating the transformative potential of AI in information processing.
A pinch of theory
Before delving into the coding aspect, let's gain a deeper comprehension of the theoretical elements surrounding language models, which will subsequently facilitate an understanding of GPT adaptation to a specific context.
Two predominant techniques exist for customizing a model:
- Fine-tuning the model entails further training the pre-existing model on task-specific labeled data. This process fine-tunes its parameters to align better with the desired context by updating the existing model's parameters.
- Use embeddings: rather than a comprehensive model update, one can incorporate additional elements (embeds) at runtime. This strategy introduces new snippets of knowledge to the already existing model.
Fine-tuning, although effective, is a laborious and costly process. It requires amassing and processing a large dataset, modifying the model's architecture by integrating specific layers, and subsequently training, assessing, and refining the model.
Embedding, on the other hand, is less resource-intensive and proves to be exceptionally beneficial when labeled data is limited. This approach aligns perfectly with my situation, as my goal is to engage with local PDF documents while capitalizing on the NLP capacities of a Large Language Model (LLM) like GPT.
What are embeddings?
Simply put, embeddings are numerical translations that encode the semantic meaning of words or sentences. They enable machine learning models to process and understand textual data, enabling a wide range of NLP applications.
Raw words and sentences are not immediately usable by machine learning models, which predominantly function on numerical data. This issue is mitigated by embeddings, as they transform textual information into a vector - a numerical format that algorithms can efficiently process and compare.
Vectors... like in Math?
Indeed, vectors correspond to the mathematical constructs you might have encountered while studying linear algebra. They represent points in a multidimensional space and contain real-valued numbers in each dimension. Language Models generally employ large-dimensional vector spaces to represent each word or token within the model's vocabulary. These dimensionalities are indicative of each model's total capacity and expressive capabilities. For instance, GPT-3 utilizes up to 12,288 dimensions and PaLM up to 18,432.
You bet we speak of LARGE Language Models!
The vectors used for embeddings are of smaller dimensionalities, primarily for efficiency and computational considerations. By implementing a lower-dimensional representation, a model can minimize its memory footprint and computational complexity. Furthermore, lower-dimensional embeddings can aid in generalization and diminish the likelihood of overfitting.
The dimension count varies based on the particular embedding model. For example, GPT-3.5 utilizes embedding vectors of 1536 dimensions.
Embeddings by OpenAI
Typically, embeddings are derived from extensive text data using techniques like Transformer-based models such as BERT or GPT. These models assess the co-occurrence of words in the training data and learn to represent words or sentences in a manner that maintains their semantic connections. Clearly, this is a pre-training technique (the 'P' in GPT), a resource-intensive task executed by OpenAI during the models' training phase.
Upon learning the embeddings, they can be applied as input features for a variety of NLP tasks, such as chatbots, sentiment analysis, machine translation, and more. By utilizing embeddings, machine learning models can understand and operate on textual data more effectively.
A particularly interesting application is using embeddings to gauge similarities between words or sentences. For instance, you can calculate the cosine similarity between two word embeddings to discern their semantic similarity (yet again, this involves linear algebra).
Generation and preservation of custom Embeddings
As mentioned previously, the objective is to inject new embeddings at runtime when invoking the OpenAI API. But how do we generate these new vectors that embody the context of our PDF files? GPT's Embeddings API serves this purpose! It delivers a vector in GPT's multidimensional space when we supply textual data – in our scenario, the content of our local documents.
These embeddings are unique to us; they neither alter nor persistently update GPT's model, as is the case with fine-tuning. Hence, we must store them locally for future use, in Vector Stores.
Vector Stores fall into three categories, each with differing functional scopes and uses:
- Low-level libraries: integrated into our code for vector storage, indexing, and searching with solutions such as FAISS, HNSWLib, or Annoy.
- Specialized solutions: databases with additional vector-based functionalities like Weaviate, Pincone, or Chroma.
- Backend platforms: comprehensive solutions offering more than just vectors (APIs, authentication, and other data types), like Supabase, Elastic, or more recently MongoDB.
For our use case, we require such a component in the architecture to persist our embeddings and to retrieve and select them when needed.
Augmenting input with most relevant Vectors
Before sending an input to GPT, we first utilize the Embeddings API to obtain the vector of this input. Then, by querying our local vector store, we select the vectors closest in proximity (representing the most contextually relevant elements sourced from our local PDF files). This proximity among vectors is determined by linear algebra formulas, a process also referred to as mapping. The most relevant vectors are subsequently injected into the final request sent to GPT's Completions API.
This method allows us to tailor GPT to our specific domain knowledge or context without modifying the entire model's architecture or parameters. In essence, although we might have a vast array of vectors in storage, we only use the most relevant ones to enhance each input sent to the pre-trained model.
Welcoming Node.js to the Party!
Now, let's venture into coding and utilizing the GPT API via Node.js.
Why not Python? Primarily, JavaScript tutorials are less abundant, and Node.js is coherent with a JavaScript UI to facilitate user interaction (for tasks such as uploading new PDF documents, soliciting initial inputs, showcasing GPT generated completions, and so on).
However, I will start by focusing on backend operations. There are several methods for interacting with GPT APIs, handling PDFs, and managing embeddings with varying degrees of abstraction.
Abstraction level 0 would entail directly invoking APIs via curl
, for example, but we seek to code and automate. Thus, let's aspire beyond a mere cot for a bed!
Level 1 (Single Bed): OpenAI Library
We'll commence by employing the Node.js library provided by OpenAI, representing abstraction level 1. A single bed, already an improvement over the cot.
The primary objective here is to call the OpenAI APIs to better discover and understand how they work.
We begin by installing the OpenAI library.
npm install openai
To access OpenAI APIs, a paid account and an API key are needed. We rely on dotenv to store and use this key.
Note that the Completions API is deprecated, so we use the Chat Completions API and the newly launched gpt-3.5-turbo model.
// Level1-ChatCompletion.js: simply connect to OpenAI Chat Completions API
const dotenv = require("dotenv");
dotenv.config();
const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
async function getCompletion() {
const completion = await openai.createChatCompletion({
model: "gpt-3.5-turbo",
messages: [
{role: "system", content: "You are a helpful assistant."},
{role: "user", content: "Say hello in Japanese"}
],
});
console.log(completion.data.choices[0].message);
}
getCompletion();
Let's put this code to the test.
node Level1-ChatCompletion.js
{
role: 'assistant',
content: 'こんにちは (Konnichiwa)'
}
The Chat Completions API leverages distinct roles to facilitate multi-turn conversations seamlessly.
Next, let's convert a local PDF file into a vector.
We require a library to parse the PDF file, such as pdf-parse.
npm install pdf-parse
Here we present a highly simplified code for illustrative purposes, bypassing error handling and token limit considerations.
// Level1-Embeddings.js: simply connect to OpenAI Embeddings API
const dotenv = require("dotenv");
dotenv.config();
const { Configuration, OpenAIApi } = require("openai");
const configuration = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(configuration);
const fs = require("fs");
const pdfParse = require("pdf-parse");
async function extractTextFromPdf(file) {
const dataBuffer = fs.readFileSync(file);
const pdfData = await pdfParse(dataBuffer);
return pdfData.text;
}
async function getEmbedding() {
const pdfContent = await extractTextFromPdf("./MyLocalFile.pdf");
const response = await openai.createEmbedding({
model: "text-embedding-ada-002",
input: pdfContent,
});
console.log(response.data.data);
console.log(response.data.usage.total_tokens);
}
getEmbedding();
The response demonstrates what a vector resembles.
node Level1-Embeddings.js
[
{
object: 'embedding',
index: 0,
embedding: [
0.0043319045, -0.0056783073, 0.01652873, -0.017465359,
-0.004114965, 0.017134784, -0.00021855372, -0.0050102714,
-0.005781612, -0.033360485, 0.02173528, 0.0064599784,
-0.020247694, -0.01245853, 0.026652576, -0.00377406,
0.02253417, -0.008897967, 0.0025016922, -0.022616813,
-0.0023295179, -0.003553677, -0.0023742833, 0.011873138,
-0.0058573685, 0.005667977, 0.01816783, -0.025468018,
0.010234038, 0.008195495, 0.00047778361, -0.023209091,
0.01834689, 0.00377406, -0.014145838, -0.043938875,
0.028512059, -0.002849484, 0.01977938, -0.0040288777,
0.014986048, 0.03239631, -0.014917179, -0.016335895,
-0.026198037, 0.027795814, 0.014283578, -0.024490068,
0.011301519, 0.029366044, 0.012740896, 0.027561657,
-0.02173528, -0.014035647, -0.016294573, -0.008395217,
-0.031294394, 0.018374437, 0.009242315, -0.003753399,
-0.00163996, -0.0115976585, -0.025206313, 0.009772612,
-0.012403434, 0.0006138013, -0.035096005, -0.00838833,
-0.004875975, -0.0076101026, 0.025550662, 0.022010759,
0.026501063, 0.0007536929, 0.0040977476, -0.0095660025,
0.018457081, -0.0030750325, 0.017258748, 0.003412494,
0.01820915, -0.003245485, -0.031211752, 0.010840092,
0.016928174, 0.015068692, 0.009187219, 0.0035140768,
-0.020426756, 0.0033539548, 0.014779439, 0.01911823,
0.007575668, 0.039971977, -0.033938993, -0.010833205,
-0.008532957, 0.0005638707, -0.009249202, -0.012975053,
... 1436 more items
]
}
]
4106
The input PDF file was broken down into 4106 tokens, with the corresponding vector having 1536 dimensions, as anticipated from the GPT-3.5 embedding model (text-embedding-ada-002).
However, the model possesses a maximum context length, expressed in tokens. Therefore, the vectors that we map from our local documents must be strategically truncated to retain the best contextual information while adhering to this maximum length. This necessitates some semantic search hacks to select only the text chunks most pertinent to the prompt. This is where things get a tad more intricate, and it proves beneficial to adjust the abstraction level and employ frameworks to carry out the heavy lifting on our behalf.
Let's therefore transition to a King-size bed and welcome some new faces to our vibrant party!
Level 2 (King-size bed): LangChain and HNSWLib
We are delighted to introduce LangChain, a comprehensive framework for developing applications powered by large language models.
LangChain simplifies the integration of language models with other data sources and facilitates the creation of interactive applications. Its features make it user-friendly, modular, and extensible.
The framework acts as a proxy, decoupling our code from the large language models (LLMs). Given the current stiff competition amongst LLMs, it's a good idea to use LangChain to ensure smooth transitions between different LLMs, if needed, without causing disruptions.
LangChain also offers a range of off-the-shelf chains to automate common tasks such as chatbots, summarization, and question answering. It is available in two versions: Python and JavaScript, making it an ideal choice for our needs.
We start with a simple Completions API call, similar to our previous attempt.
// Level2-LC-Completion.js: simple use of LangChain for completion
const dotenv = require("dotenv");
dotenv.config();
const { OpenAI } = require("langchain/llms/openai");
const llm = new OpenAI({ openAIApiKey: process.env.OPENAI_API_KEY, });
async function getCompletion() {
const res = await llm.call(
"Say hello in Japanese"
);
console.log(res);
}
getCompletion();
Just like before, LangChain requires OpenAI credentials and provides high-level functions to communicate with the LLM.
This is the result.
node Level2-LangChain.js
Konnichiwa!
As you may have noticed, the LLM's responses aren't always identical even to the same question. We can further investigate this by running our Node.js script again.
node Level2-LangChain.js
こんにちは!
But the true power of LangChain resides in its chains. A chain is an sequence of modular components or other chains that work together to meet specific requirements. You can either create your own or use the pre-defined chains provided by the framework.
And the best part? There's a chain designed just for our use case, named Retrieval QA. Let's put it to the test.
Our first step is to install the lightweight HNSWLib, an in-memory vector store that can also be saved to a file.
npm install hnswlib-node
Here's the revised code, which now employs LangChain's components to create a new Retrieval QA chain. This chain communicates with OpenAI and uses the contextual content from a local PDF document to answer questions.
// Level2-LC-ChatPDF.js: use of LangChain with a local PDF file
const dotenv = require("dotenv");
dotenv.config();
const { OpenAI } = require("langchain/llms/openai");
const { RetrievalQAChain } = require("langchain/chains");
const { HNSWLib } = require("langchain/vectorstores/hnswlib");
const { OpenAIEmbeddings } = require("langchain/embeddings/openai");
const { PDFLoader } = require("langchain/document_loaders/fs/pdf");
async function getPDFCompletion() {
// Initialize the LLM to use to answer the question
const llm = new OpenAI({ openAIApiKey: process.env.OPENAI_API_KEY });
//Load our local PDF document
const loader = new PDFLoader("TechSquad.pdf");
const docs = await loader.load();
// Create a HNSWLib vector store from the embeddings
const vectorStore = await HNSWLib.fromDocuments(
docs,
new OpenAIEmbeddings()
);
// Create a chain that uses the OpenAI LLM and our vector store
const chain = RetrievalQAChain.fromLLM(
llm,
vectorStore.asRetriever()
);
const res = await chain.call({
query: "What are the ambition and objectives of the TechSquad?"
});
console.log({ res });
};
getPDFCompletion();
Here is the result of its execution.
node Level2-LC-ChatPDF.js
{
res: {
text: "The ambition and objectives of the TechSquad are to amplify Worldline's technical excellence to the world and to empower cooperation, international acceleration, and elected technical voicing of expertise."
}
}
Impressive, isn't it? The content from the PDF has influenced the LLM, which has been able to provide a meaningful response about the TechSquad initiative of Worldline, a topic that GPT wouldn't have known about.
LangChain accomplishes all this heavy lifting behind the scenes. Fit for a king!
Level 3 (Hotel Room): EmbedChain and Chroma
Let's elevate our discussion to EmbedChain, a framework that makes it easy to create large language model (LLM) powered bots capable of handling any dataset. The beauty of EmbedChain lies in its abstraction of the entire process from loading the dataset, chunking it, creating embeddings, to storing them in a vector database.
EmbedChain employs a dockerized Chroma vector store, and, of course, it's built on the foundation provided by LangChain! Given its higher level of abstraction, EmbedChain fits into our Level 3 solution category.
Let's explore this new level of abstraction. Our first step is to install the necessary components.
npm install embedchain
Chroma is an open-source embedding database built on PostgreSQL. It offers developers a straightforward API and seamlessly integrates with LangChain.
We can install Chroma in a Docker container using the following commands.
git clone https://github.com/chroma-core/chroma.git
cd chroma
docker-compose up -d --build
With Chroma running locally, it's now ready to function as a vector store for EmbedChain. Now let's see how easy it is to set up a bot for question-answering with our local PDF document.
// Level3-EmbedChain-bot.js: use of LangChain with a local PDF file
const dotenv = require("dotenv");
dotenv.config();
const { App } = require("embedchain");
async function askBot() {
const myBot = await App();
// Embed local resource
await myBot.add(
"pdf_file",
"./TechSquad.pdf"
);
const myPrompt = "What are the Core Team activities within the TechSquad?";
const res = await myBot.query(myPrompt);
console.log(res);
}
askBot();
When executed, the bot responds with
node Level3-EmbedChain-bot.js
Successfully saved ./TechSquad.pdf. Total chunks count: 3536
The Core Team activities within the TechSquad include strategic vision and management, operational activities, and communication coordination.
Using EmbedChain is even simpler than LangChain: with just a few lines of code, we're good to go! Chroma can store and maintain many embeddings, derived from several local PDFs or other sources such as remote files, text, and Markdown formats. For more information, take a look at the project's documentation. Despite its relative novelty, EmbedChain already shows great promise.
EmbedChain marks our Level 3 on the abstraction scale, building on top of the lower-level LangChain framework. In this category, we also have Flowise, an open-source solution that allows for customizing LLM flow from a drag-and-drop UI.
As we ascend the abstraction ladder, we leave behind the King-sized bed for a Hotel Room, where a broader range of services are offered. Let's continue to climb and move into what can be considered a Hotel Suite, complete with almost all the accommodations required by our initial use case.
Level 4 (Hotel Suite): 7-docs and Supabase
Several Level 4 solutions exist today, offering all-in-one web applications to ingest content, generate embeddings, store and search corresponding vectors, and facilitate contextual interactions with GPT APIs. They also deliver a user interface, enabling end users to engage in a chat-like interaction with their documents via prompts similar to ChatGPT. Indeed, Chat is the new UX! And no real party without UX/UI...
There are numerous commercial SaaS solutions for this purpose, such as ChatPDF, HiPDF, PDF.ai, CustomGPT, or AskYourPDF. Simultaneously, there are open-source solutions that can be used in SaaS mode or containerized for deployment, such as Markprompt (focused on Markdown documents), ChainDesk (formerly DataBerry), and the very promising Quivr.
Here, we illustrate a Level 4 solution with 7-docs, an open-source project compatible with Node.js (and deno). Note that 7-docs chooses not to use external frameworks like LangChain or EmbedChain, but rather, it directly implements it own use of OpenAI APIs.
In this example, we will also utilize the Supabase Backend Platform to demonstrate another type of vector store.
Let's start by installing 7-docs, which can be done by copying this prepared template on GitHub: https://github.com/7-docs/template-next-supabase/generate. After copying, we clone the newly created repository and install its dependencies.
git clone https://github.com/raphiki/7-docs.git
cd 7-docs
npm install
On the Supabase side, I have already created a free account and an organization that hosts a project for this demonstration. We need the project's URL and API key.
These details, along with our OpenAI API key, are needed to configure 7-docs using environmental variables in the .env.local file.
cp .env.example .env.local
vi .env.local
# .env.local: fill your APIs key and Supabase URL
OPENAI_API_KEY="sk-..."
SUPABASE_URL="https://xxxxxxxxxxxxxxxxxxxx.supabase.co"
SUPABASE_API_KEY="ey..."
We also duplicate this file to configure the 7-docs CLI, which will be used to ingest the documents.
cp .env.local .env
Although it's possible to configure the NextJS app through the config.ts
file, we'll stick to default parameters for this demonstration.
Next, we generate the database schema to be created in Supabase.
npx 7d supabase-create-table --namespace my-namespace
Choose or create your project in https://app.supabase.com and paste the following in the SQL Editor:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS my_namespace (id uuid PRIMARY KEY, metadata jsonb, embedding vector(1536));
CREATE OR REPLACE function match_my_namespace (
query_embedding vector(1536),
similarity_threshold float,
match_count int
)
returns TABLE (id uuid, metadata jsonb, similarity float)
LANGUAGE plpgsql
AS $$
BEGIN
return query
SELECT my_namespace.id, my_namespace.metadata, 1 - (my_namespace.embedding <=> query_embedding) AS similarity
FROM my_namespace
WHERE 1 - (my_namespace.embedding <=> query_embedding) > similarity_threshold
ORDER BY my_namespace.embedding <=> query_embedding
LIMIT match_count;
END; $$;
As directed by 7-docs, we use the Supabase console and SQL Editor to execute the SQL statements and create our table.
Once the table is ready, we can ingest a PDF file using the 7-docs CLI.
npx 7d ingest --files '../PDF/TechSquad.pdf' --namespace my-namespace
✔ Fetching files
✔ Creating and upserting vectors
ℹ Fetched 1 file(s) from fs, used 873 OpenAI tokens, upserted 2 vectors to Supabase
In the Supabase console, we can see the two inserted vectors.
You'll notice that 7-docs also saves some metadata in Supabase, which maintains the textual context of the embeddings.
With our embeddings ready, we can start the 7-docs NextJS application (in dev mode for simplicity).
npm run dev
> template-next-supabase@0.1.0 dev
> next dev
ready - started server on 0.0.0.0:3000, url: http://localhost:3000
info - Loaded env from /home/s239922/Dev/7-docs/.env.local
info - Loaded env from /home/s239922/Dev/7-docs/.env
event - compiled client and server successfully in 902 ms (167 modules)
wait - compiling...
event - compiled successfully in 78 ms (134 modules)
wait - compiling / (client and server)...
event - compiled client and server successfully in 961 ms (992 modules)
wait - compiling /api/completion (client and server)...
event - compiled successfully in 193 ms (140 modules)
Once the application is running locally, we can navigate to http://localhost:3000 URL and begin interacting with our PDF.
Notice that the bot displays the sources used to generate responses to user prompts. This showcases the use of metadata.
This concludes our test and demo with 7-docs. Keep in mind that while 7-docs is still a young project, it can already ingest multiple files at once, including those from GitHub and web pages, and supports the Markdown format. Like many things in Generative AI on the web today, it has a lot of potentials.
Now it's Time to enjoy the PJ party!
I trust that this article has offered you insightful knowledge into the potential of the GPT language model within a Node.js framework, specifically targeting the scenario of managing local PDF documents.
Through the power of embedding techniques, tailoring language models becomes feasible even in scenarios with limited labeled data. Furthermore, the adoption of abstraction-driven solutions like LangChain, EmbedChain, or 7-docs paves the way for revolutionary modes of document interaction.
So, step into this new world of information processing, where PDFs are not static objects but interactive resources. Revel in the possibilities and harness the power of language models to transform your data access, understanding, and user engagement. Enjoy the PDF Jam party!
Top comments (2)
Nice post 👏🏻👏🏻👏🏻
And very happy to see Worldliner's contributions about LLMs 🤓
Thanks!