Introduction
As the amount of information available online continues to grow at an unprecedented pace, traditional search engines are struggling to keep up. This is where semantic search comes in—a technique that aims to understand the intent behind a user's query and provide more accurate results. In this tutorial, we will explore how to implement semantic search using Supabase and PostgreSQL for the database, Next.js for the frontend, and OpenAI's GPT-3 for the natural language processing. By the end of this tutorial, you will have a fully functional semantic search engine that can provide relevant and accurate results to your users.
One of the key components of implementing semantic search is the ability to represent textual data in a way that allows for meaningful comparison and analysis. This is where vector representations come in—they allow us to transform words or phrases into high-dimensional vectors that capture their meaning and context.
In PostgreSQL, the pg-vector
module provides a way to store and manipulate vector data within the database. By using pg-vector
, we can represent our text data as vectors and perform vector operations such as cosine similarity to compare the similarity of different documents or search queries.
To implement semantic search with Supabase and PostgreSQL, we can first create a table to store our documents and their vector representations using the pg-vector
data type. Then, when a user submits a search query, we can convert the query into a vector representation using the same method as the documents, and perform a cosine similarity search to find the most relevant results.
By using pg-vector
in conjunction with Supabase and Next.js, we can create a powerful semantic search engine that can provide accurate and relevant results to our users. This approach is especially useful when dealing with large amounts of text data, where traditional keyword-based search engines may struggle to provide meaningful results.
This tutorial will be using code snippets from my portfolio site (which is a public repository). Link are at the bottom.
The Process
The secret to making this work is OpenAI's GPT-3 embeddings API, which handles converting the text snippets into a very large array of numbers called a vector (1536 to be precise). By using a chunk size of 200 tokens, it gives the embeddings API just enough information to match relevant results to the query, but not too much information to overload our answer request to ChatGPT, who will interpret that information.
The process is as follows:
- We parse the document text into chunks approximately 200-tokens each
- We run each chunk through OpenAI's embeddings API to get the semantic vector
- We store that text, including the vector in our PG table
That's all done in advance. The continuous process is as follows:
- A user asks a question on the front-end form
- The question is converted to a vector via OpenAI on-the-fly
- We run a vector search query against our database to get the top 3 results
- We concatenate those results and feed them and the original question to ChatGPT
- ChatGPT gives us the answer
Setting Up Supabase
- First, we need to create a Supabase project and database. If you haven't done so already, sign up for a free Supabase account and create a new project.
- Once you have created your project, navigate to the SQL editor and connect to your database. You can do this by clicking on the "SQL" button in the left sidebar, and then clicking "Connect to Database".
- In the SQL editor, create a new table to store your documents. For this example, we will create a table called "documents" with the following columns:
-- RUN 1st
create extension vector;
-- RUN 2nd
create table vsearch (
id bigserial primary key,
document_title text,
page_no int2,
content text,
content_length bigint,
content_tokens bigint,
embedding vector (1536)
);
-- RUN 3rd after running the scripts
create or replace function vector_search (
query_embedding vector(1536),
similarity_threshold float,
match_count int
)
returns table (
id bigint,
document_title text,
page_no int2,
content text,
content_length bigint,
content_tokens bigint,
similarity float
)
language plpgsql
as $$
begin
return query
select
vsearch.id,
vsearch.document_title,
vsearch.page_no,
vsearch.content,
vsearch.content_length,
vsearch.content_tokens,
1 - (vsearch.embedding <=> query_embedding) as similarity
from vsearch
where 1 - (vsearch.embedding <=> query_embedding) > similarity_threshold
order by vsearch.embedding <=> query_embedding
limit match_count;
end;
$$;
-- RUN 4th
create index on vsearch
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
Here is a breakdown of what each section of the code is doing:
-
create extension vector;
is creating thevector
extension which provides support for vector data types and operations in PostgreSQL. -
create table vsearch ...
is creating a new table calledvsearch
with columns for the document title, page number, content, content length, content tokens, and embedding vector. -
create or replace function vector_search ...
is defining a new function calledvector_search
that takes in a query embedding vector, a similarity threshold, and a match count, and returns a table of document information with a similarity score. This function is written in PL/pgSQL, a procedural language for PostgreSQL. -
create index on vsearch ...
is creating an index on theembedding
column using the IVFFLAT method with 100 lists. This index speeds up similarity searches by clustering vectors together in groups, reducing the number of comparisons needed to find the most similar vectors.
Parsing A Document Into the PG-Vector Table
Ok, hopefully everything went to plan and you have your table and the RPC function created. If not, just ask ChatGPT—he'll know what to do. 🤣
In my experiment, I used a PDF file, a short book about early IBM programmers, titled True Hackers for my body of information. I wrote several scripts that I run from my dev environment to pre-process information and then insert into the table.
Here's the script for parsing the PDF:
import fs from 'fs'
import { encode } from 'gpt-3-encoder';
import PDF from 'pdf-scraper';
const inFile = 'data/hackers.pdf';
const outFile = 'data/hackers.json';
const dataBuffer = fs.readFileSync(inFile);
const tokensPerChunk = 200;
const documentTitle = "True Hackers"
PDF(dataBuffer).then(function (data) {
console.log(`Successfully parsed ${data.numpages} pages from ${inFile}`);
// Iterate over PDF pages
let chunkIndex = 0;
let currentChunk = '';
let currentChunkWords = 0;
const output = [];
let content = '';
for (let pageIndex = 0; pageIndex < data.pages.length; pageIndex++) {
console.log(`Parsing page #${pageIndex}...`);
const pushChunk = (chunk) => {
const contentLength = encode(chunk).length;
console.log('Creating chunk of token length:', contentLength);
output.push({
documentTitle: documentTitle,
pageNo: pageIndex + 1,
tokens: contentLength,
content: chunk.trim()
});
}
// Normalize all whitespace to a single space
content = data.pages[pageIndex].replace(/\s+/g, ' ');
// If page content is longer than tokens limit, parse into sentences
if (encode(content).length > tokensPerChunk) {
// Split content into sentences
let sentences = content.split('. ');
let chunk = '';
for (let i = 0; i < sentences.length; i++) {
const sentence = sentences[i];
const sentenceTokenLength = encode(sentence).length;
const chunkTokenLength = encode(chunk).length;
// If our chunk has grown to exceed the tokens limit, append to output buffer
if (chunkTokenLength + sentenceTokenLength > tokensPerChunk) {
pushChunk(chunk);
chunk = '';
}
// If current sentence ends with a character, append a period, otherwise a space
if (sentence && sentence[sentence.length - 1].match(/[a-z0-9]/i)) {
chunk += sentence + ". ";
} else {
chunk += sentence + " ";
}
}
// Append the remaining text
pushChunk(chunk);
} else {
pushChunk(content);
}
}
fs.writeFileSync(outFile, JSON.stringify(output));
});
The code begins by importing necessary modules such as 'fs', 'gpt-3-encoder', and 'pdf-scraper'.
Then it defines the input and output file paths, and reads the content of the input PDF file using fs.readFileSync()
into a buffer object.
Next, the PDF content is passed through the pdf-scraper
module using PDF(dataBuffer)
. The pdf-scraper
module extracts the text content from the PDF file, and returns a Promise which is handled in a callback function. The text content is stored in data.pages
, where each element of the array represents a page of the PDF.
The script then iterates over the pages of the PDF using a for
loop. For each page, it checks if the length of the page content exceeds the maximum token length specified by tokensPerChunk
. If it does, the content is split into sentences, and each sentence is concatenated into a chunk of text with a maximum token length of tokensPerChunk
. If a sentence exceeds this limit, it is split into multiple chunks.
The gpt-3-encoder
module is used to encode the text chunks into a format that can be used as input for the OpenAI GPT-3 language model. The encoded text is stored in the output
array, which is a collection of objects that contain the document title, page number, number of tokens, and the content chunk itself.
Finally, the output
array is written to a JSON file using fs.writeFileSync()
with the output file path specified by outFile
.
Next, here is the code for generating the embeddings and inserting into our PG table. Be sure to check the JSON output for validity first. That's why I split these functions into two scripts:
import { loadEnvConfig } from "@next/env";
import { createClient } from "@supabase/supabase-js";
import fs from "fs";
import { Configuration, OpenAIApi } from "openai";
import { encode } from "gpt-3-encoder";
loadEnvConfig("");
(async function () {
try {
const configuration = new Configuration({ apiKey: process.env.OPENAI_KEY });
const openai = new OpenAIApi(configuration);
const supabase = createClient(process.env.NEXT_PUBLIC_SB_URL, process.env.SB_SERVICE_KEY);
const inFile = 'data/hackers.json';
const dataBuffer = fs.readFileSync(inFile);
const data = JSON.parse(dataBuffer);
data.forEach(async item => {
// Generate embedding via GPT model ada-002
const aiRes = await openai.createEmbedding({
model: "text-embedding-ada-002",
input: item.content
});
const [{ embedding }] = aiRes.data.data;
// Insert data and embedding into PG table
const { data, error } = await supabase
.from("vsearch")
.insert({
document_title: item.documentTitle,
page_no: item.pageNo,
embedding,
content: item.content,
content_length: item.content.length,
content_tokens: encode(item.content).length
})
.select("*");
return;
});
} catch (err) {
console.error(err.message, err.stack);
}
})();
- We load our JSON data from the file we created earlier
- We use the
text-embedding-ada-002
OpenAI endpoint to get our vector - We use the Supabase SDK to insert our snippets into our search table
Putting It Into Practice
We'll divide the back-end process into two separate controllers. The first will query our PG table and get the 3 most relevant results. The second will prompt ChatGPT with the pulled data and the original query.
Here's the first:
import { createClient } from "@supabase/supabase-js";
import readRequestBody from "@/lib/api/readRequestBody";
import checkTurnstileToken from "@/lib/api/checkTurnstileToken";
// Use Next.js edge runtime
export const config = {
runtime: 'edge',
}
export default async function handler(request) {
const MATCHES = 3; // max matches to return
const THRESHOLD = 0.01; // similarity threshold
try {
const requestData = await readRequestBody(request);
// Validate CAPTCHA response
if (!await checkTurnstileToken(requestData.token)) {
throw new Error('Captcha verification failed');
}
const supabase = createClient(process.env.NEXT_PUBLIC_SB_URL, process.env.SB_SERVICE_KEY);
// Get embedding vector for search term via OpenAI
const input = requestData.searchTerm.replace(/\n/g, " ");
const result = await fetch("https://api.openai.com/v1/embeddings", {
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.OPENAI_KEY}`
},
method: "POST",
body: JSON.stringify({
model: "text-embedding-ada-002",
input
})
});
if (!result.ok) {
const mess = await result.text();
throw new Error(mess)
}
const json = await result.json();
const embedding = json.data[0].embedding;
// Perform cosine similarity search via RPC call
const { data: chunks, error } = await supabase.rpc("vector_search", {
query_embedding: embedding,
similarity_threshold: THRESHOLD,
match_count: MATCHES
});
if (error) {
console.error(error);
throw new Error(error);
}
return new Response(JSON.stringify(chunks), {
status: 200,
headers: {
'Content-Type': 'application/json'
}
});
} catch (err) {
return new Response(`Server error: ${err.message}`, { status: 500 });
}
}
Here's what is does:
- The
config
object specifies that this is a Next.js edge runtime function. Edge runtime allows the function to run closer to the user, reducing latency. - The
handler
function is the main function that takes in arequest
object and returns aResponse
object. - The
MATCHES
variable determines the maximum number of matches to return. - The
THRESHOLD
variable specifies the similarity threshold. Matches with a similarity score below this threshold will not be returned. - The
readRequestBody
function reads the request body as a string. - The
checkTurnstileToken
function validates the CAPTCHA response by sending it to Supabase. - The
createClient
function creates a Supabase client with the Supabase URL and service key. - The
fetch
function sends a request to the OpenAI API to get the embedding vector for the search term. - The
rpc
function sends a remote procedure call to the Supabase server to perform the cosine similarity search. - The
Response
object is returned with the matching chunks in JSON format.
The next is a function that sends a prompt to the OpenAI API and returns a ReadableStream
that receives the response in real-time:
import { createParser } from "eventsource-parser";
const openAiStream = async (system, prompt) => {
const encoder = new TextEncoder();
const decoder = new TextDecoder();
const res = await fetch("https://api.openai.com/v1/chat/completions", {
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.OPENAI_KEY}`
},
method: "POST",
body: JSON.stringify({
model: 'gpt-3.5-turbo',
messages: [
{
role: "system",
content: system
},
{
role: "user",
content: prompt
}
],
max_tokens: 150,
temperature: 0.0,
stream: true
})
});
if (res.status !== 200) {
const mess = await res.text();
throw new Error(mess);
}
const stream = new ReadableStream({
async start(controller) {
const onParse = (event) => {
if (event.type === "event") {
const data = event.data;
if (data === "[DONE]") {
controller.close();
return;
}
try {
const json = JSON.parse(data);
const text = json.choices[0].delta.content;
const queue = encoder.encode(text);
controller.enqueue(queue);
} catch (e) {
controller.error(e);
}
}
};
const parser = createParser(onParse);
for await (const chunk of res.body) {
parser.feed(decoder.decode(chunk));
}
}
});
return stream;
};
export default openAiStream;
- The
createParser
function creates a newParser
object that parses anEventSource
stream. - The
openAiStream
function takes two parameters:system
andprompt
.system
is a message from the system, andprompt
is a prompt from the user. - The function sends a POST request to the OpenAI API with the specified
system
andprompt
messages. - If the response status is not 200, an error is thrown.
- The
ReadableStream
is created with anasync start
method that takes acontroller
object. Thecontroller
is used to manage the stream. - A callback function
onParse
is created to handle parsed events from theParser
. - Inside the
start
method, afor await...of
loop is used to iterate over the response body chunks. - Each chunk is decoded from binary to UTF-8 using
decoder.decode(chunk)
. - The decoded chunk is fed into the parser using
parser.feed()
. - If the parser encounters an "event" type, the
data
field is parsed into JSON, and the response text is encoded and enqueued into thecontroller
object. - If the parser encounters a "[DONE]" message, the
controller
is closed, and thestart
method exits. - If there is an error parsing the response, it is thrown using
controller.error()
.
Lastly, here is the handler that will put it all together:
import readRequestBody from "@/lib/api/readRequestBody";
import endent from "endent";
import openAiStream from "@/lib/openAi/openAiStream";
// Use Next.js edge runtime
export const config = {
runtime: 'edge',
}
export default async function handler(request) {
try {
const requestData = await readRequestBody(request);
const query = requestData.searchTerm.replace(/\n/g, " ");
const system = endent`
You are a helpful assistant that answers questions based on a provided body of text.
Use only the provided text to answer the question. Try not to copy the text word-for-word.
If you are not certain of the answer, then answer: "Sorry, I can't help you with that."
`
const textBody = requestData.chunks.map(c => {
return c.content + ' ';
});
const prompt = `Please answer this query: ${query}\n\n`
+ `Use only the following information:\n\n${textBody}`;
const stream = await openAiStream(system, prompt);
return new Response(stream);
} catch (err) {
return new Response(`Server error: ${err.message}`, { status: 500 });
}
}
The first part is the system message that primes ChatGPT on how it should handle subsequent queries. The second is the actually query containing the user's search term and the data pulled from the database. We then feed it to our openAiStream
function and then return the stream to the client.
Creating a Client
You're welcome to view the client I created on my portfolio site (see below), but this is getting to be long, so I won't put the full code here as most of it is not important to making this work. Here is the good stuff, though:
const embedSearchTerm = async () => {
const body = new FormData();
body.append('searchTerm', searchTerm);
body.append('token', token);
const result = await fetch('/api/vsearch/embed', {
method: 'POST',
body
});
if (!result.ok) {
const mess = await result.text();
throw new Error(mess);
}
const chunks = await result.json();
if (!chunks.length) {
setErrorMess('Sorry, I was unable to find a match');
return;
}
setPageNumbers(chunks.map(c => (c.page_no)));
return chunks;
}
const getAnswer = async (chunks) => {
const body = {
searchTerm,
chunks
};
const response = await fetch('/api/vsearch/answer', {
method: 'POST',
body: JSON.stringify(body),
headers: {
'Content-Type': 'application/json'
}
});
if (!response.ok) {
const message = await response.text();
throw new Error(message);
}
const stream = response.body;
const reader = stream.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) {
break;
}
const chunk = new TextDecoder('utf-8').decode(value);
setAnswer((prev) => prev + chunk);
}
};
const handleSearch = async (e) => {
e.preventDefault();
e.stopPropagation();
if (isLoading) return;
setSearchErrorMess('');
if (!searchTerm.length) {
setSearchErrorMess('Please fill this out');
return;
}
setIsLoading(true);
try {
const chunks = await embedSearchTerm();
if (chunks.length) {
await getAnswer(chunks);
}
} catch (err) {
setErrorMess(err.message);
captcha.reset();
} finally {
setIsLoading(false);
}
}
Conclusion
In this tutorial, we've explored how to implement semantic search using Supabase/PostgreSQL, Next.js, and OpenAI. We've set up a database table in Supabase to store our documents and their vector representations, and defined a function for performing semantic search against these vectors. We've also utilized the pg-vector extension in PostgreSQL to enable vector operations, and set up an index on the embedding column to speed up similarity searches.
Semantic search is a powerful tool for improving search accuracy and relevance, especially when dealing with large amounts of unstructured data. With the rise of natural language processing and machine learning, it's becoming increasingly feasible to implement semantic search in our own projects. By following the steps outlined in this tutorial, you can get started with semantic search using Supabase/PostgreSQL, Next.js, and OpenAI, and tailor it to your specific use case.
The power of these LLMs will unlock knowledge to people in a way never before possible. The world will be forever changed by this technology.
Links
Thank you for taking the time to read my article and I hope you found it useful (or at the very least, mildly entertaining). For more great information about web dev, systems administration and cloud computing, please read the Designly Blog. Also, please leave your comments! I love to hear thoughts from my readers.
I use Hostinger to host my clients' websites. You can get a business account that can host 100 websites at a price of $3.99/mo, which you can lock in for up to 48 months! It's the best deal in town. Services include PHP hosting (with extensions), MySQL, Wordpress and Email services.
Looking for a web developer? I'm available for hire! To inquire, please fill out a contact form.
Top comments (0)