In this post, I’ll show you how to use a vector database to lower GPT token costs in a Q&A application. The vector database I chose was Pinecone, which allows you to store and query high-dimensional vectors in an efficient and scalable way. The idea is to turn the questions and answers into vectors using a pre-trained natural language model, such as the text-embedding-ada-002 model, and then use Pinecone to find the most similar answers to users’ questions.
Vector databases can provide better text query results than SQL databases because they use a mathematical representation of the data called a vector, which allows you to measure the similarity between documents and queries using operations such as distance or angle. SQL databases, on the other hand, use a structured query language (SQL) that can be limited to searching for exact or fuzzy matches of words or phrases using the CONTAINS command. Also, SQL databases can require more resources and time to index and search large amounts of text data than vector databases.
I used Firebase Firestore, a NoSQL database in the cloud that offers real-time sync and offline support to manage the questions. Each question is stored in a document with a unique identifier and a field for the answer. Firestore also lets you create functions in the cloud that are triggered by events in the database, such as creating, updating, or deleting documents. I used a Cloud Function to fire an event whenever a new question is registered in Firestore. This function is responsible for sending the question to the ChatGPT API, a service that uses the GPT-3.5-turbo model to generate conversational responses. The ChatGPT returns the answer in text format, which is then converted to a vector using the same natural language model that was used for the questions. This vector is sent to Pinecone, which stores the vector for future new queries.
One advantage of using Pinecone is that it allows you to do similarity queries using the question and answer vectors. So when a user asks a question, I don’t need to send the question to ChatGPT and spend GPT tokens. I can simply convert the query into an array and send it to Pinecone, which returns the identifiers of the most similar arrays. I can adopt an existing answer as the question’s answer based on the similarity score between the new question and questions that have been asked before. This significantly reduces GPT token costs, as I only need to use ChatGPT to generate answers to new or significantly different questions than existing ones.
It can be quite a challenge to find the exact answer to the user’s question based on previous questions. Example: ‘How much does 1kg of your product cost?’ or ‘How much does 1g of your product cost?’. Possibly the score of the vectorized text of the two questions will have a score of almost 1.0. This can be a problem depending on your interest. I haven’t come up with anything concrete to solve this problem, but I believe there are ways around this by defining differentiations between some words such as ‘kilogram’ and ‘gram’.
To compare the cost between Pinecone queries and the GPT-3.5-turbo model , I used the following values:
Tokens are common strings of characters found in text. GPT processes text using tokens and understands the statistical relationships between them. Tokens can include spaces and even subwords. The maximum number of tokens that GPT can take as input depends on the model and tokenizer used. For example, the text-embedding-ada-002 template uses the cl100k_base tokenizer and can receive up to 8191 tokens.
I even considered using Cloud Function as the hosting platform for my ChatGPT query code, but it would bring me high financial costs. That’s because Cloud Function charges based on your function’s runtime and the number of invocations and provisioned resources. As the ChatGPT API can be slow to respond depending on the complexity of your query, you could end up paying a lot for the time your function is waiting for the API response.
One way I found to reduce costs with ChatGPT is to use a web service on Render.com, where it is possible to create an application in Flask where I managed to use threads so as not to wait for the GPT API to respond before responding to the Cloud Function. Render.com is a platform that lets you host web applications in a simple and inexpensive way, with plans starting at $7 per month. The idea is to create an intermediate layer between the Cloud Function and the ChatGPT, which receives the question from the Cloud Function and sends an immediate response saying the response is being generated. Then, the Flask application creates a thread to send the question to the ChatGPT API and once it gets the answer, it updates the question in Firestore.
I had thought of describing the project's source code in this post. But I think that for now, it is worth reflecting on the applied infrastructure. Later I will create new posts to detail more about the NodeJS project inserted in Google Cloud Function and the project in Python (Flask) hosted on Render.com.
The web application for inserting and querying questions in Firestore is still being developed. I intend to publish it soon as well.
Top comments (0)