UPDATE
I came across a fantastic article that reviews the general process coded here that is well worth a read.
Introduction
In my two previous posts (one, two) on generative AI and using Postgresql's pg_vector for preparing database data for natural language parsing we covered some very simple use cases just to get an idea of how creating custom embeddings can be useful. In this next post let's see if we can apply those learnings to do natural language processing on even more complex database data.
Using arbitrary database data
Previous we used a toy example, a list of local and international food dishes, to show we could use OpenAI's API to first create vector embeddings and then match those vectors against the vectors of a question to find the correct meals. However I suspect what we really want is a technique to let us ask questions like this about out existing database data. Here's where it gets fun and where I suspect we will need to iterate a few times before we get a good result. Right now the best practice on this is evolving but we have to begin someplace, so what I'm going to do the following:
First populate a database with embeddings about data we want to query:
- Use SQL to create a JSON record with interesting information we want to be able to do natural language parsing on.
- convert that JSON into a textual representation that we can create vector embeddings on.
- Get the vector embeddings for that text and put it in the database along with the textual representation.
Next:
- Create a command line query application that takes a question and converts it to its vectors.
- Get all the textual representations that match the vectors.
- Use the OpenAI chat API to complete the question using the textual representations as context.
This really is my best first guess at a way to do this but as always feel free to jump into the comments and tell me everything I'm doing wrong :)
Ok let's write the script to populate the vector table. Here's the table DDL:
CREATE TABLE about_people (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
description TEXT NOT NULL,
vector_info VECTOR(1536) NOT NULL
);
Now for the purposes of this example I'm going to use a sample database that contains information about students such as their name and location, along with information about surveys they took as classroom assignments. You can use any type of data I just happened to have this laying around. Here's the script:
use warnings;
use strict;
use DBI;
use OpenAI::API;
my $api = OpenAI::API ->new(api_key => $KEY);
my $dbh = DBI->connect(@DNS) || die "Can't connect to DB";
my $sql = qq[
SELECT
p.person_id,
p.first_name,
p.last_name,
json_build_object(
'person', json_build_object(
'first_name', p.first_name,
'last_name', p.last_name,
'email', p.email,
'institution', json_build_object(
'name', i.name,
'location', i.location
),
'surveys', json_agg(
json_build_object(
'survey_name', s.survey_name,
'survey_type', s.survey_type
)
),
'survey_count', COUNT(DISTINCT s.survey_id)
)
) AS nested_json
FROM person p
JOIN team t ON t.person_id = p.person_id
JOIN survey s ON s.survey_id = t.survey_id
JOIN institutions i ON i.institution_id = p.institution_id
WHERE p.person_type = 'student'
GROUP BY p.person_id, p.first_name, p.last_name, p.email, i.name, i.location
HAVING COUNT(DISTINCT s.survey_id) >= 1
ORDER BY p.person_id DESC
LIMIT 100;
];
my $insert_stmt = $dbh->prepare('INSERT INTO about_people (name, description, vector_info) VALUES (?,?,?)');
my $sth = $dbh->prepare($sql);
$sth->execute();
while (my $row = $sth->fetchrow_hashref) {
# Get a textual description of the JSON
my $json_string = $row->{nested_json};
my $res_complete = $api->completions(
model => 'text-davinci-003',
prompt => "Given the following JSON, what's a good textual representation:\n $json_string",
max_tokens => 50,
);
my $text = $res_complete->{choices}[0]{text};
# Get the embeddings for that text
my $res_embeddings = $api->embeddings(
model => "text-embedding-ada-002",
input => $text,
);
my $embedding = join (',', @{$res_embeddings->{'data'}[0]->{'embedding'}});
$insert_stmt->execute($row->{first_name}.' '.$row->{last_name}, $text, "[${embedding}]")
or die "Couldn't execute statement: $DBI::errstr";
}
Ok now what I doing here is getting a record as a JSON object first. A typical object might look like this:
{
"person": {
"first_name": "Student",
"last_name": "Student ",
"email": "student@email.com",
"institution": {
"name": "Texas A&M University",
"location": "College Station"
},
"surveys": [{
"survey_name": "ff",
"survey_type": "TEAM-MAKER"
}, {
"survey_name": "teamtrust",
"survey_type": "CATME"
}, {
"survey_name": "licensetest",
"survey_type": "CATME"
}, {
"survey_name": "long",
"survey_type": "TEAM-MAKER"
}, {
"survey_name": "fsdfsdfsdfeeeee",
"survey_type": "TEAM-MAKER"
}],
"survey_count": 8
}
}
Next I need to turn that into text. For this example I'm going to cheat and use OpenAI's API for that but I am warning you this is not a cheap approach. It's fine for this demo however. When its done you might have text like the following:
John Nap is a student at New Age Training, Inc. in New York. He has taken 11 surveys of two types - CATME and TEAM-MAKER. These surveys include Comments, Test01, Fewer, Greater
Then I use the embeddings API to create and store vectors. This time you will note that I store both the vectors AND the text. We'll need the text for the query application.
Now for the query application itself. The process here is pretty simple for the first go. I'm going to take the question and get its vectors and then query the database to get the top textual answers that seem to match the question based on the vector analysis. Then I'm going to use the chat API to ask OpenAI to answer the question using that text as its context. Here's the code:
use strict;
use warnings;
use DBI;
use AI::Embedding;
use DBI;
use OpenAI::API;
my $json = JSON::PP->new;
my $api = OpenAI::API ->new(api_key => $KEY);
my $dbh = DBI->connect(@DNS) || die "Can't connect to DB";
my @messages = ();
my $sth = $dbh->prepare('SELECT name, description FROM about_people ORDER BY vector_info <=> ? LIMIT 5');
while (1) {
# Prompt the user for input
print "Enter your query: ";
my $query = <STDIN>;
chomp $query;
# first lets find text that matches the embedding
my $res_embeddings = $api->embeddings(
model => "text-embedding-ada-002",
input => $query,
);
my $embeddings = join (',', @{$res_embeddings->{'data'}[0]->{'embedding'}});
$sth->execute("[${embeddings}]") or die "Couldn't execute statement: $DBI::errstr";
my $text_to_process = '';
while (my $row = $sth->fetchrow_hashref()) {
my $description = $row->{description};
$description =~ s/[\r\n]+/ /g;
$text_to_process .= $description . '; ';
}
# Use this text as prompt to generate the next sentence
push @messages, {role=> "user", content => "Use the following context '$text_to_process' to answer the question: $query"};
my $res = $api->chat(
model => "gpt-3.5-turbo",
messages => [
{role => "system", content => "Answer questions exactly as they are asked and do not speculate about things outside the question."},
@messages,
],
);
my $text = $res->{choices}[0]{message}{content};
push @messages, {role => "assistant", content => $text};
print "\nAnswer: $text\n";
# Prompt the user to continue or quit
print "Enter 'quit' to exit or press Enter to continue: ";
my $input = <STDIN>;
chomp $input;
# If the user enters "quit", break the loop and exit the program
last if lc $input eq 'quit';
# Otherwise, continue to the next iteration of the loop
}
$sth->finish();
$dbh->disconnect();
print "Goodbye! Exiting the program.\n";
You'll also note that I'm store the previous questions and answers for context. This is a very naive implementation since I'm not bothering to make sure I'm keeping within the context limits of the API but again this is for learning and testing so for now it's fine. Lets try it:
$ perl ./nlp_people.pl
Enter your query: tell me about students with 2 or more surveys
Answer: Based on the provided context, there are two students mentioned: John Nap and John Nap1. Both students have completed multiple surveys. John Nap, a student at New Age Training, Inc. in New York, has completed two surveys called 'rwerwerwer' and 'rank' through the TEAM-MAKER survey type. John Nap1, on the other hand, is a student at Texas A&M University in College Station and has taken five CATME surveys named 'bug,' 'student bug,' 'drew,' 'gfhfghfg,' and 'wefwerwer.' From this information, we can conclude that both students have completed two or more surveys.
Enter 'quit' to exit or press Enter to continue:
Enter your query: how many students are in texas
Answer: Based on the provided context, we have information about two individuals associated with Texas:
1. Tiffany Adams: She is an alumnus of Texas A&M University located in College Station. The context does not explicitly state if she is currently a student or not, but it mentions that she has taken 8 surveys with various names.
2. John Nap1: He is currently a student at Texas A&M University located in College Station. It is mentioned that he has taken 5 CATME surveys.
From this information, we can conclude that at least two individuals, Tiffany Adams and John Nap1, are associated with Texas.
Enter 'quit' to exit or press Enter to continue: quit
Goodbye! Exiting the program.
It's pretty basic but it's giving accurate answers. With more prompt engineering we could get it to give more concise answers. But as a proof of concept it outlines the possibilities!
One More Thing
This is a demo and I expect that if you have more data and more complex data you might need multiple query layers as well as storing more overall meta information in your database. You might find it useful to adopt a hybrid approach and use for example Postgresql full text search to first narrow down the records rather than hit the API for everything. But it outlines an approach you can pick up for your further learnings.
Top comments (0)