This tutorial walks you through how to do semantic embedding completely offline with open source models in NodeJS. No knowledge of AI/ML is required. Source code: https://github.com/OmarShehata/minimal-embedding-template
An "embedding" is a high dimensional vector (x,y,z,...)
that represents a concept. Think of it as the internal representation of words in an LLM. You can compute distances between these vectors.
Example: "Man" is much closer to "boy" & "woman", compared to "chicken". "Coffee" and "wifi" are somewhat close, and are both close to "coffee shop".
I think this is an extremely underutilized feature of modern LLM's, and it's much cheaper compute wise compared to text generation. Most of the time I don't really want the LLM to generate text as much as I want to see & manipulate the semantic concepts like this.
Setup
Clone the repo: https://github.com/OmarShehata/semantic-embedding-template. This contains a minimal NodeJS template that you can copy/paste and build on.
It uses (1) gpt4all as the LLM engine. This is where the open source model comes from, and is what converts a word/string/document into a vector. (2) Vectra as a local, single file vector database. Allows us to index & search vectors.
Run pnpm install
, then run the first example in example-simple-embedding/index.js
:
pnpm simple-embedding
This takes an array of strings and converts them to vectors:
await embeddings.insertText(['coffee shop', 'wifi', ...])
You can print the vectors with embeddings.getTextMap()
. You can do a search as shown below. This returns a sorted list of the closest vectors in the DB, along with the cosine distance.
const results = await embeddings.search('coffee')
// returns:
// [
// [ 'coffee shop', 0.8214959697396015 ],
// [ 'wifi', 0.711907901740376 ],
// [ 'hard work', 0.6709908415581982 ],
// [ 'love peace & joy, relaxation', 0.6495931802131457 ]
// ]
(1
means it's exactly the same vector, -1
means it's exactly opposite, 0
means no correlation)
embeddings
is a thin wrapper around gpt4all
lib/embeddings.js implements insertText
which:
- checks to see if these words are already in the DB
- inserts them if they are not, with a batch update
The search
function takes the query string and converts it to a vector, then runs a query with the vectra DB.
The specific model I'm using here is nomic-embed-text-v1.5 which is an open source model & free to use model that runs locally on your machine.
OpenAI embeddings
lib/embeddings-openai.js is a version of this file that has exactly the same API but sends the text to OpenAI. See OpenAI's embedding docs.
The OpenAI model captures more nuance in my experience (for example, it captures the semantic meaning of emojis whereas the open source one doesn't seem to).
Set the OPEN_API_KEY
environment variable to use this. To run the example in example-openai-embedding/index.js
:
pnpm openai-embedding
Clustering
To run example-clustering/index.js
:
pnpm clustering
This clusters the vectors in the DB using k-means. You tell it the number of clusters you want to create, and it iterates over each point to find the "k nearest neighbors" to create these clusters.
Normally, you don't know how many clusters are in the data. There's various techniques to find this. One way is the "elbow method" where you cluster the dataset for increasingly higher cluster sizes and compute a "score". The score represents how close all items in the cluster are to a centroid. So the lower the score the more you end up with clusters of semantically related things.
I hope you found this useful! You can use the base code here to basically recreate Neal's Infinite Craft game. Basically put all the words in the dictionary in the vector database. Then to combine two words, add the vectors (or get the average?), then search for the closest thing to that combined vector.
This is my personal sandbox that I hope to add more stuff to. For example, there are models that can convert an image to a text description. You can then get a vector embedding for that text, and with that you can build an app where you can "CTRL+F" for your images (again, all offline, and free!)
Top comments (3)
Great guideline! Thanks for sharing
thanks Martin!! I wrote this partially because I kept finding dozens of tutorials on this but they're all "ads" (like telling me to use this or that service). And I just wanted to know how to do it in a super simple nodeJS script!!
Awesome post , keep it up.