Omar Shehata

Posted on Sep 25 • Edited on Sep 26

Open source semantic embedding, search & clustering in NodeJS

#node #ai #vectordatabase #javascript

This tutorial walks you through how to do semantic embedding completely offline with open source models in NodeJS. No knowledge of AI/ML is required. Source code: https://github.com/OmarShehata/minimal-embedding-template

An "embedding" is a high dimensional vector (x,y,z,...) that represents a concept. Think of it as the internal representation of words in an LLM. You can compute distances between these vectors.

Example: "Man" is much closer to "boy" & "woman", compared to "chicken". "Coffee" and "wifi" are somewhat close, and are both close to "coffee shop".

illustration of what these vectors might look like projected down to 2D

I think this is an extremely underutilized feature of modern LLM's, and it's much cheaper compute wise compared to text generation. Most of the time I don't really want the LLM to generate text as much as I want to see & manipulate the semantic concepts like this.

Setup

Clone the repo: https://github.com/OmarShehata/semantic-embedding-template. This contains a minimal NodeJS template that you can copy/paste and build on.

It uses (1) gpt4all as the LLM engine. This is where the open source model comes from, and is what converts a word/string/document into a vector. (2) Vectra as a local, single file vector database. Allows us to index & search vectors.

Run pnpm install, then run the first example in example-simple-embedding/index.js:

pnpm simple-embedding

This takes an array of strings and converts them to vectors:

await embeddings.insertText(['coffee shop', 'wifi', ...])

You can print the vectors with embeddings.getTextMap(). You can do a search as shown below. This returns a sorted list of the closest vectors in the DB, along with the cosine distance.

const results = await embeddings.search('coffee')
// returns:
// [
//   [ 'coffee shop', 0.8214959697396015 ],
//   [ 'wifi', 0.711907901740376 ],
//   [ 'hard work', 0.6709908415581982 ],
//   [ 'love peace & joy, relaxation', 0.6495931802131457 ]
// ]

(1 means it's exactly the same vector, -1 means it's exactly opposite, 0 means no correlation)

`embeddings` is a thin wrapper around gpt4all

lib/embeddings.js implements insertText which:

checks to see if these words are already in the DB
inserts them if they are not, with a batch update

The search function takes the query string and converts it to a vector, then runs a query with the vectra DB.

The specific model I'm using here is nomic-embed-text-v1.5 which is an open source model & free to use model that runs locally on your machine.

OpenAI embeddings

lib/embeddings-openai.js is a version of this file that has exactly the same API but sends the text to OpenAI. See OpenAI's embedding docs.

The OpenAI model captures more nuance in my experience (for example, it captures the semantic meaning of emojis whereas the open source one doesn't seem to).

Set the OPEN_API_KEY environment variable to use this. To run the example in example-openai-embedding/index.js:

pnpm openai-embedding

Clustering

To run example-clustering/index.js:

pnpm clustering

This clusters the vectors in the DB using k-means. You tell it the number of clusters you want to create, and it iterates over each point to find the "k nearest neighbors" to create these clusters.

Normally, you don't know how many clusters are in the data. There's various techniques to find this. One way is the "elbow method" where you cluster the dataset for increasingly higher cluster sizes and compute a "score". The score represents how close all items in the cluster are to a centroid. So the lower the score the more you end up with clusters of semantically related things.

I hope you found this useful! You can use the base code here to basically recreate Neal's Infinite Craft game. Basically put all the words in the dictionary in the vector database. Then to combine two words, add the vectors (or get the average?), then search for the closest thing to that combined vector.

This is my personal sandbox that I hope to add more stuff to. For example, there are models that can convert an image to a text description. You can then get a vector embedding for that text, and with that you can build an app where you can "CTRL+F" for your images (again, all offline, and free!)

Top comments (3)

Martin Baun • Sep 26

Great guideline! Thanks for sharing

Omar Shehata • Sep 26

thanks Martin!! I wrote this partially because I kept finding dozens of tutorials on this but they're all "ads" (like telling me to use this or that service). And I just wanted to know how to do it in a super simple nodeJS script!!

Mitch Chimwemwe Chanza • Sep 26

Awesome post , keep it up.

DEV Community

Open source semantic embedding, search & clustering in NodeJS

Setup

`embeddings` is a thin wrapper around gpt4all

OpenAI embeddings

Clustering

Top comments (3)

Read next

Function-Calling vs. Model Context Protocol (MCP)

Deus in Machina: Pinging Jesus in the Digital Confessional

Interactive Snowfall Cursor Effect with CSS and JavaScript

Top APIs Every Frontend Developer Should Use in 2025

Setup

embeddings is a thin wrapper around gpt4all

OpenAI embeddings

Clustering

Read next

Function-Calling vs. Model Context Protocol (MCP)

Deus in Machina: Pinging Jesus in the Digital Confessional

Interactive Snowfall Cursor Effect with CSS and JavaScript

Top APIs Every Frontend Developer Should Use in 2025

`embeddings` is a thin wrapper around gpt4all