In today's data-driven world, managing and searching through large datasets has become increasingly important. One powerful tool for handling this challenge is Milvus, an open-source vector database designed for AI applications. In this blog post, we'll explore a practical implementation of Milvus using Python, showcasing how it can be integrated with text embedding techniques to create an efficient search system.
All code for this blog post can be found in this companion GitHub repository.
Milvus: The Vector Database
Milvus is designed to provide scalable, reliable, and fast search capabilities for vector data. It's particularly suited for applications like image and video recognition, natural language processing, and recommendation systems, where data can be represented as high-dimensional vectors.
Setting Up Milvus
Before diving into the code, ensure you have Milvus installed and running. The first step in our Python script is to establish a connection with the Milvus server:
from pymilvus import connections
def connect_to_milvus():
try:
connections.connect("default", host="localhost", port="19530")
print("Connected to Milvus.")
except Exception as e:
print(f"Failed to connect to Milvus: {e}")
raise
This function attempts to connect to a Milvus server running on the local machine. Error handling is crucial to catch and understand any issues that might arise during the connection.
Creating a Collection in Milvus
A collection in Milvus is like a table in a traditional database. It's where our data will be stored. Each collection can have multiple fields, akin to columns in a table. In our example, we create a collection with three fields: a primary key (pk
), a source text (source
), and embeddings (embeddings
):
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
def create_collection(name, fields, description):
schema = CollectionSchema(fields, description)
collection = Collection(name, schema, consistency_level="Strong")
return collection
# Define fields for our collection
fields = [
FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=768)
]
collection = create_collection("hello_milvus", fields, "Collection for demo purposes")
In this code snippet, the embeddings have a dimension of 768 (for our specific custom embeddings model, mentioned in the next section), which should align with the output of the embedding model you use.
Generating Text Embeddings in Python
Before we insert data into our Milvus collection, we need to generate text embeddings. This process involves using a pre-trained model from the transformers library to convert text into numerical vectors. In our code, we use the thenlper/gte-base model for this purpose. This process for our app is abstracted by our embedding_util.py
module that handles creating vector embeddings.
For more details on how our custom embedding_util.py
module works for creating vector embeddings, check out my blog post on how to use weaviate to store and query vector embeddings.
Generating and Inserting Data
To generate embeddings from text, we use the previously mentioned pre-trained model from the transformers
library. This model converts text into numerical vectors that can be stored in our Milvus collection:
from embedding_util import generate_embeddings
documents = [...]
embeddings = [generate_embeddings(doc) for doc in documents]
entities = [
[str(i) for i in range(len(documents))],
[str(doc) for doc in documents],
embeddings
]
insert_result = insert_data(collection, entities)
The insert_data
function inserts our data into the Milvus collection and then flushes the operations to ensure data persistence.
Creating an Index for Efficient Searching
Milvus uses indexes to speed up the search process. Here, we create an IVF_FLAT index on the embeddings field:
def create_index(collection, field_name, index_type, metric_type, params):
index = {"index_type": index_type, "metric_type": metric_type, "params": params}
collection.create_index(field_name, index)
create_index(collection, "embeddings", "IVF_FLAT", "L2", {"nlist": 128})
Performing a Vector Search
With our data indexed, we can now perform searches based on vector similarity:
def search_and_query(collection, search_vectors, search_field, search_params):
collection.load()
result = collection.search(search_vectors, search_field, search_params, limit=3, output_fields=["source"])
print_search_results(result, "Vector search results:")
query = "Give me some content about the ocean"
query_vector = generate_embeddings(query)
search_and_query(collection, [query_vector], "embeddings", {"metric_type": "L2", "params": {"nprobe": 10}})
In this search, we're looking for the top 3 documents most similar to the query "Give me some content about the ocean".
If you are able to run the app successfully, you should see the following vector search results, sorted by cosine distance (smaller is more semantically similar):
Vector search results:
Hit: id: 6, distance: 0.39819106459617615, entity: {'source': 'The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.'}, source field: The sunset paints the sky with shades of orange, pink, and purple, reflecting on the calm sea.
Hit: id: 4, distance: 0.4780573844909668, entity: {'source': 'The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.'}, source field: The ancient tree, with its gnarled branches and deep roots, whispers secrets of the past.
Hit: id: 0, distance: 0.4835127890110016, entity: {'source': 'A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.'}, source field: A group of vibrant parrots chatter loudly, sharing stories of their tropical adventures.
Cleaning Up
After completing our operations, it's good practice to clean up by deleting entities and dropping the collection:
delete_entities(collection, f'pk in ["{insert_result.primary_keys[0]}", "{insert_result.primary_keys[1]}"]')
drop_collection("hello_milvus")
Conclusion
Milvus offers a powerful and flexible way to work with vector data. By combining it with natural language processing techniques, we can build sophisticated search and recommendation systems. The Python script demonstrated here is a basic example, but the potential applications are vast and varied.
Whether you're dealing with large-scale image databases, complex recommendation systems, or advanced NLP tasks, Milvus can be an invaluable tool in your AI arsenal.
Top comments (1)
great article; I love focus and simplicity - very powerful for someone learning this stuff.