Abstract
In this article, we'll use OpenAI's CLIP model (Contrastive Language-Image Pre-training) to analyse the relationship between text and visual data by encoding and comparing their feature representations. Cosine similarity is calculated between image and text embeddings, and several dimensionality reduction techniques are used to create 2D visualisations of these relationships.
The notebook file used in this article is available on GitHub.
Introduction
In this article, we'll explore OpenAI's CLIP model to evaluate the relationship between images and text data. CLIP's model is used to encode both text and image features, followed by normalisation to compute cosine similarity, which measures the relevance between the two modalities.
Create a SingleStore Cloud account
A previous article showed the steps to create a free SingleStore Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.
Import the notebook
We'll download the notebook from GitHub.
From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.
Run the notebook
After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.
We'll begin by installing the necessary libraries and importing dependencies.
Next, we'll load the CLIP model and preprocess function:
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device = device)
We'll then download a sample image, preprocess the image and create some sample text, as follows:
image_url = "https://github.com/VeryFatBoy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"
response = requests.get(image_url)
display(Image(url = image_url))
image = preprocess(
PILImage.open(
BytesIO(response.content)
)
).unsqueeze(0).to(device)
texts = [
"What makes SingleStoreDB unique",
"Ultra-Fast Ingestion",
"Pipelines"
]
Next, we'll encode the image and text features:
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(
clip.tokenize(texts).to(device)
)
We'll normalise the features:
image_features /= image_features.norm(dim = -1, keepdim = True)
text_features /= text_features.norm(dim = -1, keepdim = True)
then combine the embeddings:
combined_features = torch.cat([
image_features,
text_features
], dim = 0).cpu().numpy()
and compute the cosine similarities:
similarities = [calculate_similarity(image_features, text_features[i]) for i in range(len(texts))]
labels = ["What makes SingleStoreDB unique (Image)"] + [
f"{text} (Cosine Similarity: {similarity:.6f})" for text, similarity in zip(texts, similarities)
]
Before plotting, we'll print the similarity scores:
print(f"{'Text':<35} {'Cosine Similarity':<10}")
print("-" * 60)
for text, similarity in zip(texts, similarities):
print(f"{text:<35} {similarity:<10.6f}")
Example output:
Text Cosine Similarity
------------------------------------------------------------
What makes SingleStoreDB unique 0.265887
Ultra-Fast Ingestion 0.155181
Pipelines 0.153016
We'll create a function to handle the different plots:
def plot_reduction(data, title, similarities):
fig = px.scatter(
x = data[:, 0],
y = data[:, 1],
color = labels,
title = title,
labels = {"x": "x", "y": "y"},
size = similarities
)
# fig.update_traces(marker = dict(sizemode = "diameter", sizemin = 5))
fig.show()
image_marker_size = 1
First, we'll plot PCA:
pca = PCA(n_components = 2)
pca_result = pca.fit_transform(combined_features)
plot_reduction(
pca_result,
"PCA",
[image_marker_size] + similarities
)
Example output is shown in Figure 1.
Next, we'll plot UMAP:
n_neighbors = min(15, combined_features.shape[0] - 1)
umap_model = umap.UMAP(n_components = 2, n_neighbors = n_neighbors, random_state = 42)
umap_result = umap_model.fit_transform(combined_features)
plot_reduction(
umap_result,
"UMAP",
[image_marker_size] + similarities
)
Example output is shown in Figure 2.
Finally, we'll plot t-SNE:
perplexity = min(30, combined_features.shape[0] - 1)
tsne = TSNE(n_components = 2, perplexity = perplexity, random_state = 42)
tsne_result = tsne.fit_transform(combined_features)
plot_reduction(
tsne_result,
"t-SNE",
[image_marker_size] + similarities
)
Example output is shown in Figure 3.
Summary
In this article, we used Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbour Embedding (t-SNE) to visualise the reduced feature space. Plotly charts for each method displayed the embeddings, with text-image cosine similarities determining marker sizes. This showed CLIP's ability to integrate and interpret multi-modal data, offering valuable insights for the visual analysis of textual and visual features.
Top comments (0)