Abstract
In this short article, we'll use the ElPiGraph library to construct a principal graph of the Iris dataset. We'll then visualise the graph using Principal Component Analysis (PCA), providing insights into the relationships between the features of different flower species.
The notebook file used in this article is available on GitHub.
Introduction
In scientific literature and various websites, there are references to Metro Maps. For example, the Wikipedia page for the Iris Flower Data Set shows an image of one such Metro Map. However, finding any example code to render such a Metro Map is challenging. Fortunately, we could draw an analogy between a principal graph constructed using the ElPiGraph library and the concept of a Metro Map in data visualisation. Just like Metro Maps show connected paths and key stations to represent an urban transit system, principal graphs represent the underlying structure of high-dimensional data, showing how different data points (or "stations") are connected or related in a lower-dimensional space.
By visualising this graph with PCA, we can essentially create a simplified, structured map of relationships between different features, similar to how Metro Maps simplify the layout of a city's transit system to highlight key connections and routes.
Create a SingleStore Cloud account
A previous article showed the steps to create a free SingleStore Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.
Import the notebook
We'll download the notebook from GitHub.
From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.
Run the notebook
After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.
We'll begin by installing the necessary libraries and importing dependencies, followed by loading the Iris dataset from scikit-learn.
The core component of the notebook code is to fit the graph, as follows:
elastic_graph = elpigraph.computeElasticPrincipalTree(data, NumNodes = 50)
The Iris dataset consists of 150 rows. We'll use 50 nodes to create the graph. In ElPiGraph, the number of nodes in the graph does not need to match the number of rows. Instead, the nodes represent key points or landmarks that summarise the structure of the data. These nodes are meant to capture the most important trends or patterns in the dataset, rather than representing every single data point.
Once the graph has been computed, we'll prepare the data for visualisation using Plotly Express. Figure 1 shows the graph with the data points and edges.
The graph highlights clusters, which correspond to different species of the Iris flowers (Setosa, Virginica, and Versicolor). It shows how different data points (flower samples) are connected or related based on their feature values (e.g., petal length, sepal width).
By projecting the principal graph using PCA, the graph shows how one species transitions to another or how they separate in the feature space. For example, some species might be clearly separated (Setosa), while others (Virginica, Versicolor) may have smoother transitions or overlaps, indicating similarities in their feature profiles.
Summary
Using ElPiGraph as a dimensionality reduction tool provides a clearer, more interpretable view of the relationships between different flower species and their feature distributions, highlighting clusters, transitions, and overall data structure.
Top comments (0)