A categorical variable (sometimes called a nominal variable) is a variable that can assume one of a limited number of possible values described as categories and there is no intrinsic ordering to the categories. It uses labels, names, or other descriptors (even numbers) to identify exclusive categories or types of things.
As an example of a categorical variable, we may mention Nationality having values like Brazilian, Canadian, French, etc., and we can see that there is no ordering between the values: we cannot say that Brazilian is higher than Canadian. In summary, there is no way to order these categories from highest to lowest or from best to worst.
Other examples of categorical variables could be Regions (North, South, East, West), Blood Type (A, B, AB, O) or Smartphone Brand (Apple, Samsumg, LG, Xiami).
However, if there is a clear order between the categories, so we are dealing with an ordinal variable, that is very similar to a categorical variable and often it's considered a special kind of this and placed on between categorical and quantitative variables. An example of an ordinal variable could be Educational Level (Elementary school education, High school graduate, Some college, College graduate, Graduate degree).
But in this article we are focusing on pure categorical or nominal variables, so let's check out what we can do with some categorical data.
Frequency distribution
Since we have a dataset with some categorical variables, the most common thing we can do is count the occurrences of each category in the whole data. This will give us a frequency distribution.
Let's take a look at some real data to demonstrate a frequency distribution. We will use the Kaggle Google Play Store Apps dataset from Lavanya Gupta. This dataset has more than 10,000 rows, each of them is an app from Google Play Store, and as features (columns) we can see the App name, Category, Rating, and others.
We will use pandas for handling the data. Firstly, we import pandas and read the CSV file downloaded from Kaggle, but only the Category column. Then, we use the unique
method to show all values observed in our data. As we can see, there are 34 App Categories in our categorical variable, like Finance, Sports, Weathers and others and we can't see any order between them (Events category is not better or higher than Shopping category, for instance).
import pandas as pd
df = pd.read_csv("./data/googleplaystore.csv", usecols=['Category'])
categories = df['Category'].unique()
print(f"{len(categories)} categories:")
print(categories)
34 categories:
['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']
Now that we know all category values we can have, let's count how many times every category occurs in our data using value_counts
method.
frequency = df['Category'].value_counts()
# frequency is a pandas Series, so we'll transform it in a DataFrame just for presentation purposes
frequency_dist = pd.DataFrame(frequency)
frequency_dist.columns = ['Frequency']
frequency_dist.index.name = 'Category'
# Using head(10) to show only the first 10 lines
frequency_dist.head(10)
Frequency | |
---|---|
Category | |
FAMILY | 1972 |
GAME | 1144 |
TOOLS | 843 |
MEDICAL | 463 |
BUSINESS | 460 |
PRODUCTIVITY | 424 |
PERSONALIZATION | 392 |
COMMUNICATION | 387 |
SPORTS | 384 |
LIFESTYLE | 382 |
So, we can see above that most apps are from the Family category with 1,972 occurrences. Game and Tools are also common categories, on the other hand, there are few apps from the Beauty category.
Relative Frequency
At the moment we already know how many apps we have from each category. But what if we wanted to figure out what is the percentage of Medical apps of all apps? Then we need to calculate the relative frequency of category apps dividing the frequency by the total number of apps (aka the sample data).
Relative frequency of something = Frequency of something / n
Again, we will use the marvelous pandas. The relative frequency must assume a value from 0 to 1, but here we will multiply it by 100 and show the values in percentage form instead. So, as you can see below, Medical apps represent approximately 4.27% of all apps in Google Play Store according to our dataset.
frequency_dist['Relative Frequency (%)'] = (frequency_dist['Frequency']/sum(frequency_dist['Frequency']))*100
# Using head(10) to show only the first 10 lines
frequency_dist.head(10)
Frequency | Relative Frequency (%) | |
---|---|---|
Category | ||
FAMILY | 1972 | 18.190204 |
GAME | 1144 | 10.552532 |
TOOLS | 843 | 7.776035 |
MEDICAL | 463 | 4.270824 |
BUSINESS | 460 | 4.243151 |
PRODUCTIVITY | 424 | 3.911078 |
PERSONALIZATION | 392 | 3.615903 |
COMMUNICATION | 387 | 3.569781 |
SPORTS | 384 | 3.542109 |
LIFESTYLE | 382 | 3.523660 |
Frequency Bar Chart
Finally, we will plot the frequency variable in a Bar Chart that is a pretty common way to visualize categorical data.
import plotly.express as px
fig = px.bar(frequency)
fig.update_layout(title='Frequency Distribution of Google Play Store app categories',
xaxis_title='Category',
yaxis_title='Frequency')
fig.show()
So, in this article we have seen a bit about Categorical Variables or Nominal Variables, which is a pretty usual data type we face in Statistics, Data Analysis, Machine Learning, and so on. It was just an introductory content, but we may cover it a little deeper in upcoming posts.
References
Wikipedia | Categorical variable 🔎
UCLA | WHAT IS THE DIFFERENCE BETWEEN CATEGORICAL, ORDINAL AND NUMERICAL VARIABLES? 🔎
Brandon Foltz | Statistics 101: Describing a Categorical Variable
🔎
web.ma.utexas.edu | Ordinal Variables 🔎
Top comments (0)