As with any IT project, the CRISP-DM method is often adopted to successfully carry out a machine learning project. It consists of six phases, with the first being the Data Understanding phase. This phase stands out as the crucial foundation of any machine learning project. Imagine yourself as an architect planning the construction of a skyscraper; before even laying the first stone, you must understand every nuance of the terrain. Similarly, before diving into sophisticated algorithms and complex models, a deep and detailed understanding of your data is essential. This stage allows you to unveil hidden secrets in your datasets, discover subtle trends, and identify anomalies. When well-executed, it transforms a mere collection of raw data into a goldmine of actionable insights, guiding every decision and adjustment throughout the project. In this article, we will explore how to master this crucial phase, using the powerful tools of scikit-learn to demystify and enhance your data. Get ready to delve into the very heart of data science, where every data point tells a story and every insight paves the way to success.
The Data Understanding phase consists of three steps, each resulting in a deliverable:
The identity card of the dataset(s).
The description of the fields.
The statistical analysis of each field.
To accomplish these, it is necessary to load the data and analyze it thoroughly.
Data understanding is an analysis phase, not a modification phase. The only manipulations allowed at this level are those necessary for loading, formatting, and changing the data type for better analysis. For example, modifying the decimal separator can be part of formatting.
Loading Data
The first action to perform is to load the data into the Jupyter notebook. For this, there are various Pandas methods of the form read_xxx. These methods currently allow reading data in the following formats: pickle, CSV, FWF (fixed-width), table (generic), clipboard, Excel, JSON, HTML, HDFS, feather, parquet, ORC, SAS, SPSS, SQL, Google BigQuery, and STATA. This list may grow over time.
To read the Iris and Covid19 datasets (in CSV and xlsx format), the method is read_csv and read_xlsx:
iris_df = pd.read_csv('iris.csv')
covid_df = pd.read_excel('covid19.xlsx')
To preview the loaded DataFrames and verify that they have been correctly loaded, simply use the **head **function. You can add a number as a parameter to indicate the number of rows to display; otherwise, the first five rows will be displayed by default.
iris_df.head()
Creating the Dataset Identity Card
The first step of this phase is to create the dataset identity card. This is important because it provides all the global information about the data that will be used for the process. It includes, but is not limited to:
The dataset name: In case there are multiple files, this allows knowing exactly which ones were used.
Its origin: This includes both the data source (database, flat file, etc.) and the extraction date. Depending on this information, the data quality and its relevance to the machine learning task may be questioned, for example, if using outdated data.
Its size: This ensures that during future loads, all data has been accounted for. Therefore, it is necessary to indicate the number of records, attributes, and the file size if applicable.
Its formatting: This helps to better understand the file structure, facilitating its loading if it needs to be done again later.
The business description of the data: This information is crucial as it allows understanding what the data represents and its connection to the problem to be solved.
Most of these fields do not require technical operations. For formatting, loading into Pandas can be reused, as it particularly indicates the file structure, such as the presence of a specific separator. For the dataset size, the shape attribute, as in NumPy, allows knowing the number of rows and columns.
iris_df.shape
(150, 5)
Field Description
Once the dataset is described, the second step of the Data Understanding phase is to describe each field, typically in the form of a table. This allows for a precise understanding of each variable: its meaning, the expected values, and any known limitations. Without this information, the variables lose their meaning, and no model can be reliably put into production.
A few years ago, a medical article was published showing a link between the treatment for cancer patients and the presence of specific sequences in their genome. It was an impressive breakthrough. However, the article had to be retracted because the authors had reversed the meaning of a variable (presence or absence), making their discovery nonsensical at best and, at worst, potentially life-threatening for patients following the model's recommendations.
This involves providing for each field:
Its name, as it appears in the dataset,
Its type: integer, float, string, date, etc.,
Its format if it is specific: for dates, for example, indicate the format, especially between DD/MM and MM/DD,
Its description: what the variable exactly indicates. For industrial processes, this is often accompanied by a diagram showing where the different measurements were taken,
Its unit: this is very important for verifying the correspondence between the variable's content and its meaning. For example, if it is a liquid water temperature in °C, it should be between 0 and 100 at ambient pressure,
The presence or absence of missing data and, if applicable, the number of missing data points,
Its expected limits, which derive from the previous information,
Any other useful information if necessary.
Pandas can provide some of this information: the type and the missing data. The rest of the table will mainly be obtained through discussions with the client and/or the data provider.
Managing Data Types
The type can be determined by the dtypes
attribute. However, be cautious, as the type may be incorrect upon import due to incorrect detection. It is possible to change the type of fields using the astype
function by passing the desired type name as a parameter
After loading the Iris dataset, you can check the types of the fields. Then, you can change the type of the 'class' column to a categorical variable and display the updated types.
iris_df.dtypes
Initial data types
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species object
dtype: object
iris_df['species'] = iris_df['species'].astype('category')
iris_df.dtypes
Update data types
sepal_length float64
sepal_width float64
petal_length float64
petal_width float64
species category
dtype: object
Detecting Missing Data
You can determine the number of missing values per variable as well as the number of rows with missing data. In both cases, the Iris dataset has no missing data, whereas the Covid-19 dataset has 5644 rows with missing data (that's quite a lot).
iris_df.isnull().sum()
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
iris_df.isnull().any(axis=1).sum()
0
covid_df.isnull().sum()
Patient ID 0
Patient age quantile 0
SARS-Cov-2 exam result 0
Patient addmited to regular ward (1=yes, 0=no) 0
Patient addmited to semi-intensive unit (1=yes, 0=no) 0
...
HCO3 (arterial blood gas analysis) 5617
pO2 (arterial blood gas analysis) 5617
Arteiral Fio2 5624
Phosphor 5624
ctO2 (arterial blood gas analysis) 5617
Length: 111, dtype: int64
covid_df.isnull().any(axis=1).sum()
5644
It is therefore necessary to determine what will be done for each field. Indeed, during the preparation phase, it will be possible to fill in missing values with a predefined value, for example.
Top comments (0)