DEV Community

Bernice Waweru
Bernice Waweru

Posted on • Edited on

Pandas Basics

Pandas are used hand in hand with NumPy in data science. In this article, we will explore pandas and form a solid understanding of the library.

Introduction

Pandas is a an open-source Python library used to read, write, manipulate and analyze data.
Pandas integrates well with other data visualization libraries.
To use pandas, you have to import it using:

import pandas as pd
pd is the conventional alias for pandas.

Pandas has two main data structures; Data frames and Series.

Series is a one-dimensional array that can contain elements of different data types. A series is similar to a list in python but is displayed as a column in a table.

Dataframes are two dimensional table structured with labeled axes. It is a collection of series.

Creating a Series

You can create a series using pd.Series() and passing a list or dictionary as an argument.

1.Using a list

import pandas as pd
my_list = [1,4,5,8,9,3]
series_A = pd.Series(my_list)
print(series_A)
Enter fullscreen mode Exit fullscreen mode

Output:

0    1
1    4
2    5
3    8
4    9
5    3
dtype: int64
Enter fullscreen mode Exit fullscreen mode

2.Using a dictionary
The key, value pair in the dictionary become the index and value in the series.

my_dict ={'one':'Jane','two':'Tom','three':'Kamau'} 
series_dict = pd.Series(my_dict)
print(series_dict)
Enter fullscreen mode Exit fullscreen mode

Output:

one       Jane
two        Tom
three    Kamau
dtype: object
Enter fullscreen mode Exit fullscreen mode

Creating DataFrames

1.Use a list or NumPy array. You can specify column and row indexes.

patient_details = [[101,'Julia','Johns'],
[102,'Jessica','Watkins'],
[103,'Amanda','Elis']]
patient_dataframe = pd.DataFrame(patient_details,columns=['ID','FirstName','LastName'],index=[1,2,3])
print(patient_dataframe)
Enter fullscreen mode Exit fullscreen mode

Output:

  ID FirstName LastName
1  101     Julia    Johns
2  102   Jessica  Watkins
3  103    Amanda     Elis
Enter fullscreen mode Exit fullscreen mode

2.Using a dictionary

data = {'Name':['Kris', 'Kate', 'Gao', 'Anita'],
        'Age':[27, 24, 22, 32],
        'Major':['Statistics', 'Accounting', 'Economics', 'Telecoms']}
df = pd.DataFrame(data)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

Name  Age       Major
0   Kris   27  Statistics
1   Kate   24  Accounting
2    Gao   22   Economics
3  Anita   32    Telecoms
Enter fullscreen mode Exit fullscreen mode

The dataframe consists of columns with different datatypes
python df.dtypes

Output:

Name     object
Age       int64
Major    object
dtype: object
Enter fullscreen mode Exit fullscreen mode

Reading and writing CSV files

We often work with already existing data which can be used to create data frames. Most data exist as CSV files, therefore it is important to understand how to read and write CSV files.

The read_csv() function in pandas is used for reading CSV files stored locally or from a URL.

df2 = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Standard_Metropolitan_Areas_Data-data.csv")
df2
Enter fullscreen mode Exit fullscreen mode

Output:

land_area   percent_city    percent_senior  physicians  hospital_beds   graduates   work_force  income  region  crime_rate
0   1384    78.1    12.3    25627   69678   50.1    4083.9  72100   1   75.55
1   3719    43.9    9.4 13326   43292   53.9    3305.9  54542   2   56.03
2   3553    37.4    10.7    9724    33731   50.6    2066.3  33216   1   41.32
3   3916    29.9    8.8 6402    24167   52.2    1966.7  32906   2   67.38
4   2480    31.5    10.5    8502    16751   66.1    1514.5  26573   4   80.19
... ... ... ... ... ... ... ... ... ... ...
94  1511    38.7    10.7    348 1093    50.4    127.2   1452    4   70.66
95  1543    39.6    8.1 159 481 30.3    80.6    769 3   36.36
96  1011    37.8    10.5    264 964 70.7    93.2    1337    3   60.16
97  813 13.4    10.9    371 4355    58.0    97.0    1589    1   36.33
Enter fullscreen mode Exit fullscreen mode

The to_csv() function converts a data frame into a CSV file.

syntax: df.to_csv('filename.csv')

You can also specify index=False to import csv file without the index.

df.to_csv('filename.csv', index=False)

Attributes and Methods

1.The shape attribute shows the shape of the dataframe; the number of rows and columns.

Syntax : df.shape

2.dtype attribute: shows the data types in the columns.

Syntax : df.dtypes
3.axes: Returns a list with the row axis labels and column axis labels.

Syntax : df.axes

4.empty: Returns True if dataFrame is entirely empty.

Syntax : df.empty

5.ndim : Returns the number of dimensions of the underlying data, by definition 1.
Syntax : df.ndim

6.size: Returns the number of elements in the dataFrame. Product of the rows and columns.
Syntax : df.size

To display the first five observations from the dataframe use the head() method.

df2.head()

Output:

    land_area   percent_city    percent_senior  physicians  hospital_beds   graduates   work_force  income  region  crime_rate
0   1384    78.1    12.3    25627   69678   50.1    4083.9  72100   1   75.55
1   3719    43.9    9.4 13326   43292   53.9    3305.9  54542   2   56.03
2   3553    37.4    10.7    9724    33731   50.6    2066.3  33216   1   41.32
3   3916    29.9    8.8 6402    24167   52.2    1966.7  32906   2   67.38
4   2480    31.5    10.5    8502    16751   66.1    1514.5  26573   4   80.19
Enter fullscreen mode Exit fullscreen mode

To display the last five observations from the dataframe use the tail() method.

You can also specify the number of rows to be displayed by passing an argument to the tail() and head() methods.

df2.tail()
Output:

    land_area   percent_city    percent_senior  physicians  hospital_beds   graduates   work_force  income  region  crime_rate
94  1511    38.7    10.7    348 1093    50.4    127.2   1452    4   70.66
95  1543    39.6    8.1 159 481 30.3    80.6    769 3   36.36
96  1011    37.8    10.5    264 964 70.7    93.2    1337    3   60.16
97  813 13.4    10.9    371 4355    58.0    97.0    1589    1   36.33
98  654 28.8    3.9 140 1296    55.1    66.9    1148    3   68.76
Enter fullscreen mode Exit fullscreen mode

The info() method returns a summary of the dataframe.

df2.info()

Output:

#   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   land_area       99 non-null     int64  
 1   percent_city    99 non-null     float64
 2   percent_senior  99 non-null     float64
 3   physicians      99 non-null     int64  
 4   hospital_beds   99 non-null     int64  
 5   graduates       99 non-null     float64
 6   work_force      99 non-null     float64
 7   income          99 non-null     int64  
 8   region          99 non-null     int64  
 9   crime_rate      99 non-null     float64
dtypes: float64(5), int64(5)
Enter fullscreen mode Exit fullscreen mode

The describe() method returns a summary of the numerical columns only.

To get a summary of the categorical columns, we use the include parameter.
Syntax : df.describe(include='object')

You can also use include='all' to get a summary of all the columns.
Syntax : df.describe(include='all')

The mean() function returns the mean of each numerical column.
df2.mean()
Output:

land_area         2615.727273
percent_city        42.518182
percent_senior       9.781818
physicians        1828.333333
hospital_beds     6345.868687
graduates           54.463636
work_force         449.366667
income            6762.505051
region               2.494949
crime_rate          55.643030
dtype: float64
Enter fullscreen mode Exit fullscreen mode

The median() function returns the median of each numerical column.
df2.median()

Output:

land_area         1951.00
percent_city        39.50
percent_senior       9.70
physicians         774.00
hospital_beds     3472.00
graduates           54.00
work_force         257.20
income            3510.00
region               3.00
crime_rate          56.06
dtype: float64
Enter fullscreen mode Exit fullscreen mode

The value_counts() function returns the number of unique entries in the data.
df2.value_counts()
Output:

land_area  percent_city  percent_senior  physicians  hospital_beds  graduates  work_force  income  region  crime_rate
47         41.9          11.9            745         3352           36.3       258.9       3915    1       51.70         1
2966       26.9          10.3            2053        6604           56.3       450.4       6966    1       56.55         1
2766       67.9          7.7             679         3873           56.3       224.0       2598    3       63.22         1
2737       45.0          10.5            602         1462           71.3       131.4       1980    4       63.44         1
2710       63.7          6.2             357         1277           72.8       110.9       1639    4       63.10         1
                                                                                                                        ..
1490       33.1          11.9            827         3818           47.4       300.2       4144    1       30.59         1
1489       58.8          9.5             911         5720           56.5       175.1       2264    3       70.55         1
1465       30.3          6.8             598         6456           50.6       164.7       2201    3       70.66         1
1456       46.7          10.4            2484        8555           56.8       710.4       10104   2       44.64         1
27293      25.3          12.3            2018        6323           57.4       510.6       7399    4       76.03         1
Length: 99, dtype: int64
Enter fullscreen mode Exit fullscreen mode

The unique() function returns all unique categories in a column.
df.Age.unique()
Output
array([27, 24, 22, 32])

To drop columns or rows use df.drop().
Drop values from rows (axis=0)
Drop values from columns(axis=1)

Indexing

The loc() and iloc() methods are used for indexing.

loc() is label based selection and iloc() is index based selection.

df.iloc[1]

Output:

Name           Kate
Age              24
Major    Accounting
Name: 1, dtype: object
Enter fullscreen mode Exit fullscreen mode

You can specify the row and column to access specific elements.

You can also use negative indexing.
df.iloc[1,2]

Output:
'Accounting'

You can also use slicing to access a range of items.
df.iloc[:3,1]

Output:

0    27
1    24
2    22
Name: Age, dtype: int64
Enter fullscreen mode Exit fullscreen mode

Using loc() you can specify the columns
df.loc[0:2,['Name','Age']]

Output:

    Name    Age
0   Kris    27
1   Kate    24
2   Gao 22
Enter fullscreen mode Exit fullscreen mode

Selecting and Assigning data

Attribute based selection

You can use dot selection to select a column using the following syntax: df.columnName
df.Name
Output:

0     Kris
1     Kate
2      Gao
3    Anita
Name: Name, dtype: object
Enter fullscreen mode Exit fullscreen mode

You can also use the bracket based selection to select a column or multiple columns.
When selecting multiple columns we use double square brackets.
df[['Age','Major']]

Output:

    Age Major
0   27  Statistics
1   24  Accounting
2   22  Economics
3   32  Telecoms
Enter fullscreen mode Exit fullscreen mode

Conditional Selection.

To select a row or rows that satisfy a certain condition, we can use conditional operators such as ==.
df.Age == 27

Output:

0     True
1    False
2    False
3    False
Name: Age, dtype: bool
Enter fullscreen mode Exit fullscreen mode

Assigning Data
To assign data to a given column, or row select and provide the new data to be replaced.

df.Name= 'Kamau'
df
Enter fullscreen mode Exit fullscreen mode

Output:

    Name    Age Major
0   Kamau   27  Statistics
1   Kamau   24  Accounting
2   Kamau   22  Economics
3   Kamau   32  Telecoms
Enter fullscreen mode Exit fullscreen mode

Top comments (0)