Today we gonna cover Pandas library.
Pandas is a python library which usually use for data manipulation and data analysis. Mostly used in Data Science and Machine Learning. In this notebook we gonna show how powerful pandas library is!
Let's get started!
Let's call the numpy and pandas library into our workspace. Here, we are using kaggle notebook where these libraries are already installed.
import numpy as np
import pandas as pd
If these libraries aren't installed in you IDE, you have to install them before calling them.
1. Pandas Series
Let's create a series with pandas.
a1=['a','b','c']
my_data=[50,70,30]
ar=np.array(my_data)
d={'a':50,'b':70,'c':30}
pd.Series(data=my_data, index=a1)
Same thing could be done with:
pd.Series(my_data,a1)
and also with:
pd.Series(d)
Indexing in series
series1=pd.Series([1,2,3,4],['A','B','C','D'])
series1
series1['C']
2. Pandas DataFrames
Call the required library for creating data frame in python with pandas.
import numpy as np
import pandas as pd
from numpy.random import randn
Setting a fixed seed point as we want to draw the same set of random numbers each time we run the code. Otherwise our result would be vary every time we run the code.
np.random.seed(1011)
df=pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
df
Here is our data frame.
If we want to grab the column 'W', output gives a series
df['W']
another way to grab a column like sql
df.W
If we want to grab multiple column, output gives a dataframe
df[['W','Z']]
Add a column
Let's add a column to the data frame
df['H']=df['W']+df['Z']
Delete a column
To delete a column we will use drop function
df.drop('H',axis=1)
But if you run again the dataframe new column is still there, so we have to add another argument.
df.drop('H',axis=1,inplace=True)
this permanently deletes the column.
Selecting rows, labelbased index:
df.loc[['A','B'],['W','Y']]
Conditional selection
Select rows where W column value is greater than zero along with Y and X column.
df[df['W']>0][['Y','X']]
Multiple selection: Can you explain what result will give the following code?
df[(df['W']>0) & (df['Y']>1)]
df[(df['W']>0) | (df['Y']>1)]
Multi-level index or index higher key
Now we will create a data frame with index more than one level.
outside=['G1','G1','G1','G2','G2','G2']
inside=[1,2,3,1,2,3]
hi_index=list(zip(outside,inside))
hi_index=pd.MultiIndex.from_tuples(hi_index)
df=pd.DataFrame(randn(6,2),hi_index,['A','B'])
df
To grab everything under G1
df.loc['G1']
Try to explain which value we want to grab with following code:
df.loc['G2'].loc[2]['B']
3. Read CSV file
CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
df = pd.read_csv('/kaggle/input/pandas/data_set.csv')
print(df.to_string())
4. Correlations
The relationship between each column in your data set can be calculated by cor() method. The relationship between the columns of our data
df.corr()
Correlation value varies from -1 to 1. Negative value indicate negative relationship that is if values of variable increases, other will decreases. Positive value mean a positive relationship, values of variable increases, other will increase too. 1 indicates perfect relationship.
You can practice more example at your own. The notebook link is given below. Go to the link and practice.
Notebook Link: [https://www.kaggle.com/code/azizaafrin/powerful-pandas-part-1]
Happy Learning!❤️
Aziza Afrin
Top comments (0)