DEV Community

Kaira Kelvin.
Kaira Kelvin.

Posted on • Updated on

Pandas notes

Once you know how to read a CSV file from local storage into memory, reading data from other sources is a breeze.

Reading from a specific webpage, u pass the following line of code
"url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

pd.concat()
Enter fullscreen mode Exit fullscreen mode

It allows us to join two or more data frames along either rows and columns.
Sometimes the input data frames have generic indexes that overlap, Like row numbers in spreadsheet. It has an optional parameter called ignore-index.
By specifying axis = 1 in the concat statement,we override the default behavior and join the columnns.

pd.concat([df1,df2...],
axis=1)
Enter fullscreen mode Exit fullscreen mode
pd.concat([df1,df2....],
ignore_index=True)
Enter fullscreen mode Exit fullscreen mode
joined_df = left_df.merge(right_df)
Enter fullscreen mode Exit fullscreen mode

Before creating line plots with Matplotlib, first set up the environment which including installing Matplotlib.To install matplotlib use pip ie the package installer for python .

A.Plotting a line graph in pandas.

This is the most used visualizations, line plots are excellent at tracking the evolution of a variable over time.

When plotting a line graph in pandas
here's a sample of a code

`Publications_per_year =df['year of  publication'].value_counts().sort_index()
publication_per_year.plot(kind='line',marker='o',linestyle='_',color='b',figsize=(10,6))
plt.xlabel('year of publication)
plt.ylabel('number of publications)
plt.title('A line graph showing number of publications against year')
plt.grid(true)
plt.show()
Enter fullscreen mode Exit fullscreen mode


**'Color b'** the line of the graph takes a bold color blue, u can use different colors when plotting different colors when plotting different line graphs.To create a line plot, use the
plt.plot()function.The plt.plot () function plots a blue line.

plt.plot(dates,closing_price,color ='red')

**alpha=0.5** The alpha parameter is used to control the transparency of the color.0 (completely transparent) and 1 (completely opaque). Setting it to a value less than 1 will make the color more transparent.
**line width** - changing the line width by passing a linewidth parameter to the plt.plot() function .The linewidth parameter takes a floating-point value representing the line's width.
plt.plot(dates,closing_price,linewidth =3)`
the marker parameter in Matplotlib determines the style of marker used to highlight data points on the line.
Specifically, marker='o' specifies that a circular marker will be used.
Below are more examples of various line styles and markers to create different lines in the plot. Which you can use to customize the combinations of line styles and markers to achieve the desired visual effect in your plots.

Image description
grid lines
We can also add grid lines to our plot to make it more readable.We can achieve by using the plt.grid() function.The plt.grid() function takes a boolean value reprensing whether the grid should be shown.
plt.grid(True)

B. Bar Plots.

A bar chart ranks data according to the value of multiple categories. It consists of rectangles whose lengths are proportional to the value of each category. They are prevalent since they are easy to read.
Making bar plots instead of line plots is as simple as passing kind='bar' (for vertical bars) or kind='barh' (for horizontal bars).
Stacked bar plots are created from a DataFrame by passing stacked=True.

df.plot(kind='barh', stacked=True, alpha=0.5)
A useful recipe for bar plots is to visualize a Series’s value frequency using value_counts: s.value_counts().plot(kind='bar')

C. Histograms and Density Plots
A histogram, with which you may be well-acquainted, is a kind of bar plot that gives a discretized display of value frequency.The data points are split into discrete, evenly
spaced bins, and the number of data points in each bin is plotted

D. Plotting a Pie Chart.

A pie chart is a circular statistical graphic that is divided into slices also called (wedges) to illustrate numerical proportions. The area of the chart is the total percentage of the given data.
Syntax

matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None, autopct=None, shadow=False)
eg


plt.pie(chart,labels=chart.index autopct= '%1.1f%%' ,startangle =90)

Explode

Maybe you want one of the wedges to stand out ? The explode parameter allows you to do that. If it is specified and not none, must be an array with one value for each wedge.

eg explode= [0.2,0,0,0] it will pull the first element 0.2 from the center of the pie.

Shadow
Adding a shadow to the pie chart by setting the shadows parameter to True: (shadow= True)

Legend
Adding a list of explanations for each wedge, use the Legend()function. u can add title to the legend by adding
plt.legend(title = "Vict Sex")

Autopct.
It is used to format and display the percentage labels on each wedge of a pie chart. It allows you to automatically calculate and format the percentage values based on the sizes of the wedges.

  • '%1.1f%%'- Displays the percentage with one digit before the decimal point and one digit after the decimal point,followed by the percentage symbol eg = "25.5%"

  • '%.2f%%' Displays the percentage with two digits after the decimal point for example 43.56%, 47.99%

  • '%.0f%% Displays the percentage with two digits after the decimal point.


plt.setp(autotexts, size=10, weight="bold"

Image description


fig,ax=plt.subplots(figsize=(6,8))
explode=[0.0,0.0,0.1]
wp={'linewidth':1,'linestyle ':'-','edgecolor':'black'}
colors=("orange","cyan","indigo")
chart = Dinosaurs['diet'].value_counts()

to create a pie chart with labels, autopct formatting,wedges properties, and explode.


wedges,texts,autotexts=ax.pie(chart,labels=chart.index,autopct='%1.1f%%',startangle=140,colors=colors,explode=explode,
wedgeprops=wp,
ax.legend(wedges,chart.index.tolist(),
title="types of dinosaurs diet",
bbox_to_anchor=(0.1,0.5),
loc="best")
plt.setp(autotexts,size=12,weight="bold"

The bbox_to_anchor parameter takes a tuple of two values (x, y), where x and y are the coordinates in the figure's normalized coordinate system.
The normalized coordinate system ranges from 0 to 1, where (0, 0) is the bottom-left corner and (1, 1) is the top-right corner of the figure.
Here's a breakdown of the bbox_to_anchor parameter:

(0, 0) corresponds to the bottom-left corner of the figure.
(1, 1) corresponds to the top-right corner of the figure.
(0.5, 0.5) corresponds to the center of the figure.
Below is a chart showing the position of legend in a figure.
Image description

Magic commands start with either % or %% and the command we need to nicely display plots inline is %matplotlib inline, with this magic in place all plots created in code cells will automatically be displayed inline.
in the new version of juypter notebooks %matplotlib inline is not strictly necessary plots will often be displayed automatically.

The Box-plot.
A box plot is a method for graphically depicting groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of the box to show the range of the data.
The position of the whiskers is set by default to 1.5*IQR (IQR = Q3 - Q1) from the edges of the box. Outlier points are those past the end of the whiskers.

In addition, the box plot allows one to visually estimate various L-estimators notably the interquartile range, midhinge, range, midrange, and trimean. Box plots can be drawn either horizontally or vertically.
Boxplot - it displays the dataset based on the five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.

  • Minimum (Q0 or 0th percentile): the lowest data point in the data set excluding any outliers

  • Maximum (Q4 or 100th percentile): the highest data point in the data set excluding any outliers

  • Median (Q2 or 50th percentile): the middle value in the data set.The middle value of a dataset where 50%of the data is less than the median and 50% of the data is higher than the median.

  • First quartile (Q1 or 25th percentile): also known as the lower quartile qn(0.25), it is the median of the lower half of the dataset.

  • Third quartile (Q3 or 75th percentile): also known as the upper quartile qn(0.75), it is the median of the upper half of the dataset.

  • Interquartile range (IQR) the distance between the upper and lower quartiles.IQR =Q3-Q1 =qn(0.75)-qn(0.25) .The upper quartile minus the lower quartile.

Outliers- Any values above the "maximum" or below the "minimum".

How to read or extract columns from a dataset.

To display or output columns in a dataframe in pandas use this ways:

  • Basic ways.
  • Using loc[]
  • Using iloc[]
  • Using .ix

Basic ways


Cal_Wat=df_food[['Calories','Water']]

Using index to select the columns for example


df_food[df_food.columns[1:4]]

Pandas is a very useful library for manipulating mathematical data and is used in the field of machine learning. It comprises many methods for its proper functioning. loc() **and **iloc() are one of those methods.

loc()function.

The loc() function is a label-based data-selecting method which means that we have to pass the name of the row or column that we want to select.
Operations that can be performed using the loc() method include;

1. Selecting Data According to Some Conditions

The code uses the loc function to select and display rows with the same condition such as book price of novels and satisfy a certain condtion such as a rating above 4.6. for example

display(df.loc[(df['book price]==16.99) & (df['rating'] >4.6)])


display(crime.loc[(crime['Crm Cd Desc'] == 'VEHICLE - STOLEN') & (crime['AREA NAME'] == 'Central')])

remember

  • Use single quotes (') around the column names if the column names contain spaces or special characters.
  • Use & for the logical AND operation.

2. Selecting a Range of Rows From the DataFrame.

The code utilizes the loc function to extract and display rows with indices ranging from 3 to 8(inclusive) from the DataFrame.


display(df.loc[3:8])

  1. Updating the Value of Any Column. The code uses the loc function to update

Pandas DataFrame.isin()

The **isin **function is a powerful tool for filtering and selecting data within a data frame based on specified conditions.
It allows you to create boolean masks to identify rows where the values in one or more columns match certain criteria.
eg


Display = crime["AREA NAME"].isin(['Devonshire']) & crime["Status Desc"].isin(['Adult Other'])
print(Display)

this will return

  • The code uses the isin function to check if each element in the specified columns is contained in the provided lists.

  • It creates a boolean Series (Display) where each element corresponds to whether both conditions are true for the corresponding row.

Python iloc() function.

The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column.DataFrames based on their position or index.

  1. Selecting Rows Using Integer Indices. The code employs the iloc function to extract and display specific rows with indices 0, 2, 4, and 7 from the DataFrame, showcasing the information about the rows selected.


display(data.iloc[[0, 2, 4, 7]])

  1. Selecting a Range of Columns and Rows Simultaneously. The code utilizes the iloc function to extract and display a subset of the DataFrame, including rows 1 to 4 and columns 2 to 4.


display(data.iloc[1: 5, 2: 5])

df.dtypes - attribute of a DataFrame which returns a Series with the data type of each column.
int64 - represents a 64-bit integer.
object - typically represents strings or mixed types (i.e., columns with different types of data).
float64 - represents a 64-bit floating-point number.

To convert the data types of columns using the keyword 'astype'


df['column1']=df['column1'].astype(float)

when converting object to float u can first clean the data.

Dataframe. info() **-useful during exploratory analysis offering a quick and informative overview of the dataset. it helps gain insights.
**Functions:

It lists all columns with their data types and the number of non-null values in each column.

df.info(verbose = False)

it utilizes the panda's library to handle tabular data, it prints a summary of the dataset.

df.info(verbose = True, null_counts = False)

it excludes all the null counts.

df.size, df.shape and df.ndim attributes.
Pandas size, shape, and ndim attributes are used to return the size, shape, and dimensions of data frames and series.

data.size - calculates and prints the total number of elements (size) in the DataFrame, which is the product of the number of rows and the number of columns. It includes both non-null and null values in the count.

df. Shape- calculates and prints the shape of the dataframe which includes the number of rows and columns.

df.ndim- attribute of a data frame or series represents the number of dimensions of the underlying data. eg (rows and columns).

To get the top values in a dataset or series ,

nlargest() - method returns a specified number of rows ,starting at the top after sorting the dataframe by the highest value for a specified column.

Syntax dataframe.nlargest(n,columns,keep)

  • n- Required a number specifying the number of rows to return.
  • columns- optional, A string (Column label) OR a list of column labels, specifying the columns to order by
  • keep- Optional default 'last' specifying what to do with duplicate rows. 'all', 'first', 'last'


df_food.nlargest(n=5,columns='Vitamin C')

Dataframe.isnull() method.
** df.isnull() ** - function detect missing values in the given dataset.It return a boolean same sized object indicating if the values are NA. Missing values get mapped to TRUE and non-missing values get mapped to FALSE .

To get the total of missing values per column.
df_food.isnull().sum() or df_food.isna().sum()

     **
Enter fullscreen mode Exit fullscreen mode


dataframe.sum()

method** - used to calculate the sum of values in a DataFrame or Series

DataFrame merge ()Method .

We can join,merge and concat data frame using different methods. In DataFrame df. merge(),df. join(), and df. concat() methods help in joining merging and concating different dataframes.
The merge () method updates the content of two DataFrame by merging them using the specified methods().

Merging DataFrame.

Pandas have options for high-performance in-memory merging and joining. When we need to combine very large data frames, joins serve as a powerful way to perform these operations swiftly. Joins can only be done on two data frame at a time, denoted as left and right tables. The key is the common column that the column avoid unintended duplication of row values. Pandas provide a single function merge () as the entry point for all standard database join operations between DataFrame objects.

Code 1: Merging a data frame with one unique key combination,
res =pd. merge(df,df1,on='key')
Code 2: Merging dataframe using multiple join keys.
res=pd.merge(df,df1,on=['key','key1']

Merging data frame using how in an argument:
We use how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA, Here is a summary of the how operations and their SQL equivalent names:

Image description

Now we set how = 'left' to use keys from the left frame only.

res = pd.merge(df, df1, how='left', on=['key', 'key1'])

Now we set how = 'right' to use keys from the right frame only.

Now we set how = 'outer' to get a union of keys from data frames.

Joining DataFrame.

To join the data frame, we use. join() function. It is used for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

Top comments (0)