I will be watching the 10-hour course from FreeCodeCamp, https://www.youtube.com/watch?v=GPVsHOlRBBI, so I want to record what I learned from it.
Conda
Conda is an open-source package manager made for Python. If you have two python projects, each requiring a different version of Python and Python packages, Conda is for you. Conda can create, save, load and switch between environments on your local computer, so you can work on your two projects without having to reinstall Python every time you switch projects.
Jupyter Notebook
Jupyter notebook is a notebook where you can write notes and run code. It is useful for data science because codes are organized in cells, so you can run individual cells one at a time. Contents inside variables are also kept after you ran a cell so that you can use them repeatedly without having to restart the entire program. Without a Jupyter notebook, when I made a change, I have to restart the entire program because variables are not stored, which is very inefficient. But from now on I will use a Jupyter notebook and be more efficient.
It also helps me with writing my blog, since its structure is very similar.
Common keyboard shortcuts in Jupyter Notebook
Keyboards shortcuts are very useful to speed up things, so I went ahead to find some useful keyboard shortcuts from this article before starting:
-
Shift + Enter
run the current cell, select below -
Alt + Enter
run the current cell, insert below -
Ctrl + S
save and checkpoint
There are two modes, one is the default mode, which is when you just loaded in the document and are not editing anything, in there you can:
-
A
insert cell above -
B
insert cell below -
D, D
(press the key twice) delete selected cells -
Z
undo cell deletion -
Y
change the cell type to Code -
M
change the cell type to Markdown -
Enter
take you into edit mode
Inside edit mode, where you edit code cells:
-
Esc
take you back to default mode -
Tab
code completion or indent -
Shift + Tab
tooltip
Numpy
-
np.genfromtxt()
can read csv file and return a numpy array -
np.savetxt()
can store a numpy array from a csv file - commonly used functions include:
- Mathematics: np.sum, np.exp, np.round, arithemtic operators
- Array manipulation: np.reshape, np.stack, np.concatenate, np.split
- Linear Algebra: np.matmul, np.dot, np.transpose, np.eigvals
- Statistics: np.mean, np.median, np.std, np.max
- numpy supports array broadcasting, which allows arithmetic operations between two arrays with different numbers of dimensions but compatible shapes
Pandas
- pandas' main data type is DataFrame, which is like an Excel spreadsheet
- we can use the
.info()
method to view the basic information about the dataframe - we can use the
.describe()
method to see statistical information about the numeric data within the dataframe - another important data type is Series, which is like an array. You can use the
.index
method to get the indexes as a list, which is very useful when you want to plot graphs of a series - we can pass in a list of columns to create a view of a data frame, like
reduced_df = df[['column1', 'column3']]
, then the resulting dataframe will only have the two columns, but note that this is simply a view, and modifying values here will change the original dataframe as well, you need to use .copy() to create a new dataframe - to view data in a dataframe, we can use
.head()
to show first few items,.tail()
to show last few items, and.sample()
to show a random item - we can sort the dataframe by value by using
covid_df.sort_values()
where we pass in the column name - we can convert data to date using the
pd.to_datetime()
function - we can use the
.groupby()
function passing in a column name to group the data, then we can select some columns, and use.sum()
or.mean()
to calculate the value for the different groupings - we can also merge dataframes, by calling
.merge()
on a dataframe, passing in the dataframe, and on which column - to write back to csv, we use
.to_csv()
passing in the file name
Matplotlib and Seaborn
- in a Jupyter notebook, we can use
%matplotlib inline
after importing matplotlib to ensure that our plots are shown and embedded within the Jupyter notebook itself - the most basic chart is a line chart, which can be plotted with
plt.plot()
- we can add labels to the chart with
plt.xlabel()
andplt.ylabel()
, and legend withplt.legend()
- it seems plots only show at the end of code execution, that's why we can do
plt.plot()
, then change certain stuff like labels afterwards, then at the end of the execution, the plot will be shown with the correct data and labels - we can even plot multiple lines on the same graph
- seaborn is a statistical graphics library built on matplotlib, and is commonly imported as
sns
, because Samuel Norman "Sam" Seaborn is a fictional character portrayed by Rob Lowe on the television serial drama 'The West Wing' according to this StackOverflow post, my guess is the creator loved this TV show. - we can plot a scatterplot using
sns.scatterplot()
passing in the x and y, and optionally a hue - we can plot a histogram with
plt.hist()
, and we can customize the bins to even uneven bins if we want - we can stack histogram by passing in the
stacked
argument - we can plot bar chart by
plt.bar()
- to stop labels from overlapping, we can tilt them by
plt.xticks(rotation=75)
- we can plot a heat map using
sns.heatmap()
, a heat map is a good way to visualize 2D data
Basic principle
- When starting data analysis, it is important to understand the data collected, how is it collected, and how accurate is it.
- then we want to do some data preparation and cleaning, to prepare the data for analysis, and clean unwanted data.
- then we can do some exploratory analysis and visualization, this is when we don't have any specific topic in mind, but just poke around and see what the data is like
- then at last we try to answer questions with the data
Conclusion
I would like to thank the platform jovian.ai for partnering with freecodecamp.org and giving us this free course, and a platform to learn while doing, where we can use the same Jupyter notebook the instructor used. I want to also thank the instructor Aakash N S for the teaching and the material, he included practical examples for the content so we can better understand.
After this course, I feel confident continuing my data analysis project, which can be found on my Hashnode blog.
Top comments (0)