DEV Community

Cover image for Introduction to Python for Data Science
Itsdru
Itsdru

Posted on • Edited on

Introduction to Python for Data Science

In my attempt to do a simple write on this, I hope I don’t lose myself or you trying this. But then let us see what the data will have to say because I know data is like people – “if you torture data long enough, it will confess to anything”. All in all in the famous words of Socrates, "I neither know nor think I know".

What is Python?

Python is a high-level programming language that has become a go-to language for many data scientists. It was created by Guido van Rossum, and first released in 1991. Some of the reasons why it is loved by many include:

  1. Easy to Learn - Being a high-level language, it means a lot of the low-level details of programming are hidden away and its simple yet easy to understand syntax makes it is pretty easy to pick up even for new programmers.
  2. Flexible and Versatile - Python is very flexible and is used for a wide range of applications including web development, data analysis, scientific computing, machine learning, web scraping, etc.
  3. Large & Active Community - The advantage of this is that there are many resources available for learning the language and also availability of many libraries and tools for data science, which are more often than not maintained by this community.
  4. Open Source - Python is free to use and modify meaning it is an attractive choice for many.
  5. Portable - Python can run on a variety of platforms, making it a great choice for cross-platform development.
  6. Interpreted Language - As an interpreted language, it means it is executed at a go without the need to compile, making it ideal for rapid development. It is also easy to write code and debug in Python.

What is Data Science?

Data Science is a field that involves using mathematical and computational techniques to extract insights and knowledge from data for operational and research benefits. Application of Data Science in different fields tries to understand the past, current and the future of an entity using data. Insights gained can be used to inform decision-making, identify new opportunities and optimize operational performance, etc. It involves using various techniques and tools to:

  1. Gather Data - Collect data from a source(s).
  2. Wrangle Data - Prepare data for analysis by ensuring the data is in a suitable format for analysis. This involves cleaning the data to remove any errors, inconsistencies, or missing values that could affect the analysis of the data. Merging and consolidating the data to create a single useful dataset. Creating new variables not present but based on the data that may provide additional insights.
  3. Analyze Data - Explore the data to identify meaningful patterns, trends, relationships between variables and factors driving a particular outcome. Valuable insights found are then used to inform decision-making or solve specific problems.
  4. Visualize Data - Create charts and visualizations to communicate findings effectively and also make sense of large and complex datasets.
  5. Machine Learning - Using artificial intelligence to teach computers to learn from data so that it can improve its performance on a specific task.
  6. Insights and Recommendations - Using the findings of the analysis to provide insights and recommendations.

Data Science at Work

Netflix, one of the most popular streaming platforms featuring hundreds of movies, animations, and TV shows, uses Data Science to build a recommendation system that suggests movies or tv shows that a user may like based on their viewing history. According to Recoai, Netflix in 2000 started using data science and analytical tools to recommend videos for users to rent. Furthermore, looking at Netflix Research you can gain a better look into how at least 80% of Netflix users play movies or shows based on the Netflix recommendation system according to Netflix.

All in all we can see from this example how Data Science impacts business and operations.

Getting Started with Python

Warm Up

Before starting on the journey of Python, it is important to note that the reason why the snake’s script failed is because they had a syntax error in their Hiss-togram.

Installation

Download the latest version of Python for your operating system from here. You can read more about setting up a Python development using different platforms like VS Code, PyCharm, Jupyter Notebooks, etc.

Hello World

It is a rite of passage in computer science to always change the world by writing the “Hello World” program. Despite its simplicity, it ensures that the development environment is set up correctly.
print(‘Hello, World!’)
Hello, World!

The print() is a built-in function that displays a message on the screen depending on what is in the brackets. This function takes on different forms in different languages.

Python Program

Python programs have the extension .py and to execute a Python program depending on the operating system of your computer, you can do so from the Command Prompt or Terminal. Locate the folder containing the program using the Terminal and then type the following command Python .py

Python3 app.py

Executing Python App through Command Line

The other way is to run the code directly from the coding environment, for example VS Code provides a way to run your Python code at the click of a button.

VSCode Run Code options

Basics of Python

To get started with Python for Data Science, one must learn and understand the basics of the language, libraries and data structures. To write more efficient and effective code, it is important to understand the following different Python concepts:

A) Syntax

Python follows specific rules and conventions for writing valid Python code. Basic elements include:

  1. Whitespace and Indentation - Python uses whitespace and indentation to organize the structure of a program code. Have a look at the code below and pay close attention to the code structure.
# A Function to calculate the are of a circle based on its radius
def calculate_circle_area(radius):
   pi = 3.14159  # approximate value of pi
   area = pi * radius**2
   return area

area = calculate_circle_area(5)
print(area)
Enter fullscreen mode Exit fullscreen mode
  1. Comments - These are texts within the code that describe a piece of code. They are ignored by the interpreter during execution and are only visible at the code level. They begin with a hash(#) symbol. They are usually one-liners. In the code above, the first line is a comment.

  2. Identifiers - These are names that identify variables, functions, modules, classes, and other objects in Python. You can not use keywords to name an identifier.

  3. Keywords - These are words in Python that have special meanings. As a growing language, this list of keywords changes: False, class, finally, is, return, None, continue, for, lambda, try, True, def, from, nonlocal, while, and, del, global, not, with, as, elif, if, or, yield, assert, else, import, pass, break, except, in, raise.

  4. String Literals - To denote a string literal, Python uses single quotes( ‘ ), double quotes(“ “) and triple single quotes(‘’’ ‘’’).
    s = 'This is a string'

s = "Another string using double quotes"

s = ''' string can span
        multiple line '''
Enter fullscreen mode Exit fullscreen mode

B. Variables

Variables are an essential part of programming in Python and other programming languages. In Python a variable is a name that represents a value. You can think of it as a value that stores a value. Variables can hold different types of values, such as:

  1. Integers - Holds whole numbers without decimal places, can either be positive or negative.
    age = 25

  2. Float - Holds decimal numbers
    price = 9.99

  3. Boolean - Holds either True or False values.
    is_student = True

  4. String - Holds a sequence of characters enclosed in quotation marks.
    name = "Alice"

Python being dynamically typed means one doesn’t need to declare the data type before assigning a value to it.

C. Arithmetic Operators - These operators are used to perform mathematical operations on numeric values. The most common include:
1) + : Adds two or more numeric numbers
2) - : Subtracts one number from another
3.) * : Multiples two or more numbers together
4) / : Divides one number by another
5) % : Returns the remainder when one number is divided by another
6) ** :Raises one number to the power of another

D. Comparison, Logical and Conditional Operators

These operators are used to make decisions based on meeting or failing set conditions.
Comparison - these compare two values and return a “True” or “False” boolean result. Comparison operators include:
1) == : equal to
2) != : not equal to
3) < : less than
4) > : greater than
5) <= : less than or equal to
6) >= : greater than or equal to

Logical - These operators combine two or more boolean expressions and return a boolean result. Logical operators include:
1) and : returns True where both expressions are True
2) or : returns True if at-least one expression is True
3) not : returns True if the expression is False and vice versa

Conditional - These operators are used to make decisions based on the result of a comparison or logical operation of whether a condition(s) has been met. Conditional operators include:
1) if : executes a block of code if the condition is True
2) else : executes a block of code if the condition is False
3) elif : allows you to check multiple conditions and execute different code blocks based on the result.

if x > y:
    print("x is greater than y")
else:
    print("y is greater than or equal to x")

Enter fullscreen mode Exit fullscreen mode

E. Loops and Iterables

Loops and Iterables are essential for iterating over collections of data or performing a set of instructions repeatedly.
1) Loops - There are two types of loops in Python, for and while.
For loops loop through an iterable object and perform the same action for each entry.

# Define a list
name_list = ['Kenya', 'Morocco', 'Rwanda', 'Ethiopia', 'Somalia']


# create an iterator from the list
for name in name_list:
   print(name)
Enter fullscreen mode Exit fullscreen mode
Kenya
Morocco
Rwanda
Enter fullscreen mode Exit fullscreen mode

Note that for loops can iterate over a sequence of numbers using the ‘range’ and ‘xrange’ functions.

# Print out the numbers stated in range
for x in range(3):
   print(x)
Enter fullscreen mode Exit fullscreen mode
0
1
2
Enter fullscreen mode Exit fullscreen mode

While - This type of loop executes a block of statements repeatedly until a given condition is satisfied.

count = 0
while count < 3:
   print(count)
   count += 1
Enter fullscreen mode Exit fullscreen mode
0
1
2
Enter fullscreen mode Exit fullscreen mode

“Break” and “continue” statements - Break is used to exit a for loop or a while loop, whereas continue is used to skip the current block, and return to the “for” or “while” statement.

# Use of break statement
count = 0
while True:
   print(count)
   count += 1
   if count >= 3:
       break
Enter fullscreen mode Exit fullscreen mode
0
1
2
Enter fullscreen mode Exit fullscreen mode
# Use of continue statement
# Prints out only odd numbers
for x in range(5):
   # Check if x is even
   if x % 2 == 0:
       continue
   print(x)
Enter fullscreen mode Exit fullscreen mode
1
3
Enter fullscreen mode Exit fullscreen mode

2) Iterables - Any object that can return its elements one at a time. The ‘range()’ function is a built-in function that returns an iterable sequence of numbers.

# define a list
my_list = [5, 2, -4, 6, 8]

# create an iterator from the list
iterator = iter(my_list)

# get the first element of the iterator
print(next(iterator))  # prints 5

# get the second element of the iterator
print(next(iterator))  # prints 2

# get the third element of the iterator
print(next(iterator))  # prints -4
Enter fullscreen mode Exit fullscreen mode
5
2
-4
Enter fullscreen mode Exit fullscreen mode

F. Basic Data Structures

Python offers several built-in data structures that are used to store and organize data. These data structures are essential for organizing and manipulating data.

  1. Lists - These are ordered collections of values, which can be of any data type. Lists are mutable meaning that you can add, remove or change the values of the list after it is created. In other languages this is referred to as an array. Surrounded by [].
    my_list = [1, 2, 3, "four", 5.5]

  2. Dictionaries - A collection of key-value pairs, where each key is associated with a value. Similar to lists, dictionaries are mutable too. Surrounded by {}.
    my_dict = {"name": "Alice", "age": 25, "is_student": True}

  3. Tuples - Tuples are like lists but they are immutable after creation. Its values can not be changed. Surrounded by ().
    mytuple = ("apple", "banana", "cherry")

  4. Sets - This is an unordered collection of unique values, can be of any data. Sets are mutable and are surrounded by {}.
    thisset = {"apple", "banana", "cherry"}

G. Methods

Python being an object-oriented language, it means it can deal with classes and objects to model the real world. A method in simple terms is a label that you call on an object, a piece of code that performs on the object or returns information about the object.

# Create a string object
my_string = "Hakuna Matata Kenya!"

# Call the count method on the object
Length = my_string. count('a')

print(Length)
Enter fullscreen mode Exit fullscreen mode

6

H. Regular Expressions

Regular expressions are special sequence characters that help match or find other strings using a specialized syntax held in a pattern.

# Import re library
import re

# Define a regular expression pattern
pattern = r'\d+'

# Define a test string
text = "I have 3 cats and 2 dogs."

# Search for the pattern in the text string
matches = re.findall(pattern, text)

# Print the matches
print(matches)
Enter fullscreen mode Exit fullscreen mode

['3', '2']

I. File Input/Output

It is an essential part of any program to allow you to read data from and write data to files. Here are the basics of file Input/Output in Python:

  1. Opening a file - To open a file, you use the built-in ‘open()’ function, which takes two arguments. The file path of the file to be opened and the mode to open the file in(e.g. Read-only, write-only or read-write). For example to open file in read mode, you would use the following code:
    file = open("filename.txt", "r")

  2. Reading from a file - Once a file is open, you can read from it using ‘read()’ or ‘readline()’ methods. Where ‘read()’ reads the entire contents of a file as a single string while the latter reads one line at a time.
    content = file.read()

  3. Writing to a file - To write data to a file, you can use the ‘write()’ method.
    file.write("Hello, world!")

  4. Closing a file
    It is highly important when we are done with performing operations on the file, we need to properly close the file to avoid data corruption and other issues. Additionally using a context manager, ‘with’ can automatically close files when done with them.

with open("filename.txt", "r") as file:
    content = file.read()
Enter fullscreen mode Exit fullscreen mode

J. Python Libraries

In the literal sense a library means a collection of books or a place where many books are stored, it translates to the same concept in programming. These are modules that contain functions and classes that can be used by other programs to perform tasks without having to write the code functionalities from scratch.
There are a bunch of these libraries including:

  1. Pandas - Pandas is a go-to library for structured data manipulation and analysis. It provides tools for reading, cleaning, and transforming data from a variety of sources.
  2. Numpy - This is a fundamental package for working with large arrays and matrices of numerical data. Suitable for scientific computing.
  3. Matplotlib - Primarily used for high definition data visualization purposes. Provides tools for plotting charts: line, bar, scatter plots and many more.
  4. Seaborn - Visualization of statistical models is made possible with this package. It is based on Matplotlib.
  5. Scipy - Designed for scientific computing, information processing and high-level computing and can be used alongside Numpy.
  6. Scikit-learn - A machine learning library for both supervised and unsupervised learning processes.
  7. BeautifulSoup - Used for web scraping providing tools for parsing and extracting data from HTML and XML documents.
  8. Scrappy is also another web-scraping library.
  9. Tensorflow - It can be used for a wide range of tasks but has a particular focus on training and inference on deep neural networks.
  10. Keras - A high-level neural networks API. Provides a simple interface for building and training neural networks.
  11. Django - Django is a high-level web framework for building web applications in Python. It provides tools for database integration, template rendering, user authentication, and more.
  12. Flask - Flask is a micro web framework for building web applications in Python. It provides tools for routing, template rendering, and database integration, among others.

K. Data Visualization

Python has many libraries that help in creating visual representations of data. They include Matplotlib, Seaborn, Plotly, etc.

# Import matplotlib library
import matplotlib.pyplot as plt

# Create two lists
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Call plot function
# Edit chart
plt.plot(x, y)
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Line chart')

plt.show()
Enter fullscreen mode Exit fullscreen mode

Visualising Data using Matplotlib

L. Machine Learning

Machine Learning is all about making a computer learn by studying data and statistics. Using data then a computer is able to make predictions or decisions based on its model training. There are a bunch of Python libraries that make this possible including Scikit-learn, TensorFlow, PyTorch, etc.

# Build a linear regression model
# Import LinearRegression class from sklearn library
from sklearn.linear_model import LinearRegression

# Define some sample data
X = [[1], [2], [3], [4], [5]]
y = [2, 4, 5, 4, 5]

# Create a LinearRegression object
regressor = LinearRegression()

# Fit the model with the sample data
regressor.fit(X, y)

# Use the model to make a prediction
prediction = regressor.predict([[6]])

# Print the prediction
print(prediction)
Enter fullscreen mode Exit fullscreen mode

[5.8]

Go Forth & Make Greatness

With Python, you have the power to turn raw data into actionable insights that can drive decision-making and create value. You can unlock a world of possibilities and opportunities, from machine learning to artificial intelligence and beyond.

Whether you’re just starting out or already an experienced data scientist, the limitless potential of Python and Data Science awaits you. So go forth and make greatness!

Exploring the Possibilities: Let's Collaborate on Your Next Data Venture! You can check me out at this Link

Top comments (0)