Micheal Ojemoron

Posted on Mar 15, 2020 • Edited on Mar 22, 2020

A Beginner's Guide to Multithreading and Multiprocessing in Python - Part 1

#python #django #datascience #computerscience

As a Backend Engineer or Data Scientist, there are times when you need to improve the speed of your program assuming that you have used the right data structures and algorithms. One way to do this is to take advantage of the benefit of using Muiltithreading or Multiprocessing.
In this post, I won't be going into detail on the inner working of Muiltithreading or Multiprocessing. Instead, we will write a small Python script to download images from Unsplash. We will start with a version that downloads images synchronously or one at a time. Next, we use threading to improve execution speed.

I am sure you are excited to learn this...

Multithreading

In a nutshell, threading allows you to run your program concurrently. Tasks that spend much of their time waiting for external events are generally good candidates for threading. They are also called I/O Bound tasks e.g writing or reading from a file, network operations or using an API to download stuff online.
let take a look at an example that shows the benefit of using threads.

Without Threading
In this example, we want to see how long it takes to download 15 images from Unsplash API by running our program sequentially.

import requests
import time
img_urls = [
    'https://images.unsplash.com/photo-1516117172878-fd2c41f4a759',
    'https://images.unsplash.com/photo-1532009324734-20a7a5813719',
    'https://images.unsplash.com/photo-1524429656589-6633a470097c',
    'https://images.unsplash.com/photo-1530224264768-7ff8c1789d79',
    'https://images.unsplash.com/photo-1564135624576-c5c88640f235',
    'https://images.unsplash.com/photo-1541698444083-023c97d3f4b6',
    'https://images.unsplash.com/photo-1522364723953-452d3431c267',
    'https://images.unsplash.com/photo-1513938709626-033611b8cc03',
    'https://images.unsplash.com/photo-1507143550189-fed454f93097',
    'https://images.unsplash.com/photo-1493976040374-85c8e12f0c0e',
    'https://images.unsplash.com/photo-1504198453319-5ce911bafcde',
    'https://images.unsplash.com/photo-1530122037265-a5f1f91d3b99',
    'https://images.unsplash.com/photo-1516972810927-80185027ca84',
    'https://images.unsplash.com/photo-1550439062-609e1531270e',
    'https://images.unsplash.com/photo-1549692520-acc6669e2f0c'
]

start = time.perf_counter() #start timer
for img_url in img_urls:
    img_name = img_url.split('/')[3] #get image name from url
    img_bytes = requests.get(img_url).content
with open(img_name, 'wb') as img_file:
     img_file.write(img_bytes) #save image to disk 

finish = time.perf_counter() #end timer
print(f"Finished in {round(finish-start,2)} seconds") 

#results
Finished in 23.101926751 seconds

With Threading
Let's see how the threading module in Pyhton can significantly improve our program execution.

import time
from concurrent.futures import ThreadPoolExecutor

def download_images(url):
    img_name = img_url.split('/')[3]
    img_bytes = requests.get(img_url).content
    with open(img_name, 'wb') as img_file:
         img_file.write(img_bytes)
         print(f"{img_name} was downloaded")

start = time.perf_counter() #start timer
with ThreadPoolExecutor() as executor:
    results = executor.map(download_images,img_urls) #this is Similar to map(func, *iterables)
finish = time.perf_counter() #end timer
print(f"Finished in {round(finish-start,2)} seconds")

#results 
Finished in 5.544147536 seconds

To have a better understanding of how to use the Threading module in Python visit this link: https://docs.python.org/3/library/concurrent.futures.html

We can see that with threading code speed improved significantly compared with without threading code i.e from 23 secs to 5 secs 💃.
For this example, please note that there is an overhead in creating threads so it makes sense to use threads for multiple API calls, not just a single call.

Also, for intensive computations like data crunching, image manipulation Multiprocessing performs better than thread.

In conclusion, for I/O bound tasks, anytime our program is running synchronously it is actually not doing much on the CPU. It's probably waiting around for some input. That's actually a good sign that we can get some benefits running our program concurrently using Multithreading.

Next week, In part 2, we'll learn how to use Multiprocessing for CPU heavy tasks to speed up our programs :).
Next, we learn how to connect a Django application to a dockerized PostgreSQL and pgAdmin 4 image running on your local machine😎

Please follow me and turn on your notification. Thank you!
Happy coding! ✌

Top comments (4)

Colin Bounouar • Mar 16 '20

Hi,

I get why you would compare http requests but I think the article would benefit from a final step advertising the use of async http framework such as httpx or aiohttp.

A lot of people unfortunately copy paste sample from the internet ;-)

Micheal Ojemoron • Mar 16 '20

True but this post is an introduction to the main topic. Thanks for pointing that out

Waylon Walker • Mar 16 '20

I like the map syntax, I havent used that before! Seems to make sense to me now. The standard library threading module always confused be before.

I've been using background.py for a few years and really like its simplicity.

Background Tasks in Python for ⚡ Data Science

Waylon Walker ・ Dec 8 '19 ・ 3 min read

#python #datascience

Micheal Ojemoron • Mar 17 '20 • Edited

Yes, I love the concurrent module than the old threading module. The methods are straight forward and easy to understand. I have not tried background.py. I will have a look at it this week. Thanks for your feedback.

DEV Community

A Beginner's Guide to Multithreading and Multiprocessing in Python - Part 1

Multithreading

Top comments (4)

Background Tasks in Python for ⚡ Data Science

Waylon Walker ・ Dec 8 '19 ・ 3 min read

Read next

Data Science Simplified: Tips for Aspiring Data Scientists in 2025

2025 Guide to Architecting an Iceberg Lakehouse

Chatbot with Semantic Kernel - Part 3: Inspector & tokens 🔎

How to Install PySpark on Your Local Machine