As a Backend Engineer or Data Scientist, there are times when you need to improve the speed of your program assuming that you have used the right data structures and algorithms. One way to do this is to take advantage of the benefit of using Muiltithreading or Multiprocessing.
In this post, I won't be going into detail on the inner working of Muiltithreading or Multiprocessing. Instead, we will write a small Python script to download images from Unsplash. We will start with a version that downloads images synchronously or one at a time. Next, we use threading to improve execution speed.
I am sure you are excited to learn this...
Multithreading
In a nutshell, threading allows you to run your program concurrently. Tasks that spend much of their time waiting for external events are generally good candidates for threading. They are also called I/O Bound tasks e.g writing or reading from a file, network operations or using an API to download stuff online.
let take a look at an example that shows the benefit of using threads.
Without Threading
In this example, we want to see how long it takes to download 15 images from Unsplash API by running our program sequentially.
import requests
import time
img_urls = [
'https://images.unsplash.com/photo-1516117172878-fd2c41f4a759',
'https://images.unsplash.com/photo-1532009324734-20a7a5813719',
'https://images.unsplash.com/photo-1524429656589-6633a470097c',
'https://images.unsplash.com/photo-1530224264768-7ff8c1789d79',
'https://images.unsplash.com/photo-1564135624576-c5c88640f235',
'https://images.unsplash.com/photo-1541698444083-023c97d3f4b6',
'https://images.unsplash.com/photo-1522364723953-452d3431c267',
'https://images.unsplash.com/photo-1513938709626-033611b8cc03',
'https://images.unsplash.com/photo-1507143550189-fed454f93097',
'https://images.unsplash.com/photo-1493976040374-85c8e12f0c0e',
'https://images.unsplash.com/photo-1504198453319-5ce911bafcde',
'https://images.unsplash.com/photo-1530122037265-a5f1f91d3b99',
'https://images.unsplash.com/photo-1516972810927-80185027ca84',
'https://images.unsplash.com/photo-1550439062-609e1531270e',
'https://images.unsplash.com/photo-1549692520-acc6669e2f0c'
]
start = time.perf_counter() #start timer
for img_url in img_urls:
img_name = img_url.split('/')[3] #get image name from url
img_bytes = requests.get(img_url).content
with open(img_name, 'wb') as img_file:
img_file.write(img_bytes) #save image to disk
finish = time.perf_counter() #end timer
print(f"Finished in {round(finish-start,2)} seconds")
#results
Finished in 23.101926751 seconds
With Threading
Let's see how the threading module in Pyhton can significantly improve our program execution.
import time
from concurrent.futures import ThreadPoolExecutor
def download_images(url):
img_name = img_url.split('/')[3]
img_bytes = requests.get(img_url).content
with open(img_name, 'wb') as img_file:
img_file.write(img_bytes)
print(f"{img_name} was downloaded")
start = time.perf_counter() #start timer
with ThreadPoolExecutor() as executor:
results = executor.map(download_images,img_urls) #this is Similar to map(func, *iterables)
finish = time.perf_counter() #end timer
print(f"Finished in {round(finish-start,2)} seconds")
#results
Finished in 5.544147536 seconds
To have a better understanding of how to use the Threading module in Python visit this link: https://docs.python.org/3/library/concurrent.futures.html
We can see that with threading code speed improved significantly compared with without threading code i.e from 23 secs to 5 secs π.
For this example, please note that there is an overhead in creating threads so it makes sense to use threads for multiple API calls, not just a single call.
Also, for intensive computations like data crunching, image manipulation Multiprocessing performs better than thread.
In conclusion, for I/O bound tasks, anytime our program is running synchronously it is actually not doing much on the CPU. It's probably waiting around for some input. That's actually a good sign that we can get some benefits running our program concurrently using Multithreading.
Next week, In part 2, we'll learn how to use Multiprocessing for CPU heavy tasks to speed up our programs :).
Next, we learn how to connect a Django application to a dockerized PostgreSQL and pgAdmin 4 image running on your local machineπ
Please follow me and turn on your notification. Thank you!
Happy coding! β
Top comments (4)
Hi,
I get why you would compare http requests but I think the article would benefit from a final step advertising the use of async http framework such as httpx or aiohttp.
A lot of people unfortunately copy paste sample from the internet ;-)
True but this post is an introduction to the main topic. Thanks for pointing that out
I like the map syntax, I havent used that before! Seems to make sense to me now. The standard library threading module always confused be before.
I've been using background.py for a few years and really like its simplicity.
Background Tasks in Python for β‘ Data Science
Waylon Walker γ» Dec 8 '19 γ» 3 min read
Yes, I love the concurrent module than the old threading module. The methods are straight forward and easy to understand. I have not tried background.py. I will have a look at it this week. Thanks for your feedback.