Today, I came across an article on Medium about parallelization in Python (here); I used that post as an example to practice vectorization principles with Numpy - you can read my previous post on DEV here. The performance gain obtained in a single core with Numpy is outstanding.
Can we improve the performance of the vectorized Monte-Carlo approach even further?
Dask offers a Numpy-similar interface with automated parallelization. So, let us try it!
This is the solution I came up with to compute the number pi using a Monte-Carlo approach, in other words, reproducing the same algorithm as in the previous referred posts but with Dask. Here, I am using the default configuration, I am not exploring tweaks in Dask to gain further performance. I find it amazing how Dask keeps the memory profile really low. After all, Dask managed the parallelization in my laptop's 8 threads and the available memory seamlessly.
start = time.time()
sample = 10_000_000_000 # <- this is huge!
xxyy = da.random.uniform(-1, 1, size=(2, sample))
norm = da.linalg.norm(xxyy, axis=0)
summ = da.sum(norm <= 1)
insiders = summ.compute()
pi = 4 * insiders / sample
print("pi ~= {}".format(pi))
print("Finished in: {:.2f}s".format(time.time()-start))
In my laptop:
pi ~= 3.141615808
Finished in: 107.14s
CPU~Quad core Intel Core i7-8550U (-MT-MCP-)
speed/max~800/4000 MHz
Kernel~4.15.0-99-generic x86_64
Mem~7178.7/32050.2MB
HDD~2250.5GB(56.6% used)
Procs~300
Client~Shell
inxi~2.3.56 `
Additional notes:
It is possible to write this statement:
sum = da.sum(norm <= 1)
using masked arrays:
mask = da.ma.masked_inside(norm, 0, 1)
trues = da.ma.getmaskarray(mask)
summ = da.sum(trues)
Yet this latter form consumes more time, about 20% in my machine.
What are your thoughts?
Cheers
Top comments (0)