DEV Community

Aniketh Deshpande
Aniketh Deshpande

Posted on

Predicate Pushdown - Understanding Practically With An Example

What is predicate pushdown?

The immediate theoretical answer that we get on searching is

Predicate pushdown is a query optimisation technique used in database technologies

Okay, I got to know that it is an optimisation technique. But I still did not understand...

  1. How is the optimisation happening? ๐Ÿคจ
  2. What is predicate? ๐Ÿค”
  3. What exactly is the meaning of pushed down here? ๐Ÿคทโ€โ™‚๏ธ

I'm sure since you are also reading this article, even you have these questions in mind!

Now, lets explore this interesting topic practically in PySpark using explain()
(Similar phenomenon could be observed in relational databases as well)

1] Reading a csv file containing employee information.

emp = spark.read.format('csv')\
                .option('header', True)\
                .load('data/employee.csv')
emp.show()
Enter fullscreen mode Exit fullscreen mode

df.show

emp.explain(mode="formatted")
Enter fullscreen mode Exit fullscreen mode

pyspark explain read df

Here, step (1) is related to read csv file

2] Lets do a group by and get number of employees on each department

df = emp.groupBy('deptID').count()
df.show()
Enter fullscreen mode Exit fullscreen mode

emp group by

df.explain(mode="formatted")
Enter fullscreen mode Exit fullscreen mode

df explain

Here, step (2) is related to group by

3] Lets filter data for only dept number 10

dept_10 = df.filter(df.deptID == 10)
dept_10.show()
Enter fullscreen mode Exit fullscreen mode

Image description

dept_10.explain(mode="formatted")
Enter fullscreen mode Exit fullscreen mode

Image description

Here, we can see that step (2) is filtering and step (3) is grouping.

Now here is the catch,
Ideally if we go by the sequence of operations, grouping should be done first and then filtering.

However, the optimiser does filtering first and then grouping, because grouping is an expensive operation and it is optimal to filter first and then group data.

Image description

In the physical plan we see that filtering (predicate) is pushed down with respect to grouping. That is why it is called push down !

Top comments (1)

Collapse
 
rudu1 profile image
Rudram

your articles are self rejuvenating.