What is predicate pushdown?
The immediate theoretical answer that we get on searching is
Predicate pushdown is a query optimisation technique used in database technologies
Okay, I got to know that it is an optimisation technique. But I still did not understand...
- How is the optimisation happening? 🤨
- What is predicate? 🤔
- What exactly is the meaning of pushed down here? 🤷♂️
I'm sure since you are also reading this article, even you have these questions in mind!
Now, lets explore this interesting topic practically in PySpark using explain()
(Similar phenomenon could be observed in relational databases as well)
1] Reading a csv file containing employee information.
emp = spark.read.format('csv')\
.option('header', True)\
.load('data/employee.csv')
emp.show()
emp.explain(mode="formatted")
Here, step (1) is related to read csv file
2] Lets do a group by and get number of employees on each department
df = emp.groupBy('deptID').count()
df.show()
df.explain(mode="formatted")
Here, step (2) is related to group by
3] Lets filter data for only dept number 10
dept_10 = df.filter(df.deptID == 10)
dept_10.show()
dept_10.explain(mode="formatted")
Here, we can see that step (2) is filtering and step (3) is grouping.
Now here is the catch,
Ideally if we go by the sequence of operations, grouping should be done first and then filtering.
However, the optimiser does filtering first and then grouping, because grouping is an expensive operation and it is optimal to filter first and then group data.
In the physical plan we see that filtering (predicate
) is pushed down with respect to grouping. That is why it is called push down
!
Top comments (1)
your articles are self rejuvenating.