In this third in a series blog post we will try to understand how query is processed in Postgres, this is definitely one of the more complicated topic so I'll try to abstract the complexities at times for a clearer core concepts without many implementation specific details.
I have further divided this post into 3 parts but the important part is the first and you can feel free to skip the part 2 and 3 in case it gets too complicated as they are there just for the more curious and you would not loose much by missing those. You can also feel free to refer them back later.
Part 1
Query Processing
The query processing subsystem includes 5 components which are essential to the processing: Parser, Analyzer/Analyser, Rewriter, Planner, and Executor.
Parser
The parser is essentially responsible for generating the parse tree. Parse tree helps create and order of precedence among the operations. If the syntax is not valid it returns an error or else a parse tree is returned. No semantics are checked at this stage so if you logically misspell a table name it wont be detected at this stage.
Analyzer/Analyser
It runs a semantic analysis on the parse tree and returns a valid query tree. So the misspelled table name will be definitely caught here. Let's get a bit familiar with the query tree. It has a target-list of the output columns. Range is the list of relations used in the query. The join tree stores the FROM clause and the WHERE clauses. The sort clause is a list of SortGroupClause. We can surely skip how it looks exactly at this point.
Rewriter
It essentially serves as a rule checker. It transforms the query tree by checking the rules defined in pg_rules system catalog.
Planner and Executor
The planner receives a query tree from the rewriter and generates a plan tree that is the most efficient based on pure cost optimizations. It consists of plan nodes. Each plan node has all the information that would be needed by the executor to execute and the executors executes in a bottom-to-top fashion for single-query table.
Part 2: Optimizing the cost for single-table-query
Cost optimization is done in the Planner and can be split into two types: Single-table-query and multi-table-query. In this part we discuss the first and in part 3 we will discuss the later. Since there's a lot of math going on here so we will just understand the basics and skips the math for now. It's also best to split the details in a few points for better remembering them, I'll also keep them short and sweet.
- Estimating the cost of different operations such as scans, joins, sorts, etc. Each operation has a unique cost and different formula for estimation with different weightages assigned to different factors.
- Factors effecting costs: Disk I/O, CPU and Memory consumption
- PostgreSQL uses statistics collected from the database. Estimates are made for the selectivity of predicates, cardinality of tables, and other relevant factors.
4.EXPLAIN command exists to view the cost of a query.
Part 3: Joins
Multiple-table query involves joining data from multiple tables to retrieve the desired results. 3 join methods are supported by Postgres SQL: nested join, merge join, hash join. In Nested join each row from one table is compared with each row from another table to find matching rows. Merge join is used when both tables being joined are sorted on the join key. Hash join builds hash tables on the join key columns of the participating tables and performs a hash-based lookup system to find the matching rows.
The planner uses the most suitable join based on lower costs. The sequence of joins is explained in the plan tree and then that is executed by the executor. While the plan is executing buffers can be used. Atomic transactions are made sure by using isolation mechanism and concurrency support.
Top comments (0)