DEV Community

Cover image for Understanding the Basics of PostgreSQL's Database Cluster
Hassam Abdullah
Hassam Abdullah

Posted on

Understanding the Basics of PostgreSQL's Database Cluster

PostgreSQL is a powerful and widely used open-source relational database management system. To make the most out of PostgreSQL, it is essential to have a solid understanding of its fundamental components. In this article, we will dive into the logical and physical structure of a database cluster, as well as the internal layout of a heap table file. This knowledge will serve as a foundation for comprehending PostgreSQL and help you become more proficient in working with PostgreSQL.


Logical Structure of a Database Cluster

A database cluster in PostgreSQL refers to a collection of databases managed by a PostgreSQL server. It's important to note that a database cluster does not refer to a group of database servers. Instead, a PostgreSQL server runs on a single host and manages a single database cluster. Within a database cluster, databases themselves are considered database objects, and they are logically separated from each other. Other database objects such as tables, indexes, and views belong to their respective databases.
To manage the relationships between database objects, PostgreSQL internally uses object identifiers (OIDs), which are unsigned 4-byte integers. The mapping between objects and their respective OIDs is stored in system catalogs like pg_database and pg_class. This allows you to retrieve the desired OIDs by executing queries against these system catalogs.

To manage the relationships between database objects, PostgreSQL internally uses object identifiers (OIDs), which are unsigned 4-byte integers. The mapping between objects and their respective OIDs is stored in system catalogs like pg_database and pg_class. This allows you to retrieve the desired OIDs by executing queries against these system catalogs.

Physical Structure of Database Cluster

A database cluster in PostgreSQL is represented by a directory structure, commonly referred to as the base directory. When you initialize a new database cluster using the initdb utility, a base directory is created under a specified location (usually set as the PGDATA environment variable). The base directory contains subdirectories and numerous files that collectively represent the physical structure of the database cluster.

These subdirectories include base/, which contains per-database subdirectories, and global/, which holds cluster-wide tables like pg_database and pg_control. Additionally, there are other subdirectories for specific purposes such as storing transaction commit timestamp data, dynamic shared memory subsystem files, logical decoding status data, replication slot data, and more. PostgreSQL also supports tablespaces, which are directories outside the base directory used to store additional data.


Layout of Databases

Under the base directory, each database is represented by a subdirectory whose name corresponds to the OID of the database. For example, if the OID of a database named sampledb is 16384, its subdirectory would be base/16384. This directory serves as the container for tables, indexes, and associated files belonging to that specific database.

Layout of Files Associated with Tables and Indexes

In PostgreSQL, each table or index is stored as a separate file under the respective database's subdirectory within the base directory. These files are managed by the relfilenode variable and are internally associated with object identifiers (OIDs). The relfilenode values typically match the corresponding OIDs, but they can change due to certain commands like TRUNCATE, REINDEX, or CLUSTER.

Table and index files are accompanied by additional files known as the free space map (FSM) and visibility map (VM). The FSM stores information about the free space capacity on each page within the file, while the VM stores visibility information. Indexes have their own FSM but don't have a visibility map.


Tablespace

PostgreSQL supports the concept of tablespaces, which provide an additional data area outside the base directory. When creating a tablespace, a directory is created under a specified location, and a version-specific subdirectory is created within it. The version-specific subdirectory name follows the format PG_'Major version'_'Catalogue version number'. The tablespace directory is then symbolically linked from the base directory. This allows PostgreSQL to manage data files stored in tablespaces as part of the overall database cluster.

To associate a table or index with a specific tablespace, you can specify the tablespace name during the creation of the table or index. By default, if no tablespace is specified, the object is created in the default tablespace, which is located in the base directory.

Heap Table File Layout

Within a database's subdirectory, each table is represented by a file known as the heap table file. This file stores the actual data rows of the table. The heap table file is identified by the table's OID, which matches the relfilenode value mentioned earlier.
The heap table file consists of multiple fixed-size blocks called pages. Each page typically contains multiple data rows and associated metadata. The size of a page is determined during the initialization of the database cluster and is set to a fixed value, usually 8 KB.

Within each page, the data rows are stored in a specific format. Each data row contains a header that includes information such as the row's length, visibility status, and other metadata. The actual row data follows the header.

When a table row is updated or deleted, PostgreSQL doesn't immediately remove the row from the heap table file. Instead, it marks the row as "dead" and later reuses the space occupied by dead rows when new rows are inserted. This approach helps optimize performance by reducing disk I/O.

In Conclusion

Understanding the basics of a PostgreSQL database cluster is crucial for effectively working with this powerful relational database management system. By comprehending the logical and physical structure of the cluster, as well as the layout of database objects and associated files, you can navigate and manage PostgreSQL databases more efficiently. This knowledge forms the foundation for advanced topics like performance tuning, replication, and high availability in PostgreSQL.

Top comments (0)