Last time we saw what could be the reason for high CPU usage. In this blog post we are going to see some troubleshooting scenarios and queries that can help us identify and fix the issue.
How to troubleshoot - overview
We need to keep in mind that the high CPU usage can be caused by various unrelated reasons. That could be just an expensive operation, or some memory issues, or false sharing, or whatever else. We saw multiple reasons in the previous part of this series.
We need to check multiple things and verify our assumptions at each stage. It’s worth thinking about recent deployments, weekly patterns, differences in running configuration, locale-specific details, running backend processes (CRON jobs, daemons, autovacuum), and other stuff.
In the rest of this post we are going to see various things that we can try.
Active and idle connections
First thing to check is how many connections we have in the database.
Number of connections
We can use the following query:
WITH connections AS (
SELECT
COUNT(*) AS total_connections,
SUM(CASE WHEN state != 'idle' THEN 1 ELSE 0 END) AS non_idle_connections
FROM pg_stat_activity
),
max_connections AS (
SELECT setting AS max_connections
FROM pg_settings
WHERE name = 'max_connections'
)
SELECT
C.total_connections,
C.non_idle_connections,
MC.max_connections,
ROUND((100 * C.total_connections::NUMERIC / MC.max_connections::NUMERIC), 2) AS connections_utilization_percentage
FROM connections AS C
CROSS JOIN max_connections AS MC
+--------------------+-----------------------+------------------+------------------------------------+
| total_connections | non_idle_connections | max_connections | connections_utilization_percentage |
+--------------------+-----------------------+------------------+------------------------------------+
| 8 | 1 | 100 | 8.0 |
+--------------------+-----------------------+------------------+------------------------------------+
We can see the number of connections and how many of them are idle. You can also check number of maximum available connections with the following query:
show max_connections;
This will give you the number of connections, like below:
+-----------------+
| max_connections |
+-----------------+
| 100 |
+-----------------+
Typically, the default is one hundred connections. You can always configure it if needed to match your needs, however, that will require rebooting your PostgreSQL instance. The more connections you have, the higher the memory usage is, so try to tune it to just keep it right above what you need
Connections per database
Next thing to check is which databases are used by the connections. You can use the following query:
SELECT
datname AS database_name,
COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname
ORDER BY 2
Example output:
+----------------+----------------------+
| database_name | non_idle_connections |
+----------------+----------------------+
| demo | 1 |
+----------------+----------------------+
You can also get similar result with executed queries:
SELECT
datname AS database_name,
query,
COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname, query
ORDER BY 3 DESC
The query gives the following output:
+----------------+------------------------------+----------------------+
| database_name | query | non_idle_connections |
+----------------+------------------------------+----------------------+
| demo | SELECT * FROM demo.bookings | 1 |
+----------------+------------------------------+----------------------+
This gives you running queries at a given point in time. However, just running the query may not be a problem. We probably want to find long queries, so do that with the following query:
SELECT
datname AS database_name,
query,
now() - query_start AS runtime
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY 3 DESC
Example output:
+----------------+------------------------------+-----------------+
| database_name | query | runtime |
+----------------+------------------------------+-----------------+
| demo | SELECT * FROM demo.bookings | 00:00:10.145824 |
+----------------+------------------------------+-----------------+
This shows you clearly which queries are running for a long time. Probably the one that runs for many seconds (minutes/hours) is worth investigating further.
Connections waiting for a lock
You can see connections that are waiting:
SELECT A.pid, A.query
FROM pg_stat_activity AS A
JOIN pg_locks AS L on L.pid = A.pid
WHERE L.granted = false
GROUP BY 1, 2
This gives the following result
+--------+-----------------------------+
| pid | query |
+--------+-----------------------------+
| 27621 | SELECT * FROM demo.bookings |
+--------+-----------------------------+
By verifying this you can figure out which queries are slowed down because of other transactions.
Queries
Now we can focus on queries to figure out which ones are slow.
Queries that run often
Let’s see how many times a given query has been executed:
SELECT SUBSTR(query, 1, 100) AS query, calls, total_exec_time, total_exec_time / calls AS call_time
FROM pg_stat_statements
ORDER BY 4 DESC
This gives the following sample output:
+-------------------------------------------------------------------------------------------------------+--------+---------------------+--------------------+
| query | calls | total_exec_time | call_time |
+-------------------------------------------------------------------------------------------------------+--------+---------------------+--------------------+
| EXPLAIN (ANALYZE, BUFFERS, TIMING, VERBOSE, COSTS, FORMAT JSON) SELECT "tconst", "titlet | 2 | 3970.404005 | 1985.2020025 |
| SELECT "tconst", "titletype", "primarytitle", "originaltitle", "isadult", "startyear", "endyear", "r | 2 | 3950.9494290000002 | 1975.4747145000001 |
+-------------------------------------------------------------------------------------------------------+--------+---------------------+--------------------+
The query uses statistics from pg_stat_statements extension.
Metis can show you queries with statistics easily:
Long transactions
You can see long transactions with the following:
SELECT now() - xact_start AS transaction_time, query
FROM pg_stat_activity
WHERE state IN ('idle in transaction','active')
ORDER BY 1 DESC
Output:
+-------------------+-----------------------------+
| transaction_time | query |
+-------------------+-----------------------------+
| 00:00:07 | SELECT * FROM demo.bookings |
+-------------------+-----------------------------+
This can show you which transactions are long. They are most likely contributing to high CPU usage due to a lot of work they need to do.
Metis can show you the runtime of each query you execute:
Database hygiene
Things may lead to an ongoing degradation of the database. Let’s see some of them.
Statistics
Outdated statistics can cause high CPU usage as well. The query planner may generate non-efficient query plans that lead to much longer execution. You can find tables that have outdated statistics with this query (change the schema name for your needs):
SELECT
schemaname,
relname,
DATE_TRUNC('minute', last_analyze) last_analyze,
DATE_TRUNC('minute', last_autoanalyze) last_autoanalyze
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY last_analyze DESC NULLS FIRST, last_autoanalyze DESC NULLS FIRST;
Output:
+-------------+-----------------+---------------+---------------------+
| schemaname | relname | last_analyze | last_autoanalyze |
+-------------+-----------------+---------------+---------------------+
| bookings | aircrafts_data | (null) | (null) |
| bookings | ticket_flights | (null) | 2023-04-06 11:13:00 |
| bookings | tickets | (null) | 2023-04-06 11:12:00 |
+-------------+-----------------+---------------+---------------------+
You can update your statistics manually with this query:
ANALYZE TABLE_NAME
Dead rows
Deleted entities may stay on the drive as dead rows to be removed later on. PostgreSQL removes these entities during the vacuuming process. You can see how many rows are dead with the following query (change your schema name):
SELECT
schemaname,
relname,
n_dead_tup,
DATE_TRUNC('minute', last_vacuum) last_vacuum,
DATE_TRUNC('minute', last_autovacuum) last_autovacuum
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY n_dead_tup desc;
This gives the following output:
+-------------+------------------+-------------+--------------+---------------------+
| schemaname | relname | n_dead_tup | last_vacuum | last_autovacuum |
+-------------+------------------+-------------+--------------+---------------------+
| bookings | boarding_passes | 0 | (null) | 2023-04-06 11:11:00 |
| bookings | tickets | 0 | (null) | 2023-04-06 11:11:00 |
| bookings | airports_data | 0 | (null) | (null) |
+-------------+------------------+-------------+--------------+---------------------+
You can vacuum your table with the following query:
VACUUM TABLE_NAME
Metis can show you the number of dead rows easily:
Indexes
Unused indexes may lead to a higher number of operations. Database needs to update the table and also to update all the indexes.
You can see how indexes are used with the following query:
SELECT
idstat.relname AS table_name,
indexrelname AS index_name,
idstat.idx_scan AS index_scans_count,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
tabstat.idx_scan AS table_reads_index_count,
tabstat.seq_scan AS table_reads_seq_count,
tabstat.seq_scan + tabstat.idx_scan AS table_reads_count,
n_tup_upd + n_tup_ins + n_tup_del AS table_writes_count,
pg_size_pretty(pg_relation_size(idstat.relid)) AS table_size
FROM pg_stat_user_indexes AS idstat
JOIN pg_indexes ON idstat.indexrelname = pg_indexes.indexname AND idstat.schemaname = pg_indexes.schemaname
JOIN pg_stat_user_tables AS tabstat ON idstat.relid = tabstat.relid
WHERE indexdef !~* 'unique'
ORDER BY idstat.idx_scan DESC, pg_relation_size(indexrelid) DESC
Output:
+------------------------+--------------------+--------------------+-------------+--------------------------+------------------------+--------------------+---------------------+------------+
| table_name | index_name | index_scans_count | index_size | table_reads_index_count | table_reqds_seq_count | table_reads_count | table_writes_count | table_size |
+------------------------+--------------------+--------------------+-------------+--------------------------+------------------------+--------------------+---------------------+------------+
| title_ratings_indexed | title_ratings_idx | 0 | 28 MB | 0 | 2 | 2 | 1297467 | 74 MB |
+------------------------+--------------------+--------------------+-------------+--------------------------+------------------------+--------------------+---------------------+------------+
Metis can show you unused indexes easily:
Make sure that your index is not used. It is possible that it’s not used in your primary database, but is used in the replica. Double check all your databases.
You may need to clear history of index usage to get meaningful results in case an index was used before but isn’t anymore. You can do that with the following query:
SELECT pg_stat_reset()
Summary
We have seen reasons for high CPU usage and how to investigate. We can always rely on automated tools like Metis to identify all the issues for us. In the next parts we are going to see how Metis actually helped with issues in real life scenarios.
Top comments (0)