DEV Community

Jim Hatcher
Jim Hatcher

Posted on • Edited on

Efficiently deleting data

When I delete data in CockroachDB (CRDB), the delete is slow, and how can I speed it up?

This is a topic that comes up occasionally, and it's not super intuitive, so I think it would be worth explaining in more detail.

Setup

To explore this, let's create some data we can play with.

First let's create a database and then configure the database to only wait 60 seconds after deleting today to actually delete the data out of the storage layer in CRDB (this will be helpful if we need to delete the data multiple times and run everything again)

CREATE DATABASE deletes;
ALTER DATABASE deletes CONFIGURE ZONE USING gc.ttlseconds= 60;
USE deletes;
Enter fullscreen mode Exit fullscreen mode

Then, let's create a table. I'm going to create a table with a few columns of various cardinality and 100 million rows. There is a helpful function in CRDB called generate_series which is great for this kind of data creation:

CREATE TABLE deletes.order_items ( id PRIMARY KEY, customer_id, order_id, product_id, order_date )
AS
SELECT
  g.i AS id,
  g.i % 77777 AS customer_id,
  g.i % 1000000 as order_id,
  g.i % 11111 as product_id,
  g.i::TIMESTAMP as order_date
FROM generate_series(1578003200, 1678003200, 1) g(i);
-- 100,000,000 records
--  between 2020-01-02 22:13:20 (i.e., 1578003200)
--  and 2023-03-05 08:00:00 (i.e., 1678003200)
Enter fullscreen mode Exit fullscreen mode

I'm leveraging a few handy techniques here:

  • I'm using the CREATE TABLE AS (or CTAS as we call it) to generate a new table. This is much faster in CRDB than creating the table first and then doing INSERT INTO ... SELECT because it bypasses all the consistency-checking algorithms.
  • I'm using the g.i value as my "seed value" and using it directly as my id field
  • for some of the other fields, I'm using the modulo operator (i.e., %) against the g.i feild to create some low-cardinality values.
  • I'm generating values in the range between 1578003200 and 1678003200 so that I can easily convert these values into timestamps.
  • Tip: To figure out which values you want to use as your start and end values, run SELECT '2023-01-15 2:00 PM'::TIMESTAMP::INT; with your desired date values to figure out the values to specify.

Once we have our table created, let's create indexes on our other fields in order to mimic what a realistic table might do:

CREATE INDEX ON order_items ( customer_id );
CREATE INDEX ON order_items ( order_id );
CREATE INDEX ON order_items ( product_id );
CREATE INDEX ON order_items ( order_date );
Enter fullscreen mode Exit fullscreen mode

To get a sense of the cardinality of our fields, we can run a query:

SELECT
  COUNT(*) AS rec_count,
  COUNT(DISTINCT customer_id) AS cust_count,
  COUNT(DISTINCT order_id) AS ord_count,
  COUNT(DISTINCT product_id) AS prd_count,
  COUNT(DISTINCT order_date) AS date_count
FROM order_items;
Enter fullscreen mode Exit fullscreen mode

Our results:

  rec_count | cust_count | ord_count | prd_count | date_count
------------+------------+-----------+-----------+-------------
  100000001 |      77777 |   1000000 |     11111 |  100000001
Enter fullscreen mode Exit fullscreen mode

Looking at explain plans

Now that we have some data to play with, we can start to understand what happens when we delete data in this table.

Let's start by selecting a single record by the PK value:

SELECT *
FROM order_items
WHERE id = 1600000000;

      id     | customer_id | order_id | product_id |     order_date
-------------+-------------+----------+------------+----------------------
  1600000000 |       49333 |        0 |       4889 | 2020-09-13 12:26:40
Enter fullscreen mode Exit fullscreen mode

Now, let's prepend "EXPLAIN" to our query to look at the execution plan.

EXPLAIN
SELECT *
FROM order_items
WHERE id = 1600000000;
                                                        info
---------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

   scan
    estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
    table: order_items@order_items_pkey
    spans: [/1600000000 - /1600000000]
Enter fullscreen mode Exit fullscreen mode

This is as simple of an execution plan as you can get. It's doing a scan of 1 span in the primary index (i.e., the table itself).

Now, let's query by the order_date field which is also a very high-cardinality value with an index on it:

SELECT *
FROM order_items
WHERE order_date = 1600000000::timestamp;
      id     | customer_id | order_id | product_id |     order_date
-------------+-------------+----------+------------+----------------------
  1600000000 |       49333 |        0 |       4889 | 2020-09-13 12:26:40
Enter fullscreen mode Exit fullscreen mode

We get the exact same data back. Let's look at the explain plan:

EXPLAIN
SELECT *
FROM order_items
WHERE order_date = 1600000000::timestamp;
                                                                            info
-------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

   index join
   estimated row count: 1
   table: order_items@order_items_pkey
  
  └──  scan
        estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
        table: order_items@order_items_order_date_idx
        spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']

  index recommendations: 1
  1. type: index replacement
     SQL commands: CREATE INDEX ON order_items (order_date) STORING (customer_id, order_id, product_id); DROP INDEX order_items@order_items_order_date_idx;
Enter fullscreen mode Exit fullscreen mode

Notice a few differences about this explain plan from the one above:

  1. We do a scan of the order_items_order_date_idx this time
  2. We then do an index join back to the primary index order_items_pkey
  3. CRDB has helpfully given us some suggestions on how we can alter our index to perform better.

It's very efficient to use the order_date index since our predicate is on order_date. But the order_date index only contains the field we're indexing (order_date) and a reference back to the PK (id). In order to satisfy our query's SELECT * clause, we need to join back to the primary index in order to have all the rest of the field values in this record. The index suggestion is telling us that if we include/store all the other fields from the record in the order_date index that we can skip the index join since we'll already have all the data points necessary to satisfy our query.

You can prove this by changing the SELECT query to only return the id and order_date fields:

EXPLAIN
SELECT id, order_date
FROM order_items
WHERE order_date = 1600000000::timestamp;
                                                        info
---------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

   scan
    estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
    table: order_items@order_items_order_date_idx
    spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']
Enter fullscreen mode Exit fullscreen mode

Moving onto deletes

Now that we have a sense of interpreting query plans on our table, let's think about what happens when we delete a record in CRDB.

When we delete a record by the PK value, our execution plan is very simple -- find the record and delete it:

EXPLAIN
DELETE FROM order_items
WHERE id = 1600000000;
                                                          info
-------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

   delete
   from: order_items
   auto commit
  
  └──  scan
        estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
        table: order_items@order_items_pkey
        spans: [/1600000000 - /1600000000]
Enter fullscreen mode Exit fullscreen mode

If we do a similar DELETE query using the order_date field instead of the PK, we see that we once again have an index join:

EXPLAIN DELETE FROM order_items
WHERE order_date = 1600000000::timestamp;
                                                                            info
-------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

   delete
   from: order_items
   auto commit
  
  └──  index join
       estimated row count: 1
       table: order_items@order_items_pkey
      
      └──  scan
            estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
            table: order_items@order_items_order_date_idx
            spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']

  index recommendations: 1
  1. type: index replacement
     SQL commands: CREATE INDEX ON order_items (order_date) STORING (customer_id, order_id, product_id); DROP INDEX order_items@order_items_order_date_idx;
Enter fullscreen mode Exit fullscreen mode

The need for this index join is pretty obvious on the SELECT query, but why is it needed on the DELETE query? The answer is that when CRDB deletes a record, it needs to delete the entry(ies) from the primary index but it also needs to delete the entry(ies) from all of the corresponding indexes. And, in order to delete the index entries, it needs the index keys.

Let's follow the optimizer's advice and create this new "covering" index:

CREATE INDEX ON order_items (order_date) STORING (customer_id, order_id, product_id);
DROP INDEX order_items@order_items_order_date_idx;                                                                                                                         
Enter fullscreen mode Exit fullscreen mode

If we run our DELETE statement again with this new index in place, we can see that the need for the index join has been eliminated.

EXPLAIN DELETE FROM order_items
WHERE order_date = 1600000000::timestamp;                                                                                                  info
-------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

   delete
   from: order_items
   auto commit
  
  └──  scan
        estimated row count: 1 (<0.01% of the table; stats collected 3 hours ago; using stats forecast for 3 hours ago)
        table: order_items@order_items_order_date_idx1
        spans: [/'2020-09-13 12:26:40' - /'2020-09-13 12:26:40']
Enter fullscreen mode Exit fullscreen mode

Warning: This is a great technique to help make deletes efficient. If you're doing a regular batch delete of old data or cleaning up records that have met some "deletable" state, this can help a lot -- especially on really big tables. However, be aware that if you later add another index on the order_items table, you'll want to remember to include that newly-indexed field into the index that is going to be leveraged in your deletes -- otherwise, this index join will creep back into your plan.

Summary

Covering indexes are a great tool in CRDB and other DBMS systems to make specific queries more efficient. It can be relatively easy to identify when they should be utilized to help SELECT queries, but it's a little un-intuitive to realize that they can also help with DELETEs. But now you know, so happy deleting!!

Top comments (0)