This tutorial follows up on the one where we introduced SQL extensions in QuestDB that make time-series analysis easier. In this tutorial, you will learn in detail about the SAMPLE BY
extension in QuestDB, which will enable you to work with time-series data efficiently because of its simplicity and flexibility.
To get started with this tutorial, you should know that SAMPLE BY
is a SQL extension in QuestDB that helps you group or bucket your time-series data based on the designated timestamp. It removes the need for lengthy CASE WHEN
statements and GROUP BY
clauses. Not only that, the SAMPLE BY
extension helps you quickly deal with many other data-related issues, such as missing data, incorrect timezones, and offsets.
This tutorial assumes you have an up-and-running QuestDB instance ready for use. Let's dive straight into it.
Setup
Import sample data
Similar to the previous tutorial, we'll use the NYC taxi riders data for February 2018. You can use the following script utilizing the HTTP REST API to upload data into QuestDB:
curl https://s3-eu-west-1.amazonaws.com/questdb.io/datasets/grafana_tutorial_dataset.tar.gz > grafana_data.tar.gz
tar -xvf grafana_data.tar.gz
curl -F data=@taxi_trips_feb_2018.csv http://localhost:9000/imp
curl -F data=@weather.csv http://localhost:9000/imp
Alternatively, you can utilize the import functionality in the QuestDB console, as shown in the image below:
For importing large CSV files into partitioned tables, QuestDB recommends using the COPY
command. Thie method is especially useful when you are trying to migrate data from another database into QuestDB.
Create an ordered timestamp column
QuestDB mandates the use of an ordered timestamp column, so you'll have to cast the pickup_datetime
column to TIMESTAMP
in a new table called taxi_trips
with the script below:
CREATE TABLE taxi_trips AS (
SELECT *
FROM 'taxi_trips_feb_2018.csv'
ORDER BY pickup_datetime
) TIMESTAMP(pickup_datetime)
PARTITION BY MONTH;
By converting the
pickup_datetime
column to timestamp, you'll allow QuestDB to use it as the designated timestamp. Using the designated timestamp column, QuestDB is able to index the table to run time-based queries more efficiently.
If it all goes well, you should see the following data after running a SELECT *
query on the taxi_trips
table:
Understanding the basics of SAMPLE BY
The SAMPLE BY
extension allows you to create groups and buckets of data based on time ranges. This is especially valuable for time-series data as you can calculate frequently used aggregates with extreme simplicity. SAMPLE BY
offers you the ability to summarize or aggregate data from very fine to very coarse units of time, i.e., from microseconds to months and everything in between, i.e., millisecond, second, minute, hour, and day. You can derive other units of time, such as a week, fortnight, and year from the ones provided out of the box.
Let's look at some examples to understand how to use SAMPLE BY
in different scenarios.
Hourly count of trips
You can use the SAMPLE BY
keyword with the sample unit of h
to get an hour-by-hour count of trips for the whole duration of the data set. Running the following query, you'll get results in the console:
SELECT pickup_datetime,
COUNT() total_trips
FROM 'taxi_trips'
SAMPLE BY 1h;
There are two ways you can read your data in the QuestDB console: using the grid, which has a tabular form factor or using a chart, where you can draw up a line chart, a bar graph, or an area chart to visualize your data. Here's an example of a bar chart drawn from the query mentioned above:
Three-hourly holistic summary of trips
The SAMPLE BY
extension allows you to group data by any arbitrary number of sample units. In the following example, you'll see that the query is calculating a three-hourly summary of trips with multiple aggregate functions:
SELECT pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM 'taxi_trips'
SAMPLE BY 3h;
You can view the output of the query in the following grid on the QuestDB console:
Weekly summary of trips
As mentioned earlier in the tutorial, although there's no sample unit for a week, a fortnight, or a year, you can derive them simply by utilizing the built-in sample units. If you want to sample the data by a week, use 7d
as the sampling time, as shown in the query below:
SELECT pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM 'taxi_trips'
WHERE pickup_datetime BETWEEN '2018-02-01' AND '2018-02-28'
SAMPLE BY 7d;
Dealing with missing data
If you've worked a fair bit with data, you already know that data isn't always in a pristine state. One of the most common issues, especially with time-series data, is discontinuity, i.e., scenarios where data is missing for specific time periods. You can quickly identify and deal with missing data using the advanced functionality of the SAMPLE BY
extension.
QuestDB offers an easy way to generate and fill missing data with the SAMPLE BY
clause. Take the following example: I've deliberately removed data from 4 am to 5 am for the 1st of February 2018. Notice how the FILL
keyword, when used in conjunction with the SAMPLE BY
extension, can generate a row for the hour starting at 4 am and fill it with some data:
SELECT pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM 'taxi_trips'
WHERE pickup_datetime NOT BETWEEN '2018-02-01T04:00:00' AND '2018-02-01T04:59:59'
SAMPLE BY 1h FILL(LINEAR);
In the example above, we've used an inline WHERE
clause to emulate missing clause with the help of the NOT BETWEEN
keyword. Alternatively, you can create a separate table with missing trips using the same idea, as shown below:
CREATE TABLE 'taxi_trips_missing' AS (
SELECT * FROM 'taxi_trips'
WHERE pickup_datetime
NOT BETWEEN '2018-02-01T04:00:00'
AND '2018-02-01T04:59:59');
Ideally, you should use DROP PARTITION
to emulate missing data, but because the data is partitioned by MONTH
, you cannot run the following query:
ALTER TABLE 'taxi_trips'
DROP PARTITION
WHERE pickup_datetime < ('2018-02-01T04:59:59')
AND pickup_datetime > ('2018-02-01T04:00:00');
The FILL
keyword demands a fillOption
from the following:
fillOption |
Usage scenario | Notes |
---|---|---|
NONE | When you don't want to populate missing data, and leave it as is | This is the default fillOption
|
NULL | When you want to generate rows for missing time periods, but leave all the values as NULLs | |
PREV | When you want to copy the values of the previous row from the summarized data | This is useful when you expect the numbers to be similar to the preceding time period |
LINEAR | When you want to normalize the missing values, you can take the average of the immediately preceding and following row | |
CONST or x | When you want to hardcode values where data is missing | FILL (column_1, column_2, column_3, ...) |
Here's another example of hardcoding values using the FILL(x) fillOption
:
Working with timezones and offsets
The SAMPLE BY
extension also enables you to change timezones and add or subtract offsets from your timestamp columns to adjust for any issues you might encounter when dealing with different source systems, especially in other geographic areas. It is important to note that, by default, QuestDB aligns its sample calculation based on the FIRST OBSERVATION
, as shown in the example below:
SELECT pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM 'taxi_trips'
WHERE pickup_datetime BETWEEN '2018-02-01T13:35:52' AND '2018-02-28'
SAMPLE BY 1d;
Note now the 1d
sample calculation starts at 13:35:52
and ends at 13:35:51
the next day. Apart from the one demonstrated above, there are two other ways to align your sample calculations -- to the calendar time zone
, and to calendar with offset
.
Let's look at the other two alignment methods now.
Aligning sample calculation to another timezone
When moving data from one system to another or via a complex pipeline, you can encounter issues with time zones. For the sake of demonstration, let's assume that you've identified that the data set you've loaded into the database is not for New York City but for Melbourne, Australia. These two cities are far apart and are in very different time zones.
QuestDB allows you to fix this issue by aligning your data to another timezone using the ALIGN TO CALENDAR TIME ZONE
option with the SAMPLE BY
extension. In the example shown below, you can see how an ALIGN TO CALENDAR TIME ZONE ('AEST')
has helped align the pickup_datetime
, i.e., the designated timestamp column to the AEST timezone for Melbourne.
SELECT pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM 'taxi_trips'
SAMPLE BY 3h
ALIGN TO CALENDAR TIME ZONE ('AEST');
Aligning sample calculation with offsets
Similar to the previous example, you can also align your sample calculation by offsetting the designated timestamp column manually by any hh:mm
value between -23:59 to 23:59. In the following example, we're offsetting the sample calculation by -5:30, i.e., negative five hours and thirty minutes:
SELECT pickup_datetime,
COUNT() total_trips,
SUM(passenger_count) total_passengers,
ROUND(AVG(trip_distance), 2) avg_trip_distance,
ROUND(SUM(fare_amount)) total_fare_amount,
ROUND(SUM(tip_amount)) total_tip_amount,
ROUND(SUM(fare_amount + tip_amount)) total_earnings
FROM 'taxi_trips'
SAMPLE BY 3h
ALIGN TO CALENDAR WITH OFFSET '-05:30';
Conclusion
In this tutorial, you learned how to exploit the SAMPLE BY
extension in QuestDB to work efficiently with time-series data, especially in aggregated form. In addition, the SAMPLE BY
extension also allows you to fix specific common problems with time-series data attributable to complex data pipelines, disparate source systems in different geographical areas, software bugs, etc. All in all, SQL extensions like SAMPLE BY
provide a significant advantage when working with time-series data by enabling you to achieve more in fewer lines of SQL.
Top comments (0)