DEV Community

Carlota Soto for CrateDB

Posted on • Edited on

Querying time series data with SQL: examples

(This content is based on a blogpost originally published in Crate.io)

When signing up for CrateDB Cloud, you can import the NYC taxi dataset from the console with only a couple of clicks. We like this dataset because it includes information from tens of millions of taxi trips, with multiple data types—including time-series data.

In a previous post, I showed you how to load the NYC taxi dataset into CrateDB Cloud. Now, I will give you some SQL queries you can try. This post will be especially useful if you are (like me) still a beginner in the world of SQL. 🤓

Let's get into it.

Assuming that you already have the full dataset imported, you now have about 40 million data records in your cluster. Let's start by taking a look at our table:

SELECT * FROM nyc_taxi
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

The result:

Alt Text

If you scroll to the right, you will see all the information in your table: fare amounts, passenger numbers, tips and tolls paid...

To play around with your data, try the following queries:

1. How many rides happened per day during July 2019?

The SQL query:

SELECT
  date_trunc('day', pickup_datetime) AS day,
  COUNT(*) AS rides
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'
GROUP BY 1
ORDER BY 1 LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

The result:

Alt Text

2. What is the total distance traveled during July 2019, as well as the average distance per trip?

The SQL query:

SELECT
  date_trunc('day', pickup_datetime) AS DAY,
  COUNT(*) AS rides,
  SUM(trip_distance) AS total_distance,
  SUM(trip_distance) / COUNT(*) AS average_per_ride
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'
GROUP BY DAY
ORDER BY 1 LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

The result:

Alt Text

3. What are the busiest hours of the day?

The SQL query:

SELECT
  EXTRACT(HOUR FROM pickup_datetime) AS HOUR,
  COUNT(*),
  SUM(trip_distance) AS total_distance
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'
GROUP BY HOUR
ORDER BY COUNT(*) DESC LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

The result:

Alt Text

4. How many unique pickup locations there are in the entire dataset?

With this one, things start getting a little bit more interesting.

For this query, you would normally do something like this:

SELECT COUNT(DISTINCT(pulocationid)) FROM nyc_taxi;
Enter fullscreen mode Exit fullscreen mode

However, the amount of data is too large. If you try this, CrateDB Cloud will immediately trip a circuit breaker: this query can't be performed reasonably, without having to do a ridiculous full scan of the table.

Alt Text

The solution is to use a function called HYPERLOGLOG:

SELECT HYPERLOGLOG_DISTINCT(pulocationid) FROM nyc_taxi;
Enter fullscreen mode Exit fullscreen mode

The result:

Alt Text

This function uses an algorithm to estimate the number of data records, and it is fast even on very large datasets. You can use HYPERLOGLOG_DISTINCT() just like you would use COUNT(DISTINCT) in your queries.

5. Which are the most high-grossing location pairs in NYC for the entire year? For each pair, how much more it grosses if compared to the next best one?

To end, we get advanced for real: this is an example of how to use window functions.

In this case, the query uses the LAG() function:

SELECT pulocationid, dolocationid, cnt, total, next_highest_grossing, total - next_highest_grossing AS leads_by
FROM (
  SELECT
    pulocationid,
    dolocationid,
    COUNT(*) AS cnt,
    SUM(total_amount) AS total,
    LAG(sum(total_amount), 1) OVER (ORDER BY SUM(total_amount)) AS     next_highest_grossing
  FROM nyc_taxi
  WHERE pickup_datetime BETWEEN '2019-01-01T00:00:00' AND
  '2019-12-31T23:59:59'
  GROUP BY pulocationid, dolocationid
  ORDER BY total DESC
  LIMIT 100
) AS sub;
Enter fullscreen mode Exit fullscreen mode

The result:

Alt Text

Hope this was useful! See you next time 👋


There's nothing better than trying things by yourself! Download CrateDB or sign up for a CrateDB Cloud free trial. Experiment... And tell us what you think 😁

Apart from Dev.to, you can reach to the Crate.io team in:

See you around 🐐

Top comments (0)