(This content is based on a blogpost originally published in Crate.io)
When signing up for CrateDB Cloud, you can import the NYC taxi dataset from the console with only a couple of clicks. We like this dataset because it includes information from tens of millions of taxi trips, with multiple data types—including time-series data.
In a previous post, I showed you how to load the NYC taxi dataset into CrateDB Cloud. Now, I will give you some SQL queries you can try. This post will be especially useful if you are (like me) still a beginner in the world of SQL. 🤓
Let's get into it.
Assuming that you already have the full dataset imported, you now have about 40 million data records in your cluster. Let's start by taking a look at our table:
SELECT * FROM nyc_taxi
LIMIT 10;
The result:
If you scroll to the right, you will see all the information in your table: fare amounts, passenger numbers, tips and tolls paid...
To play around with your data, try the following queries:
1. How many rides happened per day during July 2019?
The SQL query:
SELECT
date_trunc('day', pickup_datetime) AS day,
COUNT(*) AS rides
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'
GROUP BY 1
ORDER BY 1 LIMIT 100;
The result:
2. What is the total distance traveled during July 2019, as well as the average distance per trip?
The SQL query:
SELECT
date_trunc('day', pickup_datetime) AS DAY,
COUNT(*) AS rides,
SUM(trip_distance) AS total_distance,
SUM(trip_distance) / COUNT(*) AS average_per_ride
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'
GROUP BY DAY
ORDER BY 1 LIMIT 100;
The result:
3. What are the busiest hours of the day?
The SQL query:
SELECT
EXTRACT(HOUR FROM pickup_datetime) AS HOUR,
COUNT(*),
SUM(trip_distance) AS total_distance
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59'
GROUP BY HOUR
ORDER BY COUNT(*) DESC LIMIT 100;
The result:
4. How many unique pickup locations there are in the entire dataset?
With this one, things start getting a little bit more interesting.
For this query, you would normally do something like this:
SELECT COUNT(DISTINCT(pulocationid)) FROM nyc_taxi;
However, the amount of data is too large. If you try this, CrateDB Cloud will immediately trip a circuit breaker: this query can't be performed reasonably, without having to do a ridiculous full scan of the table.
The solution is to use a function called HYPERLOGLOG
:
SELECT HYPERLOGLOG_DISTINCT(pulocationid) FROM nyc_taxi;
The result:
This function uses an algorithm to estimate the number of data records, and it is fast even on very large datasets. You can use HYPERLOGLOG_DISTINCT() just like you would use COUNT(DISTINCT) in your queries.
5. Which are the most high-grossing location pairs in NYC for the entire year? For each pair, how much more it grosses if compared to the next best one?
To end, we get advanced for real: this is an example of how to use window functions.
In this case, the query uses the LAG()
function:
SELECT pulocationid, dolocationid, cnt, total, next_highest_grossing, total - next_highest_grossing AS leads_by
FROM (
SELECT
pulocationid,
dolocationid,
COUNT(*) AS cnt,
SUM(total_amount) AS total,
LAG(sum(total_amount), 1) OVER (ORDER BY SUM(total_amount)) AS next_highest_grossing
FROM nyc_taxi
WHERE pickup_datetime BETWEEN '2019-01-01T00:00:00' AND
'2019-12-31T23:59:59'
GROUP BY pulocationid, dolocationid
ORDER BY total DESC
LIMIT 100
) AS sub;
The result:
Hope this was useful! See you next time 👋
There's nothing better than trying things by yourself! Download CrateDB or sign up for a CrateDB Cloud free trial. Experiment... And tell us what you think 😁
Apart from Dev.to, you can reach to the Crate.io team in:
- Github (We'd love a ⭐️ 🙏)
- Our community page
See you around 🐐
Top comments (0)