DEV Community

DolphinDB
DolphinDB

Posted on

How to Downsample Your Data Efficiently

Image description

Tired of spending countless mintues downsampling your data? Look no further!

In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million minute-level records in only 41 seconds in DolphinDB.

The basic configuration of the DolphinDB server is:

  • 16 CPU cores
  • 256 GB memory
  • 4 SSDs

A DolphinDB cluster with 4 data nodes is deployed, and each node uses a SSD.

Image description

The data we use is:

  • the level 1 quotes on August, 2007 from New York Stock Exchange
  • around 272 GB, with 6.48 billion records

Downsampling can be performed with a SQL statement in DolphinDB.

Image description

As the SQL query may involve multiple partitions, DolphinDB breaks down the job into several tasks and assigns the tasks to the corresponding data nodes for parallel execution. When all the tasks are completed, the system merges the intermediate results from the nodes to return the final result.

The script is as follows:

db = database("dfs://TAQ")
quotes = db.loadTable("quotes")
select count(*)  from quotes where date between 2007.08.01 : 2007.08.31

model=select  top 1 symbol,date, minute(time) as minute, bid, ofr from quotes where date=2007.08.01,symbol=`EBAY
if(existsTable("dfs://TAQ", "quotes_minute_sql"))
 db.dropTable("quotes_minute_sql")
db.createPartitionedTable(model, "quotes_minute_sql", `date`symbol)

timer{
 minuteQuotes=select avg(bid) as bid, avg(ofr) as ofr from quotes where data between 2007.08.01 : 2007.08.31 group by symbol,date,bar(time, 60) as minute
 loadTable("dfs://TAQ", "quotes_minute_sql").append!(minuteQuotes)
}

select count(*)  from loadTable("dfs://TAQ", "quotes_minute")
Enter fullscreen mode Exit fullscreen mode

The frequency can be adjusted as needed just by modifying bar(time, 60). Here 60 means the data is downsampled to 1-minute interval as the timestamp values have seconds precision.

Image description

The table “quotes_minute_sql“ is created with createPartitionedTable and the downsampled result can be appended to this table.

Image description

We can execute the script and visit the web-based user interface to check the resource usage. It’s shown that all CPU cores have participated in the downsampling. On each data node, 15 tasks are running concurrently as data is being read from disk.

Image description

When we come back to VScode and check the execution status, we find that it only takes 41 seconds to complete the data downsampling, which generates 61 million minute-level records.

Image description

DolphinDB exhibits outstanding performance in data downsampling due to the following reasons:

  1. Jobs are executed distributedly and resources of different nodes can be utilized at the same time;
  2. Compression reduces the disk I/O;
  3. Columnar storage and vectorized computation improve the efficiency of aggregation.

To learn detailed operations of data downsampling, take a look at this demo!
https://youtu.be/0vRuiz1Lf6Y

Top comments (0)