DEV Community

DolphinDB
DolphinDB

Posted on

Accelerating Data Analysis: Embracing the Efficient Query and Aggregation Calculations

Image description

Efficient query and calculation capabilities play a pivotal role in handling large-scale datasets, and ensuring timely and accurate data analysis. In this article, we will explore how DolphinDB’s query and advanced calculation features significantly enhance data processing and analysis workflows through practical code examples👇

In this case, we choose TSDB engine as the DolphinDB storage engine. We use a server with 16 CPU cores, 512 GB of memory, and 4 SSDs. The read speed of each disk is about 400 MB/s.

The data we use is the US Stock Exchange level 1 quotes of 4 years. The original data size is around 12.75 TB. After compression, the disk space occupied by DolphinDB is 2.12 TB.

All these data are stored in one table in DolphinDB, and there are almost 270 billion records.

login(`admin, `123456)
pnodeRun(clearAllCache)
quotes = loadTable("dfs://TAQ", "quotes")
Enter fullscreen mode Exit fullscreen mode

Image description

First, let’s do the query.

We can get the data of Lehman Brothers at 3:59:59 PM, August 21, 2007 through the code below. It turns out that DolphinDB takes only 7.9 ms to extract one second’s data of a stock from 270 billion (12.75 TB) records.

select count(*) from quotes 

timer x=select symbol, time, bid, ofr from quotes where symbol='LEH', date=2007.08.21 and time=15:59:59
Enter fullscreen mode Exit fullscreen mode

Image description

The query result is assigned to the variable x, and the content of x can be shown in the data browser.

Image description

Next, let’s calculate the average bid-ask spread of Lehman Brothers in each minute on Aug 31, 2007.

The bid-ask spread is defined as the ask price minus the bid price, and then divided by the average ask price and the bid price, and we also add some filtering conditions to ensure the validity of the data.

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.08.31, symbol=`LEH, time between 09:30:00 : 15:59:59, ofr>bid, ofr>0, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
Enter fullscreen mode Exit fullscreen mode

Run the script, and we can see that execution takes 89 milliseconds.

Image description

The calculation result is plotted in this graph.

Image description

Then we’ll increase the data volume, remove the stock code limitation, and calculate the bid-and-ask price difference per minute.

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.09.24, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
Enter fullscreen mode Exit fullscreen mode

The data of all stocks in the whole market for a whole day is about 10 GB. Run the script, and it takes 2.1 seconds.

Image description

For a well-designed system, the time it takes to finish a query should be linearly related to the amount of days the query consumes.

Image description

So this time, we performs the same calculation on 3 days.

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date between 2007.09.25 : 2007.09.27, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
Enter fullscreen mode Exit fullscreen mode

It takes 5.47 seconds, about 3 times as long as the previous query, which means that the time consumption is directly proportional to the amount of data.

Image description

In a word, performance of DolphinDB is pretty good for both querying a small amount of data and aggregating calculations with large amounts of data.

The complete script is displayed as follows 👇👀

login(`admin, `123456)
pnodeRun(clearAllCache)
quotes = loadTable("dfs://TAQ", "quotes")

select count(*) from quotes 

timer x=select symbol, time, bid, ofr from quotes where symbol='LEH', date=2007.08.21 and time=15:59:59

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.08.31, symbol=`LEH, time between 09:30:00 : 15:59:59, ofr>bid, ofr>0, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date=2007.09.24, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")

timer avgSpread = select avg((ofr-bid)/(ofr+bid)*2) as avgSpread from quotes where date between 2007.09.25 : 2007.09.27, time between 09:30:00 : 15:59:59, ofr>bid, bid>0, ofr/bid<1.2 group by minute(time) as minute
plot(avgSpread.avgSpread, avgSpread.minute, "Average bid-ask spread per minute")
Enter fullscreen mode Exit fullscreen mode

And you can follow the video below, experiencing the efficient query and aggregation calculation.

https://youtu.be/A_XoHP9NboE

Top comments (0)