We often hear about the advertisements for the performance of a big data product, saying that it is capable of running “queries over trillion-row tables in seconds”, which means they can get and return data meeting the specified condition from one trillion rows in seconds.
Is this true?
Probably you do not think it is true if you have read the article How Much Is One Terabyte of Data? ”. To process one trillion rows of data, which is dozens of, even one hundred, terabytes in size, we need tens of thousands of, even hundreds of thousands of, hard disks. This is almost impracticable.
However, “queries in seconds” does not necessarily mean full traversal. If we use the unique identifiers (such as phone numbers, though there are not one trillion of them) to perform the query, the amount of computation based on the index won’t be too large, and the number of retrievals and comparisons is about forty. Contemporary computers can handle this computation amount effortlessly in one second, even along with many concurrencies. The computational complexity of this type of target-search task is logarithmic level. When data volume grows from one million rows to one trillion rows, the computation amount the search task involves only increases by two times. Almost all databases can deal with the search as long as they can accommodate this data. It is not surprising at all. Of course, creating index will be very slow, but that is another thing.
Yet what will happen if it is a computation based on traversal? Index becomes useless for, such as, calculating sum of values of a certain column.
Well, it is impractical to achieve “queries in seconds” for such a computing task on trillion-row tables, but it is possible to do this on TB-level data. If there are one hundred columns of data, an aggregation on one column just needs to retrieve 1% data, which is probably 10GB. Ten hard disks are enough to scan the data in seconds. It is easy to achieve the configuration, even ten times more, under contemporary server clusters.
So, probably this is what “being able to handle TB-level data in seconds” really means. Every database product can do that.
Some database products like comparing them with famous old-brand databases, such as Oracle. We often hear about statements saying that the product is N times faster than Oracle. Sounds like they are bragging. As the world-class benchmark product, it isn’t Oracle if it is easily outrun by N times.
Well, they are not bragging.
Oracle is mainly intended for transaction processing (often known as TP) rather than for analytical processing (often known as AP). A TP database should use row-based storage while an AP database usually uses columnar storage. For example, a certain operation only involves two of one hundred columns in a data table. The row-based Oracle database basically needs to read through all the one hundred columns; but a columnar AP database only needs to read the two targeted columns. The amount of data that is retrieved is dozens of times less. In this case, it is natural that the computation is N times faster. This shouldn’t be a surprise. If the computation isn’t N times faster, that will be a problem.
Apart from using columnar storage instead of row-based storage, other stratagems may include adopting clusters instead of single machines, in-memory techniques instead of external memory strategies, and so on. In a word, they run faster because they use N times more resources. Yet, though they outpace Oracle, they do not really outsmart it.
That’s probably what “being N times faster than Oracle” means. The speed is a fact but isn’t worth boasting about.
In fact, Oracle’s optimizer is powerful. If not for columnar storage and the support of plenty of resources, many specialized AP databases does not necessarily run faster than Oracle, particularly the Hadoop-based technologies.
In the above, we mentioned the power of logarithmic-level algorithm, which logarithmize the amount of computation from one trillion rows to dozens of rows. However, there is the opposite situation. The seemingly small volume of data is accompanied with an astronomical amount of computation. In the SPL forum there is a National Astronomical Observatories’ computing scenario. The data involves eleven tables and each has only 500,000 rows. The total volume is less than 10GB. It takes a distributed database 3.8 hours and 100 CPUs to finish the computation, whose degree of complexity is multiplication-level and involves a total computation amount of 10*500,000*500,000=2.5 trillion. It is already pretty satisfactory to be able to get the job done in such a period of time.
Therefore, we cannot simply look at the slogans saying how fast they can run over a huge volume of data when examining the performance of big data products. Those statements may be true, but they are meaningless. With an efficient algorithm, we can execute computations on 100TB in seconds; without an efficient algorithm, it could take N hours to perform computations on 10B.
In this sense, the key element to examining the performance of a big data technology is whether it provides distinguished, efficient algorithms that can reduce the computation amount.
But, under the SQL system, a bad product can be extremely bad but a good one is only decent at most. SQL has been developed for decades; there are mature optimization algorithms already known throughout the industry. It is difficult for the language to come up with anything new. As a result, vendors who do not have the optimization techniques make extremely bad products and those who have mastered the techniques are only able to design mediocre products. After all, they can and only can copy the techniques from each other.
However, by breaking out of the SQL system, it is possible to come up with new algorithms. For the Observatories scenario mentioned above, there are actually algorithms that can logarithmize one side of the multiplication. The computation amount is about 10*500,000*log500,000*2, which is reduced by 10,000 times. Using this algorithm, esProc SPL can finish the computation in 2 minutes on a 4-core laptop.
Unfortunately, SQL is unable to express this algorithm. It is very laborious just to describe the computing logic correctly. The amount of code is measured in KB, the database optimizer is disabled, and it takes N solid hours to finish the execution.
Top comments (0)