Abstract
In this final part of the statistics series, we'll use WebAssembly to extend SingleStoreDB with Linear Regression and Analysis of Variance (ANOVA).
Introduction
This is the final part of the short statistics series of articles, and we'll wrap up with details of how to implement two well-known statistical analysis techniques in SingleStoreDB using WebAssembly.
Create a SingleStoreDB Cloud account
A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use Stats Demo Group as our Workspace Group Name and stats-demo as our Workspace Name.
Once we've created our database in the following steps, we'll make a note of our password and host name.
Create a Database
In our SingleStoreDB Cloud account, we'll use the SQL Editor to create a new database, as follows:
CREATE DATABASE IF NOT EXISTS test;
Setup local Wasm development environment
We'll follow the steps described in the previous article to quickly create a local Wasm development environment. We'll also install and use the pushwasm
tool.
Next, let's clone the following GitHub repo:
git clone https://github.com/singlestore-labs/singlestoredb-statistics
Compile
We'll now change to the singlestoredb-statistics/linreg
directory and build the code, as follows:
cargo build --target wasm32-wasi --release
Deploy
Once the code is built, we'll create an environment variable:
export SINGLESTOREDB_CONNSTRING="mysql://admin:<password>@<host>:3306/test"
We'll replace the <password>
and <host>
with the values from our SingleStoreDB Cloud account.
Next, we'll use pushwasm
to load the Wasm modules into SingleStoreDB, one-by-one, in stages:
- Analysis of Variance
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name aov_init
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name aov_iter
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name aov_merge
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name aov_term
- Simple Linear Regression
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name slr_init
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name slr_iter
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name slr_merge
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name slr_term
- Multiple Linear Regression
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name mlr_init
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name mlr_iter
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name mlr_merge
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name mlr_term
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name mlr_terml
- Utility Functions
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name vec_pack_f64
pushwasm udf --force --conn $SINGLESTOREDB_CONNSTRING --wit ./linreg.wit --wasm ./target/wasm32-wasi/release/linreg.wasm --name vec_unpack_f64
All the Wasm UDFs should be successfully created.
Load and run SQL
In the file linreg.sql
, we'll replace the 3 occurrences of blob
with longblob
.
We'll use a MySQL CLI client to connect to SingleStoreDB:
mysql --local-infile -u admin -h <host> -P 3306 --default-auth=mysql_native_password -p
We'll replace the <host>
with the value from our SingleStoreDB Cloud account.
Once connected, we'll switch to the test
database:
USE test;
We'll then execute the SQL statements from the linreg.sql
file, as follows:
SOURCE linreg.sql
This will create some additional functions.
Run Wasm in the database
For the next part, we'll need to download the bottle.csv
file from the CalCOFI project at Kaggle. The downloaded zip file is approximately 50 MB and 257 MB when unpacked. We'll copy the unpacked CSV file to the singlestoredb-statistics/linreg/data
directory.
Next, we'll execute the SQL statements from the schema.sql
file, as follows:
SOURCE schema.sql
This will create and load data into the fitness
and bottle
tables, and run some SQL statements. Further details of the tables and the following queries can be found in the schema.sql
file in the GitHub repo.
1. Example Fitness Queries
- Long output, full model
SELECT mlr(oxygen,vec_pack_f64([Age, Weight, RunTime, RestPulse, RunPulse, MaxPulse]))
FROM fitness;
The result should be similar to the following:
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| mlr(oxygen,vec_pack_f64([Age, Weight, RunTime, RestPulse, RunPulse, MaxPulse])) |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| [102.93447948433982,-0.22697379625619457,-0.07417741369123694,-2.6286528180843582,-0.021533639753211362,-0.36962775849034357,0.3032171294554114] |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
- Long output, small model
SELECT mlrl(oxygen,vec_pack_f64([Age, Weight]))
FROM fitness;
The result should be similar to the following:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| mlrl(oxygen,vec_pack_f64([Age, Weight])) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [31,3,128.24814438231112,723.13340045637744,851.38154483868857,2,28,30,2.4829084373909605,0.10169946158950849,77.290494127549152,-0.37041610365825456,-0.15823203989833429,13.854628842781686,0.18310073707407407,0.11457181170725468,5.5786766289172585,-2.0230180914476681,-1.3810730365566442,5.7196351554988922e-06,0.052713683614924456,0.17818046127909337] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- Regression coefficients
SELECT mlr(oxygen,vec_pack_f64([Age, Weight]))
FROM fitness;
The result should be similar to the following:
+----------------------------------------------------------------+
| mlr(oxygen,vec_pack_f64([Age, Weight])) |
+----------------------------------------------------------------+
| [77.290494127549152,-0.37041610365825456,-0.15823203989833429] |
+----------------------------------------------------------------+
- Simple Linear Regression
SELECT slr(oxygen,Age)
FROM fitness;
The result should be similar to the following:
+-----------------------------------------------------------------------------------------------------------------------------------------+
| slr(oxygen,Age) |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| {"b0":62.22063853036843,"b1":-0.31135980679392494,"n":31,"pvalue":0.09570086024288527,"r2":0.09277653134140218,"sse":772.3933182604706} |
+-----------------------------------------------------------------------------------------------------------------------------------------+
- Regression coefficients
SELECT mlr(oxygen,vec_pack_f64([Age]))
FROM fitness;
The result should be similar to the following:
+-------------------------------------------+
| mlr(oxygen,vec_pack_f64([Age])) |
+-------------------------------------------+
| [62.220638530368433,-0.31135980679392494] |
+-------------------------------------------+
- Detailed output
SELECT mlrl(oxygen,vec_pack_f64([Age]))
FROM fitness;
The result should be similar to the following:
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| mlrl(oxygen,vec_pack_f64([Age])) |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [31,2,78.988226578218018,772.39331826047055,851.38154483868857,1,29,30,2.9656633694439267,0.095700860242885266,62.220638530368433,-0.31135980679392494,8.6698326369370626,0.18080133071430227,7.1766827730079648,-1.7221101502064047,6.7106189272081451e-08,0.095700860242885266] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
- Simple Linear Regression with
GROUP BY
SELECT slr(oxygen,Age)
FROM fitness
GROUP BY agegroup;
The result should be similar to the following:
+-----------------------------------------------------------------------------------------------------------------------------------------+
| slr(oxygen,Age) |
+-----------------------------------------------------------------------------------------------------------------------------------------+
| {"b0":67.07355405406196,"b1":-0.3972297297298791,"n":12,"pvalue":0.5382534183098988,"r2":0.039028511587353654,"sse":239.5865736621429} |
| {"b0":17.676813043480454,"b1":0.6239260869564743,"n":13,"pvalue":0.33971793419603324,"r2":0.08303100866375002,"sse":228.18461678696318} |
| {"b0":61.31558399999808,"b1":-0.24194399999995245,"n":6,"pvalue":0.8882375351410527,"r2":0.005572168395395094,"sse":217.63939676800314} |
+-----------------------------------------------------------------------------------------------------------------------------------------+
- Multiple Linear Regression with
GROUP BY
SELECT AgeGroup, mlr(oxygen,vec_pack_f64([Weight, RunPulse]))
FROM fitness
GROUP BY AgeGroup;
The result should be similar to the following:
+----------+-----------------------------------------------------------------+
| AgeGroup | mlr(oxygen,vec_pack_f64([Weight, RunPulse])) |
+----------+-----------------------------------------------------------------+
| > 50 | [89.457181891040094,-0.061111943943463934,-0.23491808981133788] |
| 44-50 | [105.46359183230126,-0.12189672611963881,-0.28551453735354709] |
| < 44 | [133.27903520507476,-0.10553676551069738,-0.42464015851549392] |
+----------+-----------------------------------------------------------------+
2. Example Bottle Queries
- Simple Linear Regression (
Salinity
as a function ofT_degc
)
SELECT slr(Salnty,T_degc)
FROM bottle
WHERE T_degc IS NOT NULL AND Salnty IS NOT NULL;
The result should be similar to the following:
+--------------------------------------------------------------------------------------------------------------------------+
| slr(Salnty,T_degc) |
+--------------------------------------------------------------------------------------------------------------------------+
| {"b0":34.44090915598341,"b1":-0.055207755972974094,"n":814247,"pvalue":0,"r2":0.255293714414324,"sse":129222.9813118148} |
+--------------------------------------------------------------------------------------------------------------------------+
- Multiple Linear Regression (
Salinity
as a function ofT_degc
)
SELECT mlr(Salnty,vec_pack_f64([T_degc]))
FROM bottle
WHERE T_degc IS NOT NULL AND Salnty IS NOT NULL;
The result should be similar to the following:
+--------------------------------------------+
| mlr(Salnty,vec_pack_f64([T_degc])) |
+--------------------------------------------+
| [34.440909155983412,-0.055207755972974094] |
+--------------------------------------------+
Summary
In this article, we have seen how WebAssembly can provide significant new functionality to SingleStoreDB, allowing detailed data analysis without using external systems or exporting the data for use with third-party tools. We have also seen how powerful the Wasm functions can be through the example queries.
Acknowledgements
I thank Oliver Schabenberger for his work on the Wasm modules and the code examples and documentation in the GitHub repo.
Top comments (0)