Inference in Painless
I am an employee of Elastic at the time of writing
Machine learning inference is just math. You have some parameters, pump them through some functions, and boom, you get a result. While this is simple on the surface, all the tooling can get complex. Could I script simple model inference in Elasticsearch?
What is [Elasticsearch | Painless]
Elasticsearch is a distributed, restful, and open data store. The underlying store is Lucene with a bunch of goodies built on top.
Painless is a secure, simple, and flexible scripting language purpose built for Elasticsearch. You can use custom scripts at search time, in many different aggregations, and even at ingest time. It's crazy powerful and flexible. But, with great power, comes great responsibility.
Machine Learning inference in Painless
Painless 100.5 (not even 101)
Painless can be used a couple of ways:
- inline: where the whole script is included in the API call
- stored: The script is stored in Elasticsearch's cluster state.
Painless scripts can reference fields in the given context (doc fields, _source fields). They also have access to a params
object. This can be provided for script reuse on different input parameters.
Simple models
Linear regression, being intuitive and simple is a very nice place to start the experiments.
It is trivial to implement one dimensional linear regression Painless.
# Storing a simple linear regression function script
PUT _scripts/linear_regression_inference
{
"script": {
"lang": "painless",
"source": """
// This assumes the parameter definitions will be given when used
// This also assumes a single target.
double total = params.intercept;
for (int i = 0; i < params.coefs.length; ++i) {
total += params.coefs.get(i) * doc[params['x'+i]].value;
}
return total;
"""
}
}
I trained a simple model in scikit-learn, on the diabetes data set. Here is using the model's resulting parameters in the script to return a script field.
GET diabetes_test/_search
{
"script_fields": {
"regression_score": {
"script": {
"id": "linear_regression_inference",
# Here are the model parameters. The linear regression coefficients and intercept.
"params": {
# coef_ attribute from sklearn
"coefs": [-35.55683674, -243.1692265, 562.75404632, 305.47203008, -662.78772128, 324.27527477, 24.78193291, 170.33056502, 731.67810787, 43.02846824],
# intercept_ attribute from sklearn
"intercept": 152.53813351954059,
"x0": "age",
"x1": "sex",
"x2": "bmi",
"x3": "bp",
"x4": "s1",
"x5": "s2",
"x6": "s3",
"x7": "s4",
"x8": "s5",
"x9": "s6"
}
}
}
}
}
Writing custom inference code for every model type could get tiring. More complex models will demand an ever growing library of functions. There are plenty of inference library's out there to experiment with. Why reinvent the wheel? Can one be made to work with Painless?
m2cgen to Painless
m2cgen is a python library that translates trained models into static code. While only specific models are supported, the code generated works great. Painless supports a subset of Java and m2cgen has Java as a potential output. Generating painless scripts from trained models is possible!
Well, its not out of the box. The entry point for painless is an object called params
. So, m2cgen's Java functions have to be adjusted for how painless accepts outside parameters. Here is an example translating the Java output to a Painless acceptable one:
import xgboost as xgb
from sklearn import datasets
from sklearn.metrics import mean_squared_error
import m2cgen as m2c
diabetes = datasets.load_diabetes() # load data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)
print(diabetes.feature_names)
model = xgb.XGBRegressor(max_depth=6, learning_rate=0.3, n_estimators=50)
model.fit(X_train,y_train)
java_model = m2c.export_to_java(model)
java_model = java_model.replace("input", "params")
for idx, val in enumerate(diabetes.feature_names):
java_model = java_model.replace("[" + str(idx) + "]", "[\"" + val + "\"]")
print(java_model)
Here is the output (truncated)
double var0;
if ((params["s5"]) >= (0.0216574483)) {
if ((params["bmi"]) >= (0.0131946635)) {
var0 = 72.2889786;
} else {
...
return ((((((((((((((((((((((((((((((((((((((((((((((((((0.5) + (var0)) + (var1)) + (var2)) + (var3)) + (var4)) + (var5)) + (var6)) + (var7)) + (var8)) + (var9)) + (var10)) + (var11)) + (var12)) + (var13)) + (var14)) + (var15)) + (var16)) + (var17)) + (var18)) + (var19)) + (var20)) + (var21)) + (var22)) + (var23)) + (var24)) + (var25)) + (var26)) + (var27)) + (var28)) + (var29)) + (var30)) + (var31)) + (var32)) + (var33)) + (var34)) + (var35)) + (var36)) + (var37)) + (var38)) + (var39)) + (var40)) + (var41)) + (var42)) + (var43)) + (var44)) + (var45)) + (var46)) + (var47)) + (var48)) + (var49);
The generated script is humongous. Almost 7,000 lines. Anybody will tell you, that is too much.
But, does it work?
Ugh:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "exceeded max allowed stored script size in bytes [65535] with size [307597] for script [diabetes_xgboost_model]"
}
],
"type" : "illegal_argument_exception",
"reason" : "exceeded max allowed stored script size in bytes [65535] with size [307597] for script [diabetes_xgboost_model]"
},
"status" : 400
}
Script size limits are wise. Stored scripts are put in the cluster state object. The more stored scripts (and the larger the scripts), the more overall cluster performance will start to drag and might lead to other problems.
But what if I didn't care about my cluster health? I want my model and I want it now!
PUT _cluster/settings
{
"transient": {
"script.max_size_in_bytes": 10000000
}
}
Sane limitations can't stop me!
Time to put the script:
PUT _scripts/diabetes_xgboost_model
{
"script": {
"lang": "painless",
"source": """
...very large source...
"""
}
}
Now I can use my stored script!
"regression": {
"bucket_script": {
"buckets_path": {
"age": "age",
"sex": "sex",
"bmi": "bmi",
"bp": "bp",
"s1": "s1",
"s2": "s2",
"s3": "s3",
"s4": "s4",
"s5": "s5",
"s6": "s6"
},
"script": {
"id": "diabetes_xgboost_model"
}
}
}
This example is using it as a bucket_script aggregation.
Should you do this in production? No.
Was it a fun exploration? For me, yes! m2cgen is such a wonderful discovery and pairing it with the flexibility of painless was worth exploring.
Here is a gist containing the python code + full generated script
Top comments (0)