DEV Community

Tony Hung
Tony Hung

Posted on

Predicting the price of Heating Oil using PyCaret

Predicting the price of Heating Oil using PyCaret

In this notebook, we'll go over how to perform a time series forecasting on the price of heating oil.

Background

In update New York, we are unable to get natural gas to service our home for heating. This is because the rock is mostly made of shale, which makes it tough to get pull natural gas. So we have to rely on oil for heat.

As the price of gas goes up and down, the price of heating oil is the same. There are many companies that distribute oil and all of them have different prices. Also, these prices change between seasons. So, I wanted to build an application that predicts the price of oil, so that I know when is the best time to buy.

Data

Before building the models, i needed to get data. I have been unable to find a dataset with the current price of oil, therefore I had to build my own. The website cheapestoil.com shows the price of oil for many companies in the northeast United States. The site shows the latest prices for these companies, but they do not show the previous prices.

So in order to get the prices, I build a AWS lambda function that scrapes the price of oil daily. I used AWS CloudWatch events to run a lambda function every 12 hours, in order to fetch the prices for that time. This lambda extracted the last updated date, price and supplier, and save these results as JSON and save to an S3 bucket. After the json data is saved, I have another lambda function, attached as a trigger, to read each json file, and save into DynamoDB. Please see this GitHub repo for more detail on how I build these lambda functions.

I started this project back in Dec 2020 in order to build up my dataset. The lambda function has been running for about 6 months, and I have a decent amount of data to work with. In order to expand my dataset, I was able to pull more data using The Wayback Machine on web.archive.org. The Wayback Machine stores a snapshot of many pages on the internet. It doesn't have every site, but it did have some snapshots from cheaptestoil.com. To get that data, I used https://github.com/hartator/wayback-machine-downloader to download the archive data. The archive only had 7 snapshots, between the dates of Aug 2020 and Oct 2021.

In all, I have about 5k records of all the oil prices from the website

Fetching Saved Data

I used DynamoDB to store the oil price data, and used Boto3 to fetch the data, which I then save to a CSV.

import boto3
from boto3.dynamodb.conditions import Key
import pandas as pd
dynamodb = boto3.resource('dynamodb')

table = dynamodb.Table('heating_oil_prices')
response = table.scan()
df = pd.DataFrame(response["Items"])
df.to_csv("data.csv")
Enter fullscreen mode Exit fullscreen mode
def get_data():
    df = pd.read_csv("data.csv", usecols=["last_updated", "price150","price500", "price300", "supplier", "state"])
    df["last_updated"] = pd.to_datetime(df["last_updated"])
    df = df.set_index("last_updated")
    df["state"] = df["state"].apply(lambda x: "NewYork" if str(x) == "nan" else x)
    df = df.sort_index()
    return df
df = get_data()
Enter fullscreen mode Exit fullscreen mode

On Cheapestoil.com, the have the price of oil in gallons, but the price is slightly different for how many gallons you buy. If we get suppliers in the state of New York, we'll see the following

    state = "NewYork"
    suppliers_by_state = df[ (df["state"] == state)].dropna()
    suppliers_by_state.iloc[1]

Enter fullscreen mode Exit fullscreen mode
price500          1.449
price300          1.469
price150          1.549
supplier    Suffolk Oil
state           NewYork
Name: 2020-08-03 15:11:16, dtype: object
Enter fullscreen mode Exit fullscreen mode

The row above shows that the price for 500 gallons(price500) is $1.449 per gallon, 300 gallons(price300) is $1.69 and 150 gallons(price150) is $1.549

Lets see how many suppliers we have for New York

suppliers_by_state["supplier"].value_counts().sum()
Enter fullscreen mode Exit fullscreen mode
2708
Enter fullscreen mode Exit fullscreen mode

Since we have so many suppliers, a forecasr for the average price of oil for all the suppliers might be a better way to go, since the prices are simlar between every company

df = df.reset_index()
data = df[ df["state"] == state].resample('d', on='last_updated').mean().dropna()
data = data.reset_index()
Enter fullscreen mode Exit fullscreen mode

Here is the mean price of oil over all the suppliers in New York.

data.head()
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
last_updated price500 price300 price150
0 2020-08-03 1.672000 1.719400 1.76140
1 2020-08-04 1.632429 1.625750 1.68075
2 2020-10-29 1.852500 1.819455 1.85300
3 2020-11-11 1.786250 1.759000 1.76900
4 2020-11-12 1.420000 1.460000 1.54000

For a quick check on our data, let's plot the price for 500 gallons

data["price500"].plot()

Enter fullscreen mode Exit fullscreen mode
<AxesSubplot:>
Enter fullscreen mode Exit fullscreen mode

svg

Model

Now, we can start building our model. We'll be using PyCaret to build our time series forecast.
Before modeling, we need to update the dataset to remove the date and replace with numeric values. To do this, I've included fastai's add_datepart function to convert the data is series of features, split by year, month, day, day of week, and much more

from fastai.tabular.all import *
add_datepart(data, field_name="last_updated")
Enter fullscreen mode Exit fullscreen mode
.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
Enter fullscreen mode Exit fullscreen mode
price500 price300 price150 last_updatedYear last_updatedMonth last_updatedWeek last_updatedDay last_updatedDayofweek last_updatedDayofyear last_updatedIs_month_end last_updatedIs_month_start last_updatedIs_quarter_end last_updatedIs_quarter_start last_updatedIs_year_end last_updatedIs_year_start last_updatedElapsed
0 1.672000 1.719400 1.761400 2020 8 32 3 0 216 False False False False False False 1.596413e+09
1 1.632429 1.625750 1.680750 2020 8 32 4 1 217 False False False False False False 1.596499e+09
2 1.852500 1.819455 1.853000 2020 10 44 29 3 303 False False False False False False 1.603930e+09
3 1.786250 1.759000 1.769000 2020 11 46 11 2 316 False False False False False False 1.605053e+09
4 1.420000 1.460000 1.540000 2020 11 46 12 3 317 False False False False False False 1.605139e+09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
56 2.220000 2.260000 2.340000 2021 6 26 28 0 179 False False False False False False 1.624838e+09
57 2.632621 2.626935 2.646290 2021 7 26 2 4 183 False False False False False False 1.625184e+09
58 2.606429 2.582633 2.611437 2021 7 27 5 0 186 False False False False False False 1.625443e+09
59 2.619422 2.609042 2.632478 2021 7 27 6 1 187 False False False False False False 1.625530e+09
60 2.571294 2.545455 2.575238 2021 7 27 7 2 188 False False False False False False 1.625616e+09

61 rows × 16 columns

The add_datepart function generates lots of feature for the date, but we don't need to use all of them. For this model, we'll use last_updatedYear, last_updatedMonth and last_updatedDay. In future models, we can try to use other features.

data = data[['price500', 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]
Enter fullscreen mode Exit fullscreen mode
mask = round(len(data) * 0.7)
train = data[:mask]
test = data[mask:]
train.shape, test.shape
Enter fullscreen mode Exit fullscreen mode
((43, 4), (18, 4))
Enter fullscreen mode Exit fullscreen mode

After we split the data into a train, test set, we then must initalize the regression setup in PyCaret. This includes suppling the dataset, the feature we are predicting on (price500) and the other features to use(last_updatedYear,last_updatedMonth and last_updatedDay)

from pycaret.regression import *
s = setup(data = train, test_data = test, target = 'price500', fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold = 3, transform_target = True, session_id = 42)
Enter fullscreen mode Exit fullscreen mode

Next we call the compare_models function to find the best model using the Mean Absolute Error(MAE), which is the mean of the absolute difference between the models prediction and expected values.

best = compare_models(sort = 'MAE')
Enter fullscreen mode Exit fullscreen mode

From the results above, it looks like the Passive Aggressive Regressor has the lowest MAE error(0.0605), so we'll use that model on the test set

preds = predict_model(best)
Enter fullscreen mode Exit fullscreen mode

Next, we'll use this model and generate a forecast for the next 7 days worth of prices

forecast_df = pd.DataFrame()
forecast_df["last_updated"] = pd.date_range(start='2021-07-08', periods=8)
add_datepart(forecast_df, 'last_updated')
forecast_df = forecast_df[['last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]
forecast_df
Enter fullscreen mode Exit fullscreen mode
predictions = predict_model(best, data=forecast_df)
predictions
Enter fullscreen mode Exit fullscreen mode

Once we make our predictions, we'll then merge the predictions to the orginal data and plot the last 15 days in the dataframe

def pad_value(day):
    value = str(int(day))
    print()
    if len(value) ==1:
        return f'0{value}'
    return value

results_df = pd.concat([data,predictions], axis=0)
dates = []
for idx, x in results_df.iterrows():
    date_str =  f'{pad_value(x["last_updatedYear"])}-{pad_value(x["last_updatedMonth"])}-{pad_value(x["last_updatedDay"])}'
    dates.append(pd.to_datetime(date_str, format='%Y-%m-%d'))
results_df["date"] = dates

results_df.drop(["last_updatedYear", "last_updatedMonth", "last_updatedDay"], axis=1,inplace=True)
results_df = results_df.set_index('date')

Enter fullscreen mode Exit fullscreen mode
results_df[-15:].plot()
Enter fullscreen mode Exit fullscreen mode

The blue line above shows the actual prices and the orange are the predicitons, which do not look the best. If we zoom in to the predictions. we'll see a increasing in price per day

results_df[-8:].plot()
Enter fullscreen mode Exit fullscreen mode

Saving The Model

After we trained our model, we can now save and use for forecasting on other data

save_model(best, "model")
Enter fullscreen mode Exit fullscreen mode
Transformation Pipeline and Model Succesfully Saved





(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=['last_updatedYear',
                                                           'last_updatedMonth',
                                                           'last_updatedDay'],
                                       target='price500', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_ca...
                                                  regressor=PassiveAggressiveRegressor(C=1.0,
                                                                                       average=False,
                                                                                       early_stopping=False,
                                                                                       epsilon=0.1,
                                                                                       fit_intercept=True,
                                                                                       loss='epsilon_insensitive',
                                                                                       max_iter=1000,
                                                                                       n_iter_no_change=5,
                                                                                       random_state=42,
                                                                                       shuffle=True,
                                                                                       tol=0.001,
                                                                                       validation_fraction=0.1,
                                                                                       verbose=0,
                                                                                       warm_start=False),
                                                  shuffle=True, tol=0.001,
                                                  validation_fraction=0.1,
                                                  verbose=0,
                                                  warm_start=False)]],
          verbose=False),
 'model.pkl')
Enter fullscreen mode Exit fullscreen mode

Training Model for all Prices

After we have a intial model, we can now train 3 seperate models for each price(price150, price300 and price500)
We'll refactor the model training code into a function that trains each price, gets the model with the best score, and saves the model to a seperate file

def train(data, price, state="NewYork"):
    data = data.reset_index()
    data = data[ data["state"] == state].resample('d', on='last_updated').mean().dropna()
    data = data.reset_index()

    add_datepart(data, field_name="last_updated")
    data = data[[price, 'last_updatedYear', 'last_updatedMonth', 'last_updatedDay']]   
    mask = round(len(data) * 0.7)
    train = data[:mask]
    test = data[mask:]
    print("train",train.shape)
    print("test",test.shape)

    s = setup(data = train, test_data = test, target = price, fold_strategy = 'timeseries', numeric_features = ['last_updatedYear','last_updatedMonth', 'last_updatedDay'], fold=2, transform_target = True, session_id = 42)
    best = compare_models(sort = 'MAE')
    save_model(best, f'{price}_model')
Enter fullscreen mode Exit fullscreen mode
for feature in ["price150", "price300", "price500"]:
    df = get_data()
    train(df, feature)
Enter fullscreen mode Exit fullscreen mode
Transformation Pipeline and Model Succesfully Saved
Enter fullscreen mode Exit fullscreen mode




Conclusion and What's Next

In this post, we've seen how to build a time series model for forecasting the price of heating oil. In the next post, we'll go over how to deploy these models into a StreamLit application.
We'll also go over how the process on how the data was collected.
We'll also go over how the process on how the data was collected.

Top comments (2)

Collapse
 
lutharblunks profile image
lutharblunks

Can we integrate that price updation feature in heat pump austin texas website so it extract all prices from all ac repairs and get an avg of all price and suggest me best one. Or is there any software which can do this?

Collapse
 
deransmith profile image
Comment deleted

Some comments may only be visible to logged-in visitors. Sign in to view all comments.