Time series forecasting is an interesting sub-topic within the field of machine learning, mainly due to the time component which adds to the complexity of making predictions. Over the past month I’ve grown quite fond of it, and one of the best things I’ve learned is that standard supervised machine learning algorithms can be applied to time series to make predictions. The process is quite similar to a standard ML process with the exception that you have to structure your data a specific way to maintain the temporal structure.
Environment Setup
For setting up your environment I do recommend that you use anaconda, it’s kind of the de facto environment manager when doing data science. However, if you only have python on your system that is more than enough as well. I’m also assuming you have a terminal available with a unix-like shell such as bash or git bash.
$ mkdir tsml-tutorial
$ cd tsml-tutorial
If you have anaconda available on your system:
$ conda create -n tsml jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
$ conda activate tsml
If you don’t have anaconda available on your system, but have python 3.3+ installed:
$ python -m venv venv
$ source venv/bin/activate
$ pip install jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
Now that you have an environment installed you can start following along by starting your local jupyter server and opening a fresh notebook.
$ jupyter notebook
Data Extraction
For this little tutorial we’ll be using one of the most common univariate time series datasets, that you’ve probably already seen, Daily minimum temperatures in Melbourne, Australia, 1981-1990. The data consists of, as you may have guessed, the daily minimum temperature over the course of 10 years in Melbourne, Australia. We’ll be grabbing our data using pandas, from a github repository. You can find the data at the following url https://github.com/jbrownlee/Datasets/blob/master/daily-min-temperatures.csv.
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
%matplotlib inline
sns.set()
# load our dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv"
df = pd.read_csv(url)
# output dataframe info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 3650 non-null object
1 Temp 3650 non-null float64
dtypes: float64(1), object(1)
memory usage: 57.2+ KB
Our dataframe consists of 2 columns, Date
and Temp
, with no missing values, and 3650 observations (365 per year). Our data is typed as follows:
-
Date
column as a string which we’ll want to convert to a DateTimeIndex -
Temp
column as a float64.
# set Date as datetimeindex
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")
Data Exploration
Since this is a time series, we’d be remiss if we didn’t plot the data out fully. We’ll also want to inspect our data and see if there is autocorrelation.
# plot full 10 years
fig, ax = plt.subplots(figsize=(16, 9))
df.plot.line(title="Daily minimum temperatures in Melbourne, Australia, 1981-1990", style=".", ax=ax)
df.rolling(30).mean().plot(figsize=(16, 9), style="-", ax=ax)
df.rolling(30).std().plot(figsize=(16, 9), style="-", ax=ax)
plt.legend(["Temperature", "30-Day Rolling Average", "30-Day Rolling Std. Dev."])
plt.show()
Our plot of the temperature for the last ten years shows the temperature oscillates, almost like a sinusoidal wave. With our rolling standard deviation showing that we don’t grow in variance as time progresses. This would definitely be an optimal dataset for an SARIMA model, but that isn’t what we are here for.
Modeling
This is the crux of our tutorial and essentially we’ll be doing regression (using a RF regressor albeit) to predict the temperature. To start we’ll create some features such as time lags, and time features to incorporate the temporal structure into our model. To make it easier as our data grows though we’ll want to make a pipeline.
Features to create:
- Time lags for the previous week
- Rolling 30-Day Temperature average
- Rolling 7-Day Temperature average
- Month of the year
- Week of the year
- Next day’s temperature (what we are predicting)
# create our features and new dataframe
data = pd.DataFrame({f"t-{x}": df.Temp.shift(x) for x in range(7, 0, -1)})
data["t"] = df.Temp
data["day"] = df.index.isocalendar().day
data["week_of_year"] = df.index.isocalendar().week
data["month"] = df.index.month
data["7-Day Temp. Avg."] = df.Temp.rolling(7).mean()
data["30-Day Temp. Avg."] = df.Temp.rolling(30).mean()
data["t+1"] = df.Temp.shift(-1)
data = data.dropna()
t
is our current time step, and t+1
is the next day’s temperature which we’ll be predicting. To make our model aware of time we’ve also created a week and month feature, and included lag values and rolling averages. Next we’ll want to divide our data up into testing and training sets so we can do some training and validate our data. However, since we are working with time series data, there is a strict order dependence and so we can’t split and shuffle our data, we’ll have to maintain our order.
We’ll split our data up using a 70-30 split, where the last 30% of our data will be used as our testing data, and the first 70% is for our training.
# split data up into training and testing set, preprocess
num_cols = ['t-7', 't-6', 't-5', 't-4', 't-3', 't-2', 't-1', 't',
'7-Day Temp. Avg.', '30-Day Temp. Avg.']
col_trans = ColumnTransformer(
[
("categorical_cols", OneHotEncoder(drop="first", sparse=False), ["week_of_year", "month", "day"]),
("numeric_cols", StandardScaler(), num_cols)
]
)
pipe = Pipeline([("trans", col_trans), ("regression", RandomForestRegressor(n_jobs=-1))])
X = data.drop(columns="t+1")
y = data["t+1"]
X_train, X_test = X[:int(X.shape[0] * .7)], X[int(X.shape[0] * .7):]
y_train, y_test = y[:int(y.shape[0] * .7)], y[int(y.shape[0] * .7):]
Since we can’t do cross validation, we’ll use the time series split class from sklearn, which is essentially the k-fold validation of time series validation. Our alternative would be to train our model on all the data, and use information criterion, which realistically when doing any model selection you should use multiple metrics to select your model.
# perform cross validation on training data
-cross_val_score(pipe, X_train, y_train, cv=TimeSeriesSplit(), scoring="neg_root_mean_squared_error").mean()
2.5270646279116358
Here we have the RMSE score after doing some cross validation, it isn’t anything special but verifies that we can apply our standard ML toolset on a time series dataset. From our CV we can see our model is about 2.5 degrees off.
# fit our model and make predictions on testing data
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
# show the predictions
y[-1086:].plot(figsize=(16, 9), title="Predictions on Hold Out Data")
pd.Series(preds, index=y[-1086:].index).plot()
plt.legend(["Observations", "Predictions"])
plt.show()
# output RMSE score on test data
mean_squared_error(y_test, preds, squared=False)
2.3153790785018127
Looking at the predictions made by our model, we aren’t going to be telling anyone the weather anytime soon. However, this is a prime example of how to apply standard Machine Learning algorithms to your time series.
Top comments (0)