Originally published on: Wordpress.
Recently I have been working on a Kaggle competition where participants are tasked with predicting Russian housing prices. In developing a model for the challenge, I came across a few methods for selecting the best regression
model for a given dataset. Let's load up some data and take a look.
library(ISLR)
set.seed(123)
swiss <- data.frame(datasets::swiss)
dim(swiss)
## [1] 47 6
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
This data provides data on Swiss fertility and some socioeconomic
indicators. Suppose we want to predict fertility rate. Each observation
is as a percentage. In order to assess our prediction ability, create
test and training data sets.
index <- sample(nrow(swiss), 5)
train <- swiss[-index,]
test <- swiss[index,]
Next, have a quick look at the data.
par(mfrow=c(2,2))
plot(train$Education, train$Fertility)
plot(train$Catholic, train$Fertility)
plot(train$Infant.Mortality, train$Fertility)
hist(train$Fertility)
Looks like there could be some interesting relationships here. The
following block of code will take a model formula (or model matrix) and
search for the best combination of predictors.
library(leaps)
leap <- regsubsets(Fertility~., data = train, nbest = 10)
leapsum <- summary(leap)
plot(leap, scale = 'adjr2')
According to the adjusted R-squared value (larger is better), the best
two models are: those with all predictors and all predictors less
Examination. Both have adjusted R-squared values around .69 - a decent fit. Fit these models so we can evaluate them further.
swisslm <- lm(Fertility~., data = train)
swisslm2 <- lm(Fertility~.-Examination, data = train)
#use summary() for a more detailed look.
First we'll compute the mean squared error (MSE).
mean((train$Fertility-predict(swisslm, train))[index]^2)
## [1] 44.21879
mean((train$Fertility-predict(swisslm2, train))[index]^2)
## [1] 36.4982
Looks like the model with all predictors is marginally better. We're
looking for a low value of MSE here. We can also look at the Akaike
information criterion (AIC) for more information. Lower is better here
as well.
library(MASS)
step1 <- stepAIC(swisslm, direction = "both")
step1$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Final Model:
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 36 1746.780 168.5701
## 2 - Examination 1 53.77608 37 1800.556 167.8436
Here, the model without Examination
scores lower than the full
model. It seems that both models are evenly matched, though I might be
inclined to use the model without Examination
.
Since we sampled our original data to make the train and test datasets,
the difference in these tests is subject to change based on the training
data used. I encourage anyone who wants to test themselves to change the
set.seed
at the beginning and evaluate the above results again. Even
better: use your own data!
If you learned something or have questions feel free to leave a comment
or shoot me an email.
Happy modeling,
Kiefer
Top comments (0)