Imagine we’re lumberjacks in Oregon. I collected some data about some of our trees. Let’s take a peek:

library(knitr)
kable(head(trees))
Girth Height Volume
8.3 70 10.3
8.6 65 10.3
8.8 63 10.2
10.5 72 16.4
10.7 81 18.8
10.8 83 19.7

We sell to a paper company so we only care about the volume of a tree. Let’s visualize the data so we can gain some understanding of the relationship between volume, girth, and height:

plot(trees$Girth, trees$Volume)

plot(trees$Girth, trees$Height)

Looks like a linear model would be reasonable for modeling this relationship. In order to make sure we’re not overfitting the data, let’s split this dataset into three equally sized parts. I’m picking three parts because there are only 31 trees in our sample, but when n > 100 you should pick five. In general this method is called k-fold cross validation, in this case we’re doing 3-fold cross validation.

# Always set the seed so your folds are the same every time you
# run the code. I like to use the date.
set.seed(2016-09-01)

number_of_folds <- 3
tree_data <- trees
tree_data$fold <- sample(rep(1:number_of_folds, length.out = nrow(tree_data)))

library(dplyr)
tree1 <- tree_data %>% filter(fold == 1)
tree2 <- tree_data %>% filter(fold == 2)
tree3 <- tree_data %>% filter(fold == 3)

model1 <- lm(Volume ~ Girth + Height, data = tree1)
model2 <- lm(Volume ~ Girth + Height, data = tree2)
model3 <- lm(Volume ~ Girth + Height, data = tree3)

Now that we’ve built three models, let’s compare their AICs and RMSEs:

AIC(model1)
## [1] 65.29876
AIC(model2)
## [1] 53.87719
AIC(model3)
## [1] 60.99817
rmse <- function(x){sqrt(sum((x)^2))}

rmse(model1$residuals)
## [1] 10.85362
rmse(model2$residuals)
## [1] 7.585309
rmse(model3$residuals)
## [1] 10.82937

Since these values are not insanely far apart from each other, we are probably not overfitting.

For fun let’s take a look at the values of the whole model:

model0 <- lm(Volume ~ Girth + Height, data = tree_data)
AIC(model0)
## [1] 176.91
rmse(model0$residuals)
## [1] 20.54072

More data means that this model has a better fit (the likelihood of the model being accurate is greater), so the AIC is higher. More data also means more variance, so the RMSE is also higher compared to each fold.