Resampling Methods

Saket Chaturvedi
4 min readAug 29, 2019
Photo by Samuel Zeller on Unsplash

Resampling methods are very useful in any machine learning technique. They involve drawing samples from the dataset to estimate the model performance.

For example, you can draw a different sample from your dataset and train a linear regression model to compare the MSE (Mean Square Error) and see how the model performs(Model Assessment) on a new dataset. However, this process can be expensive as it would involve repeated model fitting using different datasets. There are different approaches which can be used for implementing it without being computationally expensive.

Validation Set approach —

The validation set approach involves keeping separate datasets for training and testing. The training and testing datasets are randomly selected from a bigger dataset. As a general approach, a 60%(train)/40%(test) split is chosen. However, it gets difficult in the scenarios where the dataset is small as a smaller dataset would not allow the model to capture the variability and pattern in the dataset.

We perform the model training on the training dataset and then the mode performance is tested on the test dataset. The comparison is made on the training and test error rates. If the training rate is lower than the test error with a wide gap, the model isn’t performing well.

However, this method has certain drawbacks —

a. The test error rate is dependent on the records were chosen for the test set and can be highly variable.

b. It will not be useful in the cases where the overall dataset size is small as it would further reduce the size of training data and the statistical models do not perform well on smaller datasets and can lead to overestimating the test error.

Leave One Out Cross Validation (LOOCV)

LOOCV is like the validation set approach. However, it differs in the way the validation set is selected. In a LOOCV method, the training is done on (n-1 ) records, and the left out record is used for the test set. So in a sense, the whole dataset is used for training the data. The test set record is used for making the prediction and calculating the error.

For example, if we have a dataset {(x1,y1),(x2,y2),…,(xn,yn))} then {(x1,y1)} will be used as the test set and {(x2,y2),…(xn,yn)} will be used for training. A single record error cannot be used for estimating the overall model performance, so the approach is followed repeatedly and an average error is calculated for the overall model performance. For example in a regression setting, MSE is used so the Cross-Validation error would be —

The advantage of LOOCV over validation set lies in the fact that it uses almost all the data as a training set and hence tends not to overestimate the test error. Along with that, the variability in the test results is not prominent when the method is repeated to estimate the model with different settings because the randomness in the data selection is minimum.

K-fold Cross-Validation

K-fold validation can be considered as an extended version of LOOCV, in a sense, that in K-fold validation, we divide the dataset into k subsets of equal sizes. In which one is used for testing/validation and others are used for training. The process repeats for k times and the error is calculated for each subset. Later, the final test error is represented as the average of all the values.

The k-fold method turns out to be a little less computationally expensive than LOOCV, because, in LOOCV, we need to fit the model n times however, in k-fold the model is fit k times. The typical values for k are 5 to 10 which are generally used, however, you can test the best value of k based on the dataset and the problem you are working on.

However, sometimes, the k-fold validation is said to give much better estimates of the test error, why? because of Bias-Variance tradeoff

In the case of LOOCV, we are fitting the data on n-1 records and the nth record is used for testing, so in a simpler way, we are training the model each time on similar dataset so the bias can be low, but due to the same factor, the variance can be high. The reason being we are training n models on the relatively same dataset which can cause the outputs to be correlated which increases the variance of the overall mean. On the other hand, k-fold validation trains the models on somewhat different samples and produces outputs that are less correlated with each other, hence a lower overall variance and some bias. The actual bias-variance tradeoff depends on the value of k chosen for the validation, which can be done via empirical methods. As said before, with many experiments and projects done by experts, the values 5 and 10 produce balanced estimates which have neither a high bias or variance.

In the post, the examples were only for quantitative outputs and we only discussed MSE, however, these methods can be utilized for classification problems with a different error metric.

--

--