DEV Community

Cover image for Cross Validation for Beginners
Rohit Gupta
Rohit Gupta

Posted on

Cross Validation for Beginners

While attempting to solve a ML problem, we do a train_test split. If this split is done randomly than it might be possible that some dataset might be completely present in test set and absent from training set or vice versa. This reduces the accuracy of model. So Cross Validation comes into picture.
Cross-validation is a step in the process of building a machine learning model which helps us ensure that our models fit the data accurately and also ensures that we do not overfit.Cross-validation is dividing training data into a few parts. We train the model on some of these parts and test on the remaining parts.

Types Of Cross Validation

i. Leave One Out CV :

  • Split a dataset into a training set and a testing set, using all but one observation as part of the training set.
  • Note that we only leave one observation β€œout” from the training set. This is where the method gets the name β€œleave-one-out” cross-validation.
  • Use "Leave One Out" as test set.
  • In the second experiment, "leave out" another set and take the rest of the data as training input.
  • Repeat the Process.

Cons : Computationally Expensive and results in Low Bias.

Low Bias : For the training and test set, we will get good results but when we will try the model on new data accuracy will go low and error rate goes high.

ii. K-Fold CV : We have some data and we have k value. For example : number of data == 1000 and k == 5. Hence first 200 samples(1000/5 = 200) will be test data. In second experiment, next 200 will be test data. Process will be iterated for 5 times.
Out of all the 5 iterations, we will get 5 accuracies and we can select the best out of 5.

Full Code

import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
    df = pd.read_csv("train.csv")
    # we create a new column called kfold and fill it with -1
    df["kfold"] = -1
    # the next step is to randomize the rows of the data
    df = df.sample(frac=1).reset_index(drop=True)
    # initiate the kfold class from model_selection module
    kf = model_selection.KFold(n_splits=5)
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

iii. Stratified CV : If we have a skewed dataset for classification with 90% positive samples and only 10% negative samples, we don't use random k-fold cross-validation. Using simple k-fold cross-validation for a dataset like this can result in folds with all negative samples. In these cases, we prefer using stratified k-fold cross-validation.
Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So, in each fold, we will have the same 90% positive and 10% negative samples. Stratified k-fold cross-validation keeps the ratio of labels in each fold constant.

import pandas as pd
from sklearn import model_selection
if __name__ == "__main__":
# Training data is in a CSV file called train.csv
    df = pd.read_csv("train.csv")
    # we create a new column called kfold and fill it with -1
    df["kfold"] = -1
    # the next step is to randomize the rows of the data
    df = df.sample(frac=1).reset_index(drop=True)
    # initiate the kfold class from model_selection module
    kf = model_selection.KFold(n_splits=5)
    # fill the new kfold column
    for fold, (trn_, val_) in enumerate(kf.split(X=df)):
        df.loc[val_, 'kfold'] = fold
    # save the new csv with kfold column
    df.to_csv("train_folds.csv", index=False)`
Enter fullscreen mode Exit fullscreen mode

iv. Time Series CV : The method that can be used for cross-validating the time-series model is cross-validation on a rolling basis. Start with a small subset of data for training purpose, forecast for the later data points and then checking the accuracy for the forecasted data points. The same forecasted data points are then included as part of the next training dataset and subsequent data points are forecasted.

Image description

Full Code

That's all folks.

If you have any doubt ask me in the comments section and I'll try to answer as soon as possible.
If you love the article follow me on Twitter: [https://twitter.com/guptarohit_kota]
If you are the Linkedin type, let's connect: www.linkedin.com/in/rohitgupta24

Happy Coding and Have an awesome day ahead πŸ˜€!

Top comments (0)