machine learning - Why does RSquared increase with # of folds in k-fold cross validation? -


i'm tuning model using k-fold cross-validation , noticed rsquared accuracy appears improve number of folds -- e.g. higher rsquared value when using 30 folds compared using 10 folds.

two questions hoping insight on:

  1. why occur?
  2. is there reason believe rsquared k=10 better estimate of model accuracy using k=30? or both unrelated future error rate can expect on unseen test set?

here's simple example of effect i'm referring to:

############### k = 10 ##################### > data(iris) > train_control <- traincontrol(method="repeatedcv", number=10, repeats=3) > train(sepal.length~.,data=iris,trcontrol=train_control,method="rf",metric="rsquared")  random forest   150 samples 4 predictor  no pre-processing resampling: cross-validated (10 fold, repeated 3 times)  summary of sample sizes: 137, 135, 134, 134, 135, 136, ...  resampling results across tuning parameters:  mtry  rmse       rsquared   rmse sd     rsquared sd 2     0.3381065  0.8404534  0.07692415  0.07583768  3     0.3247406  0.8502577  0.07311807  0.07326181  5     0.3228651  0.8517740  0.07213958  0.07315720   ############### k = 30 ##################### > data(iris) > train_control <- traincontrol(method="repeatedcv", number=30, repeats=3) > train(sepal.length~.,data=iris,trcontrol=train_control,method="rf",metric="rsquared")  random forest   150 samples 4 predictor  no pre-processing resampling: cross-validated (30 fold, repeated 3 times)  summary of sample sizes: 143, 145, 146, 144, 145, 144, ...  resampling results across tuning parameters:  mtry  rmse       rsquared   rmse sd     rsquared sd 2     0.3238545  0.8580474  0.10327919  0.1352787   3     0.3119541  0.8679321  0.09734168  0.1236307   5     0.3109572  0.8717550  0.09727307  0.1123173   

bigger number of folds - bigger training set , smaller testing one. observe best result fo loo (n-fold n training samples) , worst k=2. there no 1 answer generic question how many folds use, solely depends on dataset. furthermore if there underlying relation between datapoints (for example come time series) important how set divided.


Comments

Popular posts from this blog

php - Invalid Cofiguration - yii\base\InvalidConfigException - Yii2 -

How to show in django cms breadcrumbs full path? -

ruby on rails - npm error: tunneling socket could not be established, cause=connect ETIMEDOUT -