Problem Set 6
J. Weldemariam
May 18, 2020
This question generates a comparison of the performance
of the “train-validate” model selection process and the cross-validation model
selection process in the task of identifying the best explanatory variable in a
linear regression from among a large collection, none of which are structurally
related to the outcome variable. Success in this context requires that the
method to provide evidence that the selected predictor is a poor predictor of
the outcome.
Anticipating that replication will be required, the
following function generates data with a train-validate split and a cross-validation
index. It exploits the fact that the data are generated in a random order.
n<-50
p<-30
k.fold<-10
dat.make<-function(n,p,k.fold){
y<-rnorm(n)
xs<-matrix(rnorm(n*p),nrow=n)
# Create train-validate identifiers for a 3/4
to 1/4 split
train.ind<-rep(c(1,0),times=c(round(n*3/4),n-round(n*3/4)))
# Create fold identifiers for k-fold
cross-validation.
xv.ind<-rep(1:k.fold,ceiling(n/k.fold))
xv.ind<-xv.ind[1:n]
dat.this<-data.frame(cbind(y,xs,train.ind,xv.ind))
names(dat.this)<-c("y",str_c("x",1:p),"train.ind","xv.ind")
return(dat.this)
}
set.seed(2345)
dat.this<-dat.make(n,p,k.fold)
Given a data set in the form returned by “dat.make”,
the function “sse.best.tv” identifies the highest performing predictor when a
model fit on the training set is used to predict the outcome on the validation
set.
Get Free Quote!
317 Experts Online