Problem Set 6
May 18, 2020
This question generates a comparison of the performance of the “train-validate” model selection process and the cross-validation model selection process in the task of identifying the best explanatory variable in a linear regression from among a large collection, none of which are structurally related to the outcome variable. Success in this context requires that the method to provide evidence that the selected predictor is a poor predictor of the outcome.
Anticipating that replication will be required, the following function generates data with a train-validate split and a cross-validation index. It exploits the fact that the data are generated in a random order.
# Create train-validate identifiers for a 3/4 to 1/4 split
# Create fold identifiers for k-fold cross-validation.
Given a data set in the form returned by “dat.make”, the function “sse.best.tv” identifies the highest performing predictor when a model fit on the training set is used to predict the outcome on the validation set.# Function to return the validation error of predicting "y" by the jth "x" variable