The file bestsubset.py defines a function called bestsubset that takes two arguments, a dataframe X and array y.

computer science

Description

The file bestsubset.py defines a function called bestsubset that takes two arguments, a dataframe X and array y. If X has k columns, then there are k variables under consideration to include in our model of y=f(X). We only consider linear models in this assignment, and the function bestsubset uses 5-folds cross validation to compare every possible linear model that includes one or more of the predictors in X. The function returns a list with two elements: a list of column indices included in the best model, and an array of coefficient estimates for that model. You should assume that X does NOT have a column for intercept, so you should have LinearRegression fit the intercept (i.e. use default setting).

The function suffers from a problem that if the number of predictors gets large, the function will be very slow. The total number of iterations of the inner loop is 2^k since there are 2^k combinations of predictors. Thus, this function is O(2^k), considerably slower than any algorithm we discussed in our unit on algorithmic efficiency.


Related Questions in computer science category