## This assignment focuses on material from the end of Chapter 7 and the beginning of Chapter 9. In Chapter 8, various models for Classification were introduced, but it is not practical to try using any of them manually or in Excel.

### others

##### Description

Assignment #4

Due: 11 pm, Friday, November 20th

This assignment focuses on material from the end of Chapter 7 and the beginning of Chapter 9.  In Chapter 8, various models for Classification were introduced, but it is not practical to try using any of them manually or in Excel.  Nonetheless, the concepts and ideas behind these models are important to understand.

This assignment is based upon the student surveys from fall 2019 and 2020.  You looked at this data in Assignment 2.  In particular, you looked a pivot tables that involved percentages.  This is material that could be interpreted as probabilities.  We will revisit these analyses, but manually.

You also looked at average summer earnings and found some differences depending upon where students were from.  We will look at this topic again, but using regression analysis with nominal variables.  For the regression analysis, missing values interfere with the necessary calculations.  Regression analysis does not permit missing values for any variable in the model.  For this reason, I have “cleaned” the file by removing records for students who did not report earning or had earnings of zero.  I have also removed or edited some records that had values that appeared to be invalid.  The edited filed is called Survey 2019 and 2020 Assignment4.xlxs

You will observe that approximately 80 observations have been deleted because of missing, zero or invalid summer earnings.

In Assignment #2 we observed that summer earnings were lower for international students and highest for Nova Scotian students.  Home was a nominal variable with many values.  Even reducing it to 3 values (Nova Scotia, Canada, and International) makes it a new challenge to include in a regression model.  Two new variables have been created.

Intl = 1 if the student is from outside Canada and 0 otherwise.

Canada = 1 if the student is Canadian but from outside Nova Scotia.

We had also observed that summer earnings appeared to be lower in 2020 than in 2019.

Class01 = 1 if the student is in the 2020 class and 0 if s/he was in the 2019 class.

Often when looking at earnings data, we observe that earnings are lower for females.

Gender01 = 1 if female and 0 if male.

Two numeric variables with few missing values were age and high school average.  Older students are likely to have more work experience and thus may have higher earnings.  Some may actually be working full time and studying part time.

Does high school average have any relationship with summer earnings?  Don’t know but chose to look and see.

1.       In assignment 3, you chose the first variable by looking at the correlation between the outcome and the input variables.  Several of the input variables are binary.  Correlation measures linear relationships.  Does it make sense to calculate correlations with binary variables?  If you recall, our first model for classification had a binary outcome, but we still fit the model.  What happens here?