Please include all the explanation, make me understand.

1. delete all the column in a data_hw8 file which is shaded in data_dictionary.

2. About the column G (Ans.1st) in data_hw8 file remove all rows which the value in column G is not 1.

ex. delete row 16, since the corresponding value in column G is not 1.

3.use column I (Yes_No), use column I as the response variable. You can see in data_dictionary row 8, it has already classified {123} to Yes, {45} to No

4. Check data_dictionary, all the row using green and yellow highlighted which means they are dummy variable. For the dummy variable, only use a k-1 variable.

ex. column AJ to AO(NYHA-NYHA_5), only use the dummy variable (NYHA_1, NYHA_2, NYHA_3, NYHA_3b), since 5-1=4. And also, Professor mention, when NYHA=1.0, NYHA_1=1, otherwise 0. NYHA=2.0, NYHA_2=1, otherwise 0, NYHA=3.0, NYHA_3=1,otherwise 0, When NYHA=3.5, NYHA_3b=1,otherwise 0.)

5. Some variables have most of the same number, So if you have the same value as for 97% or more, then don’t include that variable.

6. delete the value of -99,99,88,-88, null . Not entire row, just value. But, the professor said, in some cases, it shouldn't be deleted. For example, some variables have values like 77, 76, 55,66, 88, 99... in here 88 and 99 shouldn't be deleted. But if in the categorical variable, it should be deleted. Something like that... There's maybe more additional cases...

7. About the Null value, the professor said to remove the entire row whether y or x contain Null. He said it should need to keep the length of y and x equally. He said temporarily removed...Not in the data itself, I don't quite understand what does he mean.

ex. x = 1 2 3 4 5 y = 1 2 3 Null 5 then x= 1 2 3 5 y= 1 2 3 5

remove Null value, and keep the correct length of the independent variable and dependent variable.

All above is the data cleaning. this part is important.

8. In each univariate logistic regression, you delete the row for y and the variable.

9. do the dummy variable separately when doing logistic regression

choose only significant variable ones

10. An additional part, after chooses the most significant variable, do the multiple logistic regression, to choose the best model.

Please don't use too complicated algorithm, and add explanation why you use this function and method.

Get Higher Grades Now

Tutors Online