The primary objective is to use classification techniques learnt so far. Each loan is graded (A to G) based on the risk, with A being least risky and G being the highest risk category.

data mining

Description

We will revisit the Lending Club data for this week’s assignment. The company has existed since 2007 and have provided millions of personal loans since then. Lending Club announced IPO in December 2014, since when the company came in the limelight for negative publicity. Lending club officials were accused of taking aggressive risks by lending money to those with risky credit worthiness. You are asked to study this phenomenon and determine if data provides clues of the authenticity of the claim that Lending Club behaved irresponsibly.

You are given a single combined file of “approved” loans data from six years, which are supposedly the pre and post periods of the controversy.

Step 1 (30 Points)

The first step is create two new columns as follows:

a)      Comb_Risk_One: Create a binary column by combining categories A and B (Low Risk) into one category and all the remaining categories in another (High Risk).

b)      Comb_Risk_Two: Create a binary column by combining categories A, B and C (Low Risk) into one category and all the remaining categories in another (High Risk).

Now, break the file into two files filtering out data for 2012, 13, and 14 in one file and 2015, 16 and 17 in another file.

Step 2 (70 Points)

 The primary objective is to use classification techniques learnt so far. Each loan is graded (A to G) based on the risk, with A being least risky and G being the highest risk category. You are asked to predict Low and High-risk categories (for the two new response variables) using various modeling techniques like Naïve Bayes’, KNN, Logistic Regression, and CART model. Make sure to look for the following:

a.       Outliers based on the independent columns (predictors)

b.      Multicollinearity

c.       Scaling and standardization of the predictors

d.      Train-Test split for both files and compare the confusion matrices on the Test.

Produce a “well documented and explained” R Markdown knit file analyzing the data with findings on the model with the highest classification ability. Also describe the features of the categories that are not classified correctly. Create a confusion matrix to answer the last question and run descriptive statistics on the misclassified categories. Provide any necessary EDA and visuals to enhance understanding of your analysis. 


Related Questions in data mining category