Recall the data cleaning assignments we did on the investments data. We showed that customers who had negative service encounters took out a lot more money than customers who had positive service encounters. One main limitation of the previous work was that it was carried out across the entire sample of data without looking to see if different segments of customers responded differently to bad vs. good service. The executives at the investments firm have asked you to reanalyze the data in greater detail. Specifically, the executives want to know if there are certain customer segments who are at higher risk of taking out more money. In order to answer this question, you will perform a cluster analysis on the data, and calculate the average changes in dollars (Chg good service – Chg bad service) within each segment to see if there are certain groups of customers the firm needs to make sure always have good service encounters.
You must provide all of your R code in a script file and we must be able to replicate your results in order for you to receive full credit for this assignment. Please upload both your answers as well as the R file. The dataset "hw3_data.txt" contains 1759 observations and 10 variables. The most important variables are defined below:
1. “categ”: Describes whether a customer had a good or a bad customer service experience. 860 customers had a bad service experience (answered a 1 or 2 on the customer satisfaction survey), and 899 customers had a good service experience (answered a 4 or 5 on the survey). Customers answering a “3” (average service experience) were omitted from this dataset.
2. “Inv_Chg”: This variable is the primary variable of interest to the firm. It represents the change in investment dollars for each customer 1 month before the service encounter vs. 3 months after the service encounter: Inv_Chg = Inv_3M_Aft – Inv_1M_Bef.
3. “Inv_1M_Bef”: This variable represents the total investment dollars customers had with the firm 1 month before they had a service encounter and survey.
4. “cust_age”: This variable represents each customer’s age in years at the time of the survey.
5. “cust_tenure”: This variable represents how long each customer has been with the firm (measured in years) at the time of the survey.
6. “tottrans”: Total monthly transactions the customer had with the firm at the time of the survey.
7. “cust_id”: Customer ID number: unique ID field used to identify each customer.
The firm has asked you to do a segmentation (i.e. cluster analysis) on the following four variables: 1) Inv_1M_Bef, 2) cust_age, 3) cust_tenure, 4) tottrans to see if you can identify segments of customers who are especially likely to take out a lot of money should a bad service encounter occur. If you can identify segments of customers likely to take out a lot of money, then this can be used to proactively identify future customers who may also be especially at risk of disengaging after a bad customer service experience. The firm believes these four variables are especially important for clustering purposes.
Part A: Standardize Data (20 Points)
Standardize the four variables: Inv_1M_Bef, cust_age, cust_tenure, tottrans to a mean of 0 and a standard deviation of 1. Don’t replace the original data. Instead, create a new data matrix called “X.scaled” that contains the four variables that have been standardized so that they are all scaled the same way. Be sure to show your work in R.