The file SalesData2.xlsx contains 5000 rows of data. This is another set of data from CPCG, the local coffee and gelato shop from Assignment 4.

data mining

Description

The file SalesData2.xlsx contains 5000 rows of data. This is another set of data from CPCG, the local coffee and gelato shop from Assignment 4. Each row summarizes a customer transaction, as before. However, the set of attributes is different from the previous dataset. In particular, this dataset has a binary attribute called return_visit that is equal to 1 if the customer returned to CPCG at some point in the future, and 0 otherwise. This attribute will be our dependent variable (or “label” in RapidMiner); we want to know, based on the characteristics of a transaction, which customers are likely to return and which customers are not.

 

Other attributes include which of the six CPCG Locations the transaction was from, the Time of day at which the transaction occurred, the Discount the customer received in dollars (usually equal to zero), the total Net Sales of the transaction in dollars, and the Tip that the customer gave.

 

First, import the dataset into RapidMiner, making sure to change the type of return_visit to binomial, and the role of return_visit to label.

 

1. Building a logistic regression model

 

a) Create a process in RapidMiner that builds a logistic regression model for this dataset. (For now, you do not need to use Cross Validation, nor make any predictions.) However, you will first need to convert the Location attribute into a set of dummy variables. The easiest way to do this is with the Nominal to Numerical operator; in the “comparison groups” list, choose Location as the “comparison group attribute,” and enter “Bakery” as the comparison group. This will set Bakery as the default location, and create a dummy variable for each of the other five CPCG stores.

 

You will also need to use the Remap Binominals operator to force RapidMiner to code 0 as the “negative” value and 1 as the “positive” value of return_visit.

 

Show a screenshot of the Process panel.

 

b) Run the process from part a, and show a screenshot of the logistic regression output. c) Based on your output from part b, identify TWO significant independent variables. If you list more than two, only the first two will be counted as your answer.

 

d) Based on your output from part b, how does Time appear to influence the likelihood of a customer returning? (Remember that Time indicates the time of day that the transaction occurred; higher values mean the transaction was later in the day.)

 

 

 

 

2. Evaluating classification models

 

In this question, you will need to use the Cross Validation operator to evaluate the accuracy of the 0-1 classifications made by logistic regression and two other approaches. For ALL parts of the question, use a local random seed of 12345 in the Cross Validation operator’s parameters. Do not remove any attributes from the dataset, even if they were insignificant in the logistic regression model from the first question.

 

a) Create a process that uses Cross Validation to evaluate your logistic regression model for this dataset. Do not change any parameters other than setting the local random seed to 12345. Show a screenshot of the Cross Validation process. (NOT the main Process panel; I want to see what’s happening within the Cross Validation operator).

 

b) Run your process from part a. Show a screenshot of the performance output. (It should be a table that includes an overall accuracy percentage above it.)

 

c) How many total customers did the logistic regression model predict would return?

 

d) Replace the Logistic Regression operator with a k-Nearest Neighbors operator, set k equal to 25, and uncheck “weighted vote.” Run the process. What is the overall accuracy of this k-NN model?

 

e) Based on your results, which of these two models is more accurate?

 


Related Questions in data mining category


Disclaimer
The ready solutions purchased from Library are already used solutions. Please do not submit them directly as it may lead to plagiarism. Once paid, the solution file download link will be sent to your provided email. Please either use them for learning purpose or re-write them in your own language. In case if you haven't get the email, do let us know via chat support.