The file SalesData2.xlsx contains 5000 rows of
data. This is another set of data from CPCG, the local coffee and gelato shop
from Assignment 4. Each row summarizes a customer transaction, as before.
However, the set of attributes is different from the previous dataset. In
particular, this dataset has a binary attribute called return_visit that is
equal to 1 if the customer returned to CPCG at some point in the future, and 0
otherwise. This attribute will be our dependent variable (or “label” in
RapidMiner); we want to know, based on the characteristics of a transaction,
which customers are likely to return and which customers are not.
Other attributes include which of the six CPCG
Locations the transaction was from, the Time of day at which the transaction occurred,
the Discount the customer received in dollars (usually equal to zero), the
total Net Sales of the transaction in dollars, and the Tip that the customer
First, import the dataset into RapidMiner,
making sure to change the type of return_visit to binomial, and the role of
return_visit to label.
1. Building a logistic regression model
a) Create a process in RapidMiner that builds
a logistic regression model for this dataset. (For now, you do not need to use
Cross Validation, nor make any predictions.) However, you will first need to
convert the Location attribute into a set of dummy variables. The easiest way
to do this is with the Nominal to Numerical operator; in the “comparison
groups” list, choose Location as the “comparison group attribute,” and enter
“Bakery” as the comparison group. This will set Bakery as the default location,
and create a dummy variable for each of the other five CPCG stores.
You will also need to use the Remap Binominals
operator to force RapidMiner to code 0 as the “negative” value and 1 as the
“positive” value of return_visit.
Show a screenshot of the Process panel.
b) Run the process from part a, and show a
screenshot of the logistic regression output. c) Based on your output from part
b, identify TWO significant independent variables. If you list more than two,
only the first two will be counted as your answer.
d) Based on your output from part b, how does
Time appear to influence the likelihood of a customer returning? (Remember that
Time indicates the time of day that the transaction occurred; higher values
mean the transaction was later in the day.)
2. Evaluating classification models
In this question, you will need to use the
Cross Validation operator to evaluate the accuracy of the 0-1 classifications made
by logistic regression and two other approaches. For ALL parts of the question,
use a local random seed of 12345 in the Cross Validation operator’s parameters.
Do not remove any attributes from the dataset, even if they were insignificant
in the logistic regression model from the first question.
a) Create a process that uses Cross Validation
to evaluate your logistic regression model for this dataset. Do not change any
parameters other than setting the local random seed to 12345. Show a screenshot
of the Cross Validation process. (NOT the main Process panel; I want to see
what’s happening within the Cross Validation operator).
b) Run your process from part a. Show a
screenshot of the performance output. (It should be a table that includes an
overall accuracy percentage above it.)
c) How many total customers did the logistic
regression model predict would return?
d) Replace the Logistic Regression operator
with a k-Nearest Neighbors operator, set k equal to 25, and uncheck “weighted
vote.” Run the process. What is the overall accuracy of this k-NN model?
e) Based on your results, which of these two
models is more accurate?