Using the data in the Excel file, Sales Data, perform K-Means Clustering on the data.



ITEC 320 Homework 3: Clustering and Classification

Due: November 17


1)      Using the data in the Excel file, Sales Data, perform K-Means Clustering on the data.  Use the all the attributes as input, except the Customer and Percent Gross Profit attributes.  Review the clusters and create various plot variable combos.

  Do a detailed interpretation of your results.  Do you see any interesting patterns?



2)      You are on an analytics team at the Really Big Financial Corporation.  You market specialized financial products geared for different income levels of potential customers. 

You do not want to waste time and money to market the products to individuals that are not a good fit based on their income level.

You have downloaded and cleaned up a set of data from the US Census Department.  The data includes demographic and other data from a Census survey.  

·         From the dataset “Census Data”  try to predict who makes more the $50,000.00 dollars per year and who makes less.

·         Use Training and Testing Sets

·         The Target attribute is “Income”:  <= 50K, >50K.

·         Try both Decision Trees and K-NN.

·         How accurate are your predictions?


Note: This is a large dataset (over 32,000 rows), so K-NN runs a little slow, it may take 30 seconds or so for K-NN to run.


3)       For the Titanic problem performed in the Lab, now try to use Support Vector Machine (SVM) Classification.   How does this compare to the accuracy of Decision Trees and K-NN?

·         Caution: SVM only works with numeric dependent attributes, so use the Dataset:  Titanic passengers numeric, where sex is converted to 1 – Female, 0- Male.

·         Note: SVM does not like missing values, so you will need to use the Replace Missing Values Operator, before the Select Attributes Operator to replace missing values for Age to the Average. 

For each of the above problems, please interpret your results and provide all supporting model output and diagrams. (e.g. Clusters, Clusters diagrams, decision trees, performance matrix results etc.)  If you feel ambitious try Naïve Bayes and Neural Networks as well.

Related Questions in others category