The Diamonds file contains data about 9900 diamonds, including their price.

data mining



Predict Diamond Prices


The Diamonds file contains data about 9900 diamonds, including their price. Your task is to use any approach you want to predict the price of diamonds, and then tell me your approach. I will then apply your approach to another dataset of 2000 diamonds that I have, and see how well it predicts those diamonds’ prices. That’s all!

Okay, it’s not THAT simple. But it’s still pretty simple.


You can try out whatever operators from class that you want in RapidMiner, with whatever parameters you want. I strongly recommend that you use a Cross Validation operator, and try several approaches to see what gets you the lowest RMSE. You can also use operators like Filter Examples, Select Attributes, Nominal to Numerical (three of the attributes are qualitative), or any other changes you’d like to make to the dataset. However, here’s the crucial part:

Whatever you do, you must be able to show or tell me clearly enough so that I can replicate it exactly!

It’s not enough to say “We used the Fortune Teller operator after removing three attributes.” You need to tell me which three attributes you removed, and what parameters you changed in the Fortune Teller operator. (You could also include a screenshot of the Fortune Teller parameters instead of writing out the individual changes.)

That explanation of your approach is due via Blackboard on Sunday, 48 hours after the start of our normal Friday class period. If you’re working individually, that’s all you need to submit. If you’re working in a group, two important things:

1. Only one group member needs to submit the explanation on Blackboard, but all group members’ full names must be listed in the submission.
2. Each group member must complete the peer assessment survey on Blackboard. It’s short. If you are in a group and you do not complete this survey, you will not get credit for the activity.



40% for submitting a clear explanation of your approach that I can understand and replicate.

30% for your approach outperforming a bad naïve approach that I created. This is simply a check to make sure you’re doing something reasonable; your predictions don’t have to be great to get full credit here.

30% for your prediction results. This will be based on the RMSEs of the whole class’ predictions for the prices of the 2000 other diamonds (which you do NOT have). The score will be determined as follows:

The most accurate set of predictions in the class (lowest RMSE on my data) will get 30/30.
The 2nd most accurate will get 29/30.
The 3rd most accurate will get 28/30.


Everyone else’s score will be calculated based on the following formula using your RMSE:

30 – 3*(RMSE/X),

where X is the RMSE of a good approach that I created.  If your RMSE is equal to mine, this formula works out to 27.  If your RMSE is twice mine, it works out to 24.  If your RMSE is more than twice mine, that’s a bad sign, and you probably didn’t get a 30/30 on the previous part.

If you submit after the deadline, your score for the prediction results will be determined using the formula above, regardless of how your RMSE compares to your classmates’

Related Questions in data mining category