The dataset for this assignment contains the prices and other attributes of 50,000 diamonds. Your task is to perform an Exploratory Data Analysis on the dataset.

statistics

Description

1 Description 

The dataset for this assignment contains the prices and other attributes of 50,000 diamonds. Your task is to perform an Exploratory Data Analysis on the dataset. Submit an R Markdown report summarising your findings togetherwiththesourcecode. CheckMoodlefordeadline. Thisassignment ispartofthecontinuousassessmentandworth30%ofthemodulegrade.

2 Dataset FirstdownloadthedatasetfromMoodle. Asthedatasetcontains50Krecords, generating the plots may take a few moments. One way is to start with a small sample and carry out analysis, for example, you can pick 10,000 observations (without replacement) using the function: sample. Run the followingcodetodoso:

s <- sample(nrow(diamonds.dataset), size=10000, replace = FALSE, prob = NULL) diamonds.subset <- diamonds.dataset[s, ]

Theabovepieceofcodecreatesanewdatasetnamed: diamonds.subset containing 10,000 observations from diamond dataset. You can use the sampled dataset (diamonds.subset) first to write and test your code. And then use the full dataset for completing the task given below. REMEMBER! Youmustreportyourfindingsonthefulldataset. Inyourfinalreport,there isnoneedtoincludeyourfindingsonthesampleddataset. Whenyouload thedataset,youwillfindthefollowingvariablesinthedataset: carat: weight of diamond (0.2 to 5.01) cut: quality of the Cut (Fair, Good, Very Good, Premium) color: diamond color from D (Best) to J (Worst)

1

clarity: a measurement of how clear the diamond is from I1 (Worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (Best) table: width of top of the diamond x: length in mm y: width in mm z: depth in mm depth: total depth percentage = 2*z/(x+y) price: price in US dollars

3 Task Your task is to perform EDA and calculate the strength of relationships betweenthevariablesofthedataset. Considerbelowasaguideline: 1. Your first task is to clean the dataset and prepare it for analysis by e.g. removing/replacingNAsandincorrectvalues. (20points) 2. Beginyouranalysiswithasummaryofthevariables(usebasicstatistical methods). Briefly describe your understanding. Prepare 4 plots: piechart,barchart,histogram,scatterplot. Eachplotshoulddisplay different variables (do not use price variable now). Each plot must haveatitleandmeaningfullabels. (20points) 3. Focusyouranalysisonthepricevariable: (20points) (a) Showthehistogramofthepricevariable. Describeitbriefly. Includesummarystatisticslikemean,median,andvariance. (b) Group diamonds by some price ranges (like low, medium, high, etc.) andsummarisethosegroupsseparately. (c) Explorepricesfordifferentcuttypes. Youmightwanttousethe boxplot. (d) How different attributes are correlated with the price? Which 3 variablesarecorrelatedthemostwithprice? 4. List the frequencies of diamonds for various cuts and clarity levels. Create 2 scatter plots and colour the diamonds price by clarity and cuts. (10points) 5. Now focus your analysis on the carat, depth, table and dimensions (x,y,z)variables: (20points)

2

(a) Compute a volume variable from x, y, z - add it to the dataset. Plotitagainsttheprice. Describeyourfindings. (b) Arethecaratandvolumeattributescorrelated? Isthatastrong relationship? Drawaplotwithregressionline. (c) Exploretherelationshipsbetweentableanddepthvariables. (d) Nowexplorerelationshipsbetweentableandrestofothervariables. Computecorrelationsanddescribeyourfindings. 6. In your Markdown document, you should use proper headings and commentaryforeachtask. Youcangetupto10pointsforclarityand qualityofthereportandthesourcecode.

Keepinmindthefollowing... • Acceptable file formats: R Markdown document (.Rmd) and its pdf output. Submitthesetwofilesasazippedfolder. • TherewillbealatesubmissionpenaltyaspertheCollegepolicy. • I will be very strict on plagiarism. You may want to read Griffith’s plagiarismpolicy. Iffound,IwillbeawardingZEROtoboththeCopyier andCopyee.


Related Questions in statistics category