Use sklearn.cluster.KMeans to do clustering on the given data set points.csv.

computer science

Description

1 Clustering & Classification (60pt) 

1. Use sklearn.cluster.KMeans to do clustering on the given data set points.csv. There are 4 clusters in this data set. Draw a scatter plot for the data and use color to indicate their clusters. 


2. Regard the clusters given by your KMeans model as the ground truth labels, randomly split the data set into training data (80%) and testing data (20%). Create a linear SVM classifier and train it on training data set. Use the confusion matrix to evaluate its performance on testing data set. 


3. Regard the data set labels.csv as the ground truth labels, repeat the second question. Compare their performance, discuss what do you observe, and how would explain it. 


4. (Bonus 10pt) Use tensorflow.keras API to create a fully connected neural network model, repeat the second question. Draw a plot to show how loss changes when the step of training increases. 


2 Regression (40pt) 

1. In this question, we are going to use the diabetes data set. Use sklearn.datasets.load diabetes() to load the data and labels. 

2. Randomly split the data into training set (80%) and testing set (20%). 

3. Create a linear regression model using sklearn, and fit training data. Evaluate your model using test data. Give all the coefficient and R-squared score. 

4. Use 10-fold cross validation to fit and validate your linear regression models on the whole data set. Print the scores for each validation. 

5. (Bonus 3pt) Use sklearn to create RandomForestRegressor model, and fit the training data into it. 6. (Bonus 7pt) Use Grid Search to find the optimal hyper-parameters (max depth:{None, 7, 4} and min samples split: {2, 10, 20}) for RandomForestRegressor.


Related Questions in computer science category