This project is comprised of a number of small tasks. To accomplish the tasks, you would need to review the provided 9 datasets. (But only need to use 4 data sets for the project) and develop research questions that you find plausible and interesting. Then, you will use Excel to split, or merge, or manipulate the relevant data, perform the analysis and present the results. Note that not all information (rows, columns) needs to be used, selecting the interesting information to be used is part of your task and will depend on your research questions.
Tasks: you would need to use ALL quantitative methods introduced in this course to solve your own research questions.
The methods that you need to use include:
1) Descriptive methods, Pie Charts, Bar Charts, Line Charts, and Histograms (Lecture 2);
2) Normal Distribution (Lecture 3 and 4);
3) Simple Random Sampling (Lecture 5);
4) Sample Distribution (Lecture 5);
5) Confidence Interval (Lecture 5);
6) Hypothesis Testing of one population (Lecture 6);
7) Hypothesis Testing of two population (Lecture 7);
8) Association Testing, linear regression (Lecture 8 and 9);
You would need to use at least 4 different datasets to develop your research questions with the quantitative methods. Your research questions are not necessarily connected. They can be separate questions on different datasets. (You are also free to use other datasets of interest. But the datasets must contain more than 10,000 rows and 5 columns. A small dataset is not accepted.)
You would need to write a short report that summarizes your research questions and the datasets you have used. The report should contain: introduction, research questions with brief motivations, corresponding datasets, brief methodology, discussion of the results, conclusions.
1. The report should contain 1200 words (excluding graphs, images and tables).
2. Separate files, Excel files with and without formula should be submitted too.
RESEARCH QUESTION EXAMPLE
The following are some research question examples using the provided datasets. Those research questions are only for inspiration. You can simply use the following questions. You are also strongly encouraged to develop more questions based on your own interest and investigation.
Will the ratings of electronics product and home and office accessories in Amazon the same? (Two population hypothesis testing)
Will the ratings after 2017 better than the average rating before 2017? (One population hypothesis testing)
Illustrate the average wage of the football stars across different countries. (Using Pie chart, Bar char......)
Illustrate the histogram of Spanish/Brazilian/ football stars’ wage/ball control/dribbling.
If we randomly select 5% football starts from the datasets as a small sample, can we find out the confidence interval of their wages? (confidence interval)
Is Spanish football stars’ wage higher than England football stars’ wage? (Two population hypothesis testing)
Is England footballer faster than the Spanish footballer? (Two population hypothesis testing)
Is Brazilian footballer dribbling better than the average dribbling score of England footballer? (One population hypothesis testing)
Can the factors as Acceleration, Aggression, Agility, Balance, Ball control, Dribbling etc. explain the footballer’s wage? Which factor is more significant? (linear regression)
Does shooting from “shot_place” #3 have higher probability of a goal than shooting from all other places? (histogram, normal distribution)
Where does Lionel Messi like to shoot the most? (histogram, normal distribution) Where does Cristiano Ronaldo most likely to goal? (histogram, normal distribution)
Do the factors such as “shot_place”, “shot_outcome”, “location”, “assist_method”, etc. significantly contribute to a goal? (logistic regression)
Do the hotels in UK have the higher review score than the hotels in US? (Two population hypothesis testing)
Can the negative review word counts and positive review word counts explain the review score? (linear regression)
Are the scores of restaurants having violation code F001 lower than the scores of restaurants having violation code F030? (Two population hypothesis testing)
Among the restaurants with scores higher than 90, which rule do they most likely to violate? (histogram, normal distribution)
Is the Spain wine more expensive than US wine? (Two population hypothesis testing)
Are the sales in holiday higher than non-holiday? (Two population hypothesis testing)
Is there any relation between the temperature and the sales? Is there any relation between the fuel price and sales? Do the “Unemployment %”, “IsHoliday”, “CPI”, “ Temperature (F)”, “Fuel_Price”, etc. explain the sales? (Association Testing, linear regression)
Do the factors as “host_response_rate”, “is_location_exact”, “minimum_nights”, “maximum_nights”, and etc. explain the price of the Airbnb? (Association Testing, linear regression)
Do people like videos in category 10 more than category 2? (Two population hypothesis testing)
Is there any relationship between the “views” and “likes”? If there is, is it positive or negative? (Association Testing, linear regression)