For the final project, you should work alone. The purpose of the project is for you to gain
experience in applying the methods taught in the class to a real data set of
interest to you.
Project Description
The standard project is to use multiple regression
analysis to analyze a data set that is of interest to you. The final report for
the project should be a 2-5 page paper (this does not include additional R output)
that describes the questions of interest, how you used your data set to analyze
these questions with details on the steps you used in your analysis, your
findings about your question of interest and the limitations of your
study. Specifically, your report should
contain the following:
You will need a data set to explore your question of
interest. I will be happy to help you
with suggestions. The data set should
ideally contain at least 30-50 observations and at least 4 variables (pieces of
information about the observations; e.g, BMI, age, gender, etc.), although if
that is not possible, exceptions will be allowed (subject to my approval). One
of the variables should be such that it is a numerical variable that would be
of interest to try to model or forecast. I will be happy to discuss ideas with
you. Here are a few potential sources of
ideas and data:
National Longitudinal Survey of Youth, 1979 cohort
https://www.nlsinfo.org/content/cohorts/nlsy79
Your task for
this project is to write a report addressing the following problem:
Is there a
significant difference in income between men and women? Does the difference
vary depending on other factors (e.g., education, marital status, criminal
history, drug use, childhood household factors, profession, etc.)?
To address
this problem you will use the NLSY79 (National Longitudinal Survey of Youth,
1979 cohort) data set. The NLSY79 data set contains survey responses on
thousands of individuals who have been surveyed every one or two years starting
in 1979.
A natural
outcome variable for the data is TOTAL INCOME FROM WAGES AND SALARY IN PAST
CALENDAR YEAR (TRUNC) (2012 survey question). Note that this quantity is topcoded, meaning that you
do not get to see the actual incomes for the top 2% of earners. For the top 2%
of earners, the income variable is set to the average income among the 2% of
earners. The implication of this topcoding is something you’ll want to discuss
as part of your analysis.
You are not expected to use all 70 variables in your
analysis. It suffices to choose a total of 6-8 variables (income, gender + 4-6
others) and to perform a thorough analysis using just those variables.
Give a
careful presentation of your main findings concerning the main problem of
income (in)equality between men and women. You should provide, where
appropriate:
·
Tabular
summaries (with carefully labelled column headers)
·
Graphical
summaries (with carefully labelled axes, titles, and legends)
·
Regression
output + interpretation of output + interpretation of coefficients
·
Assessments
of statistical significance (output of tests, models, and corresponding
p-values)
As part of your
analysis you must run a regression model. When running regressions, you should discuss whether
the standard diagnostic plots indicate issues with the model (trends in
residuals, variance issues, outliers, etc.). You will not receive full credit
for your regression unless you clearly display and discuss the diagnostic
plots.
Get Free Quote!
281 Experts Online