For the final project, you should work alone. The purpose of the project is for you to gain
experience in applying the methods taught in the class to a real data set of
interest to you.
The standard project is to use multiple regression
analysis to analyze a data set that is of interest to you. The final report for
the project should be a 2-5 page paper (this does not include additional R output)
that describes the questions of interest, how you used your data set to analyze
these questions with details on the steps you used in your analysis, your
findings about your question of interest and the limitations of your
study. Specifically, your report should
contain the following:
A one paragraph summary of what you set out to learn, and what you ended
up finding. It should summarize the
A discussion of what questions you are interested in.
Set: Describe details about how the data set was collected and the
variables in the data set.
Describe how you used multiple regression to analyze the data set. Specifically, you should discuss how you
carried out the steps in analysis discussed in class, i.e., exploration of
data to find an initial reasonable model, checking the model and changes
to the model based on your checking of the model.
Provide inferences about the questions of interest and discussion.
of study and conclusion: Describe any limitations of your study and how
they might be overcome in future research and provide brief conclusions
about the results of your study.
You will need a data set to explore your question of
interest. I will be happy to help you
with suggestions. The data set should
ideally contain at least 30-50 observations and at least 4 variables (pieces of
information about the observations; e.g, BMI, age, gender, etc.), although if
that is not possible, exceptions will be allowed (subject to my approval). One
of the variables should be such that it is a numerical variable that would be
of interest to try to model or forecast. I will be happy to discuss ideas with
you. Here are a few potential sources of
ideas and data:
National Longitudinal Survey of Youth, 1979 cohort
Your task for
this project is to write a report addressing the following problem:
Is there a
significant difference in income between men and women? Does the difference
vary depending on other factors (e.g., education, marital status, criminal
history, drug use, childhood household factors, profession, etc.)?
this problem you will use the NLSY79 (National Longitudinal Survey of Youth,
1979 cohort) data set. The NLSY79 data set contains survey responses on
thousands of individuals who have been surveyed every one or two years starting
outcome variable for the data is TOTAL INCOME FROM WAGES AND SALARY IN PAST
CALENDAR YEAR (TRUNC) (2012 survey question). Note that this quantity is topcoded, meaning that you
do not get to see the actual incomes for the top 2% of earners. For the top 2%
of earners, the income variable is set to the average income among the 2% of
earners. The implication of this topcoding is something you’ll want to discuss
as part of your analysis.
You are not expected to use all 70 variables in your
analysis. It suffices to choose a total of 6-8 variables (income, gender + 4-6
others) and to perform a thorough analysis using just those variables.
How to report Findings
careful presentation of your main findings concerning the main problem of
income (in)equality between men and women. You should provide, where
summaries (with carefully labelled column headers)
summaries (with carefully labelled axes, titles, and legends)
output + interpretation of output + interpretation of coefficients
of statistical significance (output of tests, models, and corresponding
As part of your
analysis you must run a regression model. When running regressions, you should discuss whether
the standard diagnostic plots indicate issues with the model (trends in
residuals, variance issues, outliers, etc.). You will not receive full credit
for your regression unless you clearly display and discuss the diagnostic