The purpose of the project is for you to gain experience in applying the methods taught in the class to a real data set of interest to you.



For the final project, you should work alone.  The purpose of the project is for you to gain experience in applying the methods taught in the class to a real data set of interest to you. 


Project Description

The standard project is to use multiple regression analysis to analyze a data set that is of interest to you. The final report for the project should be a 2-5 page paper (this does not include additional R output) that describes the questions of interest, how you used your data set to analyze these questions with details on the steps you used in your analysis, your findings about your question of interest and the limitations of your study.  Specifically, your report should contain the following:


  1. Abstract: A one paragraph summary of what you set out to learn, and what you ended up finding.  It should summarize the entire report. 
  2. Introduction: A discussion of what questions you are interested in.
  3. Data Set: Describe details about how the data set was collected and the variables in the data set.
  4. Analysis: Describe how you used multiple regression to analyze the data set.  Specifically, you should discuss how you carried out the steps in analysis discussed in class, i.e., exploration of data to find an initial reasonable model, checking the model and changes to the model based on your checking of the model.    
  5. Results: Provide inferences about the questions of interest and discussion.
  6. Limitations of study and conclusion: Describe any limitations of your study and how they might be overcome in future research and provide brief conclusions about the results of your study.



You will need a data set to explore your question of interest.  I will be happy to help you with suggestions.  The data set should ideally contain at least 30-50 observations and at least 4 variables (pieces of information about the observations; e.g, BMI, age, gender, etc.), although if that is not possible, exceptions will be allowed (subject to my approval). One of the variables should be such that it is a numerical variable that would be of interest to try to model or forecast. I will be happy to discuss ideas with you.  Here are a few potential sources of ideas and data:



National Longitudinal Survey of Youth, 1979 cohort


Your task for this project is to write a report addressing the following problem:

Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?


To address this problem you will use the NLSY79 (National Longitudinal Survey of Youth, 1979 cohort) data set. The NLSY79 data set contains survey responses on thousands of individuals who have been surveyed every one or two years starting in 1979.

A natural outcome variable for the data is TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (TRUNC) (2012 survey question). Note that this quantity is topcoded, meaning that you do not get to see the actual incomes for the top 2% of earners. For the top 2% of earners, the income variable is set to the average income among the 2% of earners. The implication of this topcoding is something you’ll want to discuss as part of your analysis.

You are not expected to use all 70 variables in your analysis. It suffices to choose a total of 6-8 variables (income, gender + 4-6 others) and to perform a thorough analysis using just those variables.

How to report Findings

Give a careful presentation of your main findings concerning the main problem of income (in)equality between men and women. You should provide, where appropriate:

·         Tabular summaries (with carefully labelled column headers)

·         Graphical summaries (with carefully labelled axes, titles, and legends)

·         Regression output + interpretation of output + interpretation of coefficients

·         Assessments of statistical significance (output of tests, models, and corresponding p-values)


As part of your analysis you must run a regression model. When running regressions, you should discuss whether the standard diagnostic plots indicate issues with the model (trends in residuals, variance issues, outliers, etc.). You will not receive full credit for your regression unless you clearly display and discuss the diagnostic plots.

Related Questions in statistics category