Introduction Research Idea
My dependent variables for the final project will be suicides per 100k persons. I will use suicides per 100k persons rather than total suicides because this will help control for the varying populations among different countries. I plan to investigate a variety of independent variables that influence suicide rate: age, gender, Human development index (HDI), and GDP per capita. Additionally, I would like to see how the impact of these variables changes across country of residence and how the impact of these variables has changed from 1985-2016.
The data set
The data set contains time series data about suicides from 1984 to 2016 with information about country, gdp per capita, Human Development Index (HDI), age, gender, etc.
I would like to use this data set to understand how certain factors influence suicide rates across different countries and time. I plan to do this by creating multiple regression models that regress different combinations of independent variables in the data set to try to discover accurate risk factors that contribute to the suicide rate. This dataset contains data with age groupings, generational data, raw number of suicides and suicides per 100k persons, population, HDI and gdp_per_capita values for over 100 countries from 1984 to 2016.
The data set was downloaded from Kaggle.com, which aggregated a variety of sources to produce this data set. These references are:
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/
suicidedata <- read.csv("~/Desktop/Statistics Lab/suicidedata.csv")
Motivation: Graphs and other descriptive statistics
names(suicidedata) <- "gdp_for_year"
names(suicidedata) <- "gdp_per_capita"
suicidedata <- suicidedata[!(suicidedata$year == 2016),
options(scipen = 3)
by_year <- group_by(suicidedata, year)
suicidedata$age = factor(suicidedata$age, levels(suicidedata$age)[c(4,
1, 2, 3, 5, 6)])
by_sex <- suicidedata %>% group_by(sex)
by_sex <- by_sex %>% summarise(suicides_no = sum(suicides_no)/sum(suicidedata$suicides_no))
stargazer(suicidedata, type = "html", digits = 1)
From this summary table we can see that the average number of suicides per 100k persons is 12.8 with a standard deviation of 19. This is not suprising that the standard deviation is greater than the mean in this case because suicides per 100k persons can never take on a value less than 0. This can be seen in the minimum column which shows 0.0. The 75% percentile shows 16.6 suicides per 100k persons but the max observed in the dataset is 225 which shows that there are some number of outliers that have a number of suicides per 100k persons that are very far away from the mean.