My dependent variables for the final project will be suicides per 100k persons. I will use suicides per 100k persons rather than total suicides because this will help control for the varying populations among different countries. I plan to investigate a variety of independent variables that influence suicide rate: age, gender, Human development index (HDI), and GDP per capita. Additionally, I would like to see how the impact of these variables changes across country of residence and how the impact of these variables has changed from 1985-2016.
The data set contains time series data about suicides from 1984 to 2016 with information about country, gdp per capita, Human Development Index (HDI), age, gender, etc.
I would like to use this data set to understand how certain factors influence suicide rates across different countries and time. I plan to do this by creating multiple regression models that regress different combinations of independent variables in the data set to try to discover accurate risk factors that contribute to the suicide rate. This dataset contains data with age groupings, generational data, raw number of suicides and suicides per 100k persons, population, HDI and gdp_per_capita values for over 100 countries from 1984 to 2016.
The data set was downloaded from Kaggle.com, which aggregated a variety of sources to produce this data set. These references are:
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/
# loads dataset into Rmarkdown
suicidedata <- read.csv("~/Desktop/Statistics Lab/suicidedata.csv")
# rename columns 10 and 11 with variable names that
# are easy to workwith
names(suicidedata)[10] <- "gdp_for_year"
names(suicidedata)[11] <- "gdp_per_capita"
# remove observations from the year 2016 because
# these observations are not from a full year of
# data in the dataset.
suicidedata <- suicidedata[!(suicidedata$year == 2016),
]
# sets the threshold for scientific notation on
# axis higher.
options(scipen = 3)
by_year <- group_by(suicidedata, year)
# levels(suicidedata$age) needed to relevel factors
# using this code below. It is commented because
# rerunning this code would continue to change the
# order.
suicidedata$age = factor(suicidedata$age, levels(suicidedata$age)[c(4,
1, 2, 3, 5, 6)])
# group total suicides by gender and create
# percentages of male/female suicides
by_sex <- suicidedata %>% group_by(sex)
by_sex <- by_sex %>% summarise(suicides_no = sum(suicides_no)/sum(suicidedata$suicides_no))
# stargazer table to show summary statistics of
# dataset
stargazer(suicidedata, type = "html", digits = 1)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
year | 27,660 | 2,001.2 | 8.4 | 1,985 | 1,994 | 2,008 | 2,015 |
suicides_no | 27,660 | 243.4 | 904.5 | 0 | 3 | 132 | 22,338 |
population | 27,660 | 1,850,689.0 | 3,920,658.0 | 278 | 97,535.2 | 1,491,041.0 | 43,805,214 |
suicides.100k.pop | 27,660 | 12.8 | 19.0 | 0.0 | 0.9 | 16.6 | 225.0 |
HDI.for.year | 8,364 | 0.8 | 0.1 | 0.5 | 0.7 | 0.9 | 0.9 |
gdp_per_capita | 27,660 | 16,815.6 | 18,861.6 | 251 | 3,436 | 24,796 | 126,352 |
From this summary table we can see that the average number of suicides per 100k persons is 12.8 with a standard deviation of 19. This is not suprising that the standard deviation is greater than the mean in this case because suicides per 100k persons can never take on a value less than 0. This can be seen in the minimum column which shows 0.0. The 75% percentile shows 16.6 suicides per 100k persons but the max observed in the dataset is 225 which shows that there are some number of outliers that have a number of suicides per 100k persons that are very far away from the mean.
Get Free Quote!
372 Experts Online