Mastitis occurs when bacteria gets into a cows udder which causes an infection. On average mastitis costs farmers €60 per cow per year. Mastitis accounts for a loss of 20% of the total agricultural revenue in Ireland. Healthy udders are: economically profitable, lead to a better quality product, and better cow welfare. Somatic cell count (scc) is the total number of cells per millilitre in milk. Primarily, scc is composed of leukocytes, or white blood cells, that are produced by the cow’s immune system to fight a mastitis infection. Since leukocytes in the udder increase as the condition worsens, scc provides an indication of the degree of mastitis in an individual cow.

Automated (robotic) milking systems are becoming more popular in Ireland and provide information on various measures of the composition of milk. Casein and whey protein are the major proteins in milk. Casein constitutes approximately 80% (29.5 g/L) of the total protein in bovine milk, and whey protein accounts for about 20% (6.3 g/L). The objective of this project is to analyze the relationship between the somatic cell count scc with the protein levels recorded by the automated (robotic) milking system, which are protein and casein. We also consider the percentage concentrate feed (supplements) in the cows’ diet

conc_fed.

This data set contains

• protein the recorded protein in the milk for cow i,

• casein the casein in the milk for cow i,

• scc the somatic cell count in the milk for cow i,

• conc_fed the percenatge concentrate feed
(supplements) in the cows diet for cow i,

for i = 1, . . . ,N, where N is the number of cows
recorded in the data set. The observations relate to

individual cows on four farms in Ireland.

Exploratory
Data Analysis (35 marks):

For each question in the EDA section please provide
the lines of R code required to produce your results

and the tables and figures produced by R.

1. Using a boxplot, histogram and the descriptive
statistics (mean, min, max, median, and quantiles).

Describe the distribution of the somatic cell count
scc. (5 marks)

2. Using a boxplot, histogram and the descriptive
statistics (mean, min, max, median, and quantiles).

Describe the distribution of the log of the somatic
cell counts scc. (5 marks)

3. Using a boxplot, histogram and the descriptive
statistics (mean, min, max, median, and quantiles).

Describe the distribution of the protein levels
protein. (5 marks)

4. Using a boxplot, histogram and the descriptive
statistics (mean, min, max, median, and quantiles).

Describe the distribution of the casein levels casein.
(5 marks)

5. Convert the categorical variable conc_fed to a
factor. Describe and illustrate the frequency and

proportions of the categorical variable concentrate
feed conc_fed (5 marks)

6. Using the descriptive statistics (mean, standard
deviation, median, mad: median absolute deviation

(from the median), minimum, maximum, skew and standard
error) and a boxplot describe how the log

of somatic cell counts scc varies with respect to the
variable concentrate feed conc_fed (5 marks)

7. Using the correlation and scatter plots discuss the
relationship between log(scc) and each of the

variables protein and casein. (3 marks)

8. Based on the results from Q 7, which variable
protein or casein would provide a better predictor

variable in your regression model with log(scc) as the
response. Provide a justification for your

selection. (2 marks)

Regression Model (65 marks):

1. Using R fit a simple linear regression model to the
data with log(scc) as the response variable and the

variable chosen in Q8 of the exploratory analysis
section as the predictor variable. Define and describe

the mathematical equation for the model. (Also provide
you R code) (4 marks)

2. Interpret the estimate of the intercept term. (2
marks)

3. Interpret the estimate of the slope term. (2 marks)

4. Calculate the variance of the estimate of the
intercept and slope term. (2 marks)

5. Calculate and interpret the confidence intervals
for _0 (Provide you R
code) (5 marks)

6. Calculate and interpret the confidence intervals
for _1 (Provide you R
code) (5 marks)

7. Compute and interpret the hypothesis test H0 : _0 = 0 vs Ha : _0 6= 0. State the
test statistic.

Compare the test statistic to the correct distribution
value and state your conclusion. Also, report the

p-value and the conclusion in the context of the
problem. (8 marks)

8. Compute and interpret the hypothesis test H0 : _1 = 0 vs Ha : _1 6= 0. State the
test statistic.

Compare the test statistic to the correct distribution
value and state your conclusion. Also, report the

p-value and the conclusion in the context of the
problem. (8 marks)

9. Interpret the F-statistic in the output in the
summary of the regression model. Hint: State the

hypothesis being tested, the test statistic and
p-value and the conclusion in the context of the problem.

(6 marks).

10. Interpret the R-squared value. (2 marks)

11. Interpret the residual standard error of the
simple linear regression model. (2 marks)

12. Calculate, plot and comment on the shape of the
confidence intervals for the estimated values of Y

(Provide you R code) (4 marks)

13. List the assumptions of the linear regression
model required for small sample inference (5 marks)

14. Examine the residuals of the regression model and
comment on whether you think the residuals satisfy

Get Higher Grades Now

Tutors Online