**“R markdown”**

Course (please circle) : MATH2831 / MATH2931

I (We) declare that this assessment item is my (our) own work, except where

acknowledged, and has not been submitted for academic credit elsewhere, and

acknowledge that the assessor of this item may, for the purpose of assessing this

item:

• Reproduce this assessment item and provide a copy to another member of

the University; and/or,

• Communicate a copy of this assessment item to a plagiarism checking ser-

vice (which may then retain a copy of the assessment item on its database

for the purpose of future plagiarism checking).

I (We) certify that I (We) have read and understood the University Rules in

respect of Student Academic Misconduct.

Surname Whit Given name Student ID Signature Whit Date Whi

1

2

3

1

Please follow the instructions below for completing the assignment, it is worth

20% of your final mark. You may do this assignment in groups up to 3 people.

Instructions

• Your assignment must be a typeset in one continuous LATEX(.pdf) or R

markdown (knitted to .pdf) document (no separate documents stapled to-

gether).

• Each question should be numbered using section or enumerate environ-

ments in LATEX or with hashes in R markdown.

• All the content for each part of each question should be consecutive, do not

refer to appendices or put questions out of order.

• You must do all your calculations in R and provide all code and relevant

output using the verbatim environment in LATEX or inside R markdown

chunks.

• You must submit a hard copy printed assignment with a completed cover

page (above).

• Font size should be easily readable (10 to 12)

• When stating conclusions for hypothesis tests, you must answer the ques-

tion. E.g. Using α = 0.05, we have evidence (p = 0.004) that latitude is

related to tree height, after controlling for temperature.

2

Assignment 1 – Questions

1. (MATH2831 and MATH2931)

(a) For a simple linear regression model covered in the lecture notes, derive

the relationship between the coefficient of determination R2 and the

sample correlation coefficient r given by

r =

Pn

i=1(yi − y ̄)(xi − x ̄)

pPn

i=1(xi − x ̄)

2 Pn

i=1(yi − y ̄)

2

(b) Consider a simple linear regression model with a known intercept pa-

rameter

yi = β

∗

0 + β1xi + εi

, i = 1, . . . , n

where β

∗

0

is known, β1 is an unknown slope parameter, errors εi are

uncorrelated with zero mean and common variance σ

2

.

i. Find the least squares estimator of β1 (you must justify your an-

swer). Does your estimator differ from the estimator obtained in

lectures for the case where β0 is unknown?

ii. Find the maximum likelihood estimator of β1 (you must justify

your answer).

iii. Find the mean and variance of the least squares estimator b1 (you

must justify your answer).

iv. Prove for the model above that the following identity holds

Xn

i=1

(yi − β

∗

0

)

2 =

Xn

i=1

(ybi − β

∗

0

)

2 +

Xn

i=1

(yi − ybi)

2

,

where ybi denotes the fitted value β

∗

0 + b1xi

.

3

2. (MATH2831 and MATH2931) To answer the following question, down-

load the ’auction.txt’ file from moodle. In the data set, the selling price at

auction of 30 antique grandfather clocks were recorded. Also recorded is

the age of the clock and the number of people who made a bid.

Variable Description

Age Age of the clock (years)

Bidders Number of individuals participating in the bidding

Price Selling price (pounds sterling)

(a) Obtain R summary output generated by fitting a simple linear re-

gression model with Price as the response and Age as the predictor.

Include the summary in your assignment.

(b) What are the least squares estimates of the intercept and slope, and

what is the estimated error variance for the fitted model?

(c) How much does price increase or decrease on average, when the age of

the clock increases by one year?

(d) What percentage of variation in the response is explained by the values

of the predictor?

(e) From the R output, state the value of an F test statistic for testing

H0 : β1 = 0 versus H1 : β1 6= 0 where β1 is the slope term in the

model. Also state the p-value for this test, and the conclusion of the

test using a 5% level of significance.

(f) State the observed value of a t test statistic equivalent to the F test

considered above in part (e). How would the computation of the p-

value for the t test be modified for testing H0 : β1 = 0 versus the one

sided alternative H1 : β1 > 0?

(g) Forecast the price of an antique grandfather clock if it is 170 years

old. Also construct a 95 percent prediction interval for the price, and

give a 95 percent confidence interval for the mean when the age of the

clock is 170 years.

(h) Use Bonferroni adjustment to compute a joint confidence interval for

β0 and β1 with at least 95% confidence level.

4

3. (MATH2831 and MATH2931) Let y = (y1, …, yn)

> be a set of re-

sponses, and consider the linear model

y = μ + ε,

where μ = (μ, …, μ)

> and ε is a vector of zero mean, uncorrelated errors

with variance σ

2

. This is a linear model in which the responses have a

constant but unknown mean μ. We will call this model the location model.

(a) If we write the location model in the usual form of the linear model

y = Xβ + ε,

then what is the design matrix X? What is β?

(b) Find X>X, (X>X)

−1 and X>y.

(c) What is the least squares estimator of μ? Show that this least squares

estimator is unbiased.

(d) Using the results we have proved for the general linear models, derive

an expression for an unbiased estimator of σ

2

in the location model.

4. (MATH2931 only) In this question, we will prove the sums of squares

identity

SStotal = SSreg + SSres

stated in lectures for the general linear model.

(a) If y is the n × 1 vector of response values, show that the vector y ̄,

which is the n × 1 vector where all entries are ̄y is given by:

y ̄ = 1(1

>1)

−11

>y,

where 1 is the n × 1 vector of ones.

(b) Show that

SStotal = y

>B

>By,

where B = (I − 1(1

>1)

−11

>) and I is n × n the identity matrix.

(c) Show that the matrix B is symmetric and idempotent.

(d) If X is the design matrix, show that

SSreg = y

>(H − 1(1

>1)

−11

>)y.

where H = X(X>X)

−1X>, the p×p hat matrix. Hint: Write SSreg = Pn

i=1 yb

2

i −

(

Pn

i=1 yi)

2

n

(e) Recall from lectures that (you don’t need to show this)

SSres = y

>(I − H)y.

Hence prove the sums of squares identity

SStotal = SSreg + SSres.

5

5. (MATH2831 and MATH2931) Download the ’Power.txt’ file from moo-

dle. It contains the first n observations from the data set Combined Cycle

Power Plant Data Set on the Machine Learning Repository. The data set

contains data points collected from a Combined Cycle Power Plant over 6

years (2006-2011), when the power plant was set to work with full load.

The response and predictor variables of this data set are listed below:

Response Net Hourly Electrical Energy Output (PE) in MW (Mega Watts)

Predictors Hourly Average Ambient Variables Temperature (AT) in oC

Exhaust Vacuum (V) in cm Hg

Ambient Pressure (AP) in milibar

Relative Humidity (RH) in %

(a) Obtain the summary and anova outputs from the multiple linear re-

gression model fitted with all the predictors listed above. Include the

outputs in your assignment.

Use the outputs from part (a) to answer the following questions. Use

α = 0.01 for all hypothesis tests.

(b) State the value of the F statistic used to test the hypothesis that

β1 = β2 = β3 = β4 = 0 versus β1 6= 0 or β2 6= 0 or β3 6= 0 or β4 6= 0.

What is the conclusion from this test?

(c) Is there evidence that a model with AT and V is better than a model

with just AT? State the relevant test statistic, p-value and conclusion.

(d) Conduct the appropriate sequential F test to test whether a model

containing all the predictors is preferred over a model with AT as the

predictor. State the relevant test statistic, p-value and conclusion.

(e) Is there evidence that AP is related to the response in the presence of

AT, V and RH? State the relevant test statistic, p-value and conclusion.

(f) Obtain a 99% prediction interval for the response PE when the observed

values of the predictors are given by (AT, V, AP, RH) = (27, 51,

1017, 44).

6

6. (MATH2931 only) Suppose we have the full rank linear model y = Xβ+

ε with n × p design matrix X, normal errors ε ∼ N (0, σ2

In×n). Let b be

the least squares estimator of β.

(a) Show that for any symmetric and idempotent matrix A, all eigenvalues

are either zero or one, and rank(A)=tr(A).

Hint: Apply spectral decomposition to A.

(b) Prove that rank(H)=tr(H) = p and rank(I − H)=tr(I − H) = n − p,

where H = X(XT X)

−1X> is the hat matrix.

(c) Prove that

(b − β)

T XT X(b − β)

σ

2

follows the χ

2

p distribution.

Hint: Write Xb in terms of X, β and ε.

(d) Hence derive a 100(1−α)% joint confidence region of β given in notes

(b − β)

>X

>X(b − β)/pσb

2 6 Fα;p,n−p,

where Fα;p,n−p denotes the upper αth quantile of the Fp,n−p distribu-

tion.