MAST90044 Thinking and Reasoning with Data

Semester 1 2020

Assignment 1

Due: 8am, Monday 27 April

Instructions

• Assignments are to be submitted (uploaded) via Canves.

• Please label your assignment with the following information:

– your name;

– your student number;

– your lab class;

– your tutor’s name.

• You must sign the plagiarism ideclaration. The link is available on the subject’s Canves website.

• Your assignment should show all working and reasoning, as marks will be given for method as well as

for correct answers. Please spell check your document.

• Paste any R code and output into the appropriate places so that it can be seen easily along with your

other work. Graphics from R can be resized within your document; make them smaller as necessary.

• Assignments count for 50% of the assessment in this subject. This one is worth 15%, and covers the

work done in weeks 1 to 4.

• Tutors will not help you directly with assignment questions. However, they may give some help with

R.

• Solutions to the assignment questions will be made available later.

• When constructing a panel of graphs with multiple plots, it is good to use the R command

par(mfrow = c(nrows,ncols)) where nrows is the number of rows and ncols the number of columns

in the panel. The default is (1,1).

MAST90044 Thinking and Reasoning with Data Assignment 1

Q.1. The data set unesco.csv, available on the LMS, contains demographic and economic information from

the 1990 UNESCO yearbook on about half the world’s countries. Definitions of the variables in the

data set are as follows:

• Birth rate per 1,000 of population

• Death rate per 1,000 of population

• Infant deaths per 1,000 of population

• Life expectancy at birth for males

• Life expectancy at birth for females

• Gross National Product (GNP) per capita

• Geopolitical group

1 Eastern Europe (former Soviet Satellite)

2 South America and Mexico

3 Western Europe, North America, Japan

4 Middle East

5 Asia

6 Africa

• Country

Ignoring geopolitical group:

(a) Summarise the GNP values using summary statistics and two graphical tools. Briefly describe any

obvious features of the distribution.

(b) Use two graphical tools to compare the observed distribution of infant deaths with a normal

distribution. Briefly comment.

(c) Graphically examine the relationship between the infant death rate and GNP. Calculate the cor-

relation coefficient between the two variables. Comment on how useful it is in this situation.

(d) Graphically examine the relationship between life expectancy at birth for females and the birth

rate. Comment on the strength or otherwise of the relationship. Formulate a statistical model to

describe the relationship. Graphically fit the model.

Taking geopolitical group into account:

(e) Use two graphical tools to examine the relationship between life expectancy at birth for males and

geopolitical group. Use suitable R functions to calculate the mean and standard deviation for each

group, and the number of countries in each group. Comment on any obvious differences between

the groups and identify any clear outliers.

(f) Calculate the net population growth rate per 1000 of population (we will call this “net growth”).

Type library(lattice) in R to ensure that the xyplot() function is available. Use xyplot

to examine the relationship between net growth and GNP for each geopolitical group separately.

Note that in the matrix of plots, group 1 will be placed in the bottom left hand corner, and you

proceed across the row of plots. Comment on what the plots show in regard to the relationship,

and any limitations of this type of plot here.

(g) Create a plot of net growth vs GNP for group 2 on its own. Calculate the correlation coefficient,

and comment on the strength and direction of the relationship.

[4 + 3 + 4 + 5 + 8 + 7 + 5 = 36 marks]

2

MAST90044 Thinking and Reasoning with Data Assignment 1

Q.2. It is well known that quitting smoking is difficult. Many people who are trying to quit use nicotene

replacement methods like nicotene patches or nicotene gum to ease nicotene withdrawal symptoms. As

an alternative, medical researchers investigated whether the use of an antidepressant medication might

be a more effective aid to those attempting to give up cigarettes. In a study reported in March 4,

1999, New England Journal of Medicine, researchers published results that compared the effectiveness

of nicotene patches to the effectiveness of the antidepressant burpropion, which is marked with the

brand name Zyban. The study consisted of 893 participants who were randomly allocated to four

(i = 1, 2, 3, 4) treatment groups, listed below in the table. They did not know to which treatment

they were allocated i.e. this was a single-blind study. The table below shows the number of people not

smoking 6 months following the study, for each treatment.

Treatment Subjects not smoking (xi) Total subjects (ni)

Placebo only 30 160

Nicotene patch 52 244

Zyban 85 244

Zyban and nicotene patch 95 245

(a) Calculate the Wald, Agresti-Coull and Jeffreys prior 95% confidence intervals for each treatment

group separately. Draw the confidence intervals.

Comment briefly on your findings.

(b) Comment on the validity or otherwise of the assumptions made in these calculations.

(c) Find a point and an interval estimate of the difference in proportions of those not smoking after

6 months between people who used the ’Zyban + patch’ group and those who used Zyban alone.

Give an interpretation of the confidence interval. Make one comment, with supporting evidence

from above, on the claim that using a patch in addition to Zyban is effective for quitting.

(d) Construct a Wald confidence interval to test the claim that using a nicotene patch is no more

effective than using nothing at all. Interpret the confidence interval as well as a reason for your

choice of confidence interval method.

(e) Provide a single Wald confidence interval to test the claim that Zyban, with or without a patch,

is better than doing nothing or using a patch. Interpret the confidence interval as well as a reason

for your choice of confidence interval method.

[5 + 4 + 4 + 5 + 6 = 24 marks]

Q.3. The chi-squared distribution, denoted by X ∼ χ2ν , is used a great deal in statistics and science, and we

will meet it again later. The exact shape of the distribution depends on the degrees of freedom (ν), at

larger ν values the chi-squared approaches a normal distribution, and therefore stronger departure from

the normal distribution. Here we will examine how quickly the sampling distribution of the sample

mean taken from a X ∼ χ22 distribution converges to normality (or at least to symmetry).

(a) Take a large sample from the X ∼ χ22 distribution and test its departure from normality using two

graphical tools. You will need the R function rchisq. Comment on the result.

(b) Examine the sampling distribution of the sample mean from samples of size 5, by generating 1000

such samples and looking at a plot of the density (make a comment about the distribution).

(c) Compare the sampling distribution of the sample mean for a range of sample sizes (e.g. 1, 5, 10,

20, 40, 80), and use your results to suggest how large the sample size needs to be for adequate

convergence. The mean of a X ∼ χ2ν distribution is ν.

[ 5 + 3 + 5 = 13 marks]

Total marks = 73