STA304H1F/1003HF Winter 2021 Assignment # 2

Posted: Friday, February 26, 2021

Due: Online into Quercus Assignment 2 by 8pm on Monday, March 8, 2021

Note: E-mail submissions will NOT be accepted. Late assignments will be accepted but subject to a 1%

penalty of the total assignment marks per hour late. Submission will not be allowed beyond 48 hours of

the due date.

Students who would like additional accommodations should email the instructional team at sta304@utoronto.ca

at least 48 hours before the assignment is due.

Instructions: • Answer all two (2) questions of this assignment.

• Each assignment should be written up independently. If you work with other students on Question 1,

indicate the names of the students on your solutions. Question 2 should contain unique answers.

• Presentation of solutions is important. Assignments should be word-processed and presently neatly.

• Use proper statistical terminology and proper English language.

• Supporting output, such as unrequested R codes and extraneous output are optional. However, if you

choose to include these, please place in a separate appendix at the end of your assignment.

• Compile your entire solution, including your Appendix, as a PDF or Word (LATEX or Rmarkdown

can be your base). Submit your assignment as PDF or Word file into the Quercus Assignment

named ‘Assignment 2’.

Grading: The grand total is 33 marks which includes 3 marks for excellent presentation. A general

marking scheme for most parts is given below:

Per Question Part

• 3 points: Complete, correct and clearly written

answers. Answers model individual prepara?tion and academic honesty (where applicable).

• 2 points: Good answers that are unclear, con?tain few mistakes or missing components. An?swers demonstrate some individual prepara?tion and some academic honesty (where ap?plicable).

• 1 points: Poor answers or many missing com?ponents. Most answers do not demonstrate

individual preparation or academic honesty

(where applicable).

• 0 points: Missing or incomprehensible answers.

Answers are not academically integral.

Presentation

• 3 points: well presented, easy to read, proper

English used, R code shown only where re?quired.

• 2 points: good presentation, some unnecessary

R codes and unformatted output

• 1 point: poor presentation, handwritten, hand?drawn diagrams, unnecessary R codes and un?formatted output

• 0 point: illegible, missing, unclear presentation

1. (10 marks) Consider the Mainstreet Research Survey report from December 8, 2020 found here

at the following link- https://www.mainstreetresearch.ca/poll/ontario-survey-doug-fords-handling?of-hydro-rates-carbon-tax-and-covid-19-pandemic/.

1

(a) (1 mark) Choose one of the survey questions and identify one parameter of interest.

(b) Based on the relevant cross tabulation table below your selected survey, choose one stratification

variable and show the following:

i. (3 marks) use weighted frequency to compute an estimate of your population parameter,

and place a bound on the error of estimation, and

ii. (3 marks) use unweighted frequency to compute an estimate of your population parameter,

and place a bound on the error of estimation.

(c) (3 marks) Compare the two estimates in part (b) above. Explain which is a post-stratified

estimate.

2. (20 marks) Consider the baseball dataset describing the population of baseball players in the data

file baseball.csv. Once, at the beginning of your R coding, set the seed of your random?ization to be the last 4 digits of your student number.

The R package- ‘sampling’, which includes the functions- strata and getdata, is useful for this ques?tion. The following R codes show how to install and load the package.

install.package("sampling")

#load sampling package, to use the functions- strata and getdata

library(sampling)

(a) (3 marks) Take a stratified random sample of 150 players, using proportional allocation with

the different teams as strata (teams are in column 1 of the data file). Describe how you selected

the sample. Show your R codes used to obtain your stratified sample.

(b) (3 marks) Find the mean of the variable logsal = ln(salary), using your stratified sample, and

give a 95% CI.

(c) (3 marks) Estimate the proportion of players in the data set who are pitchers, using your

stratified sample, and give a 95% CI.

(d) (3 marks) Take a simple random sample of 150 players and repeat part (c). How does your

estimate compare with that of part (c).

(e) (3 marks) Examine the sample variances of logsal in each stratum. Do you think optimal

allocation would be worthwhile for this problem?

(f) (5 marks) Using the sample variances from (e) to estimate the population stratum variances,

determine the optimal allocation for a sample in which the cost is the same in each stratum

and the total sample size is 150. How much does the optimal allocation differ from proportional

allocation for this scenario?