Tutorial 4

Learning objectives

After this tutorial the student should be able to:

  • recognize a continuous probability distribution (i.e., probability density function - PDF);

  • mention the three properties and the statistical notation of a Normal distribution;

  • mention and explain the statistical notation of the Standard Normal distribution;

  • mention and apply the formula for the Z-transformation;

  • determine the probability for a given z-value using the table of the Standard Normal Distribution;

  • determine the z-value for a given probability using the table of the Standard Normal Distribution;

  • give the estimators for the population mean \(\mu\), the population variance \(\sigma^2\), and the population standard deviation \(\sigma\) of a continuous random variable;

  • explain the concept of bias with respect to the estimators for \(\mu\), \(\sigma^2\), and \(\sigma\) of a continuous random variable;

  • determine the estimates for the population mean \(\mu\), the population variance \(\sigma^2\), and the population standard deviation \(\sigma\) given the data;

  • mention and apply the rule of thumb for validity of the normal approximation, i.e., the empirical rule;

  • interpret a Q-Q plot.

Pre-class activity

Watch:

The clip is linked on Brightspace.

The empirical rule

Read:

    • paragraph 3.5 pp.100-103 starting at line 5 on p.100 up to end of example 3.13, or
    • paragraph 3.5 pp.93–96 starting just below example 3.10 to the end of example 3.12,

for a discussion of the empirical rule and its application.

NoteThe empirical rule

Given a set of \(n\) measurements possessing a bell-shaped histogram (representing the density function), then:

  • the interval \(\bar{y} \pm 1 \times s\) contains approximately \(68\%\) of the measurements,

  • the interval \(\bar{y} \pm 2 \times s\) contains approximately \(95\%\) of the measurements,

  • the interval \(\bar{y} \pm 3 \times s\) contains approximately \(99.7\%\) of the measurements.

Random variables: continuous random variables

Read:

    • paragraph 4.9 pp.177-180, or
    • paragraph 4.9 pp.168–171,

where the term probability density function \(f(y)\) for a continuous random variable is discussed.

Read:

    • paragraph 4.10 pp.180-187, or
    • paragraph 4.9 pp.171–178,

discussing the most known example of a continuous distribution: the normal distribution (also called Gaussian distribution) and the standard normal distribution.

The normal distribution with parameters \(\mu\) and \(\sigma\) is denoted as \(y \sim \mbox{N}(\mu,\ \sigma)\).

  • Probability density function: \(f(y) = \frac{1}{\sigma \sqrt{2\pi}}\ e^{-\frac{1}{2} \left(\frac{y-\mu}{\sigma}\right)^2}\)

  • \(\mbox{E}(y) = \mu,\ \mbox{var}(y) = \sigma^2,\ \sqrt{\mbox{var}(y)} = \sqrt{\sigma^2} = \sigma\)

  • Variable \(y\) follows a standard normal distribution , when \(\mu_y = 0\) and \(\sigma_y = 1\). This is denoted as: \(y \sim \mbox{N}(\mu = 0,\ \sigma = 1)\)

  • When \(y \sim \mbox{N}(\mu, \sigma)\), then for the standardized variable \(z = \frac{y - \mu}{\sigma}\) holds \(z \sim \mbox{N}(\mu = 0,\ \sigma = 1)\).

ImportantRemark about calculating normal probabilities.

For calculating normal probabilities see O&L Table \(1\) (at the inside of the cover of both editions and:

  • O&L 7th Edition pp.1086-1087, or
  • O&L 6th Edition pp.1170–1171,

or use a graphing calculator.

The probability density function of the normal distribution is symmetric, see Figure 1 (a) and Figure 1 (b) for examples.

(a) \(\mu = 0,\ \sigma = 1\)
(b) \(\mu = 5,\ \sigma = 3\)
Figure 1: Normal distributions, with

Estimators for the population mean, variance and standard deviation of a continuous variable

This section will introduce a few extra theoretical considerations, which are not included in the textbook.

Population parameters (such as mean and variance) can be measured with certainty only, when all possible outcomes are known, i.e., when the the whole population is known. Usually, when undertaking a study, only a random sample from the population is available. The population parameters are then estimated based on the sample.

As an example, consider the population mean \(\mu\) of a quantitative variable \(y\). When taking a random sample, the population mean \(\mu\) is estimated by the sample mean \(\bar{y}\). Notation: \(\hat{\mu} = \bar{y}\).

A desired property of the estimator of a population parameter is that this estimator is unbiased, which means that the estimate of the population mean (or expected value) is equal to the parameter itself. The estimator \(\bar{y}\) is an unbiased estimator for \(\mu\), because \(\mbox{E}(\bar{y}) = \mu\) . It is very important to note that \(\bar{y}\) is a random variable itself, of which the value will vary from sample to sample even though the samples are taken from the same population.

If you have a random sample of independent observations \(y_1, y_2, \ldots, y_n\) with \(\mbox{E}(y) = \mu\) and \(\mbox{var}(y) = \sigma^2\), then:

  • \(\hat{\mu} = \bar{y}\) is an unbiased estimator of \(\mu\).

  • \(\hat{\sigma}^2 = s^2 = \frac{\sum (y - \bar{y})^2}{n-1}\) is an unbiased estimator of \(\sigma^2\).

  • \(\hat{\sigma} = s = \sqrt{s^2}\) is an unbiased estimator of \(\sigma\).

Please note, that statisticians abuse the usage of the word “mean” for the population mean. The population mean is actually an expected value. From the context can be derived, whether mean refers to a population mean or a sample mean. Anyhow, it is of vital importance for statistical inference to make the distinction between these two means. The sample mean is an unbiased estimator for the population mean (i.e., the expected value).

Checking whether or not a population distribution is normal: Q-Q plot

For checking the normality assumption of observations, a normal probability plot is used, also called Quantile-Quantile plot or Q-Q plot. Observations are plotted vertically against a percentile score on the horizont axis. The horizontal probability scale is non-linear and approximately such that the expected dots for normally distributed data are relatively close to a straight line. The aim of this type of graph is to decide by eye whether (isolated) outliers or extremes occur in the data, and whether the shape (strongly) deviates from the straight line representing normality. R shows a linear quantile scale, hence the name Quantile-Quantile plot or Q-Q plot.

For a more elaborated explanation read:

    • paragraph 4.14 pp.203-206 including lines 6-10 under Figure 4.28, but skipping the part about the R code in lines 1-5, or
    • paragraph 4.14 pp.194-197 including lines 1-3 under Figure 4.28,

and watch:

Exercises to be done during the tutorial

Exercise 4.1 and Exercise 4.2 are in the presentation handouts of Tutorial 4. For answers/feedback check Brightspace.

Post-class activity

Watch:

All of the clips are linked on Brightspace.

Exercises to be done after the tutorial

For answers/feedback check Brightspace.

Exercise 4.3

A machine produces packages of coffee. The random variable \(y\) is the weight of a randomly selected package of coffee: \(\mbox{E}(y) = \mu\) and \(\mbox{var}(y) = \sigma^2\) .

A random sample of 10 packages is taken; the weights (in g) of these packages are: 485, 540, 505, 510, 465, 455, 515, 560, 525, 510

Use the R Commander output in Table 1 to answer the questions below.

Table 1: Numerical summary of the coffee weights.
mean sd IQR 0% 25% 50% 75% 100% n
507 32.0763 32.5 455 490 510 522.5 560 10

a. Give the estimate of \(\mu\).

b. Give the estimate of \(\sigma\).

c. Calculate the estimate of \(\sigma^2\).

d. Give the median.

e. Although 10 is a very small sample size and we can’t be sure, would you expect - based on the weights observed in this sample - that the weight of the packages coffee is normally distributed? Explain your answer.

Exercise 4.4

This exercise uses the slightly modified data of Exercise 3.29 O&L \(7^{\mbox{th}}\) Edition p.133 [O&L \(6^{\mbox{th}}\) Edition p.125]. You can use Table 2 (with the modified ordered data) given below, to answer this question.

The mean and the standard deviation of this sample of treatment times, as given in Table 2, are: \(\bar{y} = 20.58\), \(s \approx 9.19\).

Table 2: Treatment times at a health clinic for 50 patients.
7 11 13 15 17 20 21 24 28 32
8 12 13 16 17 20 22 24 29 33
10 12 14 16 18 21 22 24 29 35
10 12 15 16 18 21 24 26 29 45
11 12 15 16 19 21 24 27 31 54

a. Check whether the empirical rule applies to this data (i.e., Are 68% of the measurements between \(\bar{y} \pm 1 \times s\); Are 95% of the measurements between \(\ldots\); Are \(\ldots\)?).

b. Is it likely that this sample comes from a normal distribution?

Exercise 4.5

Do either

Exercise 4.6

Do either