Tutorial 1

Learning objectives

After this tutorial the student should be able to:

  • identify: population, sample, unit, variables (quantitative: discrete, continuous; qualitative: nominal, ordinal);
  • recognize and interpret a bar chart and a histogram;
  • mention and draw (by hand) an appropriate plot for a given variable;
  • interpret and construct a frequency table and a relative frequency table;
  • interpret and calculate cumulative frequencies;
  • choose the correct measure for central tendency for a given variable;
  • determine and interpret the mode, median and mean.

Important concepts

Read for an introduction to the important concepts of population, sample, unit and variable:

    • paragraphs 1.1 and 1.2 pp.2-9, and

    • paragraph 4.6 pp.164-166, or

    • paragraphs 1.1, and 1.2 pp.2-8,

    • paragraph 4.6 pp.155-157.

Descriptive analysis for one variable: visualization

Read:

    • paragraphs 3.1 and 3.2 pp.60-66, and

    • paragraph 3.3 pp.66-75 (up to stem-and-leaf plot), or

    • paragraphs 3.1, and 3.2 pp.56-62, and

    • paragraph 3.3 pp.62-72 (first 9 lines).

Two different graphical representations for a single variable are discussed: the bar chart and the histogram.

A bar chart or bar plot is applied for categorical or discrete data.

A bar chart displays the count for each distinct category or value as a separate bar, allowing you to compare categories visually.

There are small gaps between the bars. They indicate that the data is categorical or discrete. There are many variations of the bar chart.

Example 1.1: college majors

University officials periodically review the distribution of undergraduate majors within the colleges of the university to help determine a fair allocation of resources to departments within the colleges. At one review, the following data were obtained (see Table 1), which were presented in a bar chart as shown in Figure 1.

Table 1: Number of majors per college with collegename abbreviation (colAbbrev).
college noMajors colAbbrev
Agriculture 1500 Agric.
Arts and Sciences 11000 ArtsandSc.
Business Administration 7000 BusAdm.
Education 2000 Educ.
Engineering 5000 Engin.
Figure 1: Bar chart of number of undergraduate majors per college.

Note: In newspapers and non-scientific journals, data like these are often presented in a so-called pie chart (see Figure 2). However, in scientific papers bar charts are preferred, because they are often more clear.

Figure 2: Pie chart of number of undergraduate majors per college.

A histogram can be used for continuous data and for discrete data with many different outcomes.

It displays the (relative) frequency of the data. The range is subdivided into classes, usually of equal width. The height of each rectangle (bar) corresponds to the count of values of the variable falling within the interval.

ImportantRemarks about histograms.
  • A histogram also has rectangles but now these cover the full class interval without gaps in between; the rectangles are plotted along an interval scale.

  • A histogram shows the shape, center, and spread of the distribution. The choice of class width or the number of classes can heavily influence the shape/impression of the histogram.

  • Also for a discrete variable with many distinct outcomes, measured in classes (by approximation a continuous variable), a histogram may be suitable.

Table 2: Area of cultivation under glass for 33 growers.
53 39 73 98 49 50 42 63 61 63 19
30 39 100 30 30 20 20 40 59 25 22
44 25 22 24 36 49 39 35 29 43 31

Example 1.2: Cultivation in greenhouses

A researcher did a small study about the cultivation under glass. He asked 33 growers the area of cultivation under glass. The results are shown in Table 2.

Figure 3: Histogram of area of cultivation under glass.

Most statistical software programs, like R, will make classes automatically, when creating a histogram (see Figure 3). R has chosen classes with a width of 10 units. In the histogram, you can see that the sample distribution is skewed to the right. The two largest observations could be called outliers (or extreme values).

Descriptive analysis for one variable: measures of central tendency

Read:

    • paragraph 3.4 pp.82-90 (skip grouped data median and Example 3.4), or
    • paragraphs 3.4 pp.78-85 (skip grouped data median and Example 3.4).

Measures of central tendency for a sample are discussed: the mode, the median, and the (arithmetic) mean.

The mode of a set of measurements is the measurement value with the highest frequency.

The median of a set of measurements is the middle value when the measurements are arranged from lowest to highest. Additional rules are needed to determine the median in the case of an even number of discrete observations.

The (arithmetic) mean \(\bar{y}\) of a set of measurements \(y\) is the sum of the measurements divided by the total number of measurements.

Example 1.3: Number of plots

A researcher registers the number of plots (variable \(y\)) from \(55\) farmers, having companies of nearly the same size. The results are given in Table 3, and the corresponding bar chart is displayed in Figure 4.

Table 3: Frequency of the number of plots per farmer (total of 55 farmers).
Number of plots Frequency
1 3
2 5
3 5
4 7
5 9
6 7
7 8
8 4
9 3
10 2
11 0
12 2
Figure 4: Bar chart of frequencies for number of plots per farmer.

The mode equals \(5\).
The median is the \(28^{\mbox{th}}\) observation. Therefore, the median is equal to \(5\).
The mean is \(\bar{y} = (3 \times 1 + 5 \times 2 + \ldots + 2 \times 12)\ /\ 55 \approx 5.491\).

R and R Commander can both provide a convenient summary of the data, using the shown commands.

# R
summary(object = farmers_count$plots)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   4.000   5.000   5.491   7.000  12.000
# R Commander
numSummary(data = farmers_count$plots)
#>      mean      sd IQR 0% 25% 50% 75% 100%  n
#>  5.490909 2.64486   3  1   4   5   7   12 55

Exercises to be done during the tutorial

Exercise 1.1 up to and including Exercise 1.5 are in the presentation handouts of Tutorial 1. For answers/feedback check Brightspace.

Exercises to be done after the tutorial

For answers/feedback check Brightspace.

Exercise 1.6

Do either

Exercise 1.7

Do either