Computer Practicum 5

This computer practicum contains the following four parts:

Exact Binomial testing.
A simulation of correlation.
Correlation applied to Tasting Coffee.
Simple Linear Regression: Training Forest.

Learning objectives

After this computer practicum the student should be able to perform the following in R Commander, explain why the chosen test is the appropriate one given the research question and data, and interpret the R Commander output:

Hypothesis test for \(\pi\): Exact Binomial test,
Correlation coefficient,
Simple Linear Regression: provide estimates for \(\beta_0\), \(\beta_1\) and \(\sigma_{\varepsilon}\),
paired samples \(t\)-test.

Part 1 - Exact Binomial Testing: Binge Drinking

This part is based on:

Example 10.5 O&L 7^th Edition p.489, or
Example 10.5 O&L 6^th Edition p.506

Questions:

Read the example introduction, but not the solution.
Formulate the research question and write it on the answer form.
Explain why the exact binomial test is the appropriate test to answer the research question.
Define the parameter \(\pi\), i.e., what is defined as the proportion (probability) of success in the given situation.
Denote the first five steps of the exact binomial test to answer the research question.
Load the data set into R Commander using the Excel file “BSP5_Binge_Drinking.xlsx” via: Data > Import data > from Exel file\(\ldots\) File Provide a name for the data set behind Enter name of data set:, e.g., “binge_drinking”. Remove the check mark in front of Variable names in first row of spreadsheet, and click the OK button to confirm. Select the Excel file, as mentioned above, and click Open to load the data set.
Before applying the binomial test the variable, currently a discrete quantitative variable containing only \(0\) and \(1\) values, needs to be converted into a nominal qualitative variable with two levels, a so-called factor variable in R/R Commander. Go to: Data > Manage variables in active data set > Convert numerical variables into factors\(\ldots\). Provide a new variable name in the field be New variable name or prefix for multiple variables:, e.g., “student”, and click OK to confirm. Supply level names for the numerical values, e.g., for \(0\) level name: “non-binge”, and for \(1\) level name: “binge”.
Click the View data set button. The data set will now contain two variables, one numerical column with the obscure name “...1” and one named “student” displaying the drinking behavior as “non-binge” or “binge” for each student in the set. Verify that there are \(1300\) non-binge drinking students and \(1200\) binge drinking students in the data set by: Statistics > Summaries > Frequency distributions\(\ldots\), and click the OK button to execute. The results are shown as counts and percentages in the Output part of the R Commander window.
Analyses in R/R Commander are performed based on the numerical values underlying qualitative variables. Here \(0\) is the lowest numerical value representing “non-binge” drinking students. Therefore, all analysis are default done on the group “non-binge” drinking students in this particular case. Reformulate the first five steps of the exact binomial testing procedure to meet the calculations performed in R/R Commander.
Perform the exact binomial test in R Commander using \(\alpha = 0.05\). Go to: Statistics > Proportions > Single-sample proportion test\(\dots\). Click on the Options tab, select the required Alternative Hypothesis as well the correct Type of Test using the radio button selection, and fill in the Null hypothesis: p = matching the steps written down for question i).
Provide the last three steps of the test procedure for this exact binomial test.

Part 2 - A Simulation of Correlation

This part of the computer practicum will give you a sense about the shape shown in a scatter plot and the strength of the straight line relationship, a.k.a. Pearson’s correlation.

Start with opening the file “BSP5_Simulation_Correlation_Scatterplots.xlsx” in Microsoft Excel.

In the blue cell, “D4” in the spreadsheet, the value for the population correlation \(\rho\) can be changed. Initially it will be \(0.50\) (do not change it yet!). By pressing the F9 function key a new sample, out of the population for variables \(x\) and \(y\), is drawn. The estimated Pearson’s correlation coefficient \(r\), as estimator for \(\rho\) based on the sample, will be given in the green cell (“D5” in the spreadsheet). The data for \(x\) and \(y\) are given in the columns “A”,“B” and are visualized in the scatter plot.

Press F9 five times sequentially, observe the shape of the scatter plot and the estimate of the correlation coefficient \(r\).
What is the shape of the scatter plots? Are the estimates close to the population value of \(\rho\)? Explain why or why not.
Change the value for the population correlation \(\rho\) such that there is a strong negative correlation in the population. Write the value you used here for \(\rho\) on the answer form.
Now again press F9 five times sequentially, observe the shape of the scatter plot and the estimate of the correlation coefficient \(r\).
What is the shape of the scatter plots for the strong negative correlation you have chosen? Are the estimates close to the chosen population value of \(\rho\)? Explain why or why not.
Next change the population correlation \(\rho\) such that there is a very weak correlation in the population (choose whether the correlation is positive or negative yourself). Write down the chosen value for \(\rho\) on the answer form.
Describe the expected shape of the scatter plot, when the correlation is weak. Explain the expected shape with arguments.
Press F9 once (i.e., draw one new sample) and check the shape of the scatter plot. Does it match your expectation?
Next press F9 five times and write down the five estimated correlation coefficients \(r\).
Are the estimates close to the population value of \(\rho\)? Explain why or why not.

Part 3 - Correlation : Tasting Coffee

For this part of the computer practicum the data from Part 1 of Computer Practicum 4 are used about tasting coffee at home in a laboratory environment by consumers. The case considered paired data, where the observations are linked per consumer, i.e., the judgement score given for the taste at home and the judgement score given in the laboratory environment are dependent (coming from the same consumer). When this trend is linear, Pearson’s correlation coefficient can be used to determine the strength of this linear relationship.

Load the data “BSP5_Tasting_Coffee.RData” into R Commander and view the data.
Make the appropriate plot to check whether there is a linear relation between the judgement scores for tasting in the laboratory environment and at home.
Calculate Pearson’s correlation coefficient using Statistics > Summaries > Correlation matrix\(\ldots\) Choose the two variables for which you want Pearson’s correlation coefficient, select both by using Shift on your keyboard and click the desired variables (CTRL + clicking allows individual selection of multiple). Click the OK button to execute the calculation of the correlation matrix.
What is the estimated Pearson’s correlation coefficient? Write the value on your answer form using \(4\) decimals.
Interpret the strength of the linear relationship between the judgement scores at home and in the laboratory environment (see form).

Part 4 - Simple Linear Regression : Training Forest

Wageningen University has several training forests, where students can carry out measurements during field work. In this computer practicum, data will be used that were collected by students in such a forest in 2015. Whether there is a linear relationship between \(y\), the height of the trees (in meters), and \(x\), the diameter at breast height (in centimeters at \(1.30\) meters height) will be studied. Assume that the \(y\) values are independent and normally distributed with constant variance \(\sigma_{\varepsilon}^2\).

Write down the above-mentioned assumptions in mathematical terms. This will provide you a concise statistical model for the observed \(y\) values. Indicate for each element, whether it belongs to the systematic part or the stochastic (random) part of the model.
Load the file “BSP5_Training_Forest.RData” into R Commander and view the data. How many cases are there?
Type in the “R Script” part of the R Commander window: attr(training_forest, "variable.labels"), keep the cursor on the same line and click Submit to get information on the meaning of the variables in the columns of the data set. Fill in the answer form.
Generate the appropriate plot to display the relationship between \(y\) and \(x\): Graphs > Scatter plot\(\ldots\), choose the correct variable for the \(y\)- and \(x\)-axis. On the Options tab place a check mark in the box in front of Least-squares line. Is there a (roughly) linear relationship between the height of trees (in meters) and the diameter of trees at breast height (in cm at 1.30 m) in 2015?
Generate simple linear regression output for the relationship in d): Statistics > Fit models > linear regression\(\ldots\) Choose the correct Response variable (pick one), representing \(y\), and Explanatory variables (pick one or more), representing \(x\). Click the OK button to generate the model output.
Give the equation of the least squares regression line (often referred to as the estimated regression line).
Give the interpretation of the estimated coefficients of the least squares regression line, using the height of trees and diameter of trees at breast height in 2015 in your description.
In order to answer the research question, whether the diameter of a tree at breast height has predictive value for the height of a tree a test should be applied. Start with mentioning the first five steps of the appropriate test (\(\alpha = 0.05\)).
Find the needed values in the R/R Commander output produced at question e) to proceed with the test procedure and fill in the answer form.
Give the estimate for \(\sigma_{\varepsilon}^2\).