Computer Practicum 6

This computer practicum contains the following three parts:

Learning objectives

After this computer practicum the student should be able to do the following in R Commander:

  • Apply a simple linear regression model.

  • Perform a \(t\)-test for the regression coefficents \(\beta_0\) and \(\beta_1\).

  • Give Confidence Intervals for the regression coefficents \(\beta_0\) and \(\beta_1\).

  • Give the estimated population mean and predicted value for a single random unit given a value for \(x\).

  • Give the standard error for the population mean given a value for \(x\).

  • Give a Confidence Interval for the population mean \(\mu_x = \beta_0 + \beta_1 \times x_{n + 1}\).

  • Give a Prediction Interval for a future response \(y\) at \(x_{n+1}\).

  • Produce plots to check the assumptions for a simple linear regression model.

ImportantR Commander Plugin for Statistical Analysis and Data Display: Heiberger and Holland

Before starting with Part 1 a plug-in for R Commander needs to be loaded. This plug-in is needed to make confidence and prediction intervals for any value of \(x\) in Simple Linear Regression.

To load the plug-in use: Tools > Load Rcmdr plug-in(s)\(\ldots\) Select the “RcmdrPlugin.HH” and click the OK button to load the plug-in. A message will appear: The plug-in(s) will not be available until the Commander is restarted. Restart now?. Click the Yes button to proceed.

When the option for “RcmdrPlugin.HH” is not available in the list of plug-ins:

  • Close R Commander.
  • In the R Console go to: Packages > Install package(s)\(\ldots\)
  • Select “\(0\)-cloud” as CRAN Mirror.
  • In the list of packages select “RcmdrPlugin.HH” and click the OK button to install.
  • Close the R Console.
  • Restart the R GUI (R \(4.3.1\) in the R folder inside your start menu).
  • Restart R Commander.
  • Load the plug-in as described above.

Part 1 - Simple Linear Regression: Effect of Fertilizer on Lettuce Plants

In an experiment, \(10\) lettuce plants are grown in soil to which different amounts of fertilizer are added. The relative amounts of fertilizer used are \(0,\ 1.0, 2.0,\) and \(3.0\). The weights (g) of the lettuce plants are measured after \(1\) week.

The measured weight for the \(10\) plants is assumed to be an outcome of normally distributed variable, the mean depends linearly on the relative amount of added fertilizer. The following general research questions are addressed in this part:

  1. Is there a linear relationship between mean weight and the amount of fertilizer added?

  2. Assuming that a possible relationship will be linear, is there a relationship?

  3. How strong is this relationship?

  4. How uncertain is the estimate of the mean weight of a lettuce plant, given the amount of added fertilizer?

  5. How uncertain is the estimate of the weight of an individual lettuce plant, given the amount of added fertilizer?

The data are in the file “BSP6_Lettuce_Plants.Rdata”.

  1. Load the data and inspect the data by viewing.

  2. Display the data in a useful graph. Does the relationship appear linear?

  3. Should fertilizer be considered a qualitative or a quantitative variable and why?

  4. Assume that a straight line linear relationship applies and fit the model using R Commander. Give estimates for the regression parameters and (when applicable) the corresponding estimated standard errors. [Hint: the three regression parameters are\(\ldots\) (see also the handouts of Tutorial \(10\))].

  5. Determine a \(95\%\) Confidence Interval for \(\beta_1\): Models > Confidence intervals\(\ldots\)

  6. Test (with \(\alpha = 0.05\)) whether adding fertilizer has a positive effect on the weight of lettuce plants. Begin with the first five steps on the answer form; next have a look at the output generated for question d) to fill in the final steps of the appropriate test.

  7. Get the estimated mean weight, the associated standard error and a \(95\%\) Confidence Interval for the mean of a randomly selected lettuce plant grown on soil with a relative amount of added fertilizer equal to \(2.5\): Models > Prediction Intervals\(\ldots\)(HH). Fill Enter X values with the desired value for fertilizer (note that the confidence interval for mean is the default setting). Place a check mark in the box in front of Standard Error and click the OK button to execute.
    Write the answers on the answer form.

  8. Give the estimated weight of a randomly selected lettuce plant grown in soil with a relative amount of added fertilizer equal to \(2.5\). Also give a \(95\%\) Prediction Interval (check the appropriate option in R Commander). Compare your answer with the answer given under g).

  9. The Confidence Intervals for the mean and the Prediction Intervals for individual data-points can be plotted in one graph: Models > Confidence interval Plot\(\ldots\) Why is the confidence interval narrower near \(\bar{x}\)?

  10. To check the quality of the regression model create some plots: Models > Graphs > Basic diagnostic plots. In this course only the top two plots have been discussed, the other two plots can be ignored.

  11. Look at the Normal Q-Q plot. What assumption does it test? Explain, whether the assumption is met or not.

  12. Check the Residuals vs Fitted plot. What assumptions does it test? Explain, whether the assumption are met or not.

Part 2 - Simple Linear Regression: Species per Island

In this second part data from the article “Plant species richness – The effect of the Island Size and Habitat Diversity” (Kohn and Walsh, Journal of Ecology, 1994) will be used. The article describes a study about the plant richness of the Shetland islands. Three variables are measured for each of \(47\) of the Shetland Islands: the area of the island (“Area”, in hectares), the number of dicotyledon plant species per island (“Species_per_island”), and the number of different habitat types per island (“Number_of_habitat_types”). This part will investigate whether “Species_per_island” can be predicted by the area of the island (“Area”) using simple linear regression, or by the variable “Number_of_habitat_types” also by using simple linear regression.

The data set is available in the file named “BSP6_Species_Island.RData”.

  1. Load and view the data.

  2. Generate two useful graphical displays to visualize the relations between:

    • Species_per_island” and “Area

    • “‘Species_per_island’” and “Number_of_habitat_types

  3. Are both relations linear?

  4. To overcome the problem of the nonlinear relation between “Species_per_island” and “Area’, the variable”Area” will be transformed with the natural logarithmic function: Data > Manage variables in active data set > Compute new variable\(\ldots\) Use as Expression to compute: “log(Area)”, and use, e.g., for New variable name: “log_area”.

  5. Make a new visualisation to see the relation between “Species_per_island” and “log_area”. Can simple linear regression be applied now?

  6. Carry out a simple linear regression analysis to predict the number of species per island with log(Area) as explanatory variable. Give the estimated regression line. Do not forget to denote what the variables in your equation mean.

  7. Give the coefficient of determination this simple linear regression model.

  8. Generate basic diagnostic plots.Are the assumptions for the simple linear regression model of question f) met?

  9. Carry out a simple linear regression analysis to predict the number of species per island with “Number_of_habitat_types” as explanatory variable. Give the estimated regression line. Do not forget to denote what the variables in your equation mean.

  10. Give the coefficient of determination for this simple linear regression model.

  11. Are the assumptions for the simple linear regression model of question i) met?

  12. Which model (f or i) would you prefer to predict the number of species per island? Why?

  13. Estimate the population mean number of species of a Shetland island with \(10\) different types of habitats. Give also the standard error of this estimate: Models > Prediction intervals\(\ldots\)(HH). Set the radio button to point estimate only and place a check mark in front of Standard Error, next click the OK button to execute.

Part 3 - Simple Linear Regression: Paying for Bread

Load and view “BSP6_Pay_for_Bread.RData” in R Commander. This data set contains two variables of a survey. Consumers were asked to give their age (“Age” in years) and the maximum price, they are willing to pay, for a whole loaf of bread, that is more healthy than regular bread (“Willing_to_Pay” in euros). Research question is: “Is there a negative relationship between maximum price, consumers are willing to pay, for a more healthy loaf of bread and age?”.

  1. Give the estimated equation of the least squares regression line, with description of the variables, as well as the coefficient of determination for this problem. Does the coefficient of determination indicate a strong relationship between “Willing_to_Pay” and “Age”? Why (not)?

  2. Give the null- and alternative hypothesis, \(p\)-value and conclusion corresponding to the formulated research question (use \(\alpha = 0.05\)).

  3. Does the conclusion of question b) indicate that there is a strong relationship between “Willing_to_Pay” and “Age”? Why (not)?

  4. Generate a plot to visualize this problem.

  5. Reflect on the practical relevance of this significant relationship. Use in your answer the words relevant, significant, and sample size.