R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. In this homework, and many others in the future, we will use the following R packages:
dplyr
: for data wranglingggplot2
: for data visualizationmosaic
: for basic statistical computingYou need to load these packages in your working environment. We do this with the library()
function. Run the following three lines in your console.
library(dplyr)
library(tidyr)
library(ggplot2)
library(mosaic)
Note that you only need to load then with the library()
function each time you relaunch RStudio.
Let us start with the BF[1] example of plant trampling. Deputy director of the Pawnee Parks and Rec department, Leslie Knope, needs to know how resistant different vegetative types are to trampling so that the number of visitors can be controlled in sensitive areas. Twenty lanes of a park are established, each .5 m wide and 1.5 m long. These twenty lanes are randomly assigned to five treatments: 0, 25, 75, 200, or 500 walking passes. Each pass consists of a 70-kg individual wearing boots, walking in a natural gait. One year after trampling, the average height of the vegetation along the lanes are measured. The data from this experiment is entered into a data frame using the code below.
<- data.frame(passes = c(rep("0", 4), rep("25", 4),
parks rep("75", 4), rep("200", 4),
rep("500", 4)),
height = c(20.7, 15.9, 17.8, 17.6,
12.9, 13.4, 12.7, 9.0,
11.8, 12.6, 11.4, 12.1,
7.6, 9.5, 9.9, 9.0,
7.8, 9.0, 8.5, 6.7))
Using the parks
data frame created from the code above, we can make a parallel dot graph using the ggplot()
function from the ggplot2
package.
ggplot(parks, aes(x = passes, y = height)) +
geom_point()
What do you notice? For which treatment do the plants fair the best? We can also look at parallel dot graphs to get a sense of the validity of the S, same standard deviations, Fisher assumption.
We can add the averages for each condition to the plot with an additional geom_point()
layer. In this second geom_point()
layer we add map the mean to the y aesthetic instead of the response variable. Note that we can compute the averages for each condition easily with group_by()
and mutate()
.
%>%
parks group_by(passes) %>%
mutate(mean = mean(height)) %>%
ggplot(aes(x = passes, y = height)) +
geom_point() +
geom_point(aes(y = mean), color = "mediumvioletred", size = 4, shape = 6)
If we feel comfortable with that assumption, we can move forward with the formal ANOVA for BF[1] design. We will use the lm()
function and save this model in an object, here we name that object mod
. Note that inside of the lm()
function, the model is given as response ~ factor
using the formula notation, y ~ x
. The data is also given by typing data = parks
. The formula and the data are arguments passed to the lm()
function. Next, to get the ANOVA decomposition, F-ratios, and p-values for those F-ratios by passing the anova()
function our mod
object.
<- lm(height ~ passes, data = parks)
mod
anova(mod)
What can we conclude about the effect of foot traffic, passes, on the height of plants?
After running our formal ANOVA, we should look at the model residuals to get a sense of how valid our N assumption is.
<- parks %>%
parks mutate(residuals = residuals(mod),
SD = sd(residuals))
Now that we have a variable saved in our data frame that contains the residuals, we can visualize these residuals with the ggplot()
function. In a layer on top of our histogram, we want to add two vertical lines that cut off 1 SD above and below the mean (of zero). We can accomplish this with the geom_vline()
function. The v
stands for vertical. The geom_vline()
function needs to know the xintercept
aesthetic to know where to draw the line. We’ll make these lines blue by adding the color = "blue"
argument to geom_vline()
.
ggplot(parks, aes(x = residuals)) +
geom_histogram(bins = 10) +
geom_vline(aes(xintercept = SD), color = "blue") +
geom_vline(aes(xintercept = -SD), color = "blue")
To create a data frame for data from a bf[1] design, we will want two variables: 1) the response variable observations and 2) the level for each observation of the factor.
The code below creates a data frame for the leafhopper data (p. 169). First, notice that observations on the response variable, days
are specified inside of the c()
function. The c
stands for concatenate, and this concatenate function makes lists in R. Next, levels of the diet
variable are created, again with the c()
function, but also making use of the rep()
function. The rep()
function will repeat the first argument how ever many times you specify in the second argument.
<- c(2.3, 1.7, 3.6, 4.0, 2.9, 2.7, 2.1, 2.3)
days <- c(rep("control", 2), rep("sucrose", 2), rep("glucose", 2), rep("fructose", 2)) diet
Right now these two variables are just lists of information that are unrelated to each other.
Finally, to relate the variables to each other, both variables are contain in a data frame with the data.frame()
function, and the data frame is saved in an object called leaf
. A data frame is a list of lists!
<- data.frame(days, diet) leaf
Take a look at the data frame with the View()
function.
View(leaf)
For homework 6, complete the following exercises. When you are done, please knit your homework to an HTML file and submit the HTML file on Moodle.
Refer to the walking babies example on page 150 of your textbook for exercises 1-3. For exercises 1-3, you will be conducting an analysis to test if there are differences in age in months when babies first walked depending on the group they were assigned to.
Using the method describe above for the leafhopper data, create a data frame in your R Studio environment called walking
with the data given in Table 5.1 on page 150 of your textbook. What kind of design is this?
Create a parallel dot graph of the walking data. What do you notice? Also, using favstats()
, check the standard deviations between groups (the S assumption). Is the largest SD >3 times as large as the smallest?
Conduct an ANOVA to test the research question: Do special exercises speed up the process of babies learning to walk? What do you conclude about these baby exercise programs from your ANOVA?
Make a histogram of residuals for the model you ran in exercise 2, complete with vlines
marking 1 SD above and below the mean. Using this histogram, discuss the validity of the N and Z assumptions.
Create a data frame in your R Studio environment called fluids
with the data given in Table 5.2 on page 153 of your textbook. For help, review the discussion above on how the leaf
data was entered. What are the units for this data frame? How many observations are there?
Conduct an ANOVA on this data set, complete with a parallel dot graph and making sure to check your residual assumptions along the way. What do you conclude about the purity of intravenous fluids across these three drug companies?
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted by Randi Garcia from labs originally written by Mark Hansen, further adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.
That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.
In this course we will be using R packages called dplyr
for data wrangling and ggplot2
for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.
These cheatsheets may come in handy throughout the semester:
Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.
Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.