Back to schedule


a_thing <- 4
another_thing <- 1
another_Thing <- 7

both_things <- a_thing + another_thing

Then we created a tiny data set.

a_data_thing <- data.frame(x = 2, y = 8)

a_data_thing$x
## [1] 2
#How would we print the variable y? Type your answer below on line 17.

Write notes for yourself in the white space. Maybe explain to your future self what dollar signs do.

ERASE THIS AND TYPE HERE

Enough playing around, let’s load some data!

riggsi <- read.csv("riggsi.csv")

Next, you want to look at your data.

head(riggsi)
##   Dyad Partner Age Gender Abuse Sat     Avoid   Anxiety Rel_Length
## 1    1       1  21     -1     5  48 0.2777778 2.2777778   1.083333
## 2    1       2  20      1     5  51 0.0000000 3.8888889   1.083333
## 3    2       1  22      1    15  43 1.4444444 3.4444444   3.416667
## 4    2       2  21     -1    16  47 0.3333333 0.8888889   3.416667
## 5    3       1  32     -1     7  52 0.3333333 1.5555556   1.666667
## 6    3       2  21      1    13  38 0.8333333 5.6111111   1.666667
##   Genderstring
## 1          Man
## 2        Woman
## 3        Woman
## 4          Man
## 5          Man
## 6        Woman
str(riggsi)
## 'data.frame':    310 obs. of  10 variables:
##  $ Dyad        : int  1 1 2 2 3 3 5 5 6 6 ...
##  $ Partner     : int  1 2 1 2 1 2 1 2 1 2 ...
##  $ Age         : int  21 20 22 21 32 21 18 19 22 21 ...
##  $ Gender      : int  -1 1 1 -1 -1 1 1 -1 -1 1 ...
##  $ Abuse       : num  5 5 15 16 7 13 5 5 5 5 ...
##  $ Sat         : int  48 51 43 47 52 38 50 45 50 57 ...
##  $ Avoid       : num  0.278 0 1.444 0.333 0.333 ...
##  $ Anxiety     : num  2.278 3.889 3.444 0.889 1.556 ...
##  $ Rel_Length  : num  1.08 1.08 3.42 3.42 1.67 ...
##  $ Genderstring: Factor w/ 2 levels "Man","Woman": 1 2 2 1 1 2 2 1 1 2 ...
names(riggsi)
##  [1] "Dyad"         "Partner"      "Age"          "Gender"      
##  [5] "Abuse"        "Sat"          "Avoid"        "Anxiety"     
##  [9] "Rel_Length"   "Genderstring"

There is also documentation about functions.

?head

You probably also want descriptive statistics.

summary(riggsi)
##       Dyad           Partner         Age           Gender  
##  Min.   :  1.00   Min.   :1.0   Min.   :17.0   Min.   :-1  
##  1st Qu.: 43.25   1st Qu.:1.0   1st Qu.:19.0   1st Qu.:-1  
##  Median : 88.00   Median :1.5   Median :21.0   Median : 0  
##  Mean   : 87.69   Mean   :1.5   Mean   :21.9   Mean   : 0  
##  3rd Qu.:131.75   3rd Qu.:2.0   3rd Qu.:23.0   3rd Qu.: 1  
##  Max.   :174.00   Max.   :2.0   Max.   :46.0   Max.   : 1  
##      Abuse             Sat            Avoid           Anxiety     
##  Min.   : 4.404   Min.   :20.00   Min.   :0.0000   Min.   :0.000  
##  1st Qu.: 5.000   1st Qu.:41.00   1st Qu.:0.5556   1st Qu.:1.847  
##  Median : 7.000   Median :46.00   Median :1.2222   Median :2.778  
##  Mean   : 8.888   Mean   :44.82   Mean   :1.4591   Mean   :2.780  
##  3rd Qu.:11.000   3rd Qu.:50.00   3rd Qu.:2.1111   3rd Qu.:3.611  
##  Max.   :25.000   Max.   :61.00   Max.   :5.7222   Max.   :5.667  
##    Rel_Length     Genderstring
##  Min.   :0.4167   Man  :155   
##  1st Qu.:0.8333   Woman:155   
##  Median :1.4583               
##  Mean   :1.7386               
##  3rd Qu.:2.4896               
##  Max.   :5.0000

We can also select pieces of a data frame. That first number is the row, the second is the column.

riggsi[2, 6]
## [1] 51
#You try it! Find a numder you want to pull from the dataset.
#riggsi[ ?, ?]

If it is instead a single variable, you can also select a piece.

riggsi$Sat[2]
## [1] 51

More Intro to R

More descriptive stats and frequencies with sample proportions. We’ll use the package mosaic. To download a cheat sheet for mosaic click here.

library(mosaic)
favstats(~Sat, data = riggsi)
##  min Q1 median Q3 max     mean      sd   n missing
##   20 41     46 50  61 44.81613 7.30707 310       0
tally(~Gender, data = riggsi, format = "proportion")
## Gender
##  -1   1 
## 0.5 0.5

Descriptives split by gender.

favstats(Sat ~ Gender, data = riggsi)
##   Gender min Q1 median Q3 max     mean       sd   n missing
## 1     -1  24 40     46 50  61 44.85806 7.171769 155       0
## 2      1  20 41     46 50  60 44.77419 7.462937 155       0

Data Manipulation

We’ll use the package dplyr. For the cheatsheet click here.

First, let’s filter cases.

library(dplyr)

We can make a dataset of men only.

menOnly <- filter(riggsi, Gender == -1)

We can save this new data set in our files as a csv.

write.csv(menOnly, "men.csv")

It’s in the same folder where this .Rmd file is saved! You can be more specific with the file path if you choose.

How about only the men of drinking age.

legal_drinkers <- menOnly %>% filter(Age > 20)

Note the use of the pipe, %>%, above. The two statements below are equivalent.

legal_drinkers <- filter(menOnly, Age > 20)

legal_drinkers <- menOnly %>% filter(Age > 20)

Adding new variables.

menOnly <- menOnly %>%
  mutate(legal = Age > 20)

favstats(Age ~ legal, data = menOnly)
##   legal min Q1 median Q3 max     mean        sd   n missing
## 1 FALSE  18 19     19 20  20 19.12963 0.7781521  54       0
## 2  TRUE  21 22     23 26  43 24.61386 4.4810050 101       0

Alternative method for descriptive statistic—the dplyr way.

riggsi %>%
  summarize(mean = mean(Sat),
            sd = sd(Sat),
            min = min(Sat))
##       mean      sd min
## 1 44.81613 7.30707  20

We can split the file and view results grouped by some variable.

riggsi %>%
  group_by(Gender) %>%
  summarize(mean = mean(Sat),
            sd = sd(Sat),
            min = min(Sat))
## # A tibble: 2 x 4
##   Gender  mean    sd   min
##    <int> <dbl> <dbl> <dbl>
## 1     -1  44.9  7.17   24.
## 2      1  44.8  7.46   20.

Save a smaller subset of variables.

small <- riggsi %>%
  select(Dyad, Partner, Sat, Avoid)

Visualizing Data

Let’s make a histogram. We’ll use the package ggplot2 to make our visualizations. To download a cheatsheet for ggplot2 click here

qplot(x = Sat, data = riggsi, bins = 20)

Scatterplot. qplot, which stands for quick plot, guesses which kind of figure you want.

qplot(x = Avoid, y = Sat, data = riggsi)

Side-by-side boxplots. In this case it does NOT know what to do, so we tell it we want boxplots. It will also be helpful to have a gender variable that is stored as categorical string variable. We do that first.

qplot(y = Sat, x = Genderstring, data = riggsi, geom = "boxplot")

We can add a regression line.

qplot(x = Avoid, y = Sat, data = riggsi) + geom_smooth(method = "lm", se = 0)

We can also split by gender. Note that for more complex models we need to move away from using the qplot function in favor of the heavy duty ggplot function.

ggplot(riggsi, aes(x = Avoid, y = Sat, group = Genderstring, color = Genderstring)) +
         geom_point() + 
         geom_smooth(method = "lm", se = 0)

Statistical Modeling and Inference in R

Independent samples t-test, making use of the t.test() function in the mosaic package.

t.test(Sat ~ legal, data = menOnly, var.equal = TRUE)
## Sat ~ legal
## 
##  Two Sample t-test
## 
## data:  Sat by legal
## t = -0.12502, df = 153, p-value = 0.9007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.547837  2.244573
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            44.75926            44.91089

Paired samples t-test.

t.test(~(Sat-Avoid), data = menOnly)
## ~(Sat - Avoid)
## 
##  One Sample t-test
## 
## data:  Sat
## t = 71.688, df = 154, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  42.25412 44.64887
## sample estimates:
## mean of x 
##  43.45149

Linear regression.

mod <- lm(Sat ~ Avoid + legal, data = menOnly)

summary(mod)
## 
## Call:
## lm(formula = Sat ~ Avoid + legal, data = menOnly)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.480  -3.620   0.316   4.650  14.401 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47.4177     1.1953  39.671  < 2e-16 ***
## Avoid        -1.8745     0.5183  -3.617 0.000405 ***
## legalTRUE     0.1182     1.1677   0.101 0.919476    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.927 on 152 degrees of freedom
## Multiple R-squared:  0.07934,    Adjusted R-squared:  0.06722 
## F-statistic: 6.549 on 2 and 152 DF,  p-value: 0.001869

Regression diagnostics.

plot(mod)


Back to schedule