a_thing <- 4
another_thing <- 1
another_Thing <- 7
both_things <- a_thing + another_thing
Then we created a tiny data set.
a_data_thing <- data.frame(x = 2, y = 8)
## [1] 2
#How would we print the variable y? Type your answer below on line 17.
Write notes for yourself in the white space. Maybe explain to your future self what dollar signs do.
Enough playing around, let’s load some data!
riggsi <- read.csv("riggsi.csv")
Next, you want to look at your data.
## Dyad Partner Age Gender Abuse Sat Avoid Anxiety Rel_Length
## 1 1 1 21 -1 5 48 0.2777778 2.2777778 1.083333
## 2 1 2 20 1 5 51 0.0000000 3.8888889 1.083333
## 3 2 1 22 1 15 43 1.4444444 3.4444444 3.416667
## 4 2 2 21 -1 16 47 0.3333333 0.8888889 3.416667
## 5 3 1 32 -1 7 52 0.3333333 1.5555556 1.666667
## 6 3 2 21 1 13 38 0.8333333 5.6111111 1.666667
## Genderstring
## 1 Man
## 2 Woman
## 3 Woman
## 4 Man
## 5 Man
## 6 Woman
## 'data.frame': 310 obs. of 10 variables:
## $ Dyad : int 1 1 2 2 3 3 5 5 6 6 ...
## $ Partner : int 1 2 1 2 1 2 1 2 1 2 ...
## $ Age : int 21 20 22 21 32 21 18 19 22 21 ...
## $ Gender : int -1 1 1 -1 -1 1 1 -1 -1 1 ...
## $ Abuse : num 5 5 15 16 7 13 5 5 5 5 ...
## $ Sat : int 48 51 43 47 52 38 50 45 50 57 ...
## $ Avoid : num 0.278 0 1.444 0.333 0.333 ...
## $ Anxiety : num 2.278 3.889 3.444 0.889 1.556 ...
## $ Rel_Length : num 1.08 1.08 3.42 3.42 1.67 ...
## $ Genderstring: Factor w/ 2 levels "Man","Woman": 1 2 2 1 1 2 2 1 1 2 ...
## [1] "Dyad" "Partner" "Age" "Gender"
## [5] "Abuse" "Sat" "Avoid" "Anxiety"
## [9] "Rel_Length" "Genderstring"
There is also documentation about functions.
You probably also want descriptive statistics.
## Dyad Partner Age Gender
## Min. : 1.00 Min. :1.0 Min. :17.0 Min. :-1
## 1st Qu.: 43.25 1st Qu.:1.0 1st Qu.:19.0 1st Qu.:-1
## Median : 88.00 Median :1.5 Median :21.0 Median : 0
## Mean : 87.69 Mean :1.5 Mean :21.9 Mean : 0
## 3rd Qu.:131.75 3rd Qu.:2.0 3rd Qu.:23.0 3rd Qu.: 1
## Max. :174.00 Max. :2.0 Max. :46.0 Max. : 1
## Abuse Sat Avoid Anxiety
## Min. : 4.404 Min. :20.00 Min. :0.0000 Min. :0.000
## 1st Qu.: 5.000 1st Qu.:41.00 1st Qu.:0.5556 1st Qu.:1.847
## Median : 7.000 Median :46.00 Median :1.2222 Median :2.778
## Mean : 8.888 Mean :44.82 Mean :1.4591 Mean :2.780
## 3rd Qu.:11.000 3rd Qu.:50.00 3rd Qu.:2.1111 3rd Qu.:3.611
## Max. :25.000 Max. :61.00 Max. :5.7222 Max. :5.667
## Rel_Length Genderstring
## Min. :0.4167 Man :155
## 1st Qu.:0.8333 Woman:155
## Median :1.4583
## Mean :1.7386
## 3rd Qu.:2.4896
## Max. :5.0000
We can also select pieces of a data frame. That first number is the row, the second is the column.
riggsi[2, 6]
## [1] 51
#You try it! Find a numder you want to pull from the dataset.
#riggsi[ ?, ?]
If it is instead a single variable, you can also select a piece.
## [1] 51
More descriptive stats and frequencies with sample proportions. We’ll use the package mosaic
. To download a cheat sheet for mosaic
click here.
favstats(~Sat, data = riggsi)
## min Q1 median Q3 max mean sd n missing
## 20 41 46 50 61 44.81613 7.30707 310 0
tally(~Gender, data = riggsi, format = "proportion")
## Gender
## -1 1
## 0.5 0.5
Descriptives split by gender.
favstats(Sat ~ Gender, data = riggsi)
## Gender min Q1 median Q3 max mean sd n missing
## 1 -1 24 40 46 50 61 44.85806 7.171769 155 0
## 2 1 20 41 46 50 60 44.77419 7.462937 155 0
We’ll use the package dplyr
. For the cheatsheet click here.
First, let’s filter cases.
We can make a dataset of men only.
menOnly <- filter(riggsi, Gender == -1)
We can save this new data set in our files as a csv.
write.csv(menOnly, "men.csv")
It’s in the same folder where this .Rmd file is saved! You can be more specific with the file path if you choose.
How about only the men of drinking age.
legal_drinkers <- menOnly %>% filter(Age > 20)
Note the use of the pipe, %>%
, above. The two statements below are equivalent.
legal_drinkers <- filter(menOnly, Age > 20)
legal_drinkers <- menOnly %>% filter(Age > 20)
Adding new variables.
menOnly <- menOnly %>%
mutate(legal = Age > 20)
favstats(Age ~ legal, data = menOnly)
## legal min Q1 median Q3 max mean sd n missing
## 1 FALSE 18 19 19 20 20 19.12963 0.7781521 54 0
## 2 TRUE 21 22 23 26 43 24.61386 4.4810050 101 0
Alternative method for descriptive statistic—the dplyr
riggsi %>%
summarize(mean = mean(Sat),
sd = sd(Sat),
min = min(Sat))
## mean sd min
## 1 44.81613 7.30707 20
We can split the file and view results grouped by some variable.
riggsi %>%
group_by(Gender) %>%
summarize(mean = mean(Sat),
sd = sd(Sat),
min = min(Sat))
## # A tibble: 2 x 4
## Gender mean sd min
## <int> <dbl> <dbl> <dbl>
## 1 -1 44.9 7.17 24.
## 2 1 44.8 7.46 20.
Save a smaller subset of variables.
small <- riggsi %>%
select(Dyad, Partner, Sat, Avoid)
Let’s make a histogram. We’ll use the package ggplot2
to make our visualizations. To download a cheatsheet for ggplot2
click here
qplot(x = Sat, data = riggsi, bins = 20)
Scatterplot. qplot
, which stands for quick plot, guesses which kind of figure you want.
qplot(x = Avoid, y = Sat, data = riggsi)
Side-by-side boxplots. In this case it does NOT know what to do, so we tell it we want boxplots. It will also be helpful to have a gender variable that is stored as categorical string variable. We do that first.
qplot(y = Sat, x = Genderstring, data = riggsi, geom = "boxplot")
We can add a regression line.
qplot(x = Avoid, y = Sat, data = riggsi) + geom_smooth(method = "lm", se = 0)
We can also split by gender. Note that for more complex models we need to move away from using the qplot
function in favor of the heavy duty ggplot
ggplot(riggsi, aes(x = Avoid, y = Sat, group = Genderstring, color = Genderstring)) +
geom_point() +
geom_smooth(method = "lm", se = 0)
Independent samples t-test, making use of the t.test()
function in the mosaic
t.test(Sat ~ legal, data = menOnly, var.equal = TRUE)
## Sat ~ legal
## Two Sample t-test
## data: Sat by legal
## t = -0.12502, df = 153, p-value = 0.9007
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.547837 2.244573
## sample estimates:
## mean in group FALSE mean in group TRUE
## 44.75926 44.91089
Paired samples t-test.
t.test(~(Sat-Avoid), data = menOnly)
## ~(Sat - Avoid)
## One Sample t-test
## data: Sat
## t = 71.688, df = 154, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 42.25412 44.64887
## sample estimates:
## mean of x
## 43.45149
Linear regression.
mod <- lm(Sat ~ Avoid + legal, data = menOnly)
## Call:
## lm(formula = Sat ~ Avoid + legal, data = menOnly)
## Residuals:
## Min 1Q Median 3Q Max
## -22.480 -3.620 0.316 4.650 14.401
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.4177 1.1953 39.671 < 2e-16 ***
## Avoid -1.8745 0.5183 -3.617 0.000405 ***
## legalTRUE 0.1182 1.1677 0.101 0.919476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 6.927 on 152 degrees of freedom
## Multiple R-squared: 0.07934, Adjusted R-squared: 0.06722
## F-statistic: 6.549 on 2 and 152 DF, p-value: 0.001869
Regression diagnostics.