The following code is required for proper rendering of the document. Please do not modify it. When working in the Rmd version of this file, do not attempt to run this chunk.
knitr::opts_chunk$set(error = TRUE)
Last updated on 2023-Feb-03.
Original text: Chad M. Eliason, PhD
Revisions: Nick M. A. Crouch, PhD; Lucas J. Legendre, PhD; and Carlos A. Rodriguez-Saltos, PhD
Exercises: Lucas J. Legendre, PhD
Rmarkdown implementation: Carlos A. Rodriguez-Saltos, PhD
Principal course instructor: Julia A. Clarke, PhD
These modules are part of the course “Curiosity to Question: Research Design, Data Analysis and Visualization”, taught by Dr. Julia A. Clarke and Dr. Adam Papendieck at UT Austin.
For questions or comments, please send an email to Dr. Clarke (julia_clarke@jsg.utexas.edu).
Eliason, C. M., Proffitt, J. V., Crouch, N. M. A., Legendre, L. J., Rodriguez-Saltos, C. A., Papendieck, A., & Clarke, J. A. (2020). The Clarke Lab’s Introductory Course to R. Retrieved from https://juliaclarke.squarespace.com
For this section we will work with the mtcars
dataset,
which comes preloaded with R.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Let’s sort the dataset by mpg and cyl.
plot(mtcars$mpg)
newdata <- mtcars[order(mtcars$mpg, mtcars$cyl), ]
order(mtcars$mpg, mtcars$cyl)
## [1] 15 16 24 7 17 31 14 23 22 29 12 13 11 6 5 10 25 30 1 2 32 4 21 3 9
## [26] 8 27 26 19 28 18 20
plot(newdata$mpg)
We can sort by mpg and cyl in ascending and descending order, respectively.
newdata <- mtcars[order(mtcars$mpg, -mtcars$cyl), ]
plot(newdata$cyl)
head(newdata)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
Let’s say you have two datasets and want to merge them based on a common ID variable.
df1 <- data.frame(id=c(1:6), prod=c(rep("toaster", 3), rep("radio", 3)))
df1
## id prod
## 1 1 toaster
## 2 2 toaster
## 3 3 toaster
## 4 4 radio
## 5 5 radio
## 6 6 radio
df2 <- data.frame(id = c(2, 4, 6), state = c(rep('Alabama', 2), 'Ohio'))
df2
## id state
## 1 2 Alabama
## 2 4 Alabama
## 3 6 Ohio
We can merge then as follows:
merge(df1, df2, by="id")
## id prod state
## 1 2 toaster Alabama
## 2 4 radio Alabama
## 3 6 radio Ohio
Use rbind
.
rbind(df1, df1)
## id prod
## 1 1 toaster
## 2 2 toaster
## 3 3 toaster
## 4 4 radio
## 5 5 radio
## 6 6 radio
## 7 1 toaster
## 8 2 toaster
## 9 3 toaster
## 10 4 radio
## 11 5 radio
## 12 6 radio
rbind
, however, won’t work if your column names
differ
rbind(df1, df2)
## Error in match.names(clabs, names(xi)): names do not match previous names
table
returns a table with the number of occurences for
each level in a factor.
mtcars$cyl
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
You can use it for more than one column at the same time. The output is called a contingency table.
table(mtcars$cyl, mtcars$am)
##
## 0 1
## 4 3 8
## 6 4 3
## 8 12 2
To show the name of the coumns in the table, use the following code:
xtabs(~ cyl + am, data = mtcars)
## am
## cyl 0 1
## 4 3 8
## 6 4 3
## 8 12 2
You can pass the output of xtabs
to plot
and summary
.
xt <- xtabs(~cyl + am, data = mtcars)
xt
## am
## cyl 0 1
## 4 3 8
## 6 4 3
## 8 12 2
plot(xt)
summary(xt)
## Call: xtabs(formula = ~cyl + am, data = mtcars)
## Number of cases in table: 32
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 8.741, df = 2, p-value = 0.01265
## Chi-squared approximation may be incorrect
We often want to apply a function to each row or column of a data frame.
Let’s try some examples on the iris
dataset.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
To get row-wise averages:
iris[, 1:4]
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5.0 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## 11 5.4 3.7 1.5 0.2
## 12 4.8 3.4 1.6 0.2
## 13 4.8 3.0 1.4 0.1
## 14 4.3 3.0 1.1 0.1
## 15 5.8 4.0 1.2 0.2
## 16 5.7 4.4 1.5 0.4
## 17 5.4 3.9 1.3 0.4
## 18 5.1 3.5 1.4 0.3
## 19 5.7 3.8 1.7 0.3
## 20 5.1 3.8 1.5 0.3
## 21 5.4 3.4 1.7 0.2
## 22 5.1 3.7 1.5 0.4
## 23 4.6 3.6 1.0 0.2
## 24 5.1 3.3 1.7 0.5
## 25 4.8 3.4 1.9 0.2
## 26 5.0 3.0 1.6 0.2
## 27 5.0 3.4 1.6 0.4
## 28 5.2 3.5 1.5 0.2
## 29 5.2 3.4 1.4 0.2
## 30 4.7 3.2 1.6 0.2
## 31 4.8 3.1 1.6 0.2
## 32 5.4 3.4 1.5 0.4
## 33 5.2 4.1 1.5 0.1
## 34 5.5 4.2 1.4 0.2
## 35 4.9 3.1 1.5 0.2
## 36 5.0 3.2 1.2 0.2
## 37 5.5 3.5 1.3 0.2
## 38 4.9 3.6 1.4 0.1
## 39 4.4 3.0 1.3 0.2
## 40 5.1 3.4 1.5 0.2
## 41 5.0 3.5 1.3 0.3
## 42 4.5 2.3 1.3 0.3
## 43 4.4 3.2 1.3 0.2
## 44 5.0 3.5 1.6 0.6
## 45 5.1 3.8 1.9 0.4
## 46 4.8 3.0 1.4 0.3
## 47 5.1 3.8 1.6 0.2
## 48 4.6 3.2 1.4 0.2
## 49 5.3 3.7 1.5 0.2
## 50 5.0 3.3 1.4 0.2
## 51 7.0 3.2 4.7 1.4
## 52 6.4 3.2 4.5 1.5
## 53 6.9 3.1 4.9 1.5
## 54 5.5 2.3 4.0 1.3
## 55 6.5 2.8 4.6 1.5
## 56 5.7 2.8 4.5 1.3
## 57 6.3 3.3 4.7 1.6
## 58 4.9 2.4 3.3 1.0
## 59 6.6 2.9 4.6 1.3
## 60 5.2 2.7 3.9 1.4
## 61 5.0 2.0 3.5 1.0
## 62 5.9 3.0 4.2 1.5
## 63 6.0 2.2 4.0 1.0
## 64 6.1 2.9 4.7 1.4
## 65 5.6 2.9 3.6 1.3
## 66 6.7 3.1 4.4 1.4
## 67 5.6 3.0 4.5 1.5
## 68 5.8 2.7 4.1 1.0
## 69 6.2 2.2 4.5 1.5
## 70 5.6 2.5 3.9 1.1
## 71 5.9 3.2 4.8 1.8
## 72 6.1 2.8 4.0 1.3
## 73 6.3 2.5 4.9 1.5
## 74 6.1 2.8 4.7 1.2
## 75 6.4 2.9 4.3 1.3
## 76 6.6 3.0 4.4 1.4
## 77 6.8 2.8 4.8 1.4
## 78 6.7 3.0 5.0 1.7
## 79 6.0 2.9 4.5 1.5
## 80 5.7 2.6 3.5 1.0
## 81 5.5 2.4 3.8 1.1
## 82 5.5 2.4 3.7 1.0
## 83 5.8 2.7 3.9 1.2
## 84 6.0 2.7 5.1 1.6
## 85 5.4 3.0 4.5 1.5
## 86 6.0 3.4 4.5 1.6
## 87 6.7 3.1 4.7 1.5
## 88 6.3 2.3 4.4 1.3
## 89 5.6 3.0 4.1 1.3
## 90 5.5 2.5 4.0 1.3
## 91 5.5 2.6 4.4 1.2
## 92 6.1 3.0 4.6 1.4
## 93 5.8 2.6 4.0 1.2
## 94 5.0 2.3 3.3 1.0
## 95 5.6 2.7 4.2 1.3
## 96 5.7 3.0 4.2 1.2
## 97 5.7 2.9 4.2 1.3
## 98 6.2 2.9 4.3 1.3
## 99 5.1 2.5 3.0 1.1
## 100 5.7 2.8 4.1 1.3
## 101 6.3 3.3 6.0 2.5
## 102 5.8 2.7 5.1 1.9
## 103 7.1 3.0 5.9 2.1
## 104 6.3 2.9 5.6 1.8
## 105 6.5 3.0 5.8 2.2
## 106 7.6 3.0 6.6 2.1
## 107 4.9 2.5 4.5 1.7
## 108 7.3 2.9 6.3 1.8
## 109 6.7 2.5 5.8 1.8
## 110 7.2 3.6 6.1 2.5
## 111 6.5 3.2 5.1 2.0
## 112 6.4 2.7 5.3 1.9
## 113 6.8 3.0 5.5 2.1
## 114 5.7 2.5 5.0 2.0
## 115 5.8 2.8 5.1 2.4
## 116 6.4 3.2 5.3 2.3
## 117 6.5 3.0 5.5 1.8
## 118 7.7 3.8 6.7 2.2
## 119 7.7 2.6 6.9 2.3
## 120 6.0 2.2 5.0 1.5
## 121 6.9 3.2 5.7 2.3
## 122 5.6 2.8 4.9 2.0
## 123 7.7 2.8 6.7 2.0
## 124 6.3 2.7 4.9 1.8
## 125 6.7 3.3 5.7 2.1
## 126 7.2 3.2 6.0 1.8
## 127 6.2 2.8 4.8 1.8
## 128 6.1 3.0 4.9 1.8
## 129 6.4 2.8 5.6 2.1
## 130 7.2 3.0 5.8 1.6
## 131 7.4 2.8 6.1 1.9
## 132 7.9 3.8 6.4 2.0
## 133 6.4 2.8 5.6 2.2
## 134 6.3 2.8 5.1 1.5
## 135 6.1 2.6 5.6 1.4
## 136 7.7 3.0 6.1 2.3
## 137 6.3 3.4 5.6 2.4
## 138 6.4 3.1 5.5 1.8
## 139 6.0 3.0 4.8 1.8
## 140 6.9 3.1 5.4 2.1
## 141 6.7 3.1 5.6 2.4
## 142 6.9 3.1 5.1 2.3
## 143 5.8 2.7 5.1 1.9
## 144 6.8 3.2 5.9 2.3
## 145 6.7 3.3 5.7 2.5
## 146 6.7 3.0 5.2 2.3
## 147 6.3 2.5 5.0 1.9
## 148 6.5 3.0 5.2 2.0
## 149 6.2 3.4 5.4 2.3
## 150 5.9 3.0 5.1 1.8
apply(iris[, 1:4], MARGIN = 1, FUN = mean)
## [1] 2.550 2.375 2.350 2.350 2.550 2.850 2.425 2.525 2.225 2.400 2.700 2.500
## [13] 2.325 2.125 2.800 3.000 2.750 2.575 2.875 2.675 2.675 2.675 2.350 2.650
## [25] 2.575 2.450 2.600 2.600 2.550 2.425 2.425 2.675 2.725 2.825 2.425 2.400
## [37] 2.625 2.500 2.225 2.550 2.525 2.100 2.275 2.675 2.800 2.375 2.675 2.350
## [49] 2.675 2.475 4.075 3.900 4.100 3.275 3.850 3.575 3.975 2.900 3.850 3.300
## [61] 2.875 3.650 3.300 3.775 3.350 3.900 3.650 3.400 3.600 3.275 3.925 3.550
## [73] 3.800 3.700 3.725 3.850 3.950 4.100 3.725 3.200 3.200 3.150 3.400 3.850
## [85] 3.600 3.875 4.000 3.575 3.500 3.325 3.425 3.775 3.400 2.900 3.450 3.525
## [97] 3.525 3.675 2.925 3.475 4.525 3.875 4.525 4.150 4.375 4.825 3.400 4.575
## [109] 4.200 4.850 4.200 4.075 4.350 3.800 4.025 4.300 4.200 5.100 4.875 3.675
## [121] 4.525 3.825 4.800 3.925 4.450 4.550 3.900 3.950 4.225 4.400 4.550 5.025
## [133] 4.250 3.925 3.925 4.775 4.425 4.200 3.900 4.375 4.450 4.350 3.875 4.550
## [145] 4.550 4.300 3.925 4.175 4.325 3.950
To get column-wise averages:
apply(iris[, 1:4], MARGIN = 2, FUN = mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
Standard deviation for each column:
apply(iris[, 1:4], MARGIN = 2, FUN = sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
When you want to apply a function to each level in a factor:
# Mean of sepal length for each species
tapply(iris$Sepal.Length, iris$Species, FUN = mean)
## setosa versicolor virginica
## 5.006 5.936 6.588
The aggregate
function produces similar output, but with
a nicer format. It splits the dataset into subsets, according to the
levels of the factors, and computes summary stats for each subset.
aggregate(Sepal.Length ~ Species, data = iris, FUN = mean)
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
You can use it for all variables at once.
aggregate(. ~ Species, data = iris, FUN = mean)
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
Or just some of the variables:
aggregate(cbind(Sepal.Length, Sepal.Width) ~ Species, data = iris, FUN = mean)
## Species Sepal.Length Sepal.Width
## 1 setosa 5.006 3.428
## 2 versicolor 5.936 2.770
## 3 virginica 6.588 2.974
If, in general, you are interested in applying a function to each
element of an object, check the functions sapply
and
lapply
. They can be used for elements of a vector or a list
–The latter is a special type of vector that can include elements of
different types.
The package dplyr
can also help to apply functions to
each column of a data frame. It includes the function
summarise
, which makes it easy to calculate several
statistics at once to summarize our dataset. We will use the function
summarise
in conjunction with the “pipe” operator
(%>%
). The latter feeds the output of a function to the
input of another function (think of it as water flowing through several
segments of a pipe).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris %>%
group_by(Species) %>%
summarise(
sepal.mean = mean(Sepal.Length),
sepal.sd = sd(Sepal.Length)
)
## # A tibble: 3 × 3
## Species sepal.mean sepal.sd
## <fct> <dbl> <dbl>
## 1 setosa 5.01 0.352
## 2 versicolor 5.94 0.516
## 3 virginica 6.59 0.636
For more information, check the help files of the above functions.
?group_by
?summarise
The function mutate
, also from dplyr
,
allows to add a variable based on a calculation performed on other
variables in the dataset.
mutate(iris, myvar = Sepal.Length * Sepal.Width)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species myvar
## 1 5.1 3.5 1.4 0.2 setosa 17.85
## 2 4.9 3.0 1.4 0.2 setosa 14.70
## 3 4.7 3.2 1.3 0.2 setosa 15.04
## 4 4.6 3.1 1.5 0.2 setosa 14.26
## 5 5.0 3.6 1.4 0.2 setosa 18.00
## 6 5.4 3.9 1.7 0.4 setosa 21.06
## 7 4.6 3.4 1.4 0.3 setosa 15.64
## 8 5.0 3.4 1.5 0.2 setosa 17.00
## 9 4.4 2.9 1.4 0.2 setosa 12.76
## 10 4.9 3.1 1.5 0.1 setosa 15.19
## 11 5.4 3.7 1.5 0.2 setosa 19.98
## 12 4.8 3.4 1.6 0.2 setosa 16.32
## 13 4.8 3.0 1.4 0.1 setosa 14.40
## 14 4.3 3.0 1.1 0.1 setosa 12.90
## 15 5.8 4.0 1.2 0.2 setosa 23.20
## 16 5.7 4.4 1.5 0.4 setosa 25.08
## 17 5.4 3.9 1.3 0.4 setosa 21.06
## 18 5.1 3.5 1.4 0.3 setosa 17.85
## 19 5.7 3.8 1.7 0.3 setosa 21.66
## 20 5.1 3.8 1.5 0.3 setosa 19.38
## 21 5.4 3.4 1.7 0.2 setosa 18.36
## 22 5.1 3.7 1.5 0.4 setosa 18.87
## 23 4.6 3.6 1.0 0.2 setosa 16.56
## 24 5.1 3.3 1.7 0.5 setosa 16.83
## 25 4.8 3.4 1.9 0.2 setosa 16.32
## 26 5.0 3.0 1.6 0.2 setosa 15.00
## 27 5.0 3.4 1.6 0.4 setosa 17.00
## 28 5.2 3.5 1.5 0.2 setosa 18.20
## 29 5.2 3.4 1.4 0.2 setosa 17.68
## 30 4.7 3.2 1.6 0.2 setosa 15.04
## 31 4.8 3.1 1.6 0.2 setosa 14.88
## 32 5.4 3.4 1.5 0.4 setosa 18.36
## 33 5.2 4.1 1.5 0.1 setosa 21.32
## 34 5.5 4.2 1.4 0.2 setosa 23.10
## 35 4.9 3.1 1.5 0.2 setosa 15.19
## 36 5.0 3.2 1.2 0.2 setosa 16.00
## 37 5.5 3.5 1.3 0.2 setosa 19.25
## 38 4.9 3.6 1.4 0.1 setosa 17.64
## 39 4.4 3.0 1.3 0.2 setosa 13.20
## 40 5.1 3.4 1.5 0.2 setosa 17.34
## 41 5.0 3.5 1.3 0.3 setosa 17.50
## 42 4.5 2.3 1.3 0.3 setosa 10.35
## 43 4.4 3.2 1.3 0.2 setosa 14.08
## 44 5.0 3.5 1.6 0.6 setosa 17.50
## 45 5.1 3.8 1.9 0.4 setosa 19.38
## 46 4.8 3.0 1.4 0.3 setosa 14.40
## 47 5.1 3.8 1.6 0.2 setosa 19.38
## 48 4.6 3.2 1.4 0.2 setosa 14.72
## 49 5.3 3.7 1.5 0.2 setosa 19.61
## 50 5.0 3.3 1.4 0.2 setosa 16.50
## 51 7.0 3.2 4.7 1.4 versicolor 22.40
## 52 6.4 3.2 4.5 1.5 versicolor 20.48
## 53 6.9 3.1 4.9 1.5 versicolor 21.39
## 54 5.5 2.3 4.0 1.3 versicolor 12.65
## 55 6.5 2.8 4.6 1.5 versicolor 18.20
## 56 5.7 2.8 4.5 1.3 versicolor 15.96
## 57 6.3 3.3 4.7 1.6 versicolor 20.79
## 58 4.9 2.4 3.3 1.0 versicolor 11.76
## 59 6.6 2.9 4.6 1.3 versicolor 19.14
## 60 5.2 2.7 3.9 1.4 versicolor 14.04
## 61 5.0 2.0 3.5 1.0 versicolor 10.00
## 62 5.9 3.0 4.2 1.5 versicolor 17.70
## 63 6.0 2.2 4.0 1.0 versicolor 13.20
## 64 6.1 2.9 4.7 1.4 versicolor 17.69
## 65 5.6 2.9 3.6 1.3 versicolor 16.24
## 66 6.7 3.1 4.4 1.4 versicolor 20.77
## 67 5.6 3.0 4.5 1.5 versicolor 16.80
## 68 5.8 2.7 4.1 1.0 versicolor 15.66
## 69 6.2 2.2 4.5 1.5 versicolor 13.64
## 70 5.6 2.5 3.9 1.1 versicolor 14.00
## 71 5.9 3.2 4.8 1.8 versicolor 18.88
## 72 6.1 2.8 4.0 1.3 versicolor 17.08
## 73 6.3 2.5 4.9 1.5 versicolor 15.75
## 74 6.1 2.8 4.7 1.2 versicolor 17.08
## 75 6.4 2.9 4.3 1.3 versicolor 18.56
## 76 6.6 3.0 4.4 1.4 versicolor 19.80
## 77 6.8 2.8 4.8 1.4 versicolor 19.04
## 78 6.7 3.0 5.0 1.7 versicolor 20.10
## 79 6.0 2.9 4.5 1.5 versicolor 17.40
## 80 5.7 2.6 3.5 1.0 versicolor 14.82
## 81 5.5 2.4 3.8 1.1 versicolor 13.20
## 82 5.5 2.4 3.7 1.0 versicolor 13.20
## 83 5.8 2.7 3.9 1.2 versicolor 15.66
## 84 6.0 2.7 5.1 1.6 versicolor 16.20
## 85 5.4 3.0 4.5 1.5 versicolor 16.20
## 86 6.0 3.4 4.5 1.6 versicolor 20.40
## 87 6.7 3.1 4.7 1.5 versicolor 20.77
## 88 6.3 2.3 4.4 1.3 versicolor 14.49
## 89 5.6 3.0 4.1 1.3 versicolor 16.80
## 90 5.5 2.5 4.0 1.3 versicolor 13.75
## 91 5.5 2.6 4.4 1.2 versicolor 14.30
## 92 6.1 3.0 4.6 1.4 versicolor 18.30
## 93 5.8 2.6 4.0 1.2 versicolor 15.08
## 94 5.0 2.3 3.3 1.0 versicolor 11.50
## 95 5.6 2.7 4.2 1.3 versicolor 15.12
## 96 5.7 3.0 4.2 1.2 versicolor 17.10
## 97 5.7 2.9 4.2 1.3 versicolor 16.53
## 98 6.2 2.9 4.3 1.3 versicolor 17.98
## 99 5.1 2.5 3.0 1.1 versicolor 12.75
## 100 5.7 2.8 4.1 1.3 versicolor 15.96
## 101 6.3 3.3 6.0 2.5 virginica 20.79
## 102 5.8 2.7 5.1 1.9 virginica 15.66
## 103 7.1 3.0 5.9 2.1 virginica 21.30
## 104 6.3 2.9 5.6 1.8 virginica 18.27
## 105 6.5 3.0 5.8 2.2 virginica 19.50
## 106 7.6 3.0 6.6 2.1 virginica 22.80
## 107 4.9 2.5 4.5 1.7 virginica 12.25
## 108 7.3 2.9 6.3 1.8 virginica 21.17
## 109 6.7 2.5 5.8 1.8 virginica 16.75
## 110 7.2 3.6 6.1 2.5 virginica 25.92
## 111 6.5 3.2 5.1 2.0 virginica 20.80
## 112 6.4 2.7 5.3 1.9 virginica 17.28
## 113 6.8 3.0 5.5 2.1 virginica 20.40
## 114 5.7 2.5 5.0 2.0 virginica 14.25
## 115 5.8 2.8 5.1 2.4 virginica 16.24
## 116 6.4 3.2 5.3 2.3 virginica 20.48
## 117 6.5 3.0 5.5 1.8 virginica 19.50
## 118 7.7 3.8 6.7 2.2 virginica 29.26
## 119 7.7 2.6 6.9 2.3 virginica 20.02
## 120 6.0 2.2 5.0 1.5 virginica 13.20
## 121 6.9 3.2 5.7 2.3 virginica 22.08
## 122 5.6 2.8 4.9 2.0 virginica 15.68
## 123 7.7 2.8 6.7 2.0 virginica 21.56
## 124 6.3 2.7 4.9 1.8 virginica 17.01
## 125 6.7 3.3 5.7 2.1 virginica 22.11
## 126 7.2 3.2 6.0 1.8 virginica 23.04
## 127 6.2 2.8 4.8 1.8 virginica 17.36
## 128 6.1 3.0 4.9 1.8 virginica 18.30
## 129 6.4 2.8 5.6 2.1 virginica 17.92
## 130 7.2 3.0 5.8 1.6 virginica 21.60
## 131 7.4 2.8 6.1 1.9 virginica 20.72
## 132 7.9 3.8 6.4 2.0 virginica 30.02
## 133 6.4 2.8 5.6 2.2 virginica 17.92
## 134 6.3 2.8 5.1 1.5 virginica 17.64
## 135 6.1 2.6 5.6 1.4 virginica 15.86
## 136 7.7 3.0 6.1 2.3 virginica 23.10
## 137 6.3 3.4 5.6 2.4 virginica 21.42
## 138 6.4 3.1 5.5 1.8 virginica 19.84
## 139 6.0 3.0 4.8 1.8 virginica 18.00
## 140 6.9 3.1 5.4 2.1 virginica 21.39
## 141 6.7 3.1 5.6 2.4 virginica 20.77
## 142 6.9 3.1 5.1 2.3 virginica 21.39
## 143 5.8 2.7 5.1 1.9 virginica 15.66
## 144 6.8 3.2 5.9 2.3 virginica 21.76
## 145 6.7 3.3 5.7 2.5 virginica 22.11
## 146 6.7 3.0 5.2 2.3 virginica 20.10
## 147 6.3 2.5 5.0 1.9 virginica 15.75
## 148 6.5 3.0 5.2 2.0 virginica 19.50
## 149 6.2 3.4 5.4 2.3 virginica 21.08
## 150 5.9 3.0 5.1 1.8 virginica 17.70
A commonly asked question in research is whether two or more populations differ in a statistic, such as the mean. We can use parametric statistics to ask this question when each population is normally distributed. In statistics, the word normal, rather than meaning “typical”, is used to describe frequency curves that have the shape of a bell.
T tests are used to compare two groups between each other. We will
present an example of how to use these tests by working with the
cats
dataset. Specifically, we will test whether cat weight
depends on sex.
First, we load the dataset.
install.packages("MASS")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data(cats)
head(cats)
## Sex Bwt Hwt
## 1 F 2.0 7.0
## 2 F 2.0 7.4
## 3 F 2.0 9.5
## 4 F 2.1 7.2
## 5 F 2.1 7.3
## 6 F 2.1 7.6
This is the distribution of weight according to sex.
?cats
boxplot(Bwt ~ Sex, data = cats)
We can also visualize the distributions using a histogram.
malewt <- with(cats, Bwt[Sex=='M'])
femalewt <- with(cats, Bwt[Sex=='F'])
hist(
x = femalewt,
# The function `rgb` lets you set colors for each plot
col = rgb(r=0.25, g=0, b=1, 0.5),
ylim = c(0, 25),
xlim = range(c(femalewt, malewt))
)
hist(
x = malewt,
col = rgb(r=0, g=1, b=0, 0.5),
ylim = c(0, 25),
add=TRUE
)
legend(
x = "topright",
legend=c("Female", "Male"),
fill=c(
rgb(r=0.25, g=0, b=1, 0.5),
rgb(r=0, g=1, b=0, 0.5))
)
?hist
From the two types of plots, it looks like the distribution of weight varies according to sex. Males seem to be larger than females. Beyond what we seem to see in the plots, can we objectively confirm that there are differences? For that, we use a T-test.
T-tests require that the data are normally-distributed. That means that within each group, the data should appear as if they come from a bell-shaped curve (to learn more about normal distributions, check this link: https://www.analyticsvidhya.com/blog/2020/04/statistics-data-science-normal-distribution/).
Do the data look normally distributed? Apart from trying to guess the shape of the distributions, we can use a quantile-quantile (Q-Q) plot to find out.
qqnorm(femalewt)
qqline(femalewt)
In this type of plot, if the dots fall along the line, then data are normally distributed. As we can see, female weights do not seem to be normally distributed.
qqnorm(malewt)
qqline(malewt)
Male weight look pretty good as assessed with a Q-Q plot.
Now that the data are normally-distributed, we will use the t-test to find difference between the sexes. We can use a version of test that assumes that variance between groups is equal:
t.test(malewt, femalewt, var.equal=TRUE)
##
## Two Sample t-test
##
## data: malewt and femalewt
## t = 7.3307, df = 142, p-value = 1.59e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3946927 0.6861584
## sample estimates:
## mean of x mean of y
## 2.900000 2.359574
Or one that assumes unequal variance:
t.test(malewt, femalewt, var.equal=FALSE)
##
## Welch Two Sample t-test
##
## data: malewt and femalewt
## t = 8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4177242 0.6631268
## sample estimates:
## mean of x mean of y
## 2.900000 2.359574
If your data are non-normally distributed, you can use a non-parametric test. For each parametric test there is a non-parametric version. For the t-test, it is the Wilcoxon signed-rank test.
wilcox.test(malewt, femalewt)
##
## Wilcoxon rank sum test with continuity correction
##
## data: malewt and femalewt
## W = 3801.5, p-value = 8.201e-11
## alternative hypothesis: true location shift is not equal to 0
Tip: If your sample size is large enough, you may still be able to use a parametric test, even when your data are not normally distributed. Be careful; although many statisticians will say that the sample size needs to be at least 30, that number may need to be much larger depending on your case. To learn more, read about the central limit theorem, which you can do by following this link: http://strata.uga.edu/8370/lecturenotes/means.html
Most tests in R return several pieces of info as a list. We have to do a bit of extra work if we need one particular piece of information. At least, we want the p value, which tell us how likely it is that the measured difference between groups is due to chance, and not to real differences between the populations of males and females (consequently, the lower the p value, the more likely it is that the differences are real).
mod <- wilcox.test(malewt, femalewt)
str(mod)
## List of 7
## $ statistic : Named num 3802
## ..- attr(*, "names")= chr "W"
## $ parameter : NULL
## $ p.value : num 8.2e-11
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "location shift"
## $ alternative: chr "two.sided"
## $ method : chr "Wilcoxon rank sum test with continuity correction"
## $ data.name : chr "malewt and femalewt"
## - attr(*, "class")= chr "htest"
mod$p.value
## [1] 8.200502e-11
mod
##
## Wilcoxon rank sum test with continuity correction
##
## data: malewt and femalewt
## W = 3801.5, p-value = 8.201e-11
## alternative hypothesis: true location shift is not equal to 0
As you can see, the p value is very low, and therefore, we can be confident that body weight differs between males and females.
Many other tests are also available in R. They include Kolmogorov-Smirnov, Mann-Whitney, Shapiro-Wilk, Fisher-Snedecor, Bartlett, etc.. A quick search on Google is usually enough to find a basic function to perform them!
Warning: Knowing the function in R does NOT always mean that you know what the test does! ALWAYS take the time to read about the actual statistical method you want to use. Make sure you understand its assumptions and parameters.
t-tests are good for comparing 2 groups. if you want to compare more than that, use an ANOVA.
We will work again with the iris dataset.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Let’s say that our hypothesis is that petal length varies significantly among species. Therefore:
Alternative hypothesis: setosa != versicolor != virginica.
Null hypothesis: setosa = versicolor = virginica.
Let’s first look at the distribution of our data.
boxplot(Petal.Length ~ Species, data = iris)
Just by eyeballing the data we seem to get a good idea of which hypothesis is supported by the data. Now, let’s run the statistical analysis:
iris.anova <- aov(Petal.Length ~ Species, data = iris)
summary(iris.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.1 218.55 1180 <2e-16 ***
## Residuals 147 27.2 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given our p-value, we can reject the null hypothesis even at an alpha level of 0.001. Therefore, at least one of the species is significantly different from the others. But the ANOVA does NOT tell us which one(s)! To know that, we need to run post-hoc pairwise comparisons between species. They are called post-hoc because they are done after the ANOVA.
The Tukey HSD test allows us to run post-hoc comparisons between all levels of our variable.
TukeyHSD(iris.anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 2.798 2.59422 3.00178 0
## virginica-setosa 4.090 3.88622 4.29378 0
## virginica-versicolor 1.292 1.08822 1.49578 0
Here, p = 0 for all species, which means that all groups are different from each other.
What does “p adj” mean? When you do a lot of significance tests on a dataset, you increase the likelihood that you will get p < 0.05 just by chance. This comic explains it quite well: https://xkcd.com/882/. Tukey HSD test adjusts p values for multiple tests. “p adj” means simply adjusted p-values.
The output of TukeyHSD
allows us to plot confidence
intervals for the mean of each species.
plot(TukeyHSD(iris.anova))
As a rule of thumb, if confidence intervals for a particular pair-wise comparison do not overlap zero, then the means of the group are different.
What if we need a non-parametric test (due to non-normal data and small sample sizes)?. Then, we use the Kruskal-Wallis test (non-parametric equivalent of ANOVA):
kruskal.test(Petal.Length ~ Species, data = iris)
##
## Kruskal-Wallis rank sum test
##
## data: Petal.Length by Species
## Kruskal-Wallis chi-squared = 130.41, df = 2, p-value < 2.2e-16
Lets work with the crabs
dataset.
library(MASS)
data(crabs)
Let’s explore different ways in which the size of the frontal lobes of the crabs may be affected by different factors.
head(crabs)
## sp sex index FL RW CL CW BD
## 1 B M 1 8.1 6.7 16.1 19.0 7.0
## 2 B M 2 8.8 7.7 18.1 20.8 7.4
## 3 B M 3 9.2 7.8 19.0 22.4 7.7
## 4 B M 4 9.6 7.9 20.1 23.1 8.2
## 5 B M 5 9.8 8.0 20.3 23.0 8.2
## 6 B M 6 10.8 9.0 23.0 26.5 9.8
boxplot(FL ~ sp, data = crabs)
boxplot(FL ~ sex, data = crabs)
Now that we are using more factors in our plots, it is a good time to
introduce ggplot2
, which is a package for advanced
plotting. Look how easy it is to produce boxplots for 2 different
factors in the same plot.
library(ggplot2)
ggplot(data = crabs, aes(x = sp, y = FL, fill = sex)) +
geom_boxplot()
We can try different colors.
ggplot(data = crabs, aes(x = sp, y = FL, fill = sex)) +
geom_boxplot() +
scale_fill_manual(values = c('F' = 'purple', 'M' = 'chartreuse3'))
(A cheat sheet for colors in R is available at: https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf)
We want to know whether frontal lobe size depends on species or sex,
or on a particular combination of species and sex. The effect of each
variable by itself is called a main effect, whereas the effect
of a combination between two variables is called an
interaction. In the aov
function, we can ask R to
test main effects and interactions simply by separating the variables
with an asterisk (*).
crab.anova <- aov(FL ~ sex * sp, data =crabs)
summary(crab.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 4.6 4.6 0.476 0.49128
## sp 1 466.3 466.3 48.627 4.63e-11 ***
## sex:sp 1 80.6 80.6 8.409 0.00416 **
## Residuals 196 1879.7 9.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From above, we see that frontal lobe is affected by species and an interaction between species and sex. As with the one-way ANOVA, we will do post-hoc tests.
TukeyHSD(crab.anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = FL ~ sex * sp, data = crabs)
##
## $sex
## diff lwr upr p adj
## M-F 0.302 -0.5617106 1.165711 0.4912819
##
## $sp
## diff lwr upr p adj
## O-B 3.054 2.190289 3.917711 0
##
## $`sex:sp`
## diff lwr upr p adj
## M:B-F:B 1.572 -0.03289788 3.1768979 0.0572712
## F:O-F:B 4.324 2.71910212 5.9288979 0.0000000
## M:O-F:B 3.356 1.75110212 4.9608979 0.0000010
## F:O-M:B 2.752 1.14710212 4.3568979 0.0000868
## M:O-M:B 1.784 0.17910212 3.3888979 0.0227162
## M:O-F:O -0.968 -2.57289788 0.6368979 0.4022824
plot(TukeyHSD(crab.anova), las=1)
We see that the confidence intervals overlap zero when comparing sexes, which means that we cannot be confident about the differences between the sexes. For the rest of comparisons, we can be confident about differences between species (bar does not overlap zero), and about differences for four comparisons arranged by sex and species (can you identify which comparisons?).
We use correlation and linear regression when our independent and dependent variables are continuous.
Using the cats
dataset, we want to know whether body
weight and heart weight are correlated.
head(cats)
## Sex Bwt Hwt
## 1 F 2.0 7.0
## 2 F 2.0 7.4
## 3 F 2.0 9.5
## 4 F 2.1 7.2
## 5 F 2.1 7.3
## 6 F 2.1 7.6
plot(Hwt ~ Bwt, cats)
We will use cor.test
.
cor.cats <- cor.test(x = cats$Bwt, y = cats$Hwt)
cor.cats$estimate
## cor
## 0.8041274
The output of cor
is the linear correlation coefficient
(lcc), which can go from -1 (negative correlation) to 1 (positive
correlation). In this case, we have a positive correlation.
Using the lcc, we can estimate R^2, which in this case is the percentage of the variation in the heart weight that can be explained by body weight.
cor.cats$estimate ^ 2
## cor
## 0.6466209
Linear models in R give similar results to cor.test
.
Models represent hypothesized relationships among variables. Usually one
is the response variable (y) and one or more the predictors (x’s). With
a model you can build an equation that will allow you to predict the
dependent variable based on the independent variable.
The formula for a linear model is as follows:
y = mx + b + error
In the equation above, m is often called the slope, and b the intercept.
We will apply a linear model to the cats
dataset…
cats.lm <- lm(Hwt ~ Bwt, data = cats)
summary(cats.lm) ## same r-squared value
##
## Call:
## lm(formula = Hwt ~ Bwt, data = cats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5694 -0.9634 -0.0921 1.0426 5.1238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3567 0.6923 -0.515 0.607
## Bwt 4.0341 0.2503 16.119 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441
## F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16
… and plot the model.
plot(Hwt ~ Bwt, cats)
abline(cats.lm, lty = 2)
We can extract the coefficients of the model (intercept and slope - the slope’s name is the same as that of the predictor).
coef(cats.lm)
## (Intercept) Bwt
## -0.3566624 4.0340627
If you want to predict the average value of y
for a
given x
, all you need to do is plug-in these coefficients
into the equation of the linear model shown above, but without the error
term.
For example, when x
is 3, we predict that the average
value of y
will be… (check the plot of the linear model to
confirm the prediction)
y <- coef(cats.lm)[2] * 3 + coef(cats.lm)[1]
print(y)
## Bwt
## 11.74553
Of course, due to expected error, when x
is 3,
y
will not be exactly equal to 11.7455257 (that is why we
mentioned that that number is an “average value of y
”).
When the error is larger, the exact values of y
will spread
further from the predicted average, and thus, the ability of our
equation to predict y
falls down. We can use R^2 as a
measure of that ability. It can be calculated from the sum of squares of
residuals. A residual is the difference between an observed y-value and
its corresponding value in the line fitted to the data. Just for fun, we
can make a function to estimate sums of squares:
ss <- function(x) {
sum((x - mean(x))^2)
}
ss(x = c(1, 2, 3))
## [1] 2
The R^2 is a ratio between two numbers: 1) the sum of squares for the residuals of the model and, 2) the sum of squares for the residuals of a model in which body weight does not predict heart weight. In other words, we compare the alternative hypothesis to the null hypothesis. To come up a model for the null hypothesis, we simply generate an horizontal line.
plot(Hwt ~ Bwt, cats)
abline(h = mean(cats$Hwt))
Just by eyeballing the graph we can see that such a model is unlikely to explain the data. When that is the case, we expect the sum of squares of residuals to be high.
ss1 <- ss(cats$Hwt - mean(cats$Hwt))
ss1
## [1] 847.6256
In contrast, look at the sums of squares of residuals for the model that we fitted to the data.
ss2 <- ss(residuals(cats.lm))
ss2
## [1] 299.5331
To calculate R^2, we use the following formula:
(SS1 - SS2) / SS1
(ss1 - ss2) / ss1
## [1] 0.6466209
As you can see, it is the same value that we got before. R^2 can take any value from 0 to 1. The higher it is, the higher the predictive power of the model will be.
Multiple regression is used when you have multiple x-variables but only one y-variable.
We will use multiple regression with the database
airquality
, and test whether air temperature is correlated
with solar radiation and wind speed.
data(airquality)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
To explore patterns, let’s plot of all variables, in a pairwise fashion.
pairs(airquality)
lm.airquality <- lm(Temp ~ Wind + Solar.R, data = airquality)
summary(lm.airquality)
##
## Call:
## lm(formula = Temp ~ Wind + Solar.R, data = airquality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6521 -5.0358 0.9309 5.0250 18.0943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.899708 2.476595 34.281 < 2e-16 ***
## Wind -1.155726 0.188311 -6.137 7.82e-09 ***
## Solar.R 0.025697 0.007337 3.503 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.944 on 143 degrees of freedom
## (7 observations deleted due to missingness)
## Multiple R-squared: 0.2687, Adjusted R-squared: 0.2585
## F-statistic: 26.27 on 2 and 143 DF, p-value: 1.916e-10
mfrow
defines the number of plots per row and column
(here, 2 and 2). To design more complex arrangements of plots, with
non-matching numbers of rows and columns, you can also check the
layout
function.
par(mfrow=c(2, 2)) ##the 'par' function changes parameters of a plot
?layout
termplot(lm.airquality, partial = TRUE)[1]
## [1] 2
plot(Temp ~ Wind, data = airquality)
plot(Temp ~ Solar.R, data = airquality)
How much is the position of the regression line affected by chance? Had we added a couple more of observations to our sample, or had we taken measurements at different times, would we expect the regression line to stay in the same exact position? We can answer these questions by visualizing the confidence intervals of the regression lines.
library(visreg)
par(mfrow=c(1,2))
visreg(lm.airquality, "Wind")
visreg(lm.airquality, "Solar.R")
In this case, the confidence intervals indicate regions in the plot
where we expect the regression line to lie if we decided to repeat the
analysis with a new sample. The regions seem to be tightly embracing
each respective regression line; thus, we can be confident about using
the fitted regression model to explain the relationship between our
variables or to predict the average value of y
.
We have already seen how to check graphically for normality using a Q-Q plot. Now, let us do that again using one of the many available normality tests, the Shapiro-Wilk test, using the crabs dataset again:
library(MASS)
data(crabs)
shapiro.test(crabs$FL)
##
## Shapiro-Wilk normality test
##
## data: crabs$FL
## W = 0.99037, p-value = 0.2023
shapiro.test(crabs$CW)
##
## Shapiro-Wilk normality test
##
## data: crabs$CW
## W = 0.99106, p-value = 0.2542
The Shapiro-Wilk test, like many other tests of normality, tests wether the distribution of the data is significantly different from a normal distribution, which means that a significant p value indicates that the data are not normally distributed. In our case, p > 0.05; therefore, data are normally distributed.
In a regression model, normality is checked for the residuals of the regression and not the raw data (Remember: residuals are the differences between raw data and values predicted by the regression model).
crabs.lm<-lm(FL~CL,data=crabs)
summary(crabs.lm)
##
## Call:
## lm(formula = FL ~ CL, data = crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86395 -0.51746 -0.02826 0.50456 1.77009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.15316 0.23477 0.652 0.515
## CL 0.48060 0.00714 67.313 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.717 on 198 degrees of freedom
## Multiple R-squared: 0.9581, Adjusted R-squared: 0.9579
## F-statistic: 4531 on 1 and 198 DF, p-value: < 2.2e-16
residuals<-crabs.lm$residuals
residuals
## 1 2 3 4 5 6
## 0.209214133 -0.051982222 -0.084520582 -0.213178578 -0.109298213 -0.406913293
## 7 8 9 10 11 12
## -0.491391835 -0.327810559 0.016368894 -0.464229284 -1.073485457 -0.733186368
## 13 14 15 16 17 18
## -0.865724728 -0.425425639 -0.521545275 -0.133186368 -0.606023817 -0.654083635
## 19 20 21 22 23 24
## -0.213784546 -0.286621995 -0.895878168 -0.884237074 -0.243937986 -0.724536163
## 25 26 27 28 29 30
## -0.772595981 -0.476476346 -0.612895070 -1.658569968 -0.593493248 -0.297373612
## 31 32 33 34 35 36
## -0.826031608 -1.254689603 -0.962450332 -0.910510150 -1.863945776 -0.750809239
## 37 38 39 40 41 42
## -1.179467234 -0.546928874 -0.691108327 -1.023646687 -1.052304683 -1.332902861
## 43 44 45 46 47 48
## -0.992603772 -1.586338487 -0.846039398 -0.509620674 -0.590218852 -1.114996483
## 49 50 51 52 53 54
## -0.730517940 -1.489329376 -0.017948418 -0.428700035 0.055778507 -0.280640218
## 55 56 57 58 59 60
## -0.072879489 -0.157358031 -0.097657120 -0.089896391 -0.182135662 -0.166614204
## 61 62 63 64 65 66
## 0.033385796 0.089206343 -0.523930195 -0.375870377 0.056667983 -0.420049831
## 67 68 69 70 71 72
## -0.748707826 0.024129623 -0.360348919 -0.096767644 -0.377365821 -0.185126550
## 73 74 75 76 77 78
## -0.761844364 -0.273485457 -0.317664910 -0.361844364 -0.402143453 -0.198263088
## 79 80 81 82 83 84
## -0.671100537 0.080839645 0.280839645 -0.388117439 -0.916775434 -0.336177257
## 85 86 87 88 89 90
## -0.384237074 -0.288117439 -1.245433430 -0.520655799 -0.324536163 -0.705134341
## 91 92 93 94 95 96
## -0.316775434 -0.845433430 -0.601253977 -0.933792337 -0.533792337 -0.189612883
## 97 98 99 100 101 102
## -1.043048510 -1.112005594 -1.204244865 -0.609620674 0.920855227 0.338761605
## 103 104 105 106 107 108
## 0.598462516 0.337266160 1.196967072 0.716368894 0.051292174 0.022634179
## 109 110 111 112 113 114
## 0.462933267 0.330394907 -0.186621995 0.057557459 -0.038562177 0.153677094
## 115 116 117 118 119 120
## 0.253677094 0.686215454 0.013378005 0.076959281 0.609497641 0.432779828
## 121 122 123 124 125 126
## 0.096361103 0.732779828 0.311882561 0.523523654 0.959942379 0.125908575
## 127 128 129 130 131 132
## -0.102749421 0.814267481 0.718147846 0.345310397 0.012772037 0.208891673
## 133 134 135 136 137 138
## -0.600364501 0.032173859 0.280233677 0.712772037 0.768592584 -0.192603772
## 139 140 141 142 143 144
## 0.455456046 0.119037322 0.434558779 -0.122757211 0.229182971 -0.520372291
## 145 146 147 148 149 150
## -0.516491927 0.588883882 -0.072312474 0.512166069 0.162610807 0.070371536
## 151 152 153 154 155 156
## 0.262043791 0.817864338 0.764428712 0.431890352 0.347411810 0.870693996
## 157 158 159 160 161 162
## 0.005617276 0.642036001 0.321138734 0.328899463 0.380839645 0.452181650
## 163 164 165 166 167 168
## 0.259942379 0.548301285 0.648301285 0.759942379 0.856062014 0.859942379
## 169 170 171 172 173 174
## 1.023523654 1.039045112 0.766207663 0.766207663 1.342925477 1.106506752
## 175 176 177 178 179 180
## 1.170088028 -0.367826141 1.277848757 1.137549668 0.705011308 0.564712219
## 181 182 183 184 185 186
## 1.770088028 1.441430032 1.301130944 0.780233677 0.732173859 0.370977504
## 187 188 189 190 191 192
## 0.811276593 0.474857868 0.430678415 0.819037322 1.211276593 1.122917686
## 193 194 195 196 197 198
## 0.502020420 1.570977504 0.096644611 1.446199873 1.505900784 1.273362424
## 199 200
## 1.681123153 0.743208984
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.99496, p-value = 0.7446
Many parametric analyses require homoscedasticity; that is, variances between groups in a dataset should be homogenous. Homoscedasticity can be checked graphically by plotting the residuals.
par(mfrow=c(1,1))
plot(
x = crabs$CL,
y = residuals,
ylab="Residuals",
xlab="Carapace length",
main="Normalized Residuals v Fitted Values"
)
abline(c(0,0))
…but it looks better when using a function from the car
package:
library("car")
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
residualPlot(crabs.lm)
If the assumption of homoscedasticity is true, the dots in the previous plots should look homogeneous (a point cloud with no clear outliers).
To check for outliers, you can use studentized residuals, which are residuals weighted by their standard deviation.
spreadLevelPlot(crabs.lm)
##
## Suggested power transformation: -0.8908105
The data do not look very homogeneous.
If you want to test for homoscedasticity, use a non-constant error variance test, such as the Breusch-Pagan test.
ncvTest(crabs.lm)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 22.77853, Df = 1, p = 1.8178e-06
Here, p is significant, which means that the residuals are not homoscedastic. In this case, using weighted least squares (WLS) might be better than using ordinary least squares when estimating the linear model. If you want to know more about this topic, check the following link: https://towardsdatascience.com/when-and-how-to-use-weighted-least-squares-wls-models-a68808b1a89d
When you have multiple predictors, you need to check whether the independent variables (x variables) are correlated to each other. If they are, the individual effect of each predictor on the dependent variable cannot be properly estimated.
Correlation among independent variables be assessed by estimating the variance inflation factor (VIF):
library(car)
# Fitting the linear model
crabs.lm <- lm(FL~CL+CW,data=crabs)
summary(crabs.lm)
##
## Call:
## lm(formula = FL ~ CL + CW, data = crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73089 -0.36896 -0.05988 0.28899 1.82283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59245 0.22046 2.687 0.00782 **
## CL 0.92406 0.06443 14.342 < 2e-16 ***
## CW -0.40305 0.05827 -6.917 6.29e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6448 on 197 degrees of freedom
## Multiple R-squared: 0.9663, Adjusted R-squared: 0.966
## F-statistic: 2826 on 2 and 197 DF, p-value: < 2.2e-16
# Testing for multicolinearity
vif(crabs.lm)
## CL CW
## 100.7036 100.7036
sqrt(vif(crabs.lm))
## CL CW
## 10.03512 10.03512
The square root of VIF indicates how much larger the standard error is compared to a model where predictors are independent. Here, it is 10 times more! Carapace length and width are highly collinear.
Observations for each variable in a linear model are assumed to be independent from each other. If you just want to test autocorrelation on your dataset, use a Durbin-Watson test:
durbinWatsonTest(crabs.lm)
## lag Autocorrelation D-W Statistic p-value
## 1 0.5563061 0.8747409 0
## Alternative hypothesis: rho != 0
A p value that short (close to 0!) suggests that the samples
are not independent. In cases such as this, it is better to use
generalized least squares when fitting a regression. The function
gls
from the package nlme
allows you to
implement this type of analysis in R.
The function gvlma
can help you quickly check several
assumptions of a linear model:
install.packages("gvlma")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(gvlma)
model<-gvlma(crabs.lm)
summary(model)
##
## Call:
## lm(formula = FL ~ CL + CW, data = crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73089 -0.36896 -0.05988 0.28899 1.82283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59245 0.22046 2.687 0.00782 **
## CL 0.92406 0.06443 14.342 < 2e-16 ***
## CW -0.40305 0.05827 -6.917 6.29e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6448 on 197 degrees of freedom
## Multiple R-squared: 0.9663, Adjusted R-squared: 0.966
## F-statistic: 2826 on 2 and 197 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = crabs.lm)
##
## Value p-value Decision
## Global Stat 31.580 2.331e-06 Assumptions NOT satisfied!
## Skewness 7.355 6.686e-03 Assumptions NOT satisfied!
## Kurtosis 2.130 1.445e-01 Assumptions acceptable.
## Link Function 0.187 6.655e-01 Assumptions acceptable.
## Heteroscedasticity 21.908 2.861e-06 Assumptions NOT satisfied!
In the following link, find a lot of great examples on how to analyze linear models, using a geological dataset on soil content indicators: http://www.css.cornell.edu/faculty/dgr2/_static/files/R_PDF/mhw.pdf
Coding with style can help make your scripts readable. Hadley
Wickham, creator of many popular packages such as ggplot2
,
has written this very useful post on coding with style in R: http://adv-r.had.co.nz/Style.html (also a good example
of well-formatted Markdown output!). In particular, we recommend that
you read the “Line length” section first, because, in our experience,
most problems with code readability are related to having very long
lines of code. But do check other parts of the post, and use the
document as a reference that you go back to as your code starts
increasing in complexity.
The first time that you write a piece of code it does not need to be perfect and you do not need to spend much time working on the format. Revising your code, however, will make it easier for your readers to understand your analyses. Programmers have a special name for this process: “refactoring”, which is just a synonym of “revising”. This is one of the many ways in which coding is like writing a manuscript.
Tip: When you have some spare time, go back to your code and ask yourself, how can I make it more readable? Do I need to change the style?, can I rewrite variables and functions in a way that produces the same output but with less code?. If you have something you can change, do it.
Download the dataset Weather from this link: https://vincentarelbundock.github.io/Rdatasets/csv/mosaicData/Weather.csv.
This dataset belongs to the mosaicData
package. To learn
more about the dataset, install and load mosaicData
and
then search for the corresponding help file.
In each corresponding chunk:
Import the dataset into R as a dataframe
What was the maximum temperature ever reached in Mumbai in December 2017? Which days of the month was it?
For how many days was there fog, rain, and a thunderstorm in Auckland in 2016?
Does average humidity influence average visibility in Beijing?
Are the data used for the previous question normal? Homoscedastic? Check these assumptions graphically and with a test.
Focusing on the last three months of 2017 in Chicago; does minimal sea pressure vary with any specific weather event (column “events”)?
Do a boxplot of the average temperature for each city.
Test if cities have significantly different average temperatures.
If yes, which cities have different temperatures?
We want to compare how temperature profiles differ between Beijing and Chicago through 2017. Plot two histograms side by side of the average temperature, one for each city. The histograms must have different colors.