This repository was archived by the owner on Oct 7, 2019. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy path03-data-analysis.Rmd
258 lines (200 loc) · 8.89 KB
/
03-data-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
---
layout: page
title: Using dplyr for data manipulation and analysis
subtitle: dplyr for data analysis
minutes: 60
---
```{r setup, echo=FALSE, purl=FALSE}
source("setup.R")
```
> ## Learning Objectives
>
> * Use `dplyr` to explore and understand the structure of a dataset
> * Use `dplyr` to calculate basic statistics from the dataset
> * Understand the concept of missing data and how to deal with them
> * Join two datasets
# Presentation of the Survey Data
```{r, echo=FALSE, purl=TRUE}
# Presentation of the survey data
```
We are studying the species and weight and hindfoot length of animals caught in plots in our study
area. The dataset is stored as a `.csv` files in an (online repository)[http://figshare.com/articles/Portal_Project_Teaching_Database/1314459]: each row
holds information for a single animal, and the columns represent `record_id`
`month`, `day`, `year`, `plot_id`, `species_id` (a 2 letter code, see the
`species.csv` file for correspondence), `sex` ("M" for males and "F" for
females), `hindfoot_length` (in millimeters), and `weight` (in grams).
This is an example of what the survey dataset looks like:
"63","8","19","1977","3","DM","M","35","40"
"64","8","19","1977","7","DM","M","37","48"
"65","8","19","1977","4","DM","F","34","29"
"66","8","19","1977","4","DM","F","35","46"
"67","8","19","1977","7","DM","M","35","36"
We will use `read.csv()` to load into memory (as a `data.frame`) the content of
the CSV file.
Next, read the `surveys.csv` into memory using `read.csv()`:
```{r demo-data-load, eval=TRUE, purl=FALSE}
surveys <- read.csv('data/surveys.csv')
```
This statement doesn't produce any output because assignment doesn't display
anything. If we want to check that our data has been loaded, we can print the
variable's value: `surveys`
Wow... that was a lot of output. At least it means the data loaded
properly. Let's check the top (the first 6 lines) of this `data.frame` using the
function `head()`:
```{r, results='show', purl=FALSE}
head(surveys)
```
## Calculating statistics
```{r, echo=FALSE, purl=FALSE}
## Calculating statistics
```
Let's get a closer look at our data. For instance, we might want to know how
many animals we trapped in each plot, or how many of each species were caught.
To get all the species caught during this study, we are going to use the
`distinct()` function on the `species_id` column or the `surveys` data frame:
```{r distinct-species, purl=FALSE}
library(dplyr)
distinct_species <- surveys %>%
select(species_id) %>% distinct()
```
### Challenge
1. How many species were caught?
1. Why is the species listed on line 14 blank?
The function `group_by` from `dplyr` allows you to calculate statistics based on
the levels of a factor. For instance, we can calculate the number of individuals
caught for each species by combining the functions `summarize()` (or
`summarise()` if you prefer) and `n()` on the "grouped by" dataset:
```{r count-species, purl=FALSE}
species_count <- surveys %>% group_by(species_id) %>%
summarize(count=n())
```
Here `count` is the column name and the function `n()` gets applied to each
unique value found in the `species_id` column.
This function returns an object of class `tbl`, and by default it only shows the
first 10 lines. If you want to see more of the data, you can use the function
`print` and use the `n` argument to specify the number of lines you'd like to
see. For instance, as there are 49 unique values in our dataset, we can use:
```{r summary-species}
print(species_count, n=50)
```
R has a lot of built in statistical functions, like `mean()`, `median()`,
`max()`, `min()`. Let's start by calculating the average weight of all the
animals using the function `mean()` in combination with `group_by()`:
```{r species-mean-weight-1}
surveys %>% group_by(species_id) %>%
summarize(count=n(),
mean_weight=mean(weight))
```
Hmm, we just get `NA`. That's because we don't have the weight for every animal
and missing data is recorded as `NA`. By default, all R functions operating on
data that contain missing value will return `NA`. It's a way to make sure that
users know they have missing values, and make a conscious decision on how to deal
with them.
When dealing with simple statistics like the mean, the easiest way to ignore
`NA` (the missing data) is to use `na.rm=TRUE` (`rm` stands for remove):
```{r species-mean-weight-2}
surveys %>% group_by(species_id) %>%
summarize(count=n(),
mean_weight=mean(weight, na.rm=TRUE))
```
Slightly better but still not great. For 9 out of the first 10 species listed,
we get `NaN` ('Not a Number'). Let's try to understand why this happens by
comparing the number of observations to the number of `NA` for each species:
```{r count-NAs, purl=FALSE}
surveys %>% group_by(species_id) %>%
summarize(count = n(),
n_na = sum(is.na(weight)),
mean_weight = mean(weight, na.rm=TRUE))
```
Here we can see that we were trying to calculate the mean for species that
completely lacked weight measurements. We were therefore trying to divide
missing data by 0 leading to the `NaN`. One way to deal with this issue is to
first remove all missing data for weight using `filter()`, and because there
will be no missing data left, we can remove the `na.rm=TRUE` inside the `mean()`
function:
```{r remove-NAs, purl=FALSE}
surveys %>% group_by(species_id) %>% filter(!is.na(weight)) %>%
summarize(count=n(),
n_na = sum(is.na(weight)),
mean_weight = mean(weight)
)
```
Let's remove the `n_na` column now that we figured out the problem with the
mean, let's add the standard deviation for weight, and let's save the results
in a variable called `species_stats`:
```{r mean-weight, purl=FALSE}
species_stats <- surveys %>% group_by(species_id) %>% filter(!is.na(weight)) %>%
summarize(count=n(),
mean_weight = mean(weight),
mean_sd = sd(weight)
)
```
### Challenge
1. Modify the `species_stats` object so that it also includes the mean and
standard deviation for hindfoot_length
## Joining datasets
A common operation when managing data is to join (merge) the content of two
tables based on unique identifiers. In our example, we have a `species` dataset
that stores the correspondence between a two-letter code (`species_id`) and the
full species name; and this two-letter code is also used in the `surveys`
dataset to record the name of the species. We can use the `species_id` in these
two datasets to join them.
```{r example-datasets, echo=FALSE, results='markup', purl=TRUE}
spex <- data.frame(species_id = c("AB", "AH", "AS", "BA"),
genus = c("Amphispiza", "Ammospermophilus", "Ammodramus", "Baiomys"),
species = c("bilineata", "harrisi", "savannarum", "taylori"),
stringsAsFactors=FALSE)
suex <- data.frame(species_id = c("BA", "AS", "", "BA"),
sex = c("M", "M", "F", "F"),
hindfoot_length = c(33, 26, 15, 25),
weight = c(NA, 45, 33, 25),
stringsAsFactors=FALSE)
```
A simplified version of the `surveys` dataset:
```{r}
kable(suex)
```
A simplified version of the `species` dataset:
```{r}
spex
```
There are four options depending on how you want to treat the missing values:
- `inner_join(x, y)` only keeps the rows that have a match in both datasets;
- `left_join(x, y)` keeps all the rows in `x` and fills with missing data the
rows that do not match in `y`
- `right_join(x, y)` is the equivalent to `left_join(y, x)` but the order of the
columns will be different
- `full_join(x, y)` includes all observations from `x` and `y`
Note: `right_join(x, y)` and `full_join(x, y)` were introduced in dplyr version 0.4.0
With our simplified examples:
* **inner join** only data for survey records where species is not missing:
```{r inner_join}
inner_join(suex, spex)
```
* **left join** only survey data, missing species gets `NA` for genus and species:
```{r left_join}
left_join(suex, spex)
```
* **right join** only survey data without missing species, also include species
not observed:
```{r right_join}
right_join(suex, spex)
```
* **full join** all data from both datasets
```{r full_join}
full_join(suex, spex)
```
## Challenge
1. Which one of these type of joins are you going to use most often?
1. If you want to add the species data (i.e., genus, species, taxa) to this
dataset, what type of join are you going to use?
1. The file `species.csv` contains more information about the observed species. It uses the same `species_id` values to identify each species.
We can load this file with `species <- read.csv('data/species.csv')`
Create a new dataset, `surveys_species`, that adds the species data to the
surveys data
<!--
```{r challenge-join}
species <- read.csv('data/species.csv')
surveys_species <- left_join(surveys, species)
```
-->