03-data-analysis.Rmd

---
layout: page
title: Using dplyr for data manipulation and analysis
subtitle: dplyr for data analysis
minutes: 60
---


```{r setup, echo=FALSE, purl=FALSE}
source("setup.R")

```

> ## Learning Objectives
>
> * Use `dplyr` to explore and understand the structure of a dataset
> * Use `dplyr` to calculate basic statistics from the dataset
> * Understand the concept of missing data and how to deal with them
> * Join two datasets

# Presentation of the Survey Data

```{r, echo=FALSE, purl=TRUE}
# Presentation of the survey data
```

We are studying the species and weight and hindfoot length of animals caught in plots in our study
area.  The dataset is stored as a `.csv` files in an (online repository)[http://figshare.com/articles/Portal_Project_Teaching_Database/1314459]: each row
holds information for a single animal, and the columns represent `record_id`
`month`, `day`, `year`, `plot_id`, `species_id` (a 2 letter code, see the
`species.csv` file for correspondence), `sex` ("M" for males and "F" for
females), `hindfoot_length` (in millimeters), and `weight` (in grams).

This is an example of what the survey dataset looks like:

    "63","8","19","1977","3","DM","M","35","40"
    "64","8","19","1977","7","DM","M","37","48"
    "65","8","19","1977","4","DM","F","34","29"
    "66","8","19","1977","4","DM","F","35","46"
    "67","8","19","1977","7","DM","M","35","36"

We will use `read.csv()` to load into memory (as a `data.frame`) the content of
the CSV file.

Next, read the `surveys.csv` into memory using `read.csv()`:

```{r demo-data-load, eval=TRUE,  purl=FALSE}
surveys <- read.csv('data/surveys.csv')
```

This statement doesn't produce any output because assignment doesn't display
anything. If we want to check that our data has been loaded, we can print the
variable's value: `surveys`

Wow... that was a lot of output. At least it means the data loaded
properly. Let's check the top (the first 6 lines) of this `data.frame` using the
function `head()`:

```{r, results='show', purl=FALSE}
head(surveys)
```

## Calculating statistics

```{r, echo=FALSE, purl=FALSE}
## Calculating statistics
```

Let's get a closer look at our data. For instance, we might want to know how
many animals we trapped in each plot, or how many of each species were caught.

To get all the species caught during this study, we are going to use the
`distinct()` function on the `species_id` column or the `surveys` data frame:

```{r distinct-species, purl=FALSE}
library(dplyr)
distinct_species <- surveys %>%
                    select(species_id) %>% distinct()
```

### Challenge

1. How many species were caught?
1. Why is the species listed on line 14 blank?

The function `group_by` from `dplyr` allows you to calculate statistics based on
the levels of a factor. For instance, we can calculate the number of individuals
caught for each species by combining the functions `summarize()` (or
`summarise()` if you prefer) and `n()` on the "grouped by" dataset:

```{r count-species, purl=FALSE}
species_count <- surveys %>% group_by(species_id) %>%
                 summarize(count=n())
```

Here `count` is the column name and the function `n()` gets applied to each
unique value found in the `species_id` column.

This function returns an object of class `tbl`, and by default it only shows the
first 10 lines. If you want to see more of the data, you can use the function
`print` and use the `n` argument to specify the number of lines you'd like to
see. For instance, as there are 49 unique values in our dataset, we can use:

```{r summary-species}
print(species_count, n=50)
```

R has a lot of built in statistical functions, like `mean()`, `median()`,
`max()`, `min()`. Let's start by calculating the average weight of all the
animals using the function `mean()` in combination with `group_by()`:

```{r species-mean-weight-1}
surveys %>% group_by(species_id) %>%
            summarize(count=n(),
                      mean_weight=mean(weight))
```

Hmm, we just get `NA`. That's because we don't have the weight for every animal
and missing data is recorded as `NA`. By default, all R functions operating on
data that contain missing value will return `NA`. It's a way to make sure that
users know they have missing values, and make a conscious decision on how to deal
with them.

When dealing with simple statistics like the mean, the easiest way to ignore
`NA` (the missing data) is to use `na.rm=TRUE` (`rm` stands for remove):

```{r species-mean-weight-2}
surveys %>% group_by(species_id) %>%
            summarize(count=n(),
                      mean_weight=mean(weight, na.rm=TRUE))
```

Slightly better but still not great. For 9 out of the first 10 species listed,
we get `NaN` ('Not a Number'). Let's try to understand why this happens by
comparing the number of observations to the number of `NA` for each species:

```{r count-NAs, purl=FALSE}
surveys %>% group_by(species_id) %>%
            summarize(count = n(),
                      n_na = sum(is.na(weight)),
                      mean_weight = mean(weight, na.rm=TRUE))
```

Here we can see that we were trying to calculate the mean for species that
completely lacked weight measurements. We were therefore trying to divide
missing data by 0 leading to the `NaN`. One way to deal with this issue is to
first remove all missing data for weight using `filter()`, and because there
will be no missing data left, we can remove the `na.rm=TRUE` inside the `mean()`
function:

```{r remove-NAs, purl=FALSE}
surveys %>% group_by(species_id) %>% filter(!is.na(weight)) %>%
    summarize(count=n(),
              n_na = sum(is.na(weight)),
              mean_weight = mean(weight)
              )
```

Let's remove the `n_na` column now that we figured out the problem with the
 mean, let's add the standard deviation for weight, and let's save the results
 in a variable called `species_stats`:


```{r mean-weight, purl=FALSE}
species_stats <- surveys %>% group_by(species_id) %>% filter(!is.na(weight)) %>%
    summarize(count=n(),
              mean_weight = mean(weight),
              mean_sd = sd(weight)
              )
```

### Challenge

1. Modify the `species_stats` object so that it also includes the mean and
   standard deviation for hindfoot_length

## Joining datasets

A common operation when managing data is to join (merge) the content of two
tables based on unique identifiers. In our example, we have a `species` dataset
that stores the correspondence between a two-letter code (`species_id`) and the
full species name; and this two-letter code is also used in the `surveys`
dataset to record the name of the species. We can use the `species_id` in these
two datasets to join them.

```{r example-datasets, echo=FALSE, results='markup', purl=TRUE}
spex <- data.frame(species_id = c("AB", "AH", "AS", "BA"),
                   genus = c("Amphispiza", "Ammospermophilus", "Ammodramus", "Baiomys"),
                   species = c("bilineata", "harrisi", "savannarum", "taylori"),
                   stringsAsFactors=FALSE)
suex <- data.frame(species_id = c("BA", "AS", "", "BA"),
                   sex = c("M", "M", "F", "F"),
                   hindfoot_length = c(33, 26, 15, 25),
                   weight = c(NA, 45, 33, 25),
                   stringsAsFactors=FALSE)

```

A simplified version of the `surveys` dataset:

```{r} 
kable(suex)
```


A simplified version of the `species` dataset:

```{r} 
spex
```


There are four options depending on how you want to treat the missing values:
- `inner_join(x, y)` only keeps the rows that have a match in both datasets;
- `left_join(x, y)` keeps all the rows in `x` and fills with missing data the
  rows that do not match in `y`
- `right_join(x, y)` is the equivalent to `left_join(y, x)` but the order of the
  columns will be different
- `full_join(x, y)` includes all observations from `x` and `y`

Note: `right_join(x, y)` and `full_join(x, y)` were introduced in dplyr version 0.4.0

With our simplified examples:

* **inner join** only data for survey records where species is not missing: 
```{r inner_join} 
inner_join(suex, spex)
```
* **left join** only survey data, missing species gets `NA` for genus and species: 
```{r left_join} 
left_join(suex, spex)
```
* **right join** only survey data without missing species, also include species
  not observed: 
```{r right_join} 
right_join(suex, spex)
```
* **full join** all data from both datasets
```{r full_join}
full_join(suex, spex)
```

## Challenge

1. Which one of these type of joins are you going to use most often?
1. If you want to add the species data (i.e., genus, species, taxa) to this
   dataset, what type of join are you going to use?
1. The file `species.csv` contains more information about the observed species. It uses the same `species_id` values to identify each species.
   We can load this file with `species <- read.csv('data/species.csv')`
   Create a new dataset, `surveys_species`, that adds the species data to the
   surveys data

<!--
```{r challenge-join}
species <- read.csv('data/species.csv')
surveys_species <- left_join(surveys, species)
```
-->