-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathexaming_data.Rmd
186 lines (141 loc) · 5.87 KB
/
examing_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: "Examining data"
output: html_notebook
---
As an example, we will load a data set based on the registration data for this workshop.
Before we begin, use the broom icon in the Environment tab to clear the environment.
# Loading packages
Now, load the required packages. If this was on your own computer, you might have to install packages first using
install.packages("tidyverse")
install.packages("ggtext")
```{r message=FALSE, warning=FALSE, include=FALSE, results='hide'}
require(ggplot2)
require(dplyr)
require(lubridate)
require(ggExtra)
```
# Loading data
The data is in comma separated format, so the default settings of `read.csv` work well:
```{r}
df = read.csv("./registration.csv")
```
There is no output but you can see the data has loaded in the Environment tab.
We can get basic information about the data using functions:
```{r}
names(df)
dim(df)
```
We might also want to view a summary and first few lines of the file.
```{r paged.print=FALSE}
#str(df) # just see Env tab
head(df)
```
R was able to guess that the On.Waitlist variable was logical (Boolean). On the other hand, the Registered.on column should represent a date and time but R did not auto-convert it. We should convert to a datetime.
The `mdy_hm()` function from the lubridate package is a convenient choice. For example:
```{r}
mdy_hm("04/22/2022 14:51 UTC")
```
Note: Many functions work on entire data columns (vectors)
```{r}
head(mdy_hm(df$Registered.on))
```
Lets update that column and check the summary again:
```{r}
df$Registered.on = mdy_hm(df$Registered.on)
df$Position = as.factor(df$Position)
str(df)
```
# ggplot2 & dplyr
Hadley Wickham, currently working at the company that develops RStudio, produced a collection of add-on packages for R that address major shortcomings in the R experience and make R more pleasant to use. Collectively, this set of packages is known as the *tidyverse* and it includes the well known *ggplot2* and *dplyr* packages and a host of smaller utility functions.
Wickham has a vision for how R can enable data analysis by using a common *grammar* to describe how to visualize or analyze data. When you use this grammar to produce results, you can focus on what you want the analysis to do and less on how to accomplish it. When you use this grammar of graphics or grammar of data analysis, the code that produces your results describes the steps succinctly.
RStudio PBC publishes documentation, tutorials and training, including these helpful cheatsheets <https://www.rstudio.com/resources/cheatsheets/>.
## "pipe" operator: `%>%`
The tidyverse uses an operator to chain function calls in a way that is easy for humans to read.
Without the pipe operator, your code might look like this:
sorted_df = arrange(df, Registered.on)
sorted_df_with_total = mutate(sorted_df, cumtotal=row_number(Registered.on))
plot_df = filter(sorted_df_with_total, Registered.on > ymd_hm("20220422 14:00"))
Even worse, you might be tempted to nest the functions:
plot_df = filter(mutate(arrange(df, Registered.on), cumtotal=row_number(Registered.on)), Registered.on > ymd_hm("20220422 14:00"))
With the pipe operator, you can express the same steps as a sequence without creating several intermediate variable names. Arranged as a sequence, we can see that we sort the data.frame, add a column to reflect the total number of registrants, and then filter the data to include cases after a certain date.
I prefer the wrap the steps in parenthesis and start each line with the `%>%` operator, so I can comment out individual lines. It is far more common to omit the parenthesis and end each line with a `%>%`.
```{r}
plot_df = (
df
%>% arrange(Registered.on)
%>% mutate(cumtotal = row_number(Registered.on))
%>% filter(Registered.on >= ymd_hm("20220422 14:00"))
)
plot_df %>% head #same as `head(plot_df)`
```
# Plotting the data
Suppose we want to examine how registration changed over time. We can plot a histogram of the number of registrants in each interval.
ggplot2 uses a similar system of composing plots by applying a sequence of steps but it uses the `+` operator to combine steps.
```{r}
plot_df = (
df
%>% arrange(Registered.on)
%>% mutate(cumtotal = row_number(Registered.on))
%>% filter(Registered.on >= ymd_hm("20220422 14:00"))
)
(
ggplot(plot_df, aes(x=Registered.on))
+ geom_histogram(binwidth = 60*60*24) #, color="darkgrey", fill="grey") #600s = 10mins
#+ geom_freqpoly(binwidth = 60*60*6) #600 s = 10 minutes
#+ theme_bw()
#+ ggtitle("Registrations for R workshop")
)
```
```{r}
plot_df = (
df
%>% arrange(Registered.on)
%>% mutate(cumtotal = row_number(Registered.on))
%>% filter(
Registered.on >= ymd_hm("20220422 14:00")
& Registered.on < ymd_hm("20220422 18:00")
)
)
(
ggplot(plot_df, aes(x=Registered.on, fill=Position))
+ geom_histogram(binwidth = 60*60, show.legend=FALSE) #, color="darkgrey", fill="grey") #600s = 10mins
#+ geom_freqpoly(binwidth = 600) #600 s = 10 minutes
+ theme_bw()
+ ggtitle("Registrations for R workshop")
+ facet_wrap(~ Position)
)
```
```{r}
plot_df = (
df
%>% arrange(Registered.on)
%>% mutate(cumtotal = row_number(Registered.on))
%>% filter(
Registered.on >= ymd_hm("20220422 14:00")
#& Registered.on < ymd_hm("20220426 00:00")
)
)
(
ggplot(plot_df, aes(x=Registered.on, y=cumtotal, color=On.Waitlist))
+ geom_line() #600 s = 10 minutes
#+ geom_point()
+ theme_bw()
+ ggtitle("Total Registrations for R workshop")
+ ylab("Total Registrations")
+ geom_hline(yintercept = 50, linetype=2)
+ annotate("text", x=mdy_hm("04/23/2022 14:00 UTC"), y=50, label="Course Capacity", hjust=0, vjust=-.5)
)
```
# Summarizing with dplyr
```{r}
df %>% count(Position, sort = TRUE)
```
```{r}
(
df
%>% filter(Registered.on > ymd_hm("20220422 14:00"))
%>% group_by(Position)
%>% summarize(mean_reg = mean(Registered.on), n=n())
%>% arrange(mean_reg)
)
```