-
Notifications
You must be signed in to change notification settings - Fork 1
/
box_uhs.Rmd
218 lines (178 loc) · 7.32 KB
/
box_uhs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "BOX UHS Download"
output:
word_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, comment = "")
```
```{r}
library(tidyverse)
kprint <- function(x) {
if(interactive()) {
print(x)
} else {
knitr::kable(x)
}
}
```
## Useful Columns for Modelers
The following columns would be useful to share with campus modelers. They appear to contain no PHI or PII but have sufficient data for modelers to study patterns of infection. Please note that this is based on data through 5 October 2020. See details below for study of fields
- RANDID (carefully randomized from PatientID)
- Race
- ResultDate, CollectionDate, ConfirmedDate, TracedDate (dropping time of day)
- QuarantineStatus, HasSymptoms, NumberOfContacts
- TestLocation, LabName, Result
- Hall_Name
- Neighborhood (LakeShore or SouthEast)
- RANDHOUS (new randomization within hall and across houses and floors)
- RANDROOM (room number modulus 100 to remove floor)
- Positive
The following are removed from the data but kept locally in lookup table:
- House_Name (recoded with Floor as RANDROOM)
- Floor (100s number from Room)
- Room (recoded as RANDID using FLIP)
## Access Original Data
```{r eval=FALSE}
From: "[email protected]" <[email protected]>
Date: Tuesday, October 13, 2020 at 1:13 PM
The Box Client ID and Client Secret identify the Box-app that you setup to connect to your Box account. This something that you would need to create for yourself via the Box Dev console:
https://uwmadison.app.box.com/developers/console
The UW-Madison Box team does not offer any custom Box development or consulting services but here are some links that might help you get started:
https://developer.box.com/guides/authentication/select/
https://developer.box.com/guides/applications/custom-apps/
Please keep in mind that the only authorization type that is supported with UW-Madison Box is Oauth 2.0 authentication:
https://kb.wisc.edu/103326
Finally, be aware that we can no longer offer unlimited storage with the Box service. Beginning on 6/15/2020 all accounts have a storage quota set. For questions regarding that change please see our Contract & Quota FAQ:
https://kb.wisc.edu/102615
Please let me know if you have additional questions.
Regards,
Jeannine
```
```{r eval=FALSE}
library(boxr)
# Ideally use boxr package to box_read() file. However
# Need Client Secret, which I don't yet understand. See email above.
box_auth("[email protected]")
filename <- scan("data/housing_data_url.txt", "text", n = 1)
cdc0 <- box_read(filename)
```
```{r eval=FALSE}
# This sometimes work, but generally gives empty file if done in Rmarkdown.
# What works is to download file and use that directly.
filename <- scan("data/housing_data_url.txt", "text", n = 1)
cdc0 <- read.csv(filename, row.names = NULL)
```
Read data, select columns, do initial cleaning. Focus on students on campus.
```{r}
#filename <- "data/MasterQuery_WithHousing_20201120.csv"
#filename <- "data/MasterQuery_HousingSubset_20201120.csv"
filename <- "data/MasterQuery_HousingSubset_20201202.csv"
cdc0 <- read.csv(filename, row.names = NULL) %>%
select(PatientID, Group, hall_housingfile, House_Name, Room, Race,
ResultDate, CollectionDate, ConfirmedDate, TracedDate,
TestLocation, LabName, Result,
HasSymptoms, QuarantineStatus, NumberOfContacts) %>%
rename(Hall_Name = "hall_housingfile") %>%
filter(!is.na(PatientID),
Result %in% c("Positive","Negative")) %>%
mutate(Race = ifelse(Race %in% c("", "DECLINED", "OTHER RACE"), NA, Race),
Positive = 1 * (Result == "Positive")) %>%
filter(Hall_Name != "Off campus", Group == "Student")
```
Replace `PatientID` with randomized `RANDID`
```{r}
IDs <- (cdc0 %>% distinct(PatientID) %>% arrange(PatientID))$PatientID
set.seed(1)
Rand <- order(runif(length(IDs)))
cdc0 <- cdc0 %>%
mutate(RANDID = Rand[match(PatientID, IDs)]) %>%
select(-PatientID, -Group)
```
Assign `Hall_Name`s to `Neighborhood`.
```{r}
lakeshore <- c("Adams", "Bradley", "Cole", "Dejope", "Kronshage", "Leopold", "Phillips", "Slichter", "Sullivan", "Tripp", "Waters")
southeast <- c("Barnard", "Chadbourne", "Davis", "Merit", "Ogg", "Sellery", "Smith", "Witte")
cdc0 <- cdc0 %>%
mutate(Neighborhood = "LakeShore",
Neighborhood = ifelse(Hall_Name %in% southeast, "SouthEast", Neighborhood))
```
Randomize `House_Name` and `Floor` to `RANDHOUS` and `Room` to `RANDROOM`.
```{r}
# Re-randomize House as RANDHOUS within RANDHALL
Rehouse <- cdc0 %>%
# Break up houses with multiple floors
mutate(Floor = floor(Room / 100)) %>%
distinct(Hall_Name, House_Name, Floor) %>%
group_by(Hall_Name) %>%
mutate(Floor = Floor,
RANDHOUS = order(runif(n()))) %>%
ungroup
cdc0 <-
left_join(
cdc0 %>%
mutate(Floor = floor(Room / 100),
RANDROOM = Room %% 100), # Could flip a coin on this.
Rehouse,
by = c("Hall_Name", "House_Name", "Floor"))
```
Flip room numbers randomly.
```{r}
cdc0 <- cdc0 %>%
group_by(Hall_Name, RANDHOUS) %>%
mutate(Flip = sign(runif(1) - 0.5),
RANDROOM = ifelse(Flip < 0,
1 + max(RANDROOM) - Flip * RANDROOM, # Flip room number if negative
RANDROOM)) %>%
ungroup
```
Write out codebook.
```{r}
write.csv(cdc0 %>%
distinct(Neighborhood, Hall_Name, House_Name, RANDHOUS, Floor, Flip) %>%
arrange(Neighborhood, Hall_Name, House_Name, Floor),
"data/uhs_codebook.csv", row.names = FALSE)
```
Pivot to have one event per row per student.
```{r}
# Pivot longer; create RANDROOM, filter out missing Date.
cdc <- cdc0 %>%
select(-House_Name, -Room, -Result, -Floor, -Flip) %>%
pivot_longer(ResultDate:TracedDate, names_to = "Event", values_to = "Date") %>%
mutate(Event = factor(Event, c("CollectionDate", "ResultDate",
"ConfirmedDate", "TracedDate")),
# Remove time of day from Date
Date = as.Date(Date, "%m/%d/%Y")) %>%
# For now, remove data with weird dates
filter(!is.na(Date),
Date > as.Date("2020-01-01")) %>%
# Filter out Confirmed or Traced Date that have Positive = 0
filter(!(Event %in% c("ConfirmedDate", "TracedDate")) | Positive == 1) %>%
# This step is tricky.
# ConfirmedDate and TracedDate are repeated sometimes with old CollectionDate.
distinct(RANDID, Date, Event, .keep_all = TRUE) %>%
arrange(Hall_Name, RANDID, Date) %>%
select(RANDID, RANDROOM, RANDHOUS, Hall_Name, Neighborhood, Date, Event, Positive, everything())
```
```{r}
ggplot(bind_rows(
cdc %>%
filter(Event == "CollectionDate") %>%
count(RANDID, Positive, name = "collections") %>%
count(Positive, collections, name = "frequency") %>%
mutate(Positive = c("Negative","Positive")[1 + Positive]),
cdc %>%
filter(Event == "CollectionDate") %>%
count(RANDID, name = "collections") %>%
count(collections, name = "frequency") %>%
mutate(Positive = "Combined"))) +
aes(collections, frequency, col = Positive, group = Positive) +
geom_point() +
geom_line() +
ggtitle("Frequency of collections across students")
```
```{r}
write.csv(cdc, "data/uhs_rand.csv", row.names = FALSE)
```
This document summarizes housing information. The R code file can be found in the [pandemic github repository](https://github.com/UW-Madison-DataScience/pandemic/blob/master/box_uhs.Rmd).