-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.Rmd
3115 lines (2355 loc) · 159 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Open Case Studies: Vaping Behaviors in American Youth"
css: style.css
output:
html_document:
includes:
in_header: GA_Script.Rhtml
self_contained: yes
code_download: yes
highlight: tango
number_sections: no
theme: cosmo
toc: yes
toc_float: yes
pdf_document:
toc: yes
word_document:
toc: yes
---
<style>
#TOC {
background: url("https://opencasestudies.github.io/img/icon-bahi.png");
background-size: contain;
padding-top: 240px !important;
background-repeat: no-repeat;
}
</style>
<!-- Open all links in new tab-->
<base target="_blank"/>
<div id="google_translate_element"></div>
<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>
<script type="text/javascript">
function googleTranslateElementInit() {
new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>
```{r, echo=FALSE}
knit_time_start <- Sys.time()
```
```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width = 10, fig.height = 8, dpi = 300)
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
message = FALSE, warning = FALSE, cache = FALSE,
fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)
remotes::install_github("opencasestudies/OCSdata")
library(OCSdata)
remotes::install_github("https://github.com/wch/extrafont")
library(extrafont)
extrafont::font_import()
rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```
#### {.outline }
```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "final_plot.png"))
```
####
#### {.disclaimer_block}
**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
####
#### {.license_block}
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"} United States License.
####
#### {.reference_block}
To cite this case study please use:
Wright, Carrie and Ontiveros, Michael and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-vaping-case-study. Vaping Behaviors in American Youth (Version v1.0.0).
####
To access the GitHub Repository with the data for this case study see here: https://github.com/opencasestudies/ocs-bp-vaping-case-study
You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:
```{r, eval=FALSE}
install.packages("OCSdata")
```
This case study is part of a series of public health case studies for the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/open-case-studies).
***
The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below:
```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```
***
**Readability Score: **
A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age.
```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>% # reading table into dataframe, selecting relevant factors
tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
age = round(as.numeric(age))
)
knitr::kable(df, format="markdown")
```
***
Please help us by filling out our survey.
<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>
## **Motivation**
***
In the United States, there have been significant and historical declines in cigarette smoking. In the 1970s, 75% of high school seniors were smoking, that number is below 10% now. This progress is largely due to the tobacco control movement and their focus on initiatives like ending advertising to children (like Joe Camel), passing indoor smoking laws, health communication, etc.
According to a recent [report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w){target="_blank"}, overall tobacco/nicotine use **increased** in youths (middle school and high school students) in the United States in 2017 and 2018, despite previous years of declining use.
This major increase is attributed to an increase in the use of electronic cigarette (e-cigarette) products.
Forms of tobacco/nicotine include these categories:
1) Cigarette and other combustible tobacco (pipes, cigars, cigarillo, etc.)
2) E-cigarettes and vaporized tobacco/nicotine (hookah, Juul, etc.)
3) Other non-combustible, non-vapor tobacco/nicotine products (snus, chewing tobacco, etc.)
Youths are more likely to use e-cigarettes vs. combustible cigarettes these days, which is concerning because e-cigarettes are really efficient nicotine delivery devices that are reinforcing and easy to initiate. By contrast, it takes quite a while to become accustomed to cigarettes (eg. because of coughing) and become dependent. It is also harder for parents to detect e-cigarette use and intervene (eg. the smell is not as strong). This means that youths may be becoming physically dependent on nicotine more quickly than in past years, and that cessation services designed for youths will be needed.
Whereas in previous decades the focus was on advertising, the current era requires attention to the marketing broadly. Juul caught on through Instagram influencers. New policies that regulate these innovative marketing strategies will be critical.
E-cigarettes are referred to by many different names, including but not limited to:
1) Electronic nicotine delivery systems (ENDS)
2) Vapes
3) e-hookahs
4) vape pens
5) tanks
6) mods
The devices vary greatly:
```{r, echo = FALSE, fig.align ="center"}
include_graphics("https://www.lung.org/getmedia/8ac8ab8c-e7fc-497b-8384-441615f50ff0/ecigs_K.jpg.jpg")
```
##### [[source](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health)]
See this [CDC guide](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/pdfs/ecigarette-or-vaping-products-visual-dictionary-508.pdf){target="_blank"} and the [American Lung Association website](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health){target="_blank"} for more information.
The report found that:
> During 2017–2018, current use of any tobacco[/nicotine]$^*$ product **increased 38.3%** (from 19.6% to 27.1%) among high school students and **28.6%** (from 5.6% to 7.2%) among middle school students; e-cigarette use **increased 77.8%** (from 11.7% to 20.8%) among high school students and **48.5%** (from 3.3% to 4.9%) among middle school students.
$^*$ *Note: we added "[/nicotine]" to this quote from the report because e-cigarettes deliver nicotine, but are not actually tobacco.*
In 2018, the [Federal Drug Administration (FDA) in the United States](https://acsjournals.onlinelibrary.wiley.com/doi/full/10.1002/cncr.31868){target="_blank"} stated that e-cigarette usage use among youth reached:
> “nothing short of an **epidemic proportion of growth**”
Additionally, you may learn more about e-cigarette or vaping use-associated lung injury (EVALI) from [CDC's Morbidity and Mortality Weekly Report (MMWR)](https://www.cdc.gov/mmwr/ecigarette_lung_injury.html){target="_blank"}.
In this case study, we will be investigating the same data used in the report that generated the above findings. This data comes from the [The National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"}.
#### {.reference_block}
Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King. “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” **MMWR. Morbidity and Mortality Weekly Report** 68 (6): 157–64 (2019).
####
## **Main Questions**
***
#### {.main_question_block}
<b><u> Our main question: </u></b>
1) How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
2) How does e-cigarette use compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?
We will base this on the following survey questions:
> "During the past 30 days, what brand of e-cigarettes did you usually use?"
> "What flavors of tobacco products have you used in the past
30 days?"
4) Is there a relationship between e-cigarette/vaping use and other tobacco use?
####
## **Learning Objectives**
***
In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the `srvyr` package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"} for wrangling data, such as `tidyr` and `dplyr` and for visualization, such as as `ggplot2`. The Tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
```{r, out.width = "20%", echo = FALSE, fig.align ="center"}
include_graphics("https://tidyverse.tidyverse.org/logo.png")
```
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
<u>**Data Science Learning Objectives:**</u>
1. Import data from Excel files
2. Merge data from multiple similar but not identical data structures
3. Create effective longitudinal data visualizations
4. Write functions in R
5. Apply functions across data subsets using `purrr` and `dplyr` functionality.
<u>**Statistical Learning Objectives:**</u>
1. Understanding of different types of longitudinal data
2. Usage of codebooks
3. Conceptual understanding of survey weighting
4. Implementing logistic regression with survey weighting
***
We will begin by loading the packages that we will need:
```{r}
library(here)
library(readxl)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(scales)
library(viridis)
library(forcats)
library(naniar)
library(srvyr)
library(cowplot)
library(broom)
library(survey)
library(OCSdata)
```
<u>**Packages used in this case study:**</u>
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data
[readxl](https://readxl.tidyverse.org/){target="_blank"} | to import the data in the excel files
[magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use the compound assignment pipe operator `%<>%`
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"} | to manipulate the character strings within the data
[purrr](https://purrr.tidyverse.org/){target="_blank"} | to import the data in all the different excel and csv files efficiently
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to arrange/filter/select/compare specific subsets of the data
[readr](https://readr.tidyverse.org/){target="_blank"} | to import the CSV file data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to rearrange data in wide and long formats
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"} | to make visualizations with multiple layers
[scales](https://cran.r-project.org/web/packages/scales/scales.pdf){target="_blank"} | to allow us to look at the colors within the viridis package
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html){target="_blank"} | to make plots with a color palette that is compatible with color blindness
[forcats](https://forcats.tidyverse.org/){target="_blank"} | to allow for reordering of factors in plots
[naniar](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html){target="_blank"} | to make a visualization of missing data
[syrvr](https://cran.r-project.org/web/packages/srvyr/srvyr.pdf){target="_blank"} | to use survey weights
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined
[broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html){target="_blank"} | to create nicely formatted model output
[survey](http://r-survey.r-forge.r-project.org/survey/index.html){target="_blank"} | to fit survey-weighted logistic regression
[OCSdata](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files
The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
## **Context**
***
According to the cited [Morbidity and Mortality Weekly Report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w) this was what was already known about this topic and the implications of this study:
```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}
knitr::include_graphics(here::here("img", "context.png"))
```
#### [[source](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w)]{target="_blank"}
Importantly, the vapors used in e-cigarettes contain harmful chemicals:
```{r, echo = FALSE, fig.align ="center"}
include_graphics(here::here("img", "e-cigarette-aerosol-can-contain-harmful-ingredients.jpg"))
```
#### [[source](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/images/e-cigarette-aerosol-can-contain-harmful-ingredients-desktop-700.jpg)]{target="_blank"}
E-cigarette usage has also been associated with [lung injury]((https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)){target="_blank"}:
```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}
knitr::include_graphics(here::here("img", "lung.png"))
```
#### [[source](https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)]{target="_blank"}
See [here](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/Quick-Facts-on-the-Risks-of-E-cigarettes-for-Kids-Teens-and-Young-Adults.html){target="_blank"} for additional information about the potential health effects of e-cigarettes in teens and young adults.
## **Limitations**
***
There are some important considerations regarding this data analysis to keep in mind:
1) The [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} does not follow the same individual student respondents over time. A [longitudinal study](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"} that does follow the same individuals over time collects data called [panel data](https://en.wikipedia.org/wiki/Panel_data){target="_blank"}. The data in this study is called pooled [cross-sectional data](https://en.wikipedia.org/wiki/Cross-sectional_data){target="_blank"}, and is obtained from random collection of observations across time.
According to Wikipedia:
> Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods
2) The data include percentages of student respondents reporting use of each particular tobacco product, but the survey questions did not ask the relative amount of use of one product compared to another. For example, the survey included questions like:"What flavors of tobacco products have you used in the past 30 days?" but did not ask how often one flavor was used by the same individual over another.
While [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contain information for groups of individuals who answered the survey questions as male or female.
## **What are the data?**
***
The data in this case study comes from the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} which is an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America.
The data for this survey is freely available online at this [website](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html){target="_blank"} with data from 1999, 2000, 2002, 2004, 2006, 2009, and 2011-2019. We will be using data from **2015-2019** due to the fact that these years are the most recent that asked questions regarding e-cigarette usage.
Each year includes documentation, such as a [codebook](https://www.lib.ncsu.edu/data/icpsrfaq#whatare){target="_blank"} and an excel file containing the data:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}
knitr::include_graphics(here::here("img", "data.png"))
```
Therefore, since we are using data from **2015-2019**, the data we are interested in are located in 5 different excel files, one for each year, each with its own [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"} (this is the one for 2019).
The [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html){target="_blank"} contains information describing the data within the excel file.
As you can see the excel file contains very short variable names and numeric values, and it is not clear what they mean without the codebook:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}
knitr::include_graphics(here::here("img", "excel.png"))
```
The codebook explains what the variables (the columns) are:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}
knitr::include_graphics(here::here("img", "variables.png"))
```
And the codebook explains what the values for each variable are:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}
knitr::include_graphics(here::here("img", "qn1.png"))
```
We will explain more later about what the values on the right indicate.
The reason that there are codebooks for each year is because the questions asked each year varied slightly.
The data in this survey are pooled cross-sectional data. In this study design, different subsets of student respondents are surveyed each year and it is not clear which, if any, individuals participate more than once from one year to the next.
## **Data Import**
***
### **Reading in the excel files**
***
Since these excel files are so large (each has roughly 20,000 rows), it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only the columns relevant to our questions of interest, and saved these data subsets as comma-separated (.csv) files.
You may download these files using the `OCSdata` package:
```{r, eval=FALSE}
# install.packages("OCSdata")
library(OCSdata)
simpler_import_data("ocs-bp-vaping-case-study", outpath = getwd())
```
You may also obtain these files from the GitHub Repository. If you have trouble accessing the GitHub Repository, these CSV files can be downloaded from [here](https://github.com/opencasestudies/ocs-bp-vaping-case-study/tree/master/data/data_reduced){target="_blank"}.
***
<details><summary> Click here for details on how the data were originally imported </summary>
If you have trouble accessing the GitHub Repository, you can follow the links to download the excel files for [2015](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2015-nyts-dataset-and-codebook-microsoft-excel/nyts2015.xlsx){target="_blank"}, [2016](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2016-nyts-dataset-and-codebook-microsoft-excel/nyts2016.xlsx){target="_blank"}, [2017](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2017-nyts-dataset-and-codebook-microsoft-excel/nyts2017.xlsx){target="_blank"}, [2018](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2018-nyts-dataset-and-codebook-microsoft-excel/nyts2018.xlsx){target="_blank"}, and [2019](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2019-nyts-dataset-and-codebook-microsoft-excel/nyts2019.xlsx){target="_blank"}.
First we created a vector of file names of all the different excel files. Using the `here()` function of the `here` package, we looked in all the directories of the project.
The `list.files()` function looked for all files with .xlsx within these sub-directories.
```{r}
excel_files <- list.files(here::here(), recursive = TRUE,
pattern = "*.xlsx")
excel_files
```
All the files were read using `read_excel()` of the `readxl` package. Using the `map()` function of the `purrr` package this was done efficiently for all of the excel files in the vector using one command. The `.` is used to indicate that we want to apply the `read_excel()` function to each element of the data that we just piped into the `map()` function.
Here we also used the `%>%` pipe which can be used to pass the input from one function to another for later sequential steps. You can use pipes if you load the `dplyr` package or the `magrittr` package.
This created a single list of tibbles (one for each file).
```{r, eval = FALSE}
tbl_files <- excel_files %>%
map(~ readxl::read_excel(.))
```
The elements of this list are in the same order as the elements of the `excel_files` vector, so we can extract the data for a particular file (year) by selecting a certain element of the list. However, it is safer to be able to select the data from this list for a specific year using a name based on the original vector of file names. We extracted a name from each Excel file name using the `str_extract()` function of the `stringr` package. Here we are keeping occurrences of the character string "nyts201" followed by a "5", "6", "7", "8", or "9".
```{r}
tbl_names <- excel_files %>%
str_extract("nyts201[5-9]")
tbl_names
```
These names became the names of the tibbles in the list of tibbles.
```{r, eval = FALSE}
names(tbl_files) <- tbl_names
```
Specific columns were selected using the `select()` function of `dplyr` from each of the tibbles using the variable name, as identified in the codebook as being of interest for our analysis.
In some cases functions like `starts_with()` of the `dplyr` package were used to select several variables at once. Most of the survey questions about tobacco use start with an `"E"` or a `"C"` according to the codebooks.
```{r, eval = FALSE}
tbl_files[["nyts2015"]] <- tbl_files[["nyts2015"]] %>%
dplyr::select(psu, # Primary Sampling Unit
finwgt, # Analysis Weight
stratum, # Sampling stratum
Qn1, # Age
Qn2, # Sex
Qn3, # Grade
starts_with("E",
ignore.case = FALSE),
starts_with("C",
ignore.case = FALSE),
)
tbl_files[["nyts2016"]] <- tbl_files[["nyts2016"]] %>%
dplyr::select(psu,
finwgt,
stratum,
Q1, # Age
Q2, # Sex
Q3, # Grade
starts_with("E",
ignore.case = FALSE),
starts_with("C",
ignore.case = FALSE),
Q50A, # Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
Q50B, # Clove or spice
Q50C, # Fruit
Q50D, # Chocolate
Q50E, # Alcoholic Drink
Q50F, # Candy/Desserts/Other Sweets
Q50G # Some Other Flavor Not Listed Here
)
tbl_files[["nyts2017"]] <- tbl_files[["nyts2017"]] %>%
dplyr::select(psu,
finwgt,
stratum,
Q1, # Age
Q2, # Sex
Q3, # Grade
starts_with("E",
ignore.case = FALSE),
starts_with("C",
ignore.case = FALSE),
Q50A, # Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
Q50B, # Clove or spice
Q50C, # Fruit
Q50D, # Chocolate
Q50E, # Alcoholic Drink
Q50F, # Candy/Desserts/Other Sweets
Q50G # Some Other Flavor Not Listed Here
)
tbl_files[["nyts2018"]] <- tbl_files[["nyts2018"]] %>%
dplyr::select(psu,
finwgt,
stratum,
Q1, # Age
Q2, # Sex
Q3, # Grade
starts_with("E",
ignore.case = FALSE),
starts_with("C",
ignore.case = FALSE),
Q50A, # Menthol #What flavors of tobacco products have you used in the past 30 days? (Select one or more)
Q50B, # Clove or spice
Q50C, # Fruit
Q50D, # Chocolate
Q50E, # Alcoholic Drink
Q50F, # Candy/Desserts/Other Sweets
Q50G # Some Other Flavor Not Listed Here
)
tbl_files[["nyts2019"]] <- tbl_files[["nyts2019"]] %>%
dplyr::select(psu,
finwgt,
stratum,
Q1, # Age
Q2, # Sex
Q3, # Grade
starts_with("E",
ignore.case = FALSE),
starts_with("C",
ignore.case = FALSE),
Q40, # Brand, e-cigarettes
Q62A, # Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
Q62B, # Clove or spice
Q62C, # Fruit
Q62D, # Chocolate
Q62E, # Alcoholic Drink
Q62F, # Candy/Desserts/Other Sweets
Q62G, # Some Other Flavor Not Listed Here
)
```
A directory called `reduced` was created for the new .csv files using the base `dir.create()` function.
A csv file was created for each of the tibbles in the list using the `write_csv()` function of the `readr` package.
This was done all at once using the `map()` function.
```{r, eval = FALSE}
dir.create("data/reduced")
names(tbl_files) %>% map(~ write_csv(tbl_files[[.]],
path = paste0("data/reduced/", ., ".csv")))
```
</details>
***
Now we will show how to read in the data from the five csv files that were created from the five different Excel files.
### **Reading in the CSV files**
***
Using the `here()` function of the `here` package, we looked in all the directories of the project. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project.
***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>
You can create a project by going to the File menu of RStudio like so:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```
You can also do so by clicking the project button:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```
See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects){target="_blank"} to learn more about using RStudio projects.
</details>
***
The `list.files()` function looked for all files with .csv within the `data/reduced` sub-directories.
All the files were read using `read_csv()` of the `readr` package. Using the `map()` function of the `purrr` package this was done efficiently for all of the CSV files in the vector using one command. The `.` is used to indicate that we want to apply the `read_csv()` function to each element of the data that we just piped into the `map()` function. For more information about the `map()` function, see [here](https://purrr.tidyverse.org/reference/map.html){target="_blank"}.
Here we also used the `%>%` pipe which can be used to pass the input from one function to another for later sequential steps.
```{r}
nyts_data <- list.files(here::here(), recursive = TRUE,
pattern = "*.csv") %>%
stringr::str_subset(., "wrangled", negate = TRUE) %>%
map(~ read_csv(.))
nyts_data_names <- list.files(recursive = TRUE,
pattern = "*.csv") %>%
stringr::str_subset(., "wrangled", negate = TRUE) %>%
str_extract("nyts201[5-9]")
names(nyts_data) <- nyts_data_names
```
We can save our imported data as an rda file (stands for R data file) using the `save()` function.
```{r}
save(nyts_data, file = here::here("data", "imported", "imported_data.rda"))
```
## **Data Exploration and Wrangling**
***
If you have been following along but stopped, we could load our imported data from the "data" directory like so:
```{r}
load(here::here("data", "imported", "imported_data.rda"))
```
***
<details> <summary> If you skipped the data import section click here. </summary>
First you need to install and load the `OCSdata` package:
```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```
Then, you may load the imported data using the following code:
```{r, eval=FALSE}
imported_data("ocs-bp-vaping-case-study", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_data.rda"))
```
If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-vaping-case-study/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-vaping-case-study/master/data/imported/imported_data.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily.
```{r}
load(here::here("data", "imported", "imported_data.rda"))
```
***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>
You can create a project by going to the File menu of RStudio like so:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```
You can also do so by clicking the project button:
```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```
See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.
</details>
***
</details>
***
Our goal in this section is to adjust or "wrangle" the data from each year into a common format so that we can combine the datasets across years for our analysis, and so that we have values in our variables that are correct and easy to interpret. We will need to understand what is the same and what is different across the data from different years, rename and recode the variables (e.g., by replacing the numbers 1 and 2 with the values "Male" and "Female" for the `Sex` variable), and combine the data. We will walk through these steps below.
First, let's take a look at our data. We can get a good sense of it using the `glimpse()` function of the `dplyr` package.
#### {.scrollable }
```{r}
dplyr::glimpse(nyts_data[["nyts2015"]])
```
####
### **Updating the set of variables and their names**
***
The easiest way of making it so that the data from the different years can be combined is by making sure that the same type of data in different datasets share the same names. In addition, giving the columns informative names will help make our code more readable. Currently, it isn't very clear what most of the variables indicate since the variable names are uninformative on their own, without the [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"}.
We want to rename variables like `Qn1` to something more meaningful like `Age`.
To do this we will use the `rename()` function of the `dplyr` package. The new name is always listed first before the `=`. This function will replace the old variable names with the new ones, i.e., after running the code below, there will no longer be a `Qn1` variable in the dataset, but there will be an `Age` variable instead. We will start working with the 2015 data, and then move on to the other years down below.
```{r}
nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
dplyr::rename(Age = Qn1,
Sex = Qn2,
Grade = Qn3)
```
Ultimately we will be combining the data from each year using the `bind_rows()` function of the `dplyr` package, which will fill in `NA` values for any columns that do not exist for one of the years.
The data for 2016-2018 have many common attributes, so we will want to write code that can be applied to all three datasets. To do this, we will use a **function** in R, which is basically a piece of code that can be applied to similar but different objects in R (e.g., the data tibbles from each of these three years). Recall that you are using functions from packages all the time, like the `rename()` function of the `dplyr` package. Now you are going to write your own function! For more information on functions, see [here](https://r4ds.had.co.nz/functions.html){target="_blank"}.
These next 3 years have the same structure for many of the questions we are interested in. For example, they all have flavor questions, but not a brand question. Moreover, their variable names are consistent across the years; for each year, we want to replace the vague question name `Q50A` with the value `menthol` in all three datasets, and the same is true for the other flavor variables.
Since we want to perform the same modifications on the data from all three years, rather than repeating the same somewhat messy piece of code three times, we can do this more efficiently if we create a function to do all of these steps at once. Then we can use the `map_at()` function of the `purrr` package, which is an extension of the `map()` function that we used in the [Data Import] (https://www.opencasestudies.org/ocs-bp-vaping-case-study/#Data_Import){target="_blank"} section to apply the function we just created (for renaming variables etc.) specifically to the data from 2016-2018 within the `nyts_data`. By using `vars()` inside of the `map_at()` function we can specify what tibbles within our `nyts_data` list we want to include or exclude.
#### **Functions**
***
So how do you create a function?
You first need to specify that you are creating a function by using the `function()` base function.
Yes, that's right it is a function for creating functions called function!
First we specify our input within the parentheses of `function()`. Thus if our function will apply something to an input called `x` then we would use `function(x)`. Theoretically our input can be named whatever we want, but we we need to refer to it consistently within our function to indicate what we want done with the input data. We can actually have more than one input as well, we would indicate two inputs like this: `function(x, b)`. Here we would be using both `x` and `b` to do something in our function.
The next part of a function is defined within curly brackets `{}`. This is where we write what we want done to our inputs.
***
<details> <summary> Click here to see a simple example </summary>
Our function will be called `simple_function`, which will add 2 to our input. It will take the input `vector_data` (note that we usually want to use an informative input argument like `vector_data`).
```{r}
simple_function <- function(vector_data) {
vector_data + 2
}
```
We can use an input that is a vector called `x`.
```{r}
x = c(1, 2, 3, 4)
x
```
Now we specify using the argument ` vector_data =` to indicate that `x` is our input that we want to perform the function on.
```{r}
simple_function(vector_data = x)
```
This function also works for other vector input. For example, vector `y` is now our input that we want to perform the function on.
```{r}
y = c(5, 10, 15, 20)
simple_function(vector_data = y)
```
</details>
***
In our case we will be applying our function to the variable names for the dataset for each year. The output of our function is the result of renaming these variables for each year.
```{r}
update_survey <- function(dataset) { dataset %>%
rename(Age = Q1,
Sex = Q2,
Grade = Q3,
menthol = Q50A,
clove_spice = Q50B,
fruit = Q50C,
chocolate = Q50D,
alcoholic_drink = Q50E,
candy_dessert_sweets = Q50F,
other = Q50G)
}
# options to apply the function to the data:
# nyts_data <- nyts_data %>% map_at(vars(nyts2016, nyts2017, nyts2018), Update_survey)
nyts_data <- nyts_data %>% map_at(vars(-nyts2015, -nyts2019), update_survey)
```
The final year, 2019, has a slightly different data structure compared to these earlier datasets. It actually has a `brand_ecig` variable already and different question numbers correspond to our flavor questions of interest. So we will rename the variables in this dataset individually. We could also write this as a function, but since we are only applying this one time, there is no need to. Functions are really helpful for repeating the same task repeatedly using different data inputs.
```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
rename(brand_ecig = Q40,
Age = Q1,
Sex = Q2,
Grade = Q3,
menthol = Q62A,
clove_spice = Q62B,
fruit = Q62C,
chocolate = Q62D,
alcoholic_drink = Q62E,
candy_dessert_sweets = Q62F,
other = Q62G)
```
Now let's take a look at the variable names for each of the years using the `map` function from `purrr`.
#### {.scrollable }
```{r}
map(nyts_data, names)
```
####
It's looking better! The data that overlap across years have the same variable names.
### **Updating Values**
***
Now that we have made some progress on the selection and names of the variables themselves, we will work on the values contained in the different variables.
We can start with updating the values for `Age` and `Grade`, so that they are more understandable.
Recall from the codebook for this year's dataset that `Age` isn't listed in the way one might expect, i.e., it is not just a number of years, but a numerically valued categorical variable.
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}
knitr::include_graphics(here::here("img", "qn1.png"))
```
The same is true for `Grade`:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}
knitr::include_graphics(here::here("img", "grade.png"))
```
**This is why it is so important to always check the codebook!!**
We also want to replace the value of `19` for `Age` to be `">18"` and the value of `13` for `Grade` to be replaced with `"Ungraded/Other"` Also according to the codebooks, numeric values of `1` indicate a survey answer of `FALSE`, while a value of `2` indicates `TRUE`. `Sex` also needs to be recoded. If we take a look at the codebooks carefully (make sure you look at the questions that we pulled, not the recoded values), we will see that males are indicated by `1` and females are indicated by `2`. Finally some values are indicated with `"*"` or`"**"` when they are missing. We want to replace these with `NA`.
Let's create a function to make all these updates. We will use the `mutate` function of the `dplyr` package to modify these variables. This function can also be used to create new variables. We will also use the `recode()` function of the `dplyr` package to replace specific values of certain variables.
```{r}
update_values <- function(dataset) {
dataset %>%
dplyr::mutate(Age = as.numeric(Age) + 8,
Grade = as.numeric(Grade) + 5) %>%
mutate(Age = as.factor(Age),
Grade = as.factor(Grade),
Sex = as.factor(Sex)) %>%
mutate(Sex = recode(Sex,
`1` = "male",
`2` = "female")) %>%
dplyr::mutate_all( ~ replace(., . %in% c("*", "**"), NA)) %>%
mutate(Age = recode(Age,
`19` = ">18"),
Grade = recode(Grade,
`13` = "Ungraded/Other")) %>%
dplyr::mutate_at(vars(
starts_with("E", ignore.case = FALSE),
starts_with("C", ignore.case = FALSE)
),
list( ~ recode(
.,
`1` = TRUE,
`2` = FALSE,
.default = NA,
.missing = NA
)))
}
nyts_data <- nyts_data %>% map(., update_values)
```
Now if we wanted to check that everything is expected we could do something like this to check the `Sex` variable using the `count()` function of the `dplyr` package. It is advisable to check your data frequently to make sure that it is as expected!
According to the codebook, we should have:
1) 8,958 males in 2015
2) 10,438 males in 2016
3) 8,881 males in 2017
4) 10,069 males in 2018
5) 9,803 males in 2019
```{r}
count_sex <- function(dataset) {dataset %>% count(Sex)}
nyts_data %>% map(., count_sex)
```
Looks good!
The years (2016-2019) that have flavors also need the flavor data to be logical (meaning TRUE or FALSE):
In this case we also are setting missing values to `FALSE` because then the `TRUE` values will represent those who reported using a specific flavor out of all users, rather than those that used a specific flavor compared to those who used a different flavor.
```{r}
update_flavors <- function(dataset) {dataset %>%
mutate_at(vars(menthol:other),
list(~ recode(.,
`1` = TRUE,
.default = FALSE,
.missing = FALSE))) }
nyts_data <- nyts_data %>% map_at(vars(-nyts2015), update_flavors)
```
Now there are just a few changes needed that are specific to 2019. Specifically, some of the 2019 questions use the values ".N", ".S", and ".Z" to indicate different types of missing data (see for example Q2 of the 2019 [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"}); we just want them to be replaced with `NA` values.
```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
mutate_all(~ replace(., . %in% c(".N", ".S", ".Z"), NA)) %>%
mutate(psu = as.character(psu)) %>%
mutate(brand_ecig = recode(brand_ecig,
`1` = "Other", # levels 1,8 combined to `Other`
`2` = "Blu",
`3` = "JUUL",
`4` = "Logic",
`5` = "MarkTen",
`6` = "NJOY",
`7` = "Vuse",
`8` = "Other"))
```
Great! Now our values don't need to be handled any differently for any of the years, thus we can combine the data across years.
Even though we have different numbers of variables for each year, we can coerce the data to be combined into one tibble by using the `bind_rows()` function of `dplyr`. Importantly, this function does not require that the columns be the same. This will create NA values for any variable that is not present in given data frame but is present in one of the other data frames that is being combined. Note that the `bind_cols()` function does expect that the rows match. The `.id` argument will create a new variable with values to link each row to its original data frame. For more information see [here](https://dplyr.tidyverse.org/reference/bind.html).
```{r}
nyts_data <- nyts_data %>%
map_df(dplyr::bind_rows, .id = "year")
glimpse(nyts_data)
```
We will want to do some of our analysis split by year, so we would like to be sure we have one variable that has the correct value for year. It looks like we just need to remove `"nyts"` from the year variable that we created from the names of the tibbles in our list and we should be all set. We will use another function from the `stringr` package to do this. The `str_remove()` function takes a string followed by a pattern and removes the pattern from the string.
```{r}
nyts_data <- nyts_data %>%
mutate(year = as.numeric(str_remove(year, "nyts")))
```
Here is our clean and wrangled data:
#### {.scrollable}
```{r}
glimpse(nyts_data)
```
####
Note that there are several variables where there are similar names, but with a `C` compared to an `E` in the variable name. Those starting with `C` are related to questions about current usage (last 30 days), while those with an `E` are related to usage across the student respondent's whole life ("ever" usage). We will discuss these groups further below.
Now we will save our wrangled data. We will save it as an rda file for ourselves and as csv files, as this is often a good option for collaborators. We will save this file in a directory called "wrangled" within our "data" directory of our project.
```{r}
save(nyts_data, file = here::here("data", "wrangled", "wrangled_data.rda"))
write_csv(nyts_data, path = here::here("data", "wrangled", "nyts_data.csv"))
```
### **Variable Table**
***
<details><summary> Click here to see a table about the final variables in our data set. </summary>
Variable | Details
---------- |-------------
**year** | the year that the survey results from a particular student respondent were acquired
**psu** | the primary sampling unit for the survey weighting
**finwgt** | the final analysis weight for the survey weighting
**stratum** | the stratum used for variance estimation for the survey weighting
**Age** | the age of the student when they took the survey
**Sex** | the sex of the student when they took the survey
**Grade** | the grade of the the student when the took the survey
**ECIGT** | student reported having ever tried cigarette smoking, even one or two puffs
**ECIGAR** | student reported having ever tried cigar, cigarillo, or little cigar smoking, even one or two puffs