index.Rmd

---
title: "Open Case Studies: Vaping Behaviors in American Youth"

css: style.css
output:
  html_document:
    includes:
      in_header: GA_Script.Rhtml
    self_contained: yes
    code_download: yes
    highlight: tango
    number_sections: no
    theme: cosmo
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
  word_document:
    toc: yes

---
<style>
#TOC {
 background: url("https://opencasestudies.github.io/img/icon-bahi.png");
  background-size: contain;
  padding-top: 240px !important;
  background-repeat: no-repeat;
}
</style>

<!-- Open all links in new tab-->  
<base target="_blank"/> 
 
<div id="google_translate_element"></div>

<script type="text/javascript" src='//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit'></script>

<script type="text/javascript">
function googleTranslateElementInit() {
  new google.translate.TranslateElement({pageLanguage: 'en'}, 'google_translate_element');
}
</script>


```{r, echo=FALSE}
knit_time_start <- Sys.time()
```

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width = 10, fig.height = 8, dpi = 300)
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(include = TRUE, comment = NA, echo = TRUE,
                      message = FALSE, warning = FALSE, cache = FALSE,
                      fig.align = "center", out.width = '90%')
library(here)
library(knitr)
library(magrittr)
remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE)
remotes::install_github("alistaire47/read.so")
library(wordcountaddin)
library(read.so)
remotes::install_github("opencasestudies/OCSdata")
library(OCSdata)
remotes::install_github("https://github.com/wch/extrafont")
library(extrafont)
extrafont::font_import()

rmarkdown:::perf_timer_reset_all()
rmarkdown:::perf_timer_start("render")
```


#### {.outline }

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "final_plot.png"))
```

####


#### {.disclaimer_block}


**Disclaimer**: The purpose of the [Open Case Studies](https://opencasestudies.github.io){target="_blank"} project is **to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data**. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts. 

####


#### {.license_block}

This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 [(CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/us/){target="_blank"}  United States License.

#### 

#### {.reference_block}

To cite this case study please use:

Wright, Carrie and Ontiveros, Michael and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-vaping-case-study. Vaping Behaviors in American Youth (Version v1.0.0).

####

To access the GitHub Repository with the data for this case study see here: https://github.com/opencasestudies/ocs-bp-vaping-case-study

You may also access and download the data using our `OCSdata` package. To learn more about this package including examples, see this [link](https://github.com/opencasestudies/OCSdata). Here is how you would install this package:

```{r, eval=FALSE}
install.packages("OCSdata")
```

This case study is part of a series of public health case studies for the [Bloomberg American Health  Initiative](https://americanhealth.jhu.edu/open-case-studies).

***

The total reading time for this case study is calculated via [koRpus](https://github.com/unDocUMeantIt/koRpus) and shown below: 

```{r, echo=FALSE}
readtable = text_stats("index.Rmd") # producing reading time markdown table
readtime = read.so::read.md(readtable) %>% dplyr::select(Method, koRpus) %>% # reading table into dataframe, selecting relevant factors
  dplyr::filter(Method == "Reading time") %>% # dropping unnecessary rows
  dplyr::mutate(koRpus = paste(round(as.numeric(stringr::str_split(koRpus, " ")[[1]][1])), "minutes")) %>% # rounding reading time estimate
  dplyr::mutate(Method = "koRpus") %>% dplyr::relocate(koRpus, .before = Method) %>% dplyr::rename(`Reading Time` = koRpus) # reorganizing table
knitr::kable(readtime, format="markdown")
```

***

**Readability Score: **

A readability index estimates the reading difficulty level of a particular text. Flesch-Kincaid, FORCAST, and SMOG are three common readability indices that were calculated for this case study via [koRpus](https://github.com/unDocUMeantIt/koRpus). These indices provide an estimation of the minimum reading level required to comprehend this case study by grade and age. 

```{r, echo=FALSE}
rt = wordcountaddin::readability("index.Rmd", quiet=TRUE) # producing readability markdown table
df = read.so::read.md(rt) %>% dplyr::select(index, grade, age) %>%  # reading table into dataframe, selecting relevant factors
  tidyr::drop_na() %>% dplyr::mutate(grade = round(as.numeric(grade)), # dropping rows with missing values, rounding age and grade columns
                                     age = round(as.numeric(age))
                                     )
knitr::kable(df, format="markdown")
```

***

Please help us by filling out our survey.


<div style="display: flex; justify-content: center;"><iframe src="https://docs.google.com/forms/d/e/1FAIpQLSfpN4FN3KELqBNEgf2Atpi7Wy7Nqy2beSkFQINL7Y5sAMV5_w/viewform?embedded=true" width="1200" height="700" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe></div>


## **Motivation**
*** 

In the United States, there have been significant and historical declines in cigarette smoking. In the 1970s, 75% of high school seniors were smoking, that number is below 10% now. This progress is largely due to the tobacco control movement and their focus on initiatives like ending advertising to children (like Joe Camel), passing indoor smoking laws, health communication, etc. 

According to a recent [report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w){target="_blank"}, overall tobacco/nicotine use **increased** in youths (middle school and high school students) in the United States in 2017 and 2018, despite previous years of declining use.

This major increase is attributed to an increase in the use of electronic cigarette (e-cigarette) products.

Forms of tobacco/nicotine include these categories:

1) Cigarette and other combustible tobacco (pipes, cigars, cigarillo, etc.)
2) E-cigarettes and vaporized tobacco/nicotine (hookah, Juul, etc.)
3) Other non-combustible, non-vapor tobacco/nicotine products (snus, chewing tobacco, etc.)

Youths are more likely to use e-cigarettes vs. combustible cigarettes these days, which is concerning because e-cigarettes are really efficient nicotine delivery devices that are reinforcing and easy to initiate. By contrast, it takes quite a while to become accustomed to cigarettes (eg. because of coughing) and become dependent. It is also harder for parents to detect e-cigarette use and intervene (eg. the smell is not as strong). This means that youths may be becoming physically dependent on nicotine more quickly than in past years, and that cessation services designed for youths will be needed.

Whereas in previous decades the focus was on advertising, the current era requires attention to the marketing broadly. Juul caught on through Instagram influencers. New policies that regulate these innovative marketing strategies will be critical. 

E-cigarettes are referred to by many different names, including but not limited to:

1) Electronic nicotine delivery systems (ENDS)
2) Vapes
3) e-hookahs
4) vape pens
5) tanks
6) mods

The devices vary greatly:

```{r, echo = FALSE, fig.align ="center"}

include_graphics("https://www.lung.org/getmedia/8ac8ab8c-e7fc-497b-8384-441615f50ff0/ecigs_K.jpg.jpg")
```

##### [[source](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health)]

See this [CDC guide](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/pdfs/ecigarette-or-vaping-products-visual-dictionary-508.pdf){target="_blank"} and the [American Lung Association website](https://www.lung.org/quit-smoking/e-cigarettes-vaping/lung-health){target="_blank"} for more information. 

The report found that:

> During 2017–2018, current use of any tobacco[/nicotine]$^*$ product **increased 38.3%** (from 19.6% to 27.1%) among high school students and **28.6%** (from 5.6% to 7.2%) among middle school students; e-cigarette use **increased 77.8%** (from 11.7% to 20.8%) among high school students and **48.5%** (from 3.3% to 4.9%) among middle school students.

$^*$ *Note: we added "[/nicotine]" to this quote from the report because e-cigarettes deliver nicotine, but are not actually tobacco.*

In 2018, the [Federal Drug Administration (FDA) in the United States](https://acsjournals.onlinelibrary.wiley.com/doi/full/10.1002/cncr.31868){target="_blank"} stated that e-cigarette usage use among youth reached:  

> “nothing short of an **epidemic proportion of growth**”

Additionally, you may learn more about e-cigarette or vaping use-associated lung injury (EVALI) from [CDC's Morbidity and Mortality Weekly Report (MMWR)](https://www.cdc.gov/mmwr/ecigarette_lung_injury.html){target="_blank"}.

In this case study, we will be investigating the same data used in the report that generated the above findings. This data comes from the [The National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"}.

#### {.reference_block}

Gentzke, Andrea S., Melisa Creamer, Karen A. Cullen, Bridget K. Ambrose, Gordon Willis, Ahmed Jamal, and Brian A. King.  “Vital Signs: Tobacco Product Use Among Middle and High School Students - United States, 2011-2018.” **MMWR. Morbidity and Mortality Weekly Report** 68 (6): 157–64 (2019).

####


## **Main Questions**
*** 

#### {.main_question_block}
<b><u> Our main question: </u></b>

1) How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
2) How does e-cigarette use compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   

> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
> "What flavors of tobacco products have you used in the past
30 days?" 

4) Is there a relationship between e-cigarette/vaping use and other tobacco use?

####

## **Learning Objectives** 
*** 

In this case study, we will cover how to import data from multiple files efficiently, how to import data from excel files, and how to make a variety of visualizations to compare multiple groups across time. We will also demonstrate how to work with codebooks. We will cover the concept of survey weighting and introduce the `srvyr` package. We will discuss the difference between pooled cross-sectional data and panel data. We will especially focus on using packages and functions from the [`Tidyverse`](https://www.tidyverse.org/){target="_blank"} for wrangling data, such as `tidyr` and `dplyr` and for visualization, such as as `ggplot2`. The Tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.

```{r, out.width = "20%", echo = FALSE, fig.align ="center"}

include_graphics("https://tidyverse.tidyverse.org/logo.png")
```


The skills, methods, and concepts that students will be familiar with by the end of this case study are:

<u>**Data Science Learning Objectives:**</u>

1. Import data from Excel files
2. Merge data from multiple similar but not identical data structures
3. Create effective longitudinal data visualizations
4. Write functions in R
5. Apply functions across data subsets using `purrr` and `dplyr` functionality.

<u>**Statistical Learning Objectives:**</u> 

1. Understanding of different types of longitudinal data 
2. Usage of codebooks
3. Conceptual understanding of survey weighting
4. Implementing logistic regression with survey weighting


*** 


We will begin by loading the packages that we will need:

```{r}
library(here)
library(readxl)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
library(scales)
library(viridis)
library(forcats)
library(naniar)
library(srvyr)
library(cowplot)
library(broom)
library(survey)
library(OCSdata)
```

<u>**Packages used in this case study:**</u>

 Package   | Use in this case study                                                                        
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readxl](https://readxl.tidyverse.org/){target="_blank"}      | to import the data in the excel files 
[magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use the compound assignment pipe operator `%<>%`
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data  
[purrr](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to rearrange data in wide and long formats 
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[scales](https://cran.r-project.org/web/packages/scales/scales.pdf){target="_blank"}    | to allow us to look at the colors within the viridis package
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html){target="_blank"}    | to make plots with a color palette that is compatible with color blindness
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to allow for reordering of factors in plots
[naniar](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html){target="_blank"}  | to make a visualization of missing data
[syrvr](https://cran.r-project.org/web/packages/srvyr/srvyr.pdf){target="_blank"} | to use survey weights
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined 
[broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html){target="_blank"} | to create nicely formatted model output
[survey](http://r-survey.r-forge.r-project.org/survey/index.html){target="_blank"} | to fit survey-weighted logistic regression
[OCSdata](https://github.com/opencasestudies/OCSdata){target="_blank"} | to access and download OCS data files


The first time we use a function, we will use the `::` to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.


## **Context**
*** 

According to the cited [Morbidity and Mortality Weekly Report](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w) this was what was already known about this topic and the implications of this study:

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "context.png"))
```

#### [[source](https://www.cdc.gov/mmwr/volumes/68/wr/mm6806e1.htm?s_cid=mm6806e1_w)]{target="_blank"} 


Importantly, the vapors used in e-cigarettes contain harmful chemicals:

```{r, echo = FALSE, fig.align ="center"}

include_graphics(here::here("img", "e-cigarette-aerosol-can-contain-harmful-ingredients.jpg"))
```

#### [[source](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/images/e-cigarette-aerosol-can-contain-harmful-ingredients-desktop-700.jpg)]{target="_blank"} 

E-cigarette usage has also been associated with [lung injury]((https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)){target="_blank"}:

```{r, echo = FALSE, fig.align ="center", out.width = "800 px"}

knitr::include_graphics(here::here("img", "lung.png"))
```

#### [[source](https://www.frontiersin.org/articles/10.3389/fphar.2019.01619/full)]{target="_blank"} 

See [here](https://www.cdc.gov/tobacco/basic_information/e-cigarettes/Quick-Facts-on-the-Risks-of-E-cigarettes-for-Kids-Teens-and-Young-Adults.html){target="_blank"} for additional information about the potential health effects of e-cigarettes in teens and young adults.

## **Limitations**
*** 
There are some important considerations regarding this data analysis to keep in mind: 

1) The [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} does not follow the same individual student respondents over time.  A [longitudinal study](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"} that does follow the same individuals over time collects data called [panel data](https://en.wikipedia.org/wiki/Panel_data){target="_blank"}. The data in this study is called pooled [cross-sectional data](https://en.wikipedia.org/wiki/Cross-sectional_data){target="_blank"}, and is obtained from random collection of observations across time.

According to Wikipedia:

> Panel data differs from pooled cross-sectional data across time, because it deals with the observations on the same subjects in different times whereas the latter observes different subjects in different time periods

2) The data include percentages of student respondents reporting use of each particular tobacco product, but the survey questions did not ask the relative amount of use of one product compared to another. For example, the survey included questions like:"What flavors of tobacco products have you used in the past 30 days?" but did not ask how often one flavor was used by the same individual over another.

While [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contain information for groups of individuals who answered the survey questions as male or female. 

## **What are the data?**
*** 
 
The data in this case study comes from the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"} which is an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America.

The data for this survey is freely available online at this [website](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html){target="_blank"} with data from 1999, 2000, 2002, 2004, 2006, 2009,  and 2011-2019. We will be using data from **2015-2019** due to the fact that these years are the most recent that asked questions regarding e-cigarette usage.

Each year includes documentation, such as a [codebook](https://www.lib.ncsu.edu/data/icpsrfaq#whatare){target="_blank"} and an excel file containing the data:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "data.png"))
```
Therefore, since we are using data from **2015-2019**, the data we are interested in are located in 5 different excel files, one for each year, each with its own [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"} (this is the one for 2019).

The [codebook](https://www.icpsr.umich.edu/icpsrweb/content/shared/ICPSR/faqs/what-is-a-codebook.html){target="_blank"} contains information describing the data within the excel file. 

As you can see the excel file contains very short variable names and numeric values, and it is not clear what they mean without the codebook:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "excel.png"))
```

The codebook explains what the variables (the columns) are:
```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "variables.png"))
```

And the codebook explains what the values for each variable are:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

We will explain more later about what the values on the right indicate.

The reason that there are codebooks for each year is because the questions asked each year varied slightly.


The data in this survey are pooled cross-sectional data. In this study design, different subsets of student respondents are surveyed each year and it is not clear which, if any, individuals participate more than once from one year to the next.

## **Data Import**
*** 

### **Reading in the excel files**
***

Since these excel files are so large (each has roughly 20,000 rows), it takes a bit of time for the data to load. To make the process faster, we previously imported these files, selected only the columns relevant to our questions of interest, and saved these data subsets as comma-separated (.csv) files. 

You may download these files using the `OCSdata` package:

```{r, eval=FALSE}
# install.packages("OCSdata")

library(OCSdata)
simpler_import_data("ocs-bp-vaping-case-study", outpath = getwd())
```

You may also obtain these files from the GitHub Repository. If you have trouble accessing the GitHub Repository, these CSV files can be downloaded from [here](https://github.com/opencasestudies/ocs-bp-vaping-case-study/tree/master/data/data_reduced){target="_blank"}.

***

<details><summary> Click here for details on how the data were originally imported </summary>

If you have trouble accessing the GitHub Repository, you can follow the links to download the excel files for [2015](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2015-nyts-dataset-and-codebook-microsoft-excel/nyts2015.xlsx){target="_blank"}, [2016](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2016-nyts-dataset-and-codebook-microsoft-excel/nyts2016.xlsx){target="_blank"}, [2017](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2017-nyts-dataset-and-codebook-microsoft-excel/nyts2017.xlsx){target="_blank"}, [2018](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2018-nyts-dataset-and-codebook-microsoft-excel/nyts2018.xlsx){target="_blank"}, and [2019](https://github.com/opencasestudies/ocs-bp-vaping-case-study/blob/master/docs/2019-nyts-dataset-and-codebook-microsoft-excel/nyts2019.xlsx){target="_blank"}.

First we created a vector of file names of all the different excel files. Using the `here()` function of the `here` package, we looked in all the directories of the project.
The `list.files()` function looked for all files with .xlsx within these sub-directories.

```{r}
excel_files <- list.files(here::here(), recursive = TRUE,
                  pattern = "*.xlsx")
excel_files
```

All the files were read using `read_excel()` of the `readxl` package. Using the `map()` function of the `purrr` package this was done efficiently for all of the excel files in the vector using one command. The `.` is used to indicate that we want to apply the `read_excel()` function to each element of the data that we just piped into the `map()` function.

Here we also used the `%>%` pipe which can be used to pass the input from one function to another for later sequential steps. You can use pipes if you load the `dplyr` package or the `magrittr` package.

This created a single list of tibbles (one for each file). 
```{r, eval = FALSE}
tbl_files <- excel_files %>%
       map(~ readxl::read_excel(.))
```

The elements of this list are in the same order as the elements of the `excel_files` vector, so we can extract the data for a particular file (year) by selecting a certain element of the list. However, it is safer to be able to select the data from this list for a specific year using a name based on the original vector of file names. We extracted a name from each Excel file name using the `str_extract()` function of the `stringr` package. Here we are keeping occurrences of the character string "nyts201" followed by a "5", "6", "7", "8", or "9".

```{r}
tbl_names <- excel_files %>%
  str_extract("nyts201[5-9]")

tbl_names
```

These names became the names of the tibbles in the list of tibbles.
```{r, eval = FALSE}
names(tbl_files) <- tbl_names
```


Specific columns were selected using the `select()` function of `dplyr` from each of the tibbles using the variable name, as identified in the codebook as being of interest for our analysis.
In some cases functions like `starts_with()` of the `dplyr` package were used to select several variables at once. Most of the survey questions about tobacco use start with an `"E"` or a `"C"` according to the codebooks. 

```{r, eval = FALSE}

tbl_files[["nyts2015"]] <- tbl_files[["nyts2015"]] %>%
    dplyr::select(psu, # Primary Sampling Unit
                  finwgt, # Analysis Weight
                  stratum, # Sampling stratum
                  Qn1, # Age
                  Qn2, # Sex
                  Qn3, # Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  )


tbl_files[["nyts2016"]] <- tbl_files[["nyts2016"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, # Age
                  Q2, # Sex
                  Q3, # Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, # Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, # Clove or spice
                  Q50C, # Fruit
                  Q50D, # Chocolate
                  Q50E, # Alcoholic Drink
                  Q50F, # Candy/Desserts/Other Sweets
                  Q50G  # Some Other Flavor Not Listed Here
                  )

tbl_files[["nyts2017"]] <- tbl_files[["nyts2017"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, # Age
                  Q2, # Sex
                  Q3, # Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, # Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, # Clove or spice
                  Q50C, # Fruit
                  Q50D, # Chocolate
                  Q50E, # Alcoholic Drink
                  Q50F, # Candy/Desserts/Other Sweets
                  Q50G  # Some Other Flavor Not Listed Here
                  )

tbl_files[["nyts2018"]] <- tbl_files[["nyts2018"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, # Age
                  Q2, # Sex
                  Q3, # Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q50A, # Menthol #What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q50B, # Clove or spice
                  Q50C, # Fruit
                  Q50D, # Chocolate
                  Q50E, # Alcoholic Drink
                  Q50F, # Candy/Desserts/Other Sweets
                  Q50G  # Some Other Flavor Not Listed Here
                  )

tbl_files[["nyts2019"]] <- tbl_files[["nyts2019"]] %>%
    dplyr::select(psu,
                  finwgt,
                  stratum,
                  Q1, # Age
                  Q2, # Sex
                  Q3, # Grade
                  starts_with("E",
                              ignore.case = FALSE),
                  starts_with("C",
                              ignore.case = FALSE),
                  Q40, # Brand, e-cigarettes
                  Q62A, # Menthol # What flavors of tobacco products have you used in the past 30 days? (Select one or more)
                  Q62B, # Clove or spice
                  Q62C, # Fruit
                  Q62D, # Chocolate
                  Q62E, # Alcoholic Drink
                  Q62F, # Candy/Desserts/Other Sweets
                  Q62G, # Some Other Flavor Not Listed Here
                  )
```

A directory called `reduced` was created for the new .csv files using the base `dir.create()` function. 
A csv file was created for each of the tibbles in the list using the `write_csv()` function of the `readr` package.
This was done all at once using the `map()` function.

```{r, eval = FALSE}
dir.create("data/reduced")

names(tbl_files) %>% map(~ write_csv(tbl_files[[.]], 
                                     path = paste0("data/reduced/", ., ".csv")))
```

</details>

***

Now we will show how to read in the data from the five csv files that were created from the five different Excel files.

### **Reading in the CSV files**
***

Using the `here()` function of the `here` package, we looked in all the directories of the project. The `here` package automatically starts looking for files based on where you have a `.Rproj` file which is created when you start a new RStudio project.

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects){target="_blank"} to learn more about using RStudio projects. 

</details>

***

The `list.files()` function looked for all files with .csv within the `data/reduced` sub-directories.

All the files were read using `read_csv()` of the `readr` package. Using the `map()` function of the `purrr` package this was done efficiently for all of the CSV files in the vector using one command. The `.` is used to indicate that we want to apply the `read_csv()` function to each element of the data that we just piped into the `map()` function. For more information about the `map()` function, see [here](https://purrr.tidyverse.org/reference/map.html){target="_blank"}.

Here we also used the `%>%` pipe which can be used to pass the input from one function to another for later sequential steps. 

```{r}
nyts_data <- list.files(here::here(), recursive = TRUE,
                        pattern = "*.csv") %>%
  stringr::str_subset(., "wrangled", negate = TRUE) %>%
  map(~ read_csv(.))


nyts_data_names <- list.files(recursive = TRUE,
                              pattern = "*.csv") %>%
  stringr::str_subset(., "wrangled", negate = TRUE) %>%
  str_extract("nyts201[5-9]")

names(nyts_data) <- nyts_data_names
```

We can save our imported data as an rda file (stands for R data file) using the `save()` function. 

```{r}
save(nyts_data, file = here::here("data", "imported", "imported_data.rda"))
```


## **Data Exploration and Wrangling**
*** 

If you have been following along but stopped, we could load our imported data from the "data" directory like so:

```{r}
load(here::here("data", "imported", "imported_data.rda"))
```

***
<details> <summary> If you skipped the data import section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the imported data using the following code:

```{r, eval=FALSE}
imported_data("ocs-bp-vaping-case-study", outpath = getwd())
load(here::here("OCSdata", "data", "imported", "imported_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-vaping-case-study/tree/master/data/imported) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-vaping-case-study/master/data/imported/imported_data.rda). Download this file and then place it in your current working directory within a subdirectory called "imported" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "imported", "imported_data.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>

***


Our goal in this section is to adjust or "wrangle" the data from each year into a common format so that we can combine the datasets across years for our analysis, and so that we have values in our variables that are correct and easy to interpret. We will need to understand what is the same and what is different across the data from different years, rename and recode the variables (e.g., by replacing the numbers 1 and 2 with the values "Male" and "Female" for the `Sex` variable), and combine the data. We will walk through these steps below. 

First, let's take a look at our data. We can get a good sense of it using the `glimpse()` function of the `dplyr` package.

#### {.scrollable }

```{r}
dplyr::glimpse(nyts_data[["nyts2015"]])
```
####

### **Updating the set of variables and their names**
*** 

The easiest way of making it so that the data from the different years can be combined is by making sure that the same type of data in different datasets share the same names. In addition, giving the columns informative names will help make our code more readable. Currently, it isn't very clear what most of the variables indicate since the variable names are uninformative on their own, without the [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"}.

We want to rename variables like `Qn1` to something more meaningful like `Age`.

To do this we will use the `rename()` function of the `dplyr` package. The new name is always listed first before the `=`. This function will replace the old variable names with the new ones, i.e., after running the code below, there will no longer be a `Qn1` variable in the dataset, but there will be an `Age` variable instead. We will start working with the 2015 data, and then move on to the other years down below.

```{r}
nyts_data[["nyts2015"]] <- nyts_data[["nyts2015"]] %>%
  dplyr::rename(Age = Qn1,
                Sex = Qn2,
                Grade = Qn3)
```


Ultimately we will be combining the data from each year using the `bind_rows()` function of the `dplyr` package, which will fill in `NA` values for any columns that do not exist for one of the years.

The data for 2016-2018 have many common attributes, so we will want to write code that can be applied to all three datasets. To do this, we will use a **function** in R, which is basically a piece of code that can be applied to similar but different objects in R (e.g., the data tibbles from each of these three years). Recall that you are using functions from packages all the time, like the `rename()` function of the `dplyr` package. Now you are going to write your own function! For more information on functions, see [here](https://r4ds.had.co.nz/functions.html){target="_blank"}.

These next 3 years have the same structure for many of the questions we are interested in. For example, they all have flavor questions, but not a brand question. Moreover, their variable names are consistent across the years; for each year, we want to replace the vague question name `Q50A` with the value `menthol` in all three datasets, and the same is true for the other flavor variables.  

Since we want to perform the same modifications on the data from all three years, rather than repeating the same somewhat messy piece of code three times, we can do this more efficiently if we create a function to do all of these steps at once. Then we can use the `map_at()` function of the `purrr` package, which is an extension of the `map()` function that we used in the [Data Import] (https://www.opencasestudies.org/ocs-bp-vaping-case-study/#Data_Import){target="_blank"} section to apply the function we just created (for renaming variables etc.) specifically to the data from 2016-2018 within the `nyts_data`. By using `vars()` inside of the `map_at()` function we can specify what tibbles within our `nyts_data` list we want to include or exclude. 


#### **Functions**
*** 

So how do you create a function?
You first need to specify that you are creating a function by using the `function()` base function.
Yes, that's right it is a function for creating functions called function!

First we specify our input within the parentheses of `function()`. Thus if our function will apply something to an input called `x` then we would use `function(x)`. Theoretically our input can be named whatever we want, but we we need to refer to it consistently within our function to indicate what we want done with the input data. We can actually have more than one input as well, we would indicate two inputs like this: `function(x, b)`. Here we would be using both `x` and `b` to do something in our function.

The next part of a function is defined within curly brackets `{}`. This is where we write what we want done to our inputs.

***

<details> <summary> Click here to see a simple example </summary> 

Our function will be called `simple_function`, which will add 2 to our input. It will take the input `vector_data` (note that we usually want to use an informative input argument like `vector_data`). 

```{r}
simple_function <- function(vector_data) {
  vector_data + 2
}
```

We can use an input that is a vector called `x`.

```{r}
x = c(1, 2, 3, 4)
x
```

Now we specify using the argument ` vector_data =` to indicate that `x` is our input that we want to perform the function on.

```{r}
simple_function(vector_data = x)
```

This function also works for other vector input. For example, vector `y` is now our input that we want to perform the function on.

```{r}
y = c(5, 10, 15, 20)

simple_function(vector_data = y)
```

</details>

***

In our case we will be applying our function to the variable names for the dataset for each year. The output of our function is the result of renaming these variables for each year. 

```{r}

update_survey <- function(dataset) { dataset %>%
           rename(Age = Q1,
                  Sex = Q2,
                Grade = Q3,
              menthol = Q50A,
          clove_spice = Q50B,
                fruit = Q50C,
            chocolate = Q50D,
      alcoholic_drink = Q50E,
 candy_dessert_sweets = Q50F,
                other = Q50G)
}

# options to apply the function to the data:
# nyts_data <- nyts_data %>% map_at(vars(nyts2016, nyts2017, nyts2018), Update_survey)
nyts_data <- nyts_data %>% map_at(vars(-nyts2015, -nyts2019), update_survey)
```

The final year, 2019, has a slightly different data structure compared to these earlier datasets. It actually has a `brand_ecig` variable already and different question numbers correspond to our flavor questions of interest. So we will rename the variables in this dataset individually. We could also write this as a function, but since we are only applying this one time, there is no need to. Functions are really helpful for repeating the same task repeatedly using different data inputs.

```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
    rename(brand_ecig = Q40,
                  Age = Q1,
                  Sex = Q2,
                Grade = Q3,
              menthol = Q62A,
          clove_spice = Q62B,
                fruit = Q62C,
            chocolate = Q62D,
      alcoholic_drink = Q62E,
 candy_dessert_sweets = Q62F,
                other = Q62G)
```


Now let's take a look at the variable names for each of the years using the `map` function from `purrr`.

#### {.scrollable }

```{r}
map(nyts_data, names)
```
####

It's looking better! The data that overlap across years have the same variable names.

### **Updating Values**
***
Now that we have made some progress on the selection and names of the variables themselves, we will work on the values contained in the different variables.

We can start with updating the values for `Age` and `Grade`, so that they are more understandable. 

Recall from the codebook for this year's dataset that `Age` isn't listed in the way one might expect, i.e., it is not just a number of years, but a numerically valued categorical variable.

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "qn1.png"))
```

The same is true for `Grade`:

```{r, echo = FALSE, fig.align ="center", out.width = "600 px"}

knitr::include_graphics(here::here("img", "grade.png"))
```

**This is why it is so important to always check the codebook!!**

We also  want to replace the value of `19` for `Age` to be `">18"` and the value of `13` for `Grade` to be replaced with `"Ungraded/Other"` Also according to the codebooks, numeric values of `1` indicate a survey answer of `FALSE`, while a value of `2` indicates `TRUE`. `Sex` also needs to be recoded. If we take a look at the codebooks carefully (make sure you look at the questions that we pulled, not the recoded values), we will see that males are indicated by `1` and females are indicated by `2`. Finally some values are indicated with `"*"` or`"**"` when they are missing. We want to replace these with `NA`.

Let's create a function to make all these updates. We will use the `mutate` function of the `dplyr` package to modify these variables. This function can also be used to create new variables. We will also use the  `recode()` function of the `dplyr` package to replace specific values of certain variables.

```{r}
update_values <- function(dataset) {
  dataset %>%
    dplyr::mutate(Age = as.numeric(Age) + 8,
                  Grade = as.numeric(Grade) + 5) %>%
    mutate(Age = as.factor(Age),
           Grade = as.factor(Grade),
           Sex = as.factor(Sex)) %>%
    mutate(Sex = recode(Sex,
                        `1` = "male",
                        `2` = "female")) %>%
    dplyr::mutate_all( ~ replace(., . %in% c("*", "**"), NA)) %>%
    mutate(Age = recode(Age,
                        `19` = ">18"),
           Grade = recode(Grade,
                          `13` = "Ungraded/Other")) %>%
    dplyr::mutate_at(vars(
      starts_with("E", ignore.case = FALSE),
      starts_with("C", ignore.case = FALSE)
    ),
    list( ~ recode(
      .,
      `1` = TRUE,
      `2` = FALSE,
      .default = NA,
      .missing = NA
    )))
}

nyts_data <- nyts_data %>% map(., update_values)
```


Now if we wanted to check that everything is expected we could do something like this to check the `Sex` variable using the `count()` function of the `dplyr` package. It is advisable to check your data frequently to make sure that it is as expected!

According to the codebook, we should have:  

1) 8,958 males in 2015  
2) 10,438 males in 2016  
3) 8,881 males in 2017  
4) 10,069 males in 2018  
5) 9,803 males in 2019   

```{r}
count_sex <- function(dataset) {dataset %>% count(Sex)}
nyts_data %>% map(., count_sex)
```


Looks good!

The years (2016-2019) that have flavors also need the flavor data to be logical (meaning TRUE or FALSE):

In this case we also are setting missing values to `FALSE` because then the `TRUE` values will represent those who reported using a specific flavor out of all users, rather than those that used a specific flavor compared to those who used a different flavor.


```{r}
update_flavors <- function(dataset) {dataset %>%
   mutate_at(vars(menthol:other),
              list(~ recode(.,
                           `1` = TRUE,
                      .default = FALSE,
                      .missing = FALSE))) }

nyts_data  <- nyts_data  %>% map_at(vars(-nyts2015), update_flavors)
```


Now there are just a few changes needed that are specific to 2019. Specifically, some of the 2019 questions use the values ".N", ".S", and ".Z" to indicate different types of missing data (see for example Q2 of the 2019 [codebook](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-codebook-p.pdf){target="_blank"}); we just want them to be replaced with `NA` values.


```{r}
nyts_data[["nyts2019"]] <- nyts_data[["nyts2019"]] %>%
  mutate_all(~ replace(., . %in% c(".N", ".S", ".Z"), NA)) %>%
  mutate(psu = as.character(psu)) %>%
  mutate(brand_ecig = recode(brand_ecig,
                                             `1` = "Other", # levels 1,8 combined to `Other`
                                             `2` = "Blu",
                                             `3` = "JUUL",
                                             `4` = "Logic",
                                             `5` = "MarkTen",
                                             `6` = "NJOY",
                                             `7` = "Vuse",
                                             `8` = "Other"))
```

Great! Now our values don't need to be handled any differently for any of the years, thus we can combine the data across years.

Even though we have different numbers of variables for each year, we can coerce the data to be combined into one tibble by using the `bind_rows()` function of `dplyr`. Importantly, this function does not require that the columns be the same.  This will create NA values for any variable that is not present in given data frame but is  present in one of the other data frames that is being combined. Note that the `bind_cols()` function does expect that the rows match. The `.id` argument will create a new variable with values to link each row to its original data frame. For more information see [here](https://dplyr.tidyverse.org/reference/bind.html).


```{r}
nyts_data <- nyts_data %>%
  map_df(dplyr::bind_rows, .id = "year")

glimpse(nyts_data)
```

We will want to do some of our analysis split by year, so we would like to be sure we have one variable that has the correct value for year. It looks like we just need to remove `"nyts"` from the year variable that we created from the names of the tibbles in our list and we should be all set. We will use another function from the `stringr` package to do this. The `str_remove()` function takes a string followed by a pattern and removes the pattern from the string.

```{r}
nyts_data <- nyts_data %>%
  mutate(year = as.numeric(str_remove(year, "nyts")))
```

Here is our clean and wrangled data:

#### {.scrollable}
```{r}
glimpse(nyts_data)

```

####


Note that there are several variables where there are similar names, but with a `C` compared to an `E` in the variable name. Those starting with `C` are related to questions about current usage (last 30 days), while those with an `E` are related to usage across the student respondent's whole life ("ever" usage). We will discuss these groups further below.

Now we will save our wrangled data. We will save it as an rda file for ourselves and as csv files, as this is often a good option for collaborators. We will save this file in a directory called "wrangled" within our "data" directory of our project.

```{r}
save(nyts_data, file = here::here("data", "wrangled", "wrangled_data.rda"))

write_csv(nyts_data, path = here::here("data", "wrangled", "nyts_data.csv"))
```


### **Variable Table**
***
<details><summary> Click here to see a table about the final variables in our data set. </summary>


Variable   | Details                                                                        
---------- |-------------
**year**  | the year that the survey results from a particular student respondent were acquired  
**psu** | the primary sampling unit for the survey weighting  
**finwgt** | the final analysis weight for the survey weighting  
**stratum** | the stratum used for variance estimation for the survey weighting  
**Age** | the age of the student when they took the survey  
**Sex** | the sex of the student when they took the survey  
**Grade** | the grade of the the student when the took the survey  
**ECIGT** | student reported having ever tried cigarette smoking, even one or two puffs  
**ECIGAR** | student reported having ever tried cigar, cigarillo, or little cigar smoking, even one or two puffs  
**ESLT** | student reported having ever tried chewing tobacco, snuff, or dip  
**EELCIGT** | student reported having ever tried e-cigarettes  
**EROLLCIGTS** | student reported having ever tried roll-your-own cigarettes  
**EFLAVCIGTS** |  (2015 only) based on answer to "Which of the following tobacco products that you used in the past 30 days were flavored?"  
**EBIDIS** | student reported having ever tried bidis (small brown cigarettes wrapped in a leaf)  
**EFLAVCIGAR** | student reported having ever tried a flavored cigar (2015-2016)
**EHOOKAH** | student reported having ever smoked tobacco from a hookah or a waterpipe  
**EPIPE** | student reported having ever smoked tobacco from a pipe (not hookah)  
**ESNUS** | student reported having ever used snus, such as Camel or Malboro Snus  
**EDISSOLV** | student reported having ever tried dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips  
**CCIGT** | student reported they smoked cigarettes on >= 1 of the past 30 days  
**CCIGAR** | student reported they smoked cigars on >= 1 of the past 30 days  
**CSLT** | student reported they used chewing tobacco, snuff, or dip on >= 1 of the past 30 days  
**CELCIGT** | student reported they used electronic cigarettes or e-cigarettes one or more days in the past 30
**CROLLCIGTS** | student reported they smoked roll-your-own cigarettes during the past 30 days  
**CFLAVCIGTS**| (2015 only) based on answer to "Which of the following tobacco products that you used in the past 30 days were flavored?" 
**CBIDIS** | student reported they smoked bidis during the past 30 days  
**CHOOKAH** | student reported they smoked tobacco in a hookah on >= 1 of the past 30 days  
**CPIPE** | student reported they smoked tobacco in a pipe (not hookah) during the past 30 days  
**CSNUS** | student reported they used snus during the past 30 days    
**CDISSOLV** | student reported they used dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips during the past 30 days  
**brand_ecig** | student answer to "During the past 30 days, what brand of e-cigarettes did you usually use?"  
**menthol** | student selected Menthol or mint as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**clove_spice** | student selected clove or spice as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**fruit** | student selected fruit as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**chocolate** | student selected chocolate as the answer to "What flavors of tobacco products have you used in the past 30 days? (select one or more)"  
**alcoholic_drink** | student selected alcoholic drink (such as wine, cognac, margarita, or other cocktails) as the answer to "What flavors of tobacco products have you used in the past 30 days? (choose one or more)"  
**candy_dessert_sweets** |  student selected candy, desserts or other sweets as the answer to "What flavors of tobacco products have you used in the past 30 days? (choose one or more)" 
**other** | student selected some other flavor not listed as the answer to "What flavors of tobacco products have you used in the past 30 days? (choose one or more)" 
**EHTP** | student reported having ever tried heated (also known as "heat-not-burn") tobacco products  
**CHTP** | student reported they used heated tobacco products during the past 30 days  
</details>


## **Data Visualization**
*** 

If you have been following along but stopped, we could load the wrangled data from the "data" directory like so:

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> If you skipped the data exploration and wrangling section click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-vaping-case-study", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-vaping-case-study/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-vaping-case-study/master/data/wrangled/wrangled_data.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>

***


Recall that our main questions were:

1) How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
2) How does e-cigarette use compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   
> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
> "What flavors of tobacco products have you used in the past
30 days?" 

4) Is there a relationship between e-cigarette/vaping use and other tobacco use?


We are now going to create data visualizations to explore each of these questions.

For many of these questions we will be interested in both **current** and **ever** users, so we will want to create a variable for labeling individuals who are current users of any tobacco product (or not, i.e., who do not currently use a tobacco product) and a variable for labeling individuals who are "ever users" of any tobacco product (or not, i.e., who have never used a tobacco product).

We define these two groups as follows:

1) **current** = students who used a product for >=1 day in the past 30 days  
2) **ever** =  students who report having used or tried a product at any point in time

All **current** users are therefore **ever** users but not all **ever** users are **current** users. Thus, **current** users are a subset of **ever** users.

To add these grouping variables to our data we will do a bit more wrangling using the `mutate()` function again of the `dplyr` package. As discussed above, our dataset contains a set of questions that relate to whether the student has ever used the particular tobacco product (questions that start with the letter "E"), and questions that relate to whether the student currently uses the particular tobacco product (questions that start with the letter "C"). 

Here are some examples for these data entries:  

 - EPIPE: Students who reported they have smoked tobacco from a pipe (not hookah).  
 - CPIPE: Students who reported they smoked tobacco in a pipe (not hookah) during the past 30 days. 
 - EROLLCIGTS: Students who reported they have tried smoking roll-your-own cigarettes. 
 - CROLLCIGTS: Students who reported they smoked roll-your-own cigarettes during the past 30 days. 
 
Based on many questions like this:  
 
 In the past 30 days, which of the following products have you used on at least one day? (Select one or more)   
A. Roll-your-own cigarettes  
B. Pipes filled with tobacco (not hookah or waterpipe)   
C. Snus, such as Camel, Marlboro, or General Snus  
D. Dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks,
or Camel strips  
E. Bidis (small brown cigarettes wrapped in a leaf)  
F. I have not used any of the products listed above in the past 30 days  

 Which of the following tobacco products have you ever tried, even just one time? (Select one or more)   
A. Roll-your-own cigarettes  
B. Pipes filled with tobacco (not hookah or waterpipe)   
C. Snus, such as Camel, Marlboro, or General Snus  
D. Dissolvable tobacco products such as Ariva, Stonewall, Camel orbs, Camel sticks, Marlboro sticks, or Camel strips   
E. Bidis (small brown cigarettes wrapped in a leaf)  
F. I have never tried any of the products listed above   
 
 
We will sum across the variables that relate to ever or current tobacco usage questions to determine if the student answered yes to any of the ever or current questions. To do this we will use the base `rowSums` function.

We will then use the `case_when()` function of the `dplyr` package to convert the sum values to `TRUE` or `FALSE` based on the threshold of zero. If the sum is greater than zero, then we know the student answered yes to at least one question. 


```{r}
nyts_data %<>%
  mutate(tobacco_sum_ever = rowSums(select(., starts_with("E", 
                                    ignore.case = FALSE)), na.rm = TRUE),
      tobacco_sum_current = rowSums(select(., starts_with("C", 
                                    ignore.case = FALSE)), na.rm = TRUE)) %>%
      mutate(tobacco_ever = case_when(tobacco_sum_ever > 0 ~ TRUE,
                                      tobacco_sum_ever == 0 ~ FALSE),
          tobacco_current = case_when(tobacco_sum_current > 0 ~ TRUE,
                                      tobacco_sum_current == 0 ~ FALSE))
```
 
 
#### {.scrollable}
```{r}
glimpse(nyts_data)
```
####

We are also interested in e-cigarette/vaping product usage compared to other tobacco products, so we will create some variables related to the sum of all e-cigarette usage question variables and the sum of all tobacco usage question variables excluding those that are about e-cigarettes. There is only one variable about e-cigarette usage ever (EELCIGT) and one about current usage (CELCIGT).


```{r}
nyts_data <- nyts_data %>% 
  mutate(ecig_sum_ever = rowSums(select(., EELCIGT), na.rm = TRUE),
      ecig_sum_current = rowSums(select(., CELCIGT), na.rm = TRUE),
     non_ecig_sum_ever = rowSums(select(., starts_with("E", 
                                           ignore.case = FALSE), 
                                           -EELCIGT), na.rm = TRUE),
  non_ecig_sum_current = rowSums(select(., starts_with("C",
                                           ignore.case = FALSE), 
                                           -CELCIGT), na.rm = TRUE)) %>%
      mutate(ecig_ever = case_when(ecig_sum_ever > 0 ~ TRUE,
                                   ecig_sum_ever == 0 ~ FALSE),
          ecig_current = case_when(ecig_sum_current > 0 ~ TRUE,
                                   ecig_sum_current == 0 ~ FALSE),
         non_ecig_ever = case_when(non_ecig_sum_ever > 0 ~ TRUE,
                                   non_ecig_sum_ever == 0 ~ FALSE),
      non_ecig_current = case_when(non_ecig_sum_current > 0 ~ TRUE,
                                   non_ecig_sum_current == 0 ~ FALSE))
```                       

#### {.scrollable}
```{r}
glimpse(nyts_data)
```

####

Finally, we are also interested in grouping students that only use e-cigarettes and those that only use other forms of tobacco.

Recall that current users are a subset of ever users, thus students would typically answer yes to having tried vaping products if they had used them one or more days in the past 30 days.

First we will make a small toy dataset called `test` to show what we will do with the larger dataset:
```{r}
test <- tibble(ecig_ever = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE",
                             "FALSE", "TRUE", "FALSE", "FALSE"),
               non_ecig_ever = c("TRUE", "FALSE", "FALSE", "FALSE", "FALSE",
                                 "FALSE", "TRUE", "TRUE", "TRUE"),
               ecig_current = c("TRUE", "FALSE", "FALSE", "TRUE", "TRUE", 
                                "FALSE", "FALSE", "FALSE", "FALSE"),
               non_ecig_current = c("TRUE", "FALSE", "TRUE", "FALSE", "TRUE",
                                    "FALSE", "FALSE", "FALSE", "TRUE"))

test
```

Now, let's look at identifying students who have tried e-cigarettes, but are not current users, and who have never tried other tobacco products (and are therefore not current users). We will again use the `case_when()` and the `mutate` function to create new variables with specific values when certain conditions are met. In this case, we will specify that several conditions must be met by using the `&` symbol. For a value of `TRUE` for the new `ecig_only_ever` variable, all of the conditions combined with `&` must be met.  If any of the conditions are not met then the `ecig_only_ever` value will be `FALSE` based on the last line `TRUE ~ FALSE`.

```{r}

test <- test %>% mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                               non_ecig_ever == FALSE &
                                                ecig_current == FALSE &
                                            non_ecig_current == FALSE ~ TRUE,
                                                         TRUE ~ FALSE))
test
```

We can see from the second row, that the `ecig_only_ever` is `TRUE` when we would expect it to be.
We can also see from the fourth row, that even though the student reported yes to ever trying e-cigarettes, because they also reported yes to currently using e-cigarettes the value for only ever trying e-cigarettes is `FALSE`. Additionally we can see from the seventh row that similarly even though the student reported yes to ever trying e-cigarettes, they also reported yes to ever trying other products, and the value for only ever trying e-cigarettes is `FALSE`. Importantly, we can see from the 6th row, that if all responses are negative than the value is `FALSE`.

Now we will expand this to the other possible categories. In this case we note that since current users are a subset of ever users, it doesn't matter if a user reports yes to ever trying  e-cigarettes, they can still be a current user.


```{r}
test <- test %>%
         mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                       non_ecig_ever == FALSE &
                                        ecig_current == FALSE &
                                    non_ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE),
          ecig_only_current = case_when(ecig_current == TRUE &
                                       non_ecig_ever == FALSE &
                                    non_ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE),
        non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                           ecig_ever == FALSE &
                                        ecig_current == FALSE &
                                    non_ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE),
  non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                           ecig_ever == FALSE &
                                        ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE),
                    no_use = case_when(non_ecig_ever == FALSE &
                                           ecig_ever == FALSE &
                                        ecig_current == FALSE &
                                    non_ecig_current == FALSE ~ TRUE,
                                                        TRUE ~ FALSE))
glimpse(test)
```


Take a minute to check that the values are what we would expect.

OK, now we are going to make a `Group` variable based on the new variables we just made to classify students into one of four mutually exclusive and exhaustive categories. In this case we will have a particular value based on one condition **or** another. This **or** conditional is specified by the `|` symbol. Only one of the conditions needs to exist for that particular value, whereas when we used the `&` symbol, all of the conditions had to be met. 

If a student has ever tried or currently uses e-cigarettes, but has never tried other tobacco products, the value will be `Only e-cigarettes`. If a student has ever tried or is a current user of other tobacco products, but has never tried e-cigarettes, the value will be `Only other products`. If the value of the `no_use` variable is simply `TRUE`, then the `Group` variable value will be `Neither`. Finally, if a student has tried or currently uses both e-cigarettes and other tobacco products the `Group` variable value will be `Combination of products`. Thus in this case the values for the usage of the variables based on **only** using e-cigarettes or **only** other products will all be `FALSE`. 

```{R}

test <- test %>%
  mutate(Group = case_when(ecig_only_ever == TRUE |
                        ecig_only_current == TRUE ~ "Only e-cigarettes",
                       non_ecig_only_ever == TRUE |
                    non_ecig_only_current == TRUE ~ "Only other products",
                                   no_use == TRUE ~ "Neither",
                           ecig_only_ever == FALSE &
                        ecig_only_current == FALSE &
                       non_ecig_only_ever == FALSE &
                    non_ecig_only_current == FALSE &
                                   no_use == FALSE ~ "Combination of products"))


test %>% count(Group)

glimpse(test)
```

OK, now that we have seen how this works with our toy dataset, we will apply our code to our `nyts_data`.

```{r}
nyts_data %<>%
             mutate(ecig_only_ever = case_when(ecig_ever == TRUE &
                                           non_ecig_ever == FALSE &
                                            ecig_current == FALSE &
                                        non_ecig_current == FALSE ~ TRUE,
                                                             TRUE ~ FALSE),
              ecig_only_current = case_when(ecig_current == TRUE &
                                           non_ecig_ever == FALSE &
                                        non_ecig_current == FALSE ~ TRUE,
                                                            TRUE ~ FALSE),
            non_ecig_only_ever = case_when(non_ecig_ever == TRUE &
                                               ecig_ever == FALSE &
                                            ecig_current == FALSE &
                                        non_ecig_current == FALSE ~ TRUE,
                                                            TRUE ~ FALSE),
      non_ecig_only_current = case_when(non_ecig_current == TRUE &
                                               ecig_ever == FALSE &
                                            ecig_current == FALSE ~ TRUE,
                                                            TRUE ~ FALSE),
                        no_use = case_when(non_ecig_ever == FALSE &
                                               ecig_ever == FALSE &
                                            ecig_current == FALSE &
                                        non_ecig_current == FALSE ~ TRUE,
                                                            TRUE ~ FALSE)) %>%
                 mutate(Group = case_when(ecig_only_ever == TRUE |
                                       ecig_only_current == TRUE ~ "Only e-cigarettes",
                                      non_ecig_only_ever == TRUE |
                                   non_ecig_only_current == TRUE ~ "Only other products",
                                                  no_use == TRUE ~ "Neither",
                                          ecig_only_ever == FALSE &
                                       ecig_only_current == FALSE &
                                      non_ecig_only_ever == FALSE &
                                   non_ecig_only_current == FALSE &
                                                  no_use == FALSE ~ "Combination of products"))
```


Lastly, it can be very helpful to have the total number of students surveyed each year. We can easily add a variable for this by using the `add_count()` function of the `dplyr` package. This will create a variable called `n` which will show the total number of survey responses for that year.

```{r}
nyts_data %<>% dplyr::add_count(year)
```

#### {.scrollable}
```{r}
glimpse(nyts_data)
```

Now let's save our wrangled data for future use. We will save it as an r compatible file (a .rda file), as well as csv files, as these are useful to give to collaborators. We will use the `write_csv()` function of the `readr` package to do this.

```{r}
save(nyts_data, file = here::here("data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
readr::write_csv(nyts_data, 
                 path = here::here("data", "wrangled", "nyts_data_for_plots.csv"))
```

### **Question 1**
***

***
<details> <summary> Click here if you skipped the previous sections and want to start here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data for plots using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-vaping-case-study", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-vaping-case-study/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-vaping-case-study/master/data/wrangled/wrangled_data_with_var_for_plots.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

</details>

***

Recall that we are interested in investigating how vaping product use has compared with other tobacco use over time. To answer this, we first want to get a sense of how tobacco use has changed in general since 2015. 

To create a visualization of how tobacco usage has changed over time, we will first convert the usage data to a percent value for each year, telling us what percent of student respondents fall into a particular usage category. To do this we will use the `group_by()` and `summarize()` functions of the `dplyr` package. This will create new variables which we will name `Ever` and `Current` based on the  percentages of `TRUE` values for `tobacco_ever` and `tobacco_current` for each year. In this case the `mean()` function is used to calculate the percentages based on an automatic conversion that R does where for TRUE/FALSE variables, `TRUE` is given a value of one and `FALSE` is given a value of zero. The mean of a 0-1 binary variable is just the percent of the time the value is 1. NA values do not contribute to the total count when we include the argument `na.rm = TRUE` to our function call. 

***

<details> <summary> Click here to see a toy example:</summary>
```{r}
# the test data has 3 TRUE values and 7 FALSE values
test <- tibble("var1" = c("TRUE", "TRUE", "TRUE", rep("FALSE", 7)))
test %<>% mutate(var1 = as.logical(var1))
test

test %>% summarize(Percentage = mean(var1) * 100)


# the test data has 3 TRUE values, 3 FALSE values, and 4 NA value
test <- tibble("var1" = c("TRUE", "TRUE", "TRUE", rep("FALSE", 3), rep("NA", 4)))
test %<>% mutate(var1 = as.logical(var1))
test

test %>% summarize(Percentage = mean(var1, na.rm = TRUE) * 100)
```
</details>

***

And now back to our data:

```{r}

nyts_data %>%
  dplyr::group_by(year) %>%
  dplyr::summarize(Ever = (mean(tobacco_ever, na.rm = TRUE) * 100),
                   Current = (mean(tobacco_current, na.rm = TRUE) * 100))
```

We will use the `pivot_longer` function from the `tidyr` package to take all columns except year (in this case the `Ever` and `Current` columns), to create a column called `User` that will contain the current column name information and a column called `Percentage of students` which will contain the mean percentage values that we just calculated. This converts our data into a format called "long" format.

```{r}

nyts_data %>%
  dplyr::group_by(year) %>%
  dplyr::summarize("Ever \n (any lifetime use)" = (mean(tobacco_ever, na.rm = TRUE) * 100),
                   "Current \n (any past-30-day use)" = (mean(tobacco_current, na.rm = TRUE) * 100)) %>%
  tidyr::pivot_longer(cols = -year,
                      names_to = "User",
                      values_to = "Percentage of students")
```
You may have noticed that our data is longer than it used to be! Hence the name of the function `pivot_longer()`. Data is often easier to plot when it is in this format.
  
Now we will use this data to create a plot using the `ggplot2` package. 

The first thing we do to create a plot is specify what data we are using for our x axis and y axis with the`aes()` argument of the `ggplot()` function. Then we add layers to our plot that specify what type of plot we would like to create. We can use the `geom_line()` function to create lines for each type of user. We specify that we want to use different line types for each user using `aes()`. We will also add points to our lines using the `geom_point()` function. We can add additional layers to specify colors and details about labels and legends etc.
  
```{r}  

plot1 <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever \n (any lifetime use)" = (mean(tobacco_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)" = (mean(tobacco_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") %>%
  ggplot(aes(x = year, y = `Percentage of students`)) +
  geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(1, 2), 
                        breaks = c("Ever \n (any lifetime use)", 
                                   "Current \n (any past-30-day use)")) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 70, by = 10),
                     labels = seq(0, 70, by = 10),
                     limits = c(0, 70)) +
  # this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has tobacco use varied over the years?",
         y = "% of students")

plot1 + theme(text = element_text(size = 15))
```

Nice! Now we can see how overall tobacco usage has changed since 2017. It appears that usage first decreased from 2015 to 2017 and then increased a bit since 2017, surpassing the levels in 2015.

What about e-cigarette use? How has the usage of e-cigarettes changed over time?

```{r}
plot1a <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever \n (any lifetime use)" = (mean(ecig_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)" = (mean(ecig_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = -year, names_to = "User", values_to = "Percentage of students") %>%
  ggplot(aes(x = year, y = `Percentage of students`)) +
  geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(1, 2), 
                        breaks = c("Ever \n (any lifetime use)", 
                                   "Current \n (any past-30-day use)")) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 60, by = 10),
                     labels = seq(0, 60, by = 10),
                     limits = c(0, 60)) +
  # this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has e-cigarette use varied over the years?",
         y = "% of students")

plot1a + theme(text = element_text(size = 15))
```
It looks like the shape of the plot is very similar to tobacco usage overall. We see a downward trend until 2017 when the rate of both current and ever users increased. Recall that this is in agreement with the articles that we referenced earlier. We can see that the slope looks steeper for e-cigarette usage as compared to all tobacco products (including e-cigarettes).


Now let's plot this data together on the same plot.

We will have four groups (e-cigarette ever users, e-cigarette current users, tobacco ever users, and tobacco current users) to plot, therefore, it would be useful to add color to our plot. Keep in mind that e-cigarette users are a subset of any tobacco product users.

One important thing to keep in mind when creating plots is that individuals with color blindness may have a difficult time distinguishing groups when certain color choices are used. 

One great option is to use the `viridis` package, which offers color palettes with colors that are still distinguishable by individuals with most forms of color blindness. 

We can choose which colors we want to use by using the `show_col()` function of the `scales` package.

Here are some color options:
```{r}

scales::show_col(viridis_pal()(6))
v_colors =  viridis(6)[c(1, 4)]
```

We will select the first and fourth colors for our plot. To add these specific colors we will use the `scale_color_manual()` function of the `ggplot2` package.

We will calculate the mean ever and current usage percentages for students who used e-cigarettes or any tobacco products (including e-cigarettes) for each year again using the `group_by()` and `summarize()` functions. We will again use the `pivot_longer` function to convert our data to long format. We will also use the `separate()` function of the `tidyr` package to create two variables from one of the variables. This is done by separating by, in this case, an underscore. 


```{r}
nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever \n (any lifetime use)_Any Tobacco Product \n (including e-cigarettes)" = 
                       (mean(tobacco_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_Any Tobacco Product \n (including e-cigarettes)" =   
                       (mean(tobacco_current, na.rm = TRUE) * 100),
                     "Ever \n (any lifetime use)_E-cigarettes" = 
                       (mean(ecig_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_E-cigarettes" = 
                       (mean(ecig_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = -year, 
           names_to = "User", 
          values_to = "Percentage of students") %>%
  separate(User, into = c("User", "Product"), sep = "_") %>%
  head()


plot1t <- nyts_data %>%
    group_by(year) %>%
    summarize("Ever \n (any lifetime use)_Any Tobacco Product \n (including e-cigarettes)" = 
                (mean(tobacco_ever, na.rm = TRUE) * 100),
              "Current \n (any past-30-day use)_Any Tobacco Product \n (including e-cigarettes)" = 
                (mean(tobacco_current, na.rm = TRUE) * 100),
              "Ever \n (any lifetime use)_E-cigarettes" = 
                (mean(ecig_ever, na.rm = TRUE) * 100),
              "Current \n (any past-30-day use)_E-cigarettes" = 
                (mean(ecig_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = -year, 
           names_to = "User", 
          values_to = "Percentage of students") %>%
  separate(User, 
           into = c("User", "Product"), 
            sep = "_") %>%
    ggplot(aes(x = year, 
               y = `Percentage of students`,
           color = Product)) +
    geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(1, 2), 
                        breaks = c("Ever \n (any lifetime use)", 
                                   "Current \n (any past-30-day use)")) +
  # we want purple associated with e-cigarettes to be consistent with later plots
  scale_color_manual(values = rev(v_colors)) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 60, by = 10),
                     labels = seq(0, 60, by = 10),
                     limits = c(0, 60)) +
  # this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has tobacco use varied over the years?",
         y = "% of students")

plot1t + theme(text = element_text(size = 15))
```

We see an increase in all categories starting in 2017, but the rate of increase is higher for students using only e-cigarettes (current or ever users), as shown by the higher slope of the e-cigarette lines.

In the above plots, the "Any tobacco product" groups include individuals in the "E-cigarette only" groups. Now let's plot students in two mutually exclusive groups on the same plot: those who reported either using only e-cigarettes or only other tobacco products (besides e-cigarettes), but not both. 

We will calculate the mean ever and current usage percentages for students in these two mutually exclusive groups, again using the `group_by()` function and the `summarize()` function. We will again use the `pivot_longer` function to convert our data to long format. We will also again use the `separate()` function of the `tidyr` package to create two variables from one variable. This is done by separating by, in this case, a space. 

```{r}

nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever \n (any lifetime use)_E-cigarette" = 
                       (mean(ecig_only_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_E-cigarette" = 
                       (mean(ecig_only_current, na.rm = TRUE) * 100),
                     "Ever \n (any lifetime use)_Non-e-cigarette" = 
                       (mean(non_ecig_only_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_Non-e-cigarette" = 
                       (mean(non_ecig_only_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = -year, 
           names_to = "User", 
          values_to = "Percentage of students") %>%
  tidyr::separate(User, into = c("User", "Product"), sep = "_") %>%
  head()

plot1c <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever \n (any lifetime use)_E-cigarette" = 
                       (mean(ecig_only_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_E-cigarette" = 
                       (mean(ecig_only_current, na.rm = TRUE) * 100),
                     "Ever \n (any lifetime use)_Non-e-cigarette" = 
                       (mean(non_ecig_only_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_Non-e-cigarette" = 
                       (mean(non_ecig_only_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = -year, 
           names_to = "User", 
          values_to = "Percentage of students") %>%
  separate(User, into = c("User", "Product"), sep = "_") %>%
    ggplot(aes(x = year, y = `Percentage of students`, color = Product)) +
    geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(1, 2), 
                        breaks = c("Ever \n (any lifetime use)", 
                                   "Current \n (any past-30-day use)")) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 30, by = 10),
                     labels = seq(0, 30, by = 10),
                     limits = c(0, 30)) +
  scale_color_manual(values = v_colors) +
  # this adjusts the background style of the plot
    theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How has use of only e-cigarettes and
only tobacco products besides e-cigarettes varied over time?",
         y = "% of students")

plot1c + theme(text = element_text(size = 15))
```

Very interesting! We can see from this plot that the percentage of students who had currently used (or ever tried) only e-cigarettes greatly increased starting in 2017, while in contrast the percentage of students who had ever tried only non-e-cigarette tobacco products actually diminished over time. In fact, we can see that in 2019 the percentage of students who were current e-cigarette users surpassed the percentage that had tried a non-e-cigarette product even just once. 


Recall that we made a variable called `Group` that identified students who used either just e-cigarette products, just other tobacco products (besides e-cigarettes), or students who used both e-cigarettes and some other type of tobacco product.

```{r}
nyts_data %>%
  count(Group)
```

We will now make a plot over time of each of these groups. Since we will have 4 total groups, we will use 4 of the viridis colors.
Notice, that in this case we are grouping by three variables by simply separating the variables that we want to group by with a comma in our `group_by()` function like this: `group_by(Group, year, n)`. 

```{r}

nyts_data %>%
  group_by(Group, year, n) %>%
  summarize(group_count = n()) %>%
  mutate("Percentage of students" = group_count / n * 100) %>%
  head()

v_colors =  viridis(5)[1:4]

nyts_data %>%
  group_by(Group, year, n) %>%
  summarize(group_count = n()) %>%
  mutate("Percentage of students" = group_count / n * 100) %>%
  ggplot(aes(x = year, y = `Percentage of students`, color = Group)) +
  geom_point(size = 2) +
  geom_line() +
  scale_color_manual(breaks = c("Neither", "Combination of products",
                                "Only e-cigarettes", "Only other products"),
                     values = v_colors) +
  theme_linedraw() +
  labs(x = "Year") +
  theme(text = element_text(size = 15))
```

We can see that the majority of students did not report using any tobacco products. Of the students that did report using tobacco products, the majority of the students used both e-cigarettes and some other tobacco product. Again, a much larger percentage reported using only e-cigarettes rather than only other tobacco products in 2019.

We will further explore the relationship between e-cigarette usage and other tobacco products a bit later in the case study.


### **Question 2**
***

Now we want to look how e-cigarette smoking rates compare between males and females across time. 


We will calculate the percent ever and current e-cigarette users for each year and sex category again using the `group_by()` function and the `summarize()` function.  We will again use the `pivot_longer` function to convert our data to long format. 

As discussed above, we acknowledge that while [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data used in this analysis only contain information for groups of individuals who answered the survey questions as male or female. For individuals that have NA values, it is unclear if the question was not answered or if the individual identifies as non-binary. Because of this uncertainty, we will filter these individuals out. 

```{r}
# use different colors here for males and females to differentiate from the previous plots
v_colors =  viridis(6)[c(3, 5)]

nyts_data %>%
     filter(!is.na(Sex)) %>%
     group_by(year, Sex) %>%
     summarize("Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
               "Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)) %>%
     pivot_longer(cols = "Ever \n (any lifetime use)":"Current \n (any past-30-day use)",
                  names_to = "User",
                  values_to = "Percentage of students") %>%
     head()

plot2 <- nyts_data %>%
     filter(!is.na(Sex)) %>%
     group_by(year, Sex) %>%
     summarize("Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
               "Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)) %>%
     pivot_longer(cols = "Ever \n (any lifetime use)":"Current \n (any past-30-day use)",
                  names_to = "User",
                  values_to = "Percentage of students") %>%
    ggplot(aes(x = year, y = `Percentage of students`, color = Sex)) +
    geom_line(aes(linetype = User)) +
    geom_point(show.legend = FALSE, size = 2) +
    scale_linetype_manual(values = c(2, 1)) +
    scale_color_manual(values = v_colors) +
    theme_linedraw() +
    theme(legend.position = "bottom",
          axis.title.x = element_blank()) +
    labs(title = "How does e-cigarette usage compare between males and females?",
         subtitle = "Current and ever users by sex",
         y = "% of students")

plot2 + theme(text = element_text(size = 15))
```

It looks like the rates are fairly similar between the sexes, however the rate for males appears to be consistently higher across time.

```{r, echo=FALSE, include=FALSE}
ggsave(here::here("img", "plot2.png"))
```

### **Question 3**
***

We are also interested in what vaping brands and flavors appear to be used the most frequently. Only the 2019 dataset has this information. Therefore, we will filter for just this year using the `filter()` function of  the `dplyr` package. We will use the `summarize()` function slightly differently this time, to calculate the total number of students using each brand using the `n()` function and the `sum()` function to calculate the percent for each brand based on the counts. We will also reorder the factor levels for the brand names so that they are in descending order of percent use, using the `fct_reorder()` function from `dplyr`. This will make them appear in decreasing order of percent use on the plot.

We will make a bar plot this time by using `geom_bar`. Importantly we assign the `stat` argument to `identity`, so that we are using the percentages that we calculated not the counts which is what is used by default. When color in specified outside of the `aes()` argument, this determines the border color of the bars, which in this case will be black.

```{r}

nyts_data %>%
  filter(year == 2019) %>%
  group_by(brand_ecig) %>%
  filter(!is.na(brand_ecig)) %>%
  summarize(n = n()) %>%
  mutate(total = sum(n),
         Percent = n * 100 / total) %>%
  mutate(brand_ecig = fct_reorder(brand_ecig, desc(Percent)))


plot3 <- nyts_data %>%
  filter(year == 2019) %>%
  group_by(brand_ecig) %>%
  filter(!is.na(brand_ecig)) %>%
  summarize(n = n()) %>%
  mutate(total = sum(n),
         Percent = n * 100 / total) %>%
  mutate(brand_ecig = fct_reorder(brand_ecig, desc(Percent))) %>%
  ggplot(aes(x = brand_ecig, y = Percent, fill = brand_ecig)) +
  geom_bar(stat = "identity", color = "black") +
  theme_linedraw() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank()
  ) +
  labs(title = "What vaping brands appear to be used the most frequently?",
       subtitle = "Brand of e-cigarette most frequently used in the last 30 days (2019)",
       y = "% of e-cigarette users responding")

plot3 + theme(text = element_text(size = 15))
```

Juul appears to be the most widely used brand. This is in agreement with a recent [article](https://tobaccocontrol.bmj.com/content/tobaccocontrol/28/2/146.full.pdf), whose most recent data was from 2017:

```{r, echo=FALSE, fig.cap="Huang J, Duan Z, Kwok J, et al. Tob Control 2019;28:146–151.", out.width = '100%'}
knitr::include_graphics(here::here("img", "HuangJ_DuanZ_KwokJ_et_al_TobaccoControl_Figure1.png"))
```

We are also interested in how the usage of different flavors has changed over time. 


To evaluate this we will calculate the percentage of students using each flavor each year - this includes usage of any type of flavored tobacco product. We will exclude 2015 data, as no specific flavor questions were asked at that time.

Recall that `NA` values are not included in calculating the total count for our percentages. However all of these flavor questions had complete reporting and did not have `NA` values. Therefore, these values reflect the percentage of students reporting using a particular favor out of all students surveyed (including those that did not use any tobacco products). Also students were allowed to select more than one flavor. You can see whether these variables had complete reporting by checking the `NA` values using the base `summary` function. Alternatively you can create a visual representation using the `vis_miss()` function of the `naniar` package.

#### {.scrollable}
```{r}
# Scroll through the output!
nyts_data %>%
  filter(year != 2015) %>%
  summary()
```
####


```{r}
nyts_data %>%
  filter(year != 2015) %>%
  select(menthol:alcoholic_drink) %>%
  vis_miss()
```


The plot above confirms that these variables have no `NA` values (because all fields indicate 100% of data is present).

```{r}
plot4 <- nyts_data %>%
  filter(year != 2015) %>%
  group_by(year) %>%
      summarize(Menthol = (mean(menthol) * 100),
       `Clove or Spice` = (mean(clove_spice) * 100),
                  Fruit = (mean(fruit) * 100),
              Chocolate = (mean(chocolate) * 100),
      `Alcoholic Drink` = (mean(alcoholic_drink) * 100),
`Candy/Desserts/Sweets` = (mean(candy_dessert_sweets) * 100),
                  Other = (mean(other) * 100)) %>%
      pivot_longer(cols = -year, 
               names_to = "Flavor",
              values_to = "Percentage of students") %>%
  rename(Year = year) %>%

 ggplot(aes(y = `Percentage of students`,
            x = Year,
            fill = reorder(Flavor, `Percentage of students`))) +
  geom_bar(stat = "identity",
           position = "dodge",
           color = "black") +
  scale_fill_viridis(discrete = TRUE) +
  theme_linedraw() +
  guides(fill = guide_legend("Flavor")) +
  labs(title = "What flavors appear to be used the most frequently?",
       subtitle = "Flavors of tobacco products used in the past 30 days")

plot4 + theme(text = element_text(size = 15))
```

From this plot, we can see that fruit flavors are the most widely used products, followed by menthol or mint flavored products. We can also see that there was a general increase in the usage of flavored products over time.

We will now look specifically at the usage of flavored e-cigarette products vs other flavored tobacco products. 

Recall that we made a variable called `Group` that identified students who used either just e-cigarette/vaping products, just other tobacco products (besides e-cigarettes), or students who used both e-cigarettes and some other type of tobacco product. We will compare the usage of these flavors for these different groups. We also perform some data summaries to decide how to order the panels (flavors) for display.


```{r}

v_colors =  viridis(5)[1:4]

plot5 <- nyts_data %>%
  filter(year != 2015) %>%
  group_by(year, Group) %>%
        summarize(Menthol = (mean(menthol) * 100),
         `Clove or Spice` = (mean(clove_spice) * 100),
                    Fruit = (mean(fruit) * 100),
                Chocolate = (mean(chocolate) * 100),
        `Alcoholic Drink` = (mean(alcoholic_drink) * 100),
`Candy/Desserts/\nSweets` = (mean(candy_dessert_sweets) * 100),
                    Other = (mean(other) * 100),
              Respondents = n()) %>%
  # converting columns between and including Menthol and Other to one column called Flavor
  pivot_longer(cols = Menthol:Other, 
           names_to = "Flavor", 
          values_to = "Percentage of students") %>%
  group_by(Flavor) %>%
  # calculate the count of students in the year/group combination who used that flavor
  mutate(affirmative = (Respondents * `Percentage of students`) / 100) %>%
  # calculate the fraction of total respondents who used that flavor
  mutate(flavor_mean = sum(affirmative) / sum(Respondents)) %>%
  ungroup() %>%
  # reorder the levels of Flavor to be in increasing order of percent of students
  mutate(flavor_mean_rank = dense_rank(flavor_mean),
         Flavor = fct_reorder(Flavor, flavor_mean_rank)) %>%
  ggplot(aes(x = year, 
             y = `Percentage of students`, 
         color = Group)) +
  facet_grid(~Flavor) +
  geom_line() +
  geom_point(show.legend = FALSE, size = 2) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "bottom",
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90),
        strip.text.x = element_text(size = 10, face = "bold")) +
  labs(title = "Among different product users, what flavors are most frequently used?")

plot5 + theme(text = element_text(size = 15))
```


We can see from this plot that there has been an increase in the number of students reporting using flavored tobacco products. Users who use both e-cigarettes and other tobacco products appear to report using flavored products the most, followed by users who only use e-cigarettes.

### **Question 4**
***

Is there a relationship between e-cigarette use and tobacco use? Now we will investigate the usage of e-cigarettes compared to other tobacco products in greater depth. 

First let's take a look at how e-cigarette usage and cigarette usage compare. We will select the data that specifically has to do with these products.

```{r}


v_colors =  viridis(6)[c(1, 4)]

nyts_data %>%
    group_by(year) %>%
    summarize("Cigarettes, Ever \n (any lifetime use)" = (mean(ECIGT, na.rm = TRUE) * 100),
            "E-cigarettes, Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
           "Cigarettes, Current \n (any past-30-day use)" = (mean(CCIGT, na.rm = TRUE) * 100),
         "E-cigarettes, Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)) %>%
    pivot_longer(cols = - year, 
             names_to = "Category", 
            values_to = "Percentage of students") %>%
    separate(Category, into = c("Product", "User"), sep = ", ") %>%
    head()


plot6 <- nyts_data %>%
  group_by(year) %>%
  summarize(
    "Cigarettes, Ever \n (any lifetime use)" = (mean(ECIGT, na.rm = TRUE) * 100),
    "E-cigarettes, Ever \n (any lifetime use)" = (mean(EELCIGT, na.rm = TRUE) * 100),
    "Cigarettes, Current \n (any past-30-day use)" = (mean(CCIGT, na.rm = TRUE) * 100),
    "E-cigarettes, Current \n (any past-30-day use)" = (mean(CELCIGT, na.rm = TRUE) * 100)
  ) %>%
  pivot_longer(cols = -year,
               names_to = "Category",
               values_to = "Percentage of students") %>%
  separate(Category, into = c("Product", "User"), sep = ", ") %>%
  ggplot(aes(
    x = year,
    y = `Percentage of students`,
    color = Product,
    linetype = User
  )) +
  geom_line() +
  geom_point(show.legend = FALSE, size = 2) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "bottom",
        axis.title.x = element_blank()) +
  labs(title = "How does e-cigarette use compare to cigarette use?",
       subtitle = "Current and ever users of e-cigarettes and cigarettes",
       y = "% of students")

plot6 + theme(text = element_text(size = 15))
```

Interesting! we can see that in 2019 the percentage of students that reported currently using e-cigarettes had surpassed those that ever tried (even just once) a cigarette. Overall cigarette usage appears to be declining over time. This is not the case for e-cigarettes.


Now we will look at students who reported that they had ever tried e-cigarettes or non-cigarette products. In this case we will not separate out users who specifically only used one or the other. Therefore, the students included in this plot who reported as having ever tried e-cigarettes might also be  current users of non-e-cigarette products or may have at least tried non-e-cigarette products.


```{r}

v_colors =  viridis(6)[c(1, 4)]

plot7 <- nyts_data %>%
  group_by(year) %>%
  summarize(
    `e-cigarette_ever` = (mean(ecig_ever, na.rm = TRUE) * 100),
    `non-e-cigarette_ever` = (mean(non_ecig_ever, na.rm = TRUE) * 100)
  ) %>%
  pivot_longer(cols = -year,
               names_to = "Category",
               values_to = "Percentage of students") %>%
  separate(Category, into = c("Product", "User"), sep = "_") %>%
  ggplot(aes(x = year,
             y = `Percentage of students`,
             color = Product)) +
  geom_line() +
  geom_point(show.legend = FALSE, size = 2) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0, 60)) +
  theme_linedraw() +
  theme(legend.position = "bottom",
        axis.title.x = element_blank()) +
  labs(title = "How does the rate of ever trying e-cigarettes
compare to ever trying other products over time?",
y = "% of students")

plot7 + theme(text = element_text(size = 15))
```


Now we will do the same, but for students who reported currently using e-cigarettes or non-e-cigarette products. 


```{r}
v_colors =  viridis(6)[c(1, 4)]

plot8 <- nyts_data %>%
  group_by(year) %>%
  summarize(
    `e-cigarette_current` = (mean(ecig_current, na.rm = TRUE) * 100),
    `non-e-cigarette_current` = (mean(non_ecig_current, na.rm = TRUE) * 100)
  ) %>%
  pivot_longer(cols = -year,
               names_to = "Category",
               values_to = "Percentage of students") %>%
  separate(Category, into = c("Product", "User"), sep = "_") %>%
  ggplot(aes(x = year, y = `Percentage of students`, color = Product)) +
  geom_line(linetype = "dashed") +
  geom_point(show.legend = FALSE, size = 2) +
  scale_color_manual(values = v_colors) +
  scale_linetype_manual(values = c(1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0, 60)) +
  theme_linedraw() +
  theme(legend.position = "bottom",
        axis.title.x = element_blank()) +
  labs(title = "How does the rate of currently using e-cigarettes
compare to currently using other products over time?",
       y = "% of students")

plot8 + theme(text = element_text(size = 15))
```

### **Putting plots together**
***

Now we will put these plots together using the `plot_grid()` function of the `cowplot` package.  We will also modify the labels using the `ggdraw()` function, which is also part of the `cowplot` package. To learn more about the `cowplot` package, refer to this [case study](https://www.opencasestudies.org/ocs-bp-RTC-analysis/#Data_Visualization){target="_blank"}.

```{r, fig.height=10}
plotA_uw <- plot1 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
  labs(title = "Tobacco product users more prevalent after 2017",
       subtitle = NULL,
       y = "% of students")

plotB_uw <- plot7 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Ever trying e-cigarettes increases &
% Ever trying other products decreases",
         subtitle = NULL,
         y = "% of students")

plotC_uw <- plot8 +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
    labs(title = "% Currently using e-cigarettes increases &
% Currently using other products decreases",
         subtitle = NULL,
         y = "% of students")

title_uw <- ggdraw() +
  draw_label(
    "Is there a relationship between e-cigarette use and tobacco use?",
    fontface = 'bold',
    size = 14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_uw <- plot_grid(plotA_uw,
                     rel_widths = c(1, 1))
plotsBC_uw <- plot_grid(plotB_uw,
                        plotC_uw,
                        rel_widths = c(1, 1))

# this will take the legend from plot1c to use as the legend for the plot we are creating
legend_uw <- get_legend(plot1c +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_uw <- plot_grid(title_uw,
                       plotsA_uw,
                       plotsBC_uw,
                       legend_uw,
                       ncol = 1,
                       rel_heights = c(0.1,
                                       1,
                                       1,
                                       0.1),
                       scale = 1.0)

figure_uw
```


### **Survey Weighting**
***

***
<details> <summary> Click here if you skipped the previous sections and want to start here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data for plots using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-vaping-case-study", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-vaping-case-study/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-vaping-case-study/master/data/wrangled/wrangled_data_with_var_for_plots.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

</details>

***
  
It turns out that our analysis thus far has been brushing an important statistical concept under the rug, related to how our data were collected. Our data come from responses to a survey, which may have a particular sampling scheme to capture data about the population we are interested in. For example, the survey may be designed to capture a set of individuals who reflect the characteristics of the population that we are interested in drawing conclusions about. However, only a fraction of the individuals who were contacted about taking the survey may have completed it, and this fraction of individuals may no longer be representative of the population. Or the survey may be designed to over-sample a particular group of interest so that individuals from that group show up more often as survey respondents than are present in the population overall. In order to account for the fact that the survey respondents may not reflect the composition of the population we want to generalize to, we can employ a technique called [survey weighting](http://www.applied-survey-methods.com/weight.html){target="_blank"}.

Survey weighting is a common technique used in survey data analysis because often the individuals that take a survey are not necessarily representative of the population that we are trying to gather information about. For example, we may have more females that respond to the survey than males because perhaps female students were more willing to participate. In this case, the proportion of data values in our data will be smaller for the males than the proportion of actual male students and larger for the females than the true proportion of actual female students. To get a better estimate of overall e-cigarette smoking rates, the data from the males can be weighted based on the true proportion of male students to amplify the contribution of the responses from the males that did participate. Conversely, the female data can be weighted to diminish the contribution if their responses to the overall picture. We will see if using survey weighting changes the general trends that we see in our data. 

Calculating survey weights involves making a weight based on the ratio of the proportion of survey respondents from a particular group and the actual proportion of that group in the population. For example, let's say that females account for 50% of the population and males account for 50 % of the population. Let's also say that 75% of the respondents to the survey were female and only 25% were males. 

Then we could calculate survey weights using this formula:

$$ \frac{\text{actual proportion of group in the population}}{\text{ proportion of group in the respondents}}$$  

Thus the weight for the females would be calculated as:

$$ \frac{.5}{.75} = .67$$  

The weight for the males would be calculated as:

$$ \frac{.5}{.25} = 2$$

Therefore each male response value would be multiplied by a factor of 2 and would have twice the contribution, while the female response values would have only about 70% of the contribution that they would have had without weighting.

Note that survey weights are in reality corrected for other aspects - for example the response rate to individual questions.

We do not need to calculate survey weights for our data as they were already supplied in the dataset, as described in the codebooks. 

#### `srvyr` package and survey design 

***

We will now use the `srvyr` package to evaluate our data using survey weights that were provided in the data for each year, as described in the respective codebooks. This package contains functions that allow the user to easily perform calculations from the data that take the survey design into account, without having to work out the math by hand.

Within the data you will see that we have three variables related to the survey sampling scheme:  `psu`, `finwgt`, and `stratum`. Details about these variables are available, for example, in the [2019 Methodology Report](./docs/2019-nyts-dataset-and-codebook-microsoft-excel/2019-nyts-methodology-report-p.pdf){target="_blank"}.

In brief they represent: 

1) `psu`: Surveys like the one used to create the data we are using often sample people based on strata. This is done to ensure that the responses are representative of the population of interest. Thus, often people first think about ensuring that surveys are conducted in a variety of geographical areas. This is often called the **primary sampling unit** or **PSU**. In [this survey](https://web.sph.harvard.edu/mch-data-connect/results/national-youth-tobacco-survey-nyts/){target="_blank"}, the county where the student's school was located was used as the PSU. 

2) `stratum`: A categorical variable that indicates subsets of the data that include respondents from different *PSUs*. In our case, strata are determined by the predominant minority in the PSU (Non-Hispanic Black or Hispanic), whether the PSU is urban or non-urban, and what percent of the students in the PSU fall into the predominant minority group. PSUs are allocated across the 16 possible strata according to the sampling scheme. These strata values allow estimates based on the survey responses to be calculated using different strata allowing for improved precision of the response estimates.

3) `finwgt`: The survey weight which was calculated based on a variety of factors.

This [link](https://web.sph.harvard.edu/mch-data-connect/results/national-youth-tobacco-survey-nyts/){target="_blank"} and this [link](https://osf.io/n7r32){target="_blank"} have more information about the study design of the data that we are using.

For detailed information on such survey designs in general see [here](http://www.asasrms.org/Proceedings/y2008/Files/301835.pdf){target="_blank"} and [here](http://ocw.jhsph.edu/courses/StatMethodsForSampleSurveys/PDFs/Lecture5.pdf){target="_blank"}.

We will use the `as_survey_design()` function of the `srvyr`package to create a survey object with a specified survey design. This is a special R object that includes information about how the survey was conducted that can be taken into account in the analysis.

There are several arguments to pay attention to:

1) The `strata` argument is used to specify the variable(s) that defined strata in the data. In this case, we will use the `stratum` variable.
2) The `ids` argument is used to define cluster ids within the data. In this case we will use the `psu` variable.
3) The `weight` argument is the  used to define which variable(s) are the survey weights.
4) The `nest = TRUE` argument, forces cluster ids (in this case the PSU) to be nested within the strata.

We can then use the `survey_mean()` function to calculate percentages of students who report using tobacco for each year while accounting for the survey design and weights. We will specify that we want [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval){target="_blank"} estimates by using the `vartype = "ci"` argument. The confidence intervals in our case give a range of possible values for the true population mean based on the data observed in the survey. We will multiply these values by 100 to get percentages. (Note: We could also have calculated confidence intervals for the unweighted results above by computing them by hand; we leave this as a potential exercise.)

Since the survey weights are specific to a single year of the survey results, we need to create survey design objects for each year separately. We will use `group_by` and `group_modify`, which is also from the `dplyr` package, to do this. We first write the function that we want to call on each group.

This function takes an input called `currYear`, which will be one set of survey responses for a specific year, and then creates a survey design based on the `stratum` and `finwgt` values specific to that year. It then calculates the percent of student respondents who have ever tried any tobacco products or who are a current user of any tobacco products accounting for the survey design and weights using `survey_mean()` as was just described. The function then wrangles the data to convert the means to percentages and reformat the data in long form for plotting.

One technical note: since some years have strata with a single PSU, we need to tell the survey weighting package how to handle estimating within strata variances. The line `options(survey.lonely.psu = "adjust")` tells R to center the stratum with the single PSU on the sample grand mean, a conservative approach to solving the problem. See further information [here](https://r-survey.r-forge.r-project.org/survey/exmample-lonely.html){target="_blank"} and [here](http://r-survey.r-forge.r-project.org/survey/html/surveyoptions.html){target="_blank"}.

### **Weighted Sample**
***

First, we show the basic output of the `survey_mean` function by year. Since we include the argument `vartype = "ci"`, we get a mean and upper and lower confidence interval bounds for the mean.

```{r}
surveyMeanA <- function(currYear) {
  options(survey.lonely.psu = "adjust")
  currYear %>%
  as_survey_design(strata = stratum,
                      ids = psu,
                  weight  = finwgt,
                     nest = TRUE) %>%
   summarize(tobacco_ever = survey_mean(tobacco_ever,
                                        vartype = "ci",
                                        na.rm = TRUE),
          tobacco_current = survey_mean(tobacco_current,
                                        vartype = "ci",
                                        na.rm = TRUE)) }


nyts_data %>%
  group_by(year) %>%
  dplyr::group_modify(~ surveyMeanA(.x)) %>%
  head()
```

Now let's make the function wrangle the output in a more usable form too:
```{r}

surveyMeanA <- function(currYear) {
  options(survey.lonely.psu = "adjust")
  currYear %>%
  as_survey_design(strata = stratum,
                      ids = psu,
                  weight  = finwgt,
                     nest = TRUE) %>%
   summarize(tobacco_ever = survey_mean(tobacco_ever,
                                        vartype = "ci",
                                        na.rm = TRUE),
          tobacco_current = survey_mean(tobacco_current,
                                        vartype = "ci",
                                        na.rm = TRUE))  %>%
  mutate_all("*", 100) %>%
  pivot_longer(everything(),
               names_to = "Type",
               values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(str_detect(Type, "_low") ~ "Lower",
                              str_detect(Type, "_upp") ~ "Upper",
                          TRUE ~ "Mean"),
         User = case_when(str_detect(Type, "ever") ~ "Ever",
                          str_detect(Type, "current") ~ "Current",
                          TRUE ~ "Mean"))}

nyts_data %>%
  group_by(year) %>%
  group_modify(~ surveyMeanA(.x))
```

We will now make a plot using this data. The confidence intervals are included using the `geom_linerange()` function of the `ggplot2` package.

```{r}
plotA_w <- nyts_data %>%
  group_by(year) %>%
  group_modify(~ surveyMeanA(.x)) %>%
  dplyr::select(-Type) %>%
  pivot_wider(names_from = Estimate,
             values_from = `Percentage of students`) %>%
  ggplot(aes(x = year, y = Mean)) +
  geom_line(aes(linetype = User)) +
  geom_linerange(aes(ymin = Lower,
                     ymax = Upper), 
                     size = 1, 
              show.legend = FALSE) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_y_continuous(breaks = seq(0, 70, by = 10),
                     labels = seq(0, 70, by = 10),
                     limits = c(0, 70)) +
    theme_linedraw() +
    theme(legend.position = "none",
          axis.title.x = element_blank()) +
    labs(title = "Tobacco product users more prevalent after 2017",
         y = "% of students")
plotA_w + theme(text = element_text(size = 15))
```

Now we can see that we have confidence interval ranges plotted for each value. 

We will make a similar plot for students who reported ever trying or who currently use e-cigarettes as opposed to tobacco in general.

```{r}
v_colors =  viridis(6)[c(1, 4)]

surveyMeanB <- function(currYear) {
  options(survey.lonely.psu = "adjust")
  currYear %>%
  as_survey_design(strata = stratum,
                      ids = psu,
                  weight  = finwgt,
                     nest = TRUE) %>%
  summarize(ecig_ever_year = survey_mean(ecig_ever, 
                                         vartype = "ci", 
                                         na.rm = TRUE),
        non_ecig_ever_year = survey_mean(non_ecig_ever, 
                                         vartype = "ci", 
                                         na.rm = TRUE)) %>%
  mutate_all("*", 100) %>%
  pivot_longer(everything(),
           names_to = "Category",
          values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(str_detect(Category, "_low") ~ "Lower",
                              str_detect(Category, "_upp") ~ "Upper",
                                                      TRUE ~ "Mean"),
             User = case_when(str_detect(Category, "ever") ~ "Ever",
                           str_detect(Category, "current") ~ "Current"),
      Product = case_when(str_detect(Category, "non_ecig") ~ "Other products",
                                                      TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate,
              values_from = `Percentage of students`)}

nyts_data %>%
  group_by(year) %>%
  group_modify( ~ surveyMeanB(.x)) %>%
  head()


plotB_w <- nyts_data %>%
  group_by(year) %>%
  group_modify( ~ surveyMeanB(.x)) %>%
  ggplot(aes(x = year, y = Mean, color = Product)) +
  geom_line() +
  geom_linerange(aes(ymin = Lower, ymax = Upper),
                 size = 1,
                 show.legend = FALSE) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_color_manual(values = v_colors) +
  scale_y_continuous(
    breaks = seq(0, 60, by = 10),
    labels = seq(0, 60, by = 10),
    limits = c(0, 60)
  ) +
  theme_linedraw() +
  theme(legend.position = "none",
        axis.title.x = element_blank()) +
  labs(title = "% Ever trying e-cigarettes increases &
% Ever trying other products decreases",
       y = "% of students")

plotB_w + theme(text = element_text(size = 15))
```


Now we will do the same but for current users:

```{r}
surveyMeanC <- function(currYear) {
  options(survey.lonely.psu = "adjust")
  currYear %>%
  as_survey_design(strata = stratum,
                      ids = psu,
                  weight  = finwgt,
                     nest = TRUE) %>%
  summarize(ecig_current_year = survey_mean(ecig_current, 
                                            vartype = "ci", 
                                            na.rm = TRUE),
        non_ecig_current_year = survey_mean(non_ecig_current, 
                                            vartype = "ci", 
                                            na.rm = TRUE)) %>%
  mutate_all("*", 100) %>%
  pivot_longer(everything(),
           names_to = "Category",
          values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(str_detect(Category, "_low") ~ "Lower",
                              str_detect(Category, "_upp") ~ "Upper",
                                                      TRUE ~ "Mean"),
             User = case_when(str_detect(Category, "ever") ~ "Ever",
                           str_detect(Category, "current") ~ "Current"),
      Product = case_when(str_detect(Category, "non_ecig") ~ "Other products",
                                                      TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate,
              values_from = `Percentage of students`)}


plotC_w <- nyts_data %>%
  group_by(year) %>%
  group_modify( ~ surveyMeanC(.x)) %>%
  ggplot(aes(x = year, y = Mean, color = Product)) +
  geom_line(aes(linetype = "dashed")) +
  geom_linerange(aes(ymin = Lower, ymax = Upper),
                 size = 1,
                 show.legend = FALSE) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0, 60)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "none",
        axis.title.x = element_blank()) +
  labs(title = "% Currently using e-cigarettes increases &
% Currently using other products decreases",
        y = "% of students")
plotC_w + theme(text = element_text(size = 15))
```

```{r, echo = FALSE}
#code to create proper legend for instructors starting at survey weighting section

plot1c <- nyts_data %>%
    dplyr::group_by(year) %>%
    dplyr::summarize("Ever \n (any lifetime use)_E-cigarette" = 
                       (mean(ecig_only_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_E-cigarette" = 
                       (mean(ecig_only_current, na.rm = TRUE) * 100),
                     "Ever \n (any lifetime use)_Non-e-cigarette" = 
                       (mean(non_ecig_only_ever, na.rm = TRUE) * 100),
                     "Current \n (any past-30-day use)_Non-e-cigarette" = 
                       (mean(non_ecig_only_current, na.rm = TRUE) * 100)) %>%
  pivot_longer(cols = - year,
               names_to = "User",
               values_to = "Percentage of students") %>%
  separate(User, into = c("User", "Product"), sep = "_") %>%
  ggplot(aes(x = year,
             y = `Percentage of students`,
             color = Product)) +
  geom_line(aes(linetype = User)) +
  geom_point(show.legend = FALSE, size = 2) +
  # this allows us to choose what type of line we want for each line
  scale_linetype_manual(values = c(2, 1)) +
  # this allows us to specify how the y-axis should appear
  scale_y_continuous(breaks = seq(0, 30, by = 10),
                        labels = seq(0, 30, by = 10),
                        limits = c(0, 30)) +
  scale_color_manual(values = v_colors) +
  # this adjusts the background style of the plot
  theme_linedraw() +
  # this moves the legend to the bottom of the plot and removes the x axis title
  theme(legend.position = "bottom",
        axis.title.x = element_blank()) +
  labs(title = "How has use of only e-cigarettes and
only tobacco products besides e-cigarettes varied over time?",
       y = "% of students")

```


Now we will put these plots together again using the `cowplot` package:

```{r}
title_w <- ggdraw() +
  draw_label(
    expression("What is the relationship between e-cigarette use and tobacco use?"),
    fontface = 'bold',
    size = 14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w <- plot_grid(plotA_w,
                     rel_widths = c(1),
                     align = "v",
                     axis = "bt")
plotsBC_w <- plot_grid(plotB_w,
                     plotC_w,
                     rel_widths = c(1, 1),
                     align = "v",
                     axis = "bt")

legend_w <- get_legend(plot1c +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w <- plot_grid(title_w,
                      plotsA_w,
                      plotsBC_w,
                      legend_w,
                      ncol = 1,
                      rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                      scale = 1.0)

figure_w
```

We can see that these figures look quite similar to the ones generated without using the survey weights.

### **Artificial Cohort**
***

Although the survey design does not allow specific individuals to be followed over time, we will use certain subsets of the data from each year to construct an artificial cohort where we follow students of the same age group as they get older. This will allow us to look at how tobacco usage changed for students who were in 8th grade in 2015 as they aged. 

All of the data so far has included all 6th-12th graders every year. Now we will look at just the data for students expected to graduate in 2019. These are the students who were in 8th grade in 2015, most of whom were 9th graders in 2016, 10th graders in 2017 and so on. We will filter the data to just the students expected to be in the graduating class of 2019.


```{r}

surveyMeanCohort <- function(currYear) {
  options(survey.lonely.psu = "adjust")
  currYear %>%
  as_survey_design(strata = stratum,
                      ids = psu,
                  weight  = finwgt,
                     nest = TRUE) %>%
  summarize(ecig_ever_year = 
              survey_mean(ecig_ever, vartype = "ci", na.rm = TRUE),
            ecig_current_year = 
              survey_mean(ecig_current, vartype = "ci", na.rm = TRUE),
            non_ecig_ever_year = 
              survey_mean(non_ecig_ever, vartype = "ci", na.rm = TRUE),
            non_ecig_current_year = 
              survey_mean(non_ecig_current, vartype = "ci", na.rm = TRUE),
            tobacco_ever_year = 
              survey_mean(tobacco_ever, vartype = "ci", na.rm = TRUE),
            tobacco_current_year = 
              survey_mean(tobacco_current, vartype = "ci", na.rm = TRUE)) %>%
  mutate_all("*", 100) %>%
  pivot_longer(everything(),
               names_to = "Category",
               values_to = "Percentage of students") %>%
  mutate(Estimate = case_when(str_detect(Category, "_low") ~ "Lower",
                              str_detect(Category, "_upp") ~ "Upper",
                                                      TRUE ~ "Mean"),
             User = case_when(str_detect(Category, "ever") ~ "Ever",
                           str_detect(Category, "current") ~ "Current"),
      Product = case_when(str_detect(Category, "non_ecig") ~ "Other products",
                           str_detect(Category, "tobacco") ~ "Any tobacco product",
                                                      TRUE ~ "E-cigarettes")) %>%
  dplyr::select(-Category) %>%
  pivot_wider(names_from = Estimate,
              values_from = `Percentage of students`)}


Cohort_data <- nyts_data %>%
  filter((Grade == "8" & year == 2015) |
         (Grade == "9" & year == 2016) |
         (Grade == "10" & year == 2017) |
         (Grade == "11" & year == 2018) |
         (Grade == "12" & year == 2019)
         ) %>%
  group_by(year) %>%
  group_modify(~ surveyMeanCohort(.x))

head(Cohort_data)
```

We will now make similar plots to those above for this subset of the data:

```{r}
plotA_w_8 <- Cohort_data %>%
  filter(Product == "Any tobacco product") %>%
  ggplot(aes(x = year, y = Mean)) +
  geom_line(aes(linetype = User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_y_continuous(breaks = seq(0, 70, by = 10),
                     labels = seq(0, 70, by = 10),
                     limits = c(0, 70)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "none",
        axis.title.x = element_blank()) +
  labs(title = "Tobacco product use became increasingly prevalent",
       y = "% of students")

plotB_w_8 <- Cohort_data %>%
  filter(Product != "Any tobacco product", User == "Ever") %>%
  ggplot(aes(x = year, y = Mean, color = Product)) +
  geom_line(linetype = 1) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) +
  scale_y_continuous(breaks = seq(10, 60, by = 10), limits = c(10, 60)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "none",
        axis.title.x = element_blank()) +
  labs(title = "% ever trying tobacco products increases",
       y = "% of students")

plotC_w_8 <- Cohort_data %>%
  filter(Product != "Any tobacco product", User == "Current") %>%
  ggplot(aes(x = year, y = Mean, color = Product)) +
  geom_line(aes(linetype = User)) +
  geom_linerange(aes(ymin = Lower, ymax = Upper), size = 1) +
  scale_linetype_manual(values = c(2, 1)) +
  scale_y_continuous(breaks = seq(0, 60, by = 10), limits = c(0, 60)) +
  scale_color_manual(values = v_colors) +
  theme_linedraw() +
  theme(legend.position = "none",
        axis.title.x = element_blank()) +
  labs(title = "E-cigarette use surpasses use of other products",
       y = "% of students")

title_w_8 <- ggdraw() +
  draw_label(
  expression("For students in the 2019 graduating class, how are vaping and tobacco use related?"),
    fontface = 'bold',
    size = 14,
    x = 0,
    hjust = 0
  ) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_w_8 <- plot_grid(plotA_w_8,
                        rel_widths = c(1),
                        align = "v",
                        axis = "bt")

plotsBC_w_8 <- plot_grid(plotB_w_8,
                         plotC_w_8,
                         rel_widths = c(1, 1),
                         axis = "bt")

legend_w_8 <- get_legend(plot1c +
                       theme(legend.position = "bottom",
                             legend.direction = "horizontal"))

figure_w_8 <- plot_grid(title_w_8,
                        plotsA_w_8,
                        plotsBC_w_8,
                        legend_w_8,
                        ncol = 1,
                        rel_heights = c(0.1,
                                      1,
                                      1,
                                      0.1),
                        scale = 1.0
)

figure_w_8
```


## **Data Analysis**
*** 

If you have been following along but stopped, we could load the wrangled data from the "data" directory like so:

```{r}
load(here::here("data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

***
<details> <summary> If you skipped the previous sections click here. </summary>

First you need to install and load the `OCSdata` package:

```{r, eval=FALSE}
install.packages("OCSdata")
library(OCSdata)
```

Then, you may load the wrangled data for plots using the following code:

```{r, eval=FALSE}
wrangled_rda("ocs-bp-vaping-case-study", outpath = getwd())
load(here::here("OCSdata", "data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

If the package does not work for you, alternatively, an RDA file (stands for R data) of the data can be found in our [GitHub repository](https://github.com//opencasestudies/ocs-bp-vaping-case-study/tree/master/data/wrangled) or slightly more directly [here](https://raw.githubusercontent.com/opencasestudies/ocs-bp-vaping-case-study/master/data/wrangled/wrangled_data_with_var_for_plots.rda). Download this file and then place it in your current working directory within a subdirectory called "wrangled" within a subdirectory called "data" to copy and paste our code. We used an RStudio project and the [`here` package](https://github.com/jennybc/here_here) to navigate to the file more easily. 

```{r}
load(here::here("data", "wrangled", "wrangled_data_with_var_for_plots.rda"))
```

***
<details> <summary> Click here to see more about creating new projects in RStudio. </summary>

You can create a project by going to the File menu of RStudio like so:


```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "New_project.png"))
```

You can also do so by clicking the project button:

```{r, echo = FALSE, out.width="60%"}
knitr::include_graphics(here::here("img", "project_button.png"))
```

See [here](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) to learn more about using RStudio projects and [here](https://github.com/jennybc/here_here) to learn more about the `here` package.

</details>
***
</details>

***

As an extension, we will include some material here on [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression){target="_blank"} and survey-weighted logistic regression that would be appropriate for answering Question 2 ("How does e-cigarette use compare between males and females?") for a single year using statistical inference, rather than just data visualizations.

We can look at the final figure in the section on Question 2 and see that among both current and ever users of e-cigarettes, a higher percentage of males than females use or have used e-cigarettes.'

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "plot2.png"))
```

But what if we wanted to quantify this effect and assess whether this difference can be considered statistically significant? This is where the tool of logistic regression can come in handy.

### **Logistic regression motivation**
***

Here, we will approach the topic of [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression){target="_blank"} assuming some prior knowledge of simple and multiple linear regression. These have been covered in [another case study](https://opencasestudies.github.io/ocs-bp-diet/).

As a brief reminder, a [linear regression model](https://en.wikipedia.org/wiki/Linear_regression){target="_blank"} allows us to estimate the relationship between an outcome variable, call it $Y$, and a set of one or more input variables, $X_1, X_2, ..., X_n$. We can write a simple linear regression model as:

$$ E(Y) = \beta_0 + \beta_1 X_1$$

where the $E(Y)$ means the expected value of $Y$, i.e., our model gives us an estimate of the mean value of $Y$ given a particular input $X_1$. Here, $\beta_1$ quantifies the expected difference in $Y$ comparing two individuals who are one unit apart in $X_1$.

Similarly, we can include more than one predictor so that our equation might look like:

$$ E(Y) = \beta_0 + \beta_1 X_1 + \beta_2 X_2$$

Here, $\beta_1$ quantifies the expected difference in $Y$ comparing two individuals who are one unit apart in $X_1$, holding $X_2$ at a fixed value. This material is covered in more detail [elsewhere](https://rafalab.github.io/dsbook/linear-models.html#linear-regression-in-the-tidyverse){target="_blank"} and in [another case study](https://opencasestudies.github.io/ocs-bp-diet/){target="_blank"}.

In the case of our question of interest for this case study, however, our outcome variable is of a particular type: it is a Yes-No or *binary* outcome, since each student respondent either is or is not a current user of e-cigarettes. This means in our setting $Y$ only takes on two values: TRUE or FALSE, which we can also think of as 1 and 0. For this kind of outcome variable, we need a special kind of regression, called *logistic regression*. And instead of using a linear model to estimate $Y$ itself for a given set of input variables, we will use a linear model to estimate the *log odds that Y=1* for a given set of input variables.

If we define $p=P(Y=1)=E(Y)$, the standard simple logistic regression equation can be written as:

$$logit(p)= log_e (\frac{p}{1-p})= \beta_0 + \beta_1 X$$  

In our case, we would define $p$ as the probability that a student respondent is a current e-cigarette user, since $Y$ is the binary variable that is 1 when a student respondent is a current e-cigarette user and 0 if not. The value $\frac{p}{1-p}$ is called the *odds* that $Y$ is equal to 1.

This *logit* function is short for *log odds*. 

While it may feel strange working with the log odds that our outcome variable is equal to 1, there are some intuitive reasons why it makes sense to do this from the point of view of fitting a line to our data and interpreting the results. The [log odds](https://en.wikipedia.org/wiki/Logit){target="_blank"} can take any value on the real number line, allowing us to estimate our model parameters with no constraints. If we instead tried to use say $p$ as the outcome variable, we would somehow need to constrain $\beta_0 + \beta_1 X$ to be between 0 and 1, since this is the only possible range of values for a [probability](https://en.wikipedia.org/wiki/Probability){target="_blank"}. A second, more technical reason is that working on the log odds scale gives us a nice formulation of our *likelihood*, i.e., a function of our unknown parameters that incorporates our observed data. We use this [likelihood function](https://en.wikipedia.org/wiki/Likelihood_function){target="_blank"} to estimate our unknown parameters (here, $\beta_0$ and $\beta_1$) and this formulation gives us a nice way to calculate the [maximum likelihood estimates](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation){target="_blank"} of these parameters.

The intuitive explanation of logistic regression then is that we are fitting a line to the log odds of $Y$, as it varies with different values of $X$. We will work through an example below to illustrate and hopefully clarify this.


### **Logistic regression "by hand" and by model**
***

For simplicity, we will consider just the set of current users of e-cigarettes in 2015. How much more likely is a male student to be a current e-cigarette user than a female student?

We can get a first look at the answer by calculating the percent of females and percent of males who are current e-cigarette users or not: 

```{r}
nyts_data %>% 
  filter(year == 2015, !is.na(Sex)) %>%
  group_by(Sex, ecig_current) %>%
  summarize(n = n()) %>%
  mutate(pct = n / sum(n))
```

We can see that the percentage is lower for females than for males. Another way of organizing this data would be to make a [2x2 table](https://en.wikipedia.org/wiki/Contingency_table){target="_blank"}, a data summarization frequently used in public health settings.

|                                 | Male     | Female| Total|
|:--------------------------------|---------:|------:|-----:|
|Current e-cigarette user         |      1171|    772|  1943|
|Not current e-cigarette user     |      7787|   7850| 15637|
|Total                            |      8958|   8622| 17580|


As discussed above, one important ingredient in understanding the output of logistic regression is understanding the concept of an odds and an odds ratio. We can ask, among males who responded to the survey in 2015, what are the odds of being a current e-cigarette user? How about for females? How do these odds compare? The [odds ratio](https://en.wikipedia.org/wiki/Odds_ratio){target="_blank"} is a tool frequently used in public health to compare the odds between two groups. 

In this case:

* Odds of current e-cigarette use for males: 1171 / 7787 = 0.150
* Odds of current e-cigarette use for females: 772 / 7850 = 0.098
* Odds ratio of e-cigarette use for females as compared to males: $$OR = \frac{odds \ for \ females}{odds \ for \ males} = \frac{772 / 7850}{1171 / 7787} = 0.65$$
* Log odds ratio: $log_e(OR) = log(1.53) = -0.42$

We would interpret these values by saying that the odds of being a current e-cigarette user for women are around 0.65 times the odds of being a current e-cigarette user for men, or 35% lower for women. This matches what we can see in our data visualizations for Question 2.

```{r, echo = FALSE, out.width = "800 px"}
knitr::include_graphics(here::here("img", "plot2.png"))
```

We could also answer this question using logistic regression:

$$log(odds \ of \ current \ e-cigarette \ use) = \beta_0 + \beta_1 \cdot Sex$$

Here is how we would fit our logistic regression model, using the `glm` function from base R. We also use the `tidy` function from the `broom` package to create a tibble of the model output.

```{r}
dat2015 <- nyts_data %>% 
  filter(year == 2015, !is.na(Sex))

currEcigSex <- glm(ecig_current ~ Sex, data = dat2015, family = binomial(link = "logit"))
currEcigSexTidy <- broom::tidy(currEcigSex)
currEcigSexTidy
```

Looking at this output, our estimated logistic regression equation is:

$$log(odds \ of \ current \ e-cigarette \ use) = \beta_0 + \beta_1 \cdot Sex = -1.89 - 0.425 \cdot (Sex == female)$$

Our $\beta_1$ parameter tells us that the log odds of being a current e-cigarette user are 0.425 lower for females compared to males, i.e., the difference in log odds of being a current e-cigarette user for females compared to males is -0.425. And we can notice that this value matches the log odds ratio that we calculated by hand from the 2x2 table above. This is because a difference in log odds is the same as a log odds ratio -- remember your [rules of logs](https://www.rapidtables.com/math/algebra/Logarithm.html#log-rules){target="_blank"}!

We can interpret this output as follows:

* $-0.425 = \beta_1 = \log(OR)$
* The log odds of being a current e-cigarette user is 0.425 lower for females compared to males
* $0.65 = e^{-0.425} = e^{\beta_1} = OR$
* The odds of being a current e-cigarette user for females is 0.65 times the odds for males.
* The odds of being a current e-cigarette user for females is 35% lower than the odds for males.

We can also look at the other columns of the model output to assess whether our data indicate that $\beta_1$ is statistically significantly different from 0. The p-value for the `Sex` variable in our model is `r format(currEcigSexTidy %>% filter(term == "Sexfemale") %>% pull(p.value), digits = 3)`. Since this value is < 0.05, we would reject the null hypothesis that $\beta_1 = 0$ based on our model output. See this [case study](https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity/#Alpha){target="_blank"} for more information about alpha and p-values.

Simple logistic regression can be extended to include additional variables in the model, for example to adjust for potential confounding variables such as `Age` or `Grade`. For example, suppose we want to estimate the effect of `Sex` on current e-cigarette use, holding `Age` constant. We could fit the model:

$$log(odds \ of \ current \ e-cigarette \ use) = \beta_0 + \beta_1 \cdot Sex + \beta_2 \cdot Age$$


```{r}
currEcigSexAge <- glm(ecig_current ~ Sex + Age, 
                      data = dat2015, 
                      family = binomial(link = "logit"))
tidy(currEcigSexAge)
```

Now our $\beta_1$ parameter tells us that the log odds of being an current e-cigarette user are 0.385 lower for females compared to males, within an age group, or holding `Age` constant.

### **Survey weighted logistic regression**
***

As discussed elsewhere in this case study, our data come from a survey, where individuals were not necessarily sampled in direct proportion to their population representation, so it is necessary to incorporate survey weights into our analysis to perform inference about the population of interest. Luckily, there are implementations of survey-weighted logistic regression in R that can do this for us.

We first create our survey design object using the `as_survey_design` function from the `srvyr` package, and then use the `svyglm` function from the `survey` package to fit our logistic regression model.

```{r}
dat2015_survey_design <- dat2015 %>%
                          as_survey_design(strata = stratum,
                                            ids = psu,
                                            weight  = finwgt,
                                            nest = TRUE)


currEcigSex_svy <- survey::svyglm(ecig_current ~ Sex,
                                  family = quasibinomial(link = 'logit'),
                                  design = dat2015_survey_design)
tidy(currEcigSex_svy)
```

Note that in this case, we use the [quasi](https://en.wikipedia.org/wiki/Quasi-likelihood){target="_blank"}-binomial family rather than the binomial family, which allows the data to not necessarily look like a sample from a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution){target="_blank"}. This is because by incorporating our survey weights, it is as if each individual does not have a 0 or 1 as their outcome variable, so we get a warning if we do not use this value for family.

From this model output, we can see that the estimate incorporating survey weights is a little different. The interpretation is as follows:

* $-0.383 = \beta_1 = \log(OR)$
* The log odds of being a current e-cigarette user is 0.383 lower for females than for males, taking survey weights into account.
* $0.68 = e^{-0.383} = e^{\beta_1} = OR$
* The odds of being a current e-cigarette user for females is 0.68 times the odds for males, taking survey weights into account.
* The odds of being a current e-cigarette user for females is 32% lower than the odds for males, taking survey weights into account.


As above, we can also fit a more complicated model with additional covariates.

```{r}
currEcigSexAge_svy <- survey::svyglm(ecig_current ~ Sex + Age,
                      family = quasibinomial(link = 'logit'), 
                      design = dat2015_survey_design)
tidy(currEcigSexAge_svy)
```

In this case, we can see that our estimated difference in log odds for females compared to males, -0.380, is not much different whether we are holding `Age` constant or not.


## **Summary**
*** 


### **Summary Plot**
***

Finally we will put our plots together to create a plot that describes the relationship between e-cigarette usage and overall tobacco use, combining both our first set of unweighted results, and those calculated using the `srvyr` package. We will also make the background of the plot white.

```{r}
title_final <- ggdraw() +
  draw_label(
    expression("What is the relationship between e-cigarette use and tobacco use?"),
    fontface = 'bold',
    size = 16,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_uw_final <- ggdraw() +
  draw_label(
    expression(~ 6^th ~ "-" ~ 12^th ~ "graders, unweighted"),
    size = 12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_final <- ggdraw() +
  draw_label(
    expression(~ 6^th ~ "-" ~ 12^th ~ "graders, weighted"),
    fontface = 'bold',
    size = 12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_w_8_final <- ggdraw() +
  draw_label(
    expression("Approximate graduating \n class of 2019, weighted"),
    fontface = 'bold',
    size = 12,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

subtitle_final <- plot_grid(subtitle_uw_final,
                            subtitle_w_final,
                            subtitle_w_8_final,
                            ncol = 3)

plotsA_title_final <- ggdraw() +
  draw_label(
    expression("Prevalence of any tobacco product use by user type"),
    size = 14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsA_final <- plot_grid(plotA_uw + theme(plot.title = element_blank()),
                          plotA_w + theme(plot.title = element_blank()),
                          plotA_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsB_title_final <- ggdraw() +
  draw_label(
    expression("Prevalence of ever trying by product type"),
    size = 14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsB_final <- plot_grid(plotB_uw + theme(plot.title = element_blank()),
                          plotB_w + theme(plot.title = element_blank()),
                          plotB_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

plotsC_title_final <- ggdraw() +
  draw_label(
    expression("Prevalence of current use by product type"),
    size = 14,
    x = 0.5) +
  theme(
    plot.margin = margin(0, 0, 0, 0)
  )

plotsC_final <- plot_grid(plotC_uw + theme(plot.title = element_blank()),
                          plotC_w + theme(plot.title = element_blank()),
                          plotC_w_8 + theme(plot.title = element_blank()),
                          ncol = 3,
                          align = "v",
                          axis = "bt")

legend_final <- get_legend(plot1c +
                             theme(legend.position = "bottom",
                                   legend.direction = "horizontal"))

final_plot <- plot_grid(title_final,
          plotsA_title_final,
          subtitle_final,
          plotsA_final,
          plotsB_title_final,
          subtitle_final,
          plotsB_final,
          plotsC_title_final,
          subtitle_final,
          plotsC_final,
          legend_final,
          ncol = 1,
          rel_heights = c(0.2,
                          0.2,
                          0.15,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2,
                          0.1,
                          1,
                          0.2)) +
  theme(plot.background = element_rect(fill = "white"))

final_plot
```

```{r, echo=FALSE, include=FALSE}
ggsave(here::here("img", "final_plot.png"))
```

### **Synopsis**
***

In this case study, we used data from the [National Youth Tobacco Survey (NYTS)](https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/index.htm){target="_blank"}, an annual survey that asks students in high school and middle school (grades 6-12) about tobacco usage in the United States of America. We used data from **2015-2019** due to the fact that these years are the most recent that asked questions regarding e-cigarette usage. 

We used this data to answer these questions:

1) How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
2) How does e-cigarette use compare between males and females?
3) What vaping brands and flavors appear to be used the most frequently?  
We will base this on the following survey questions:   
> "During the past 30 days, what brand of e-cigarettes did you usually use?"   
> "What flavors of tobacco products have you used in the past
30 days?" 

4) Is there a relationship between e-cigarette/vaping use and other tobacco use?

We showed how to work with the data in the format provided (Excel), how to to use the codebooks to decide what variables to use to answer our questions and how to clean and recode the data from the survey for our visualizations and analysis. We made visualizations of our summary statistics over time, to illustrate the trends present in the data for different products and groups of student respondents.

In answer to our questions, we found that tobacco use has gone up slightly overall between 2015 and 2019, with little difference in rates of change comparing males to females. This slight increase is the result of a large increase in e-cigarette/vaping use, coupled with a decrease in use of other tobacco products. The most used brand of e-cigarette/vaping products is Juul, and fruit, menthol and candy/desserts/sweets are the most commonly used flavors.

We then introduced the statistical concept of survey weighting, illustrating how to calculate usage percentages using survey-weighted means, and compare the results in the weighted and unweighted cases. We also introduced the topic of logistic regression and we performed a survey-weighted logistic regression analysis to compare the vaping rates of male and female students. 

## **Suggested Homework**
*** 

<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
<div class = "blue">

+ Calculate confidence intervals for the unweighted estimates and add the appropriate error bars to the main figures.
+ Apply survey weights to one of the figures produced in this case study in which weighted estimates were not produced. Include error bars in the updated figure.
    + Does the figure change after the application of survey weights?
    + If so, describe how. 
+ Reproduce `final_plot` above for a different cohort of your choice.
+ Focusing on a single year of data, explore demographic factors that contribute to tobacco use of some kind. Compare results of unweighted and weighted analysis (for example, using the `svyglm` function to calculate survey-weighted logistic regression estimates).

</div>


## **Additional Information**
*** 

### **Helpful Links**
***

[Tidyverse](https://www.tidyverse.org/){target="_blank"}  
[Writing functions](https://r4ds.had.co.nz/functions.html){target="_blank"}   
[Codebooks](https://www.lib.ncsu.edu/data/icpsrfaq#whatare){target="_blank"}  
[Longitudinal studies](https://www.bmj.com/about-bmj/resources-readers/publications/epidemiology-uninitiated/7-longitudinal-studies){target="_blank"}   
[Panel data](https://en.wikipedia.org/wiki/Panel_data){target="_blank"}    
[Cross-sectional data](https://en.wikipedia.org/wiki/Cross-sectional_data){target="_blank"}    
[Survey weighting](http://www.applied-survey-methods.com/weight.html){target="_blank"}  
[Confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval){target="_blank"}   
[Introduction to Logarithms](https://www.mathsisfun.com/algebra/logarithms.html){target="_blank"}   
[Logarithm](https://en.wikipedia.org/wiki/Logarithm){target="_blank"} 
[Rules of logs](https://www.rapidtables.com/math/algebra/Logarithm.html#log-rules){target="_blank"} [Odds ratio](https://en.wikipedia.org/wiki/Odds_ratio){target="_blank"}    
[Log odds](https://en.wikipedia.org/wiki/Logit){target="_blank"}   
[2x2 table](https://en.wikipedia.org/wiki/Contingency_table){target="_blank"}  
[Probability](https://en.wikipedia.org/wiki/Probability){target="_blank"}   
[Likelihood function](https://en.wikipedia.org/wiki/Likelihood_function){target="_blank"}   
[Maximum likelihood estimates](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation){target="_blank"}   
[Linear regression model](https://en.wikipedia.org/wiki/Linear_regression){target="_blank"}   
[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression){target="_blank"}   
[Quasi-likelihood](https://en.wikipedia.org/wiki/Quasi-likelihood){target="_blank"}   
[Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution){target="_blank"}   

For more information on linear regression see this [book](https://rafalab.github.io/dsbook/linear-models.html#linear-regression-in-the-tidyverse){target="_blank"} and this [case study](https://opencasestudies.github.io/ocs-bp-diet/){target="_blank"}.

For more information on survey designs see [here](http://www.asasrms.org/Proceedings/y2008/Files/301835.pdf){target="_blank"} and [here](http://ocw.jhsph.edu/courses/StatMethodsForSampleSurveys/PDFs/Lecture5.pdf){target="_blank"}.  

For more information on survey analysis in R [here](https://r-survey.r-forge.r-project.org/survey/exmample-lonely.html){target="_blank"} and [here](http://r-survey.r-forge.r-project.org/survey/html/surveyoptions.html){target="_blank"}.   

If you are interested in an info-graphic summary of the 2019 findings, and links to many more resources about this topic and dataset, see the FDA's website [here](https://www.fda.gov/tobacco-products/youth-and-tobacco/youth-tobacco-use-results-national-youth-tobacco-survey){target="_blank"}.

<u>**Packages used in this case study:**</u>

Package   | Use in this case study                                                                       
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"}       | to easily load and save data  
[readxl](https://readxl.tidyverse.org/){target="_blank"}      | to import the data in the excel files 
[magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html){target="_blank"} | to use the compound assignment pipe operator `%<>%`
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"}    | to manipulate the character strings within the data  
[purrr](https://purrr.tidyverse.org/){target="_blank"}   | to import the data in all the different excel and csv files efficiently
[dplyr](https://dplyr.tidyverse.org/){target="_blank"}      | to arrange/filter/select/compare specific subsets of the data  
[readr](https://readr.tidyverse.org/){target="_blank"}      | to import the CSV file data
[tidyr](https://tidyr.tidyverse.org/){target="_blank"}      | to rearrange data in wide and long formats 
[ggplot2](https://ggplot2.tidyverse.org/){target="_blank"}    | to make visualizations with multiple layers
[scales](https://cran.r-project.org/web/packages/scales/scales.pdf){target="_blank"}    | to allow us to look at the colors within the viridis package
[viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html){target="_blank"}    | to make plots with a color palette that is compatible with color blindness
[forcats](https://forcats.tidyverse.org/){target="_blank"}    | to allow for reordering of factors in plots
[naniar](https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html){target="_blank"}  | to make a visualization of missing data
[syrvr](https://cran.r-project.org/web/packages/srvyr/srvyr.pdf){target="_blank"} | to use survey weights
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} | to allow plots to be combined 
[broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html){target="_blank"} | to create nicely formatted model output
[survey](http://r-survey.r-forge.r-project.org/survey/index.html){target="_blank"} | to fit survey-weighted logistic regression

### **Session info**
***

```{r}
sessionInfo()
```

**Estimate of RMarkdown Compilation Time: **

```{r, echo=FALSE}
rmarkdown:::perf_timer_stop("render")
pts = rmarkdown:::perf_timer_summary()
cat("About", round(pts$time[1]/1000 + 5), "-", round(pts$time[1]/1000 + 15),"seconds")
```

This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems. 

### **Acknowledgments**
***

We would like to acknowledge [Renee Johnson](https://www.jhsph.edu/faculty/directory/profile/2848/renee-m-johnson) for assisting in framing the major direction of the case study and for reviewing the case study for subject matter content.

We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package. 

We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work. 

<script type='text/javascript' id='clustrmaps' src='//cdn.clustrmaps.com/map_v2.js?cl=080808&w=a&t=tt&d=tygDY_TbG7XE2HVUC6Eox6Z74i2E5MUBqCA83PgJ9SY&co=ffffff&cmo=3acc3a&cmn=ff5353&ct=808080'></script>