This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 8
/
maps.Rmd
1031 lines (795 loc) · 42.3 KB
/
maps.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
editor_options:
chunk_output_type: console
---
```{r}
i <- 1
chapter_number <- 4
source("_common.R")
```
# Creating Maps {#maps-chapter}
When I first started learning R, I considered it a tool for working with numbers, not shapes, so I was surprised when I saw people using it to make maps. Abdoul Madjid, a developer, has been creating maps with R for several years. Recently, he used one to visualize rates of COVID-19 in the United States in 2021.
You might think you need specialized mapmaking software like ArcGIS to make maps, but this tool is expensive, and while Excel has added support for map-making in recent years, its features are limited (for example, you can’t use it to make maps based on street addresses). Even QGIS, an open source tool similar to ArcGIS, still requires learning new skills.
Using R for map-making has benefits. It’s way more flexible than Excel, way less expensive than ArcGIS, and is based on syntax you already know. It also lets you perform all of your data manipulation tasks with one tool and apply the principles of high-quality data visualization discussed in Chapter 2. For example, Madjid used R to obtain his data, analyze it, and make his COVID-19 map, which you can see in Figure \@ref(fig:madjid-covid-map).
```{r results='asis'}
print_nostarch_file_name(file_type_to_print = "png")
```
```{r madjid-covid-map, out.width="100%", fig.cap="Abdoul Madjid's map of COVID in the United States in 2021"}
knitr::include_graphics(here::here("assets/covid-map.png"))
```
```{r results='asis'}
save_image_for_nostarch(here::here("assets/covid-map.png"))
```
In this chapter, we’ll explore principles of working with simple features geospatial data, then walk through Madjid’s code to understand how he created this high-quality map. We’ll also discuss where to find geospatial data and how to use it to make your own maps.
## The Briefest of Primers on Geospatial Data {-}
You don’t need to be a GIS expert to make maps. But you do need to understand a few things about how geospatial data works, starting with its two main types: *vector* and *raster*. Vector data uses points, lines, and polygons to represent the world. Raster data, which often comes from digital photographs, ties each pixel in an image to a specific geographic location. Vector data tends to be easier to work with, and we’ll be using it exclusively in this chapter.
In the past, working with geospatial data meant mastering competing standards, each of which required learning a different approach. Today, though, most people use the *simple features* model for working with vector geospatial data (often abbreviated as *sf*), which is way easier to understand. For example, I can import simple features data about the state of Wyoming using this code:
```{r}
# This code chunk is just to create the data
library(tigris)
library(sf)
# wyoming <- states(progress_bar = FALSE) %>%
# st_transform("WGS84") %>%
# select(NAME) %>%
# filter(NAME == "Wyoming") %>%
# st_cast(to = "POLYGON")
#
# wyoming %>%
# st_write("data/wyoming.geojson")
wyoming <- read_sf("https://data.rwithoutstatistics.com/wyoming.geojson")
```
```{r}
library(sf)
wyoming <- read_sf("https://data.rwithoutstatistics.com/wyoming.geojson")
```
After doing this, I can now take a look at the data:
```{r echo = TRUE}
wyoming
```
You can see that it has two columns, one for the state name (`NAME`) and another called `geometry`. This data looks like the data frames you’re used to encountering, aside from two major differences: There is a bunch of metadata above the data frame, and our simple features data contains geographical data in a variable called `geometry`.
The metadata states that we have one feature and one field. The feature referenced here is the row, and the field is the `NAME` variable, which contains non-spatial data. Because the `geometry` column must be present for a data frame to be geospatial data, it is not counted as a field. Let’s look at each part of this simple features data.
### Geometry Type {-}
The geometry type represents the shape of the geospatial data we’re working with. These types are typically written in all caps. In this case, the relatively simple `POLYGON` type represents a single polygon. We can use ggplot to display this data by calling `geom_sf()`, a special geom designed to work with simple features data:
```{r wyoming-map-code, echo = TRUE, eval = FALSE}
library(tidyverse)
wyoming %>%
ggplot() +
geom_sf()
```
Figure \@ref(fig:wyoming-map-plot) shows the resulting map of Wyoming. It may not look like much, but, hey, I wasn’t the one who chose to make Wyoming a nearly perfect rectangle!
```{r results='asis'}
print_nostarch_file_name()
```
```{r wyoming-map-plot, ref.label = "wyoming-map-code", echo = FALSE, include = TRUE, fig.cap="A map of Wyoming"}
```
```{r results='asis'}
save_figure_for_nostarch()
```
```{r}
# This code is just to get the data the first time
# wy_ev_stations <- read_csv("data/ev-charging-stations.csv") %>%
# janitor::clean_names() %>%
# filter(state == "WY") %>%
# filter(fuel_type_code == "ELEC") %>%
# st_as_sf(
# coords = c("x", "y"),
# crs = "WGS84"
# ) %>%
# select(objectid)
#
# wy_ev_stations %>%
# st_write("data/wyoming-all-ev-stations.geojson")
#
# wy_ev_stations %>%
# slice(1) %>%
# st_write("data/wyoming-one-ev-station.geojson")
```
Other geometry types used in simple feature data include `POINT`, to display elements such as a pin on a map that represents a single location. Try importing data that shows a single electrical vehicle charging station in Wyoming, then placing this data on a map:
```{r}
# This is code just to import the data
# wy_roads <- primary_secondary_roads(
# state = "Wyoming",
# progress_bar = FALSE
# ) %>%
# janitor::clean_names()
#
# wy_roads %>%
# filter(linearid == 11010932011560) %>%
# st_write("data/wyoming-highway-30.geojson")
#
# wy_roads %>%
# st_write("data/wyoming-roads.geojson")
```
```{r eval = FALSE, echo = TRUE}
wyoming_one_ev_station <- read_sf("https://data.rwithoutstatistics.com/wyoming-one-ev-station.geojson")
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
data = wyoming_one_ev_station,
shape = 21,
fill = "#ff7400",
color = "white",
size = 3
)
```
Figure \@ref(fig:ev-stations-map) shows the resulting map.
```{r results='asis'}
print_nostarch_file_name()
```
```{r ev-stations-map, fig.cap = "A map of a single electric vehicle charging station in Wyoming"}
wyoming_one_ev_station <- read_sf("https://data.rwithoutstatistics.com/wyoming-one-ev-station.geojson")
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
data = wyoming_one_ev_station,
shape = 21,
fill = "#ff7400",
color = "white",
size = 3
)
```
```{r results='asis'}
save_figure_for_nostarch()
```
The `LINESTRING` geometry type is for a set of points that can be connected with lines, often used to represent roads. We can import and plot the following data to show a `LINESTRING`:
```{r echo = TRUE, eval = FALSE}
wyoming_highway_30 <- read_sf("data/wyoming-highway-30.geojson")
wyoming_highway_30 %>%
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
color = "#ff7400",
linewidth = 1
)
```
Figure \@ref(fig:wy-roads-map) shows us the resulting map, which is a section of US Highway 30 that runs through Wyoming.
```{r results='asis'}
print_nostarch_file_name()
```
```{r wy-roads-map, fig.cap = "A map of a section of U.S. Highway 30 running through Wyoming"}
wyoming_highway_30 <- read_sf("data/wyoming-highway-30.geojson")
wyoming_highway_30 %>%
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
color = "#ff7400",
linewidth = 1
)
```
```{r results='asis'}
save_figure_for_nostarch()
```
Each of these geometry types has a MULTI variation (`MULTIPOINT`, `MULTILINESTRING`, and `MULTIPOLYGON`) that combines multiple instances of the type in one row of data. We can import and plot `MULTIPOINT` data showing all electric vehicle charging stations in Wyoming using this code:
```{r eval = FALSE, echo = TRUE}
wyoming_all_ev_stations <- read_sf("https://data.rwithoutstatistics.com/wyoming-all-ev-stations.geojson")
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
data = wyoming_all_ev_stations,
fill = "#ff7400",
color = "white",
shape = 21,
size = 3
)
```
Figure \@ref(fig:wyoming-ev-stations-map) shows what the map made from this code looks like.
```{r results='asis'}
print_nostarch_file_name()
```
```{r wyoming-ev-stations-map, fig.cap = "A map of all electric vehicle charging stations in Wyoming"}
wyoming_all_ev_stations <- read_sf("data/wyoming-all-ev-stations.geojson")
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
data = wyoming_all_ev_stations,
fill = "#ff7400",
color = "white",
shape = 21,
size = 3
)
```
```{r results='asis'}
save_figure_for_nostarch()
```
Likewise, we can use MULTILINESTRING data to show not just one road, but all major roads in Wyoming:
```{r eval = FALSE, echo = TRUE}
wyoming_roads <- read_sf("https://data.rwithoutstatistics.com/wyoming-roads.geojson")
wyoming_roads %>%
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
color = "#ff7400",
linewidth = 1
)
```
Figure \@ref(fig:wyoming-roads-map) shows the resulting map.
```{r results='asis'}
print_nostarch_file_name()
```
```{r wyoming-roads-map, fig.cap = "A map of all major roads in Wyoming"}
wyoming_roads <- read_sf("https://data.rwithoutstatistics.com/wyoming-roads.geojson")
wyoming_roads %>%
ggplot() +
geom_sf(data = wyoming) +
geom_sf(
color = "#ff7400",
linewidth = 1
)
```
```{r results='asis'}
save_figure_for_nostarch()
```
```{r}
# This is just to bring the data the first time
# wy_counties <- counties(state = "Wyoming",
# progress_bar = FALSE) %>%
# st_transform("WGS84") %>%
# select(NAME)
#
# wy_counties %>%
# st_write("data/wyoming-counties.geojson")
```
Lastly, we could use `MULTIPOLYGON` data to, for example, depict a state made up of multiple polygons. To see what I mean, take a look at a map of Wyoming’s counties. We can import data to make this map with the following code:
```{r}
wyoming_counties <- read_sf("https://data.rwithoutstatistics.com/wyoming-counties.geojson")
```
If we look at the data, we can see how it represents the 23 counties in the state:
```{r}
wyoming_counties
```
The geometry type of this data is `MULTIPOLYGON`. In addition, the repeated `MULTIPOLYGON` text in the geometry column indicates that each row contains a shape of type `MULTIPOLYGON`.
```{r eval = FALSE, echo = TRUE}
wyoming_counties %>%
ggplot() +
geom_sf(data = wyoming) +
geom_sf()
```
Figure \@ref(fig:wyoming-counties-map) is a map made with this data.
```{r results='asis'}
print_nostarch_file_name()
```
```{r wyoming-counties-map, fig.cap = "A map of Wyoming counties"}
wyoming_counties %>%
ggplot() +
geom_sf()
```
```{r results='asis'}
save_figure_for_nostarch()
```
You can easily see the multiple polygons that make up the map.
### The Dimensions {-}
Next, the geospatial data frame contains the data’s *dimensions*, or the type of geospatial data we’re working with. In the Wyoming example, it looks like this: `Dimension: XY`, meaning the data is two-dimensional, as in the case of all the geospatial data used in this chapter. There are two other dimensions (`Z` and `M`) that you’ll see much more rarely. I’ll leave them for you to investigate further.
### Bounding Box {-}
The penultimate element in the metadata is the bounding box. A *bounding box* represents the smallest area in which we can fit all of our geospatial data. For our `wyoming` object, it looks like this:
`Bounding box: xmin: -111.0569 ymin: 40.99475 xmax: -104.0522 ymax: 45.0059`
The `ymin` value of 40.99475 and `ymax` value of 45.0059 represent the lowest and highest latitude, respectively, that the state’s polygon can fit into. The x values do the same for the longitude. Bounding boxes are calculated automatically, and you don’t typically have to worry about altering them.
### The Geodetic CRS {-}
The last piece of metadata specifies the *coordinate reference system* used to project our data when we plot it. The problem with representing any geospatial data is that we’re displaying information about the three-dimensional Earth on a two-dimensional map. Doing so requires us to choose a coordinate reference system that determines what type of correspondence, or *projection*, to use when making our map.
The data for the Wyoming counties map includes the line `Geodetic CRS: WGS 84`, indicating the use of a coordinate reference system known as *WGS84*. To see a different projection, check out the same map using what’s known as the *Albers equal-area conic convenience projection*:
```{r eval = FALSE, echo = TRUE}
wyoming_counties %>%
sf::st_transform(albersusa::us_laea_proj) %>%
ggplot() +
geom_sf()
```
To use this projection, we add a line that uses the `st_transform()` function from the `sf` package, along with data from the `albersusa` package, before plotting it using ggplot. While Wyoming looked perfectly horizontal in Figure \@ref(fig:wyoming-counties-map), the version in Figure \@ref(fig:wyoming-counties-map-wgs84) appears to be tilted.
```{r results='asis'}
print_nostarch_file_name()
```
```{r wyoming-counties-map-wgs84, fig.cap = "A map of Wyoming counties using the Albers equal-area conic convenience projection"}
wyoming_counties %>%
sf::st_transform(albersusa::us_laea_proj) %>%
ggplot() +
geom_sf()
```
```{r results='asis'}
save_figure_for_nostarch()
```
If you’re curious about how to change projections when making maps of your own, fear not. You’ll see how to do this when we look at Abdoul Madjid’s map in the next section. And if you want to know how to choose appropriate projections for your maps, check out the "Using Appropriate Projections" section below.
### The `geometry` Column {-}
In addition to the metadata, our simple features data differs from traditional data frames in another respect: its `geometry` column. As you probably guessed from the name, it holds the data needed to make our maps.
To understand how this works, consider the connect-the-dots drawings you probably completed as a kid. As you added lines to connect one point to the next, the subject of your drawing became clear. The geometry column is similar. It has a set of numbers, each of which corresponds to a point. If you’re using `LINESTRING/MULTILINESTRING` or `POLYGON/MULTIPOLYGON` simple features data, ggplot uses the numbers in the geometry column to draw each point and then adds lines to connect the points. If you’re using `POINT/MULTIPOINT` data, it draws the points but doesn’t connect them.
Once again, you never have to worry about these details or look in any depth at the `geometry` column.
## Recreating the COVID Map {-}
Now that you understand the basics of geospatial data, let’s walk through the code Madjid used to make his COVID-19 map, which makes use of the geometry types, dimensions, bounding boxes, projections, and the geometry column we just explored. (I’ve made some small modifications to the code to make the final map fit on the page.) Let’s begin by loading few packages:
```{r echo = TRUE}
library(tidyverse)
library(albersusa)
library(sf)
library(zoo)
library(colorspace)
```
The `albersusa` package will give us access to geospatial data, and you can install it as follows:
```{r echo = TRUE, eval = FALSE}
remotes::install_github("hrbrmstr/albersusa")
```
You can install all of the other packages using the standard `install.packages()` code. We’ll use `tidyverse` to import data, manipulate it, and plot it with ggplot. The `sf` package will enable us to change the coordinate reference system and use an appropriate projection for the data. The `zoo` package has functions for calculating rolling averages, and the `colorspace` package gives us a color scale that highlights the data well.
### Importing the Data {-}
Next, let’s import the data we need. We require three pieces of data: COVID rates by state over time, state populations, and geospatial information. Madjid imported each of these pieces of data separately and then merged them, and we’ll do the same.
First, we import COVID data. This data comes directly from *The New York Times*, which publishes daily case rates by state as a CSV file on its GitHub account. I’ve dropped the `fips` variable; Federal Information Processing Standards (FIPS) are numeric codes used to represent states, but we can reference states by their names instead:
```{r echo = TRUE}
covid_data <- read_csv("https://data.rwithoutstatistics.com/covid-us-states.csv") %>%
select(-fips)
```
If you take a look at this data, you can see the arrival of the first COVID cases in the United States in January 2020.
```{r}
covid_data
```
Madjid’s map shows per capita rates (rates per 100,000 people) rather than absolute rates (the rates without consideration for a state’s population). So, to recreate his maps, we need to obtain data on each state’s population. Madjid downloaded this data as a CSV:
```{r echo = TRUE}
usa_states <- read_csv("https://data.rwithoutstatistics.com/population-by-state.csv") %>%
select(State, Pop)
```
We import this data, keep the `State` and `Pop` (population) variables, and save it as an object called `usa_states`. Let’s see what `usa_states` looks like:
```{r}
usa_states
```
Finally, we’ll import the geospatial data and save it as an object called `usa_states_geom`:
```{r echo = TRUE}
usa_states_geom <- usa_sf() %>%
select(name) %>%
st_transform(us_laea_proj)
```
The `usa_sf()` function from the `albersusa` package gives us simple features data for all US states. Conveniently, it places Alaska and Hawaii in locations, and at a scale, that make them easy to see. This data includes multiple variables, but we need only the state names, so we keep the `name` variable only.
We then use the `st_transform()` function from the `sf` package to change the coordinate reference system. The one used here comes from the `us_laea_proj` object in the albersusa package. Remember the *Albers equal-area conic convenience* projection we used to change the appearance of our Wyoming counties map? This is the same projection.
### Calculating Daily COVID Cases {-}
Next, we need to calculate the number of daily COVID cases. We have to do this because the `covid_data` data frame gives us cumulative cases by state, but not the number of cases per day:
```{r echo = TRUE}
covid_cases <- covid_data %>%
group_by(state) %>%
mutate(
pd_cases = lag(cases)
) %>%
replace_na(list(pd_cases = 0)) %>%
mutate(
daily_cases = case_when(
cases > pd_cases ~ cases - pd_cases,
TRUE ~ 0
)
) %>%
ungroup() %>%
arrange(state, date)
```
We use the `group_by()` function to calculate totals for each state, then create a new variable called `pd_cases`, which represents the number of cases in the previous day (we can use the `lag()` function to assign data to this variable). Some days do not have cases counts for the previous day, so in these cases, we set the value to 0 using the `replace_na()` function.
Finally, we create a new variable called `daily_cases`. To set the value of this variable, we use the `case_when()` function to create a condition: If the cases variable (which holds the cases on that day) is greater than the `pd_cases` variable (which holds cases from one day prior), then `daily_cases` is equal to cases minus pd_cases. Otherwise, we set `daily_cases` to be equal to 0.
Because we grouped the data by state at the beginning, we must now remove this grouping using the `ungroup()` function before arranging our data by state and date. Now take a look at the `covid_cases` data frame we created:
```{r}
covid_cases
```
In the next step, we’ll make use of the `daily_cases` variable.
### Calculating Incidence Rates {-}
We’re not quite done calculating values. The data that Madjid used to make his map didn’t include daily case counts. Instead, it contained a five-day rolling average of cases per 100,000 people. A *rolling average* is the average case rate in a certain time period. Quirks of reporting (for example, not reporting on weekends but instead rolling Saturday and Sunday cases into Monday) can make the value for any single day less reliable. Using a rolling average smooths things out. Here is the code to generate this data:
```{r echo = TRUE, eval = FALSE}
covid_cases %>%
mutate(roll_cases = rollmean(
daily_cases,
k = 5,
fill = NA
))
```
We create a new data frame called `covid_cases_rm` (where *rm* stands for rolling mean). The first step in its creation is to use the `rollmean()` function from the `zoo` package to create a roll_cases variable, which holds the average number of cases in the five-day period surrounding a single date. The `k` argument is the number of days for which we want to calculate the rolling average (five, in our case), and the `fill` argument determines what happens in cases like the first day, where we can’t calculate a five-day rolling mean because there are no days prior to this day (Madjid set these values to be NA).
After calculating `roll_cases`, we need to calculate per capita case rates. To do this, we needed population data, so we join the population data from the `usa_states` data frame with the `covid_cases` data:
```{r echo = TRUE, eval = FALSE}
covid_cases_rm <- covid_cases %>%
mutate(roll_cases = rollmean(
daily_cases,
k = 5,
fill = NA
)) %>%
left_join(
usa_states,
by = c("state" = "State")
) %>%
drop_na(Pop)
```
We then drop rows with missing population data (the `Pop` variable). In practice, this means getting rid of several US territories (American Samoa, Guam, Northern Marianas Islands, and Virgin Islands).
Next, we created a per capita case rate variable called `incidence_rate` by multiplying the `roll_cases` variable by 100,000 and then dividing it by the population of each state:
```{r echo = TRUE, eval = FALSE}
covid_cases_rm <- covid_cases_rm %>%
mutate(incidence_rate = 10^5 * roll_cases / Pop) %>%
mutate(
incidence_rate = cut(
incidence_rate,
breaks = c(seq(0, 50, 5), Inf),
include.lowest = TRUE
) %>%
factor(labels = paste0(">", seq(0, 50, 5)))
)
```
Rather than keeping raw values (for example, on June 29, 2021, Florida had a rate of 57.77737 cases per 100,000 people), we use the `cut()` function to convert the values into categories: values of `>0` (greater than zero), values of `>5` (greater than five), and values of `>50` (greater than 50).
The last step is to filter the data so it includes only 2021 data (the only year depicted in Madjid’s map) and select only the variables (`state`, `date`, and `incidence_rate`) we’ll need to create our map. Here is the final `covid_cases_rm` data frame.
```{r}
# This is just to create the covid_cases_rm data frame
covid_cases_rm <- covid_cases %>%
mutate(roll_cases = rollmean(
daily_cases,
k = 5,
fill = NA
)) %>%
left_join(
usa_states,
by = c("state" = "State")
) %>%
drop_na(Pop) %>%
mutate(incidence_rate = 10^5 * roll_cases / Pop) %>%
mutate(
incidence_rate = cut(
incidence_rate,
breaks = c(seq(0, 50, 5), Inf),
include.lowest = TRUE
) %>%
factor(labels = paste0(">", seq(0, 50, 5)))
) %>%
filter(year(date) == 2021) %>%
select(state, date, incidence_rate)
```
```{r}
covid_cases_rm
```
We now have a data frame that we can combine with our geospatial data.
### Adding Geospatial Data {-}
We’ve now used two of our three data sources (COVID case data and state population data) to create the `covid_cases_rm` data frame we’ll need to make the map. Let’s now use the third data source: the geospatial data we saved as `usa_states_geom.` Simple features data allows us to merge regular data frames and geospatial data, another mark in its favor:
```{r echo = TRUE, eval = FALSE}
usa_states_geom %>%
left_join(covid_cases_rm, by = c("name" = "state"))
```
We merge our `covid_cases_rm` data frame into the geospatial data, matching the name variable from `usa_states_geom` to the state variable in `covid_cases_rm`. Next, we create a new variable called `fancy_date.` As the name implies, it’s a nicely formatted version of the date (for example, *Jan 01* instead of *2021-01-01*):
```{r echo = TRUE}
usa_states_geom_covid <- usa_states_geom %>%
left_join(covid_cases_rm, by = c("name" = "state")) %>%
mutate(fancy_date = fct_inorder(format(date, "%b. %d"))) %>%
relocate(fancy_date, .before = incidence_rate)
```
The `format()` function does the formatting while the `fct_inorder()` function makes the `fancy_date` variable sort data by date (rather than, say, alphabetically, which would put August before January). Last, we use the `relocate()` function to put the `fancy_date` column next to the date column. We save this data frame as `usa_states_geom_covid`. Take a look at it:
```{r}
usa_states_geom_covid
```
We can see the metadata and `geometry` columns we discussed.
### Making the Map {-}
It took a lot of work to end up with the surprisingly simple data frame `usa_states_geom_covid`. And while the data may be simple, the code Madjid used to make his map is quite complex. In this section, we walk through it in pieces.
The final map is actually multiple maps, one for each day in 2021. Combining 365 days makes for a large final product, so instead of showing the code for every single day, we’ll filter the `usa_states_geom_covid` to show just the first six days in January:
```{r echo = TRUE}
usa_states_geom_covid_six_days <- usa_states_geom_covid %>%
filter(date <= as.Date("2021-01-06"))
```
We save the result as a data frame called `usa_states_geom_covid_six_days`. Here’s what this data looks like:
```{r}
usa_states_geom_covid_six_days
```
Madjid’s map is giant, as it includes all 365 days. We’ll change the size of a few elements so they fit in this book.
#### Generating the Basic Map {-}
Now that we have six days of data, let’s make some maps. Abdoul Madjid’s map-making code has two main parts: generating the basic map, then tweaking its appearance. We’ll revisit the three lines of code used to make our Wyoming maps, with some adornments to improve the quality of the visualization:
```{r basic-map-code, echo = TRUE, eval = FALSE}
usa_states_geom_covid_six_days %>%
ggplot() +
geom_sf(
aes(fill = incidence_rate),
size = .05,
color = "grey55"
) +
facet_wrap(
vars(fancy_date),
strip.position = "bottom"
)
```
We use `geom_sf()` to plot the geospatial data, modifying a couple arguments. We use `size = .05` to make the state borders less prominent and `color = "grey55"` to set the borders to a medium gray color. Then, to make one map for each day, we use the `facet_wrap()` function. The `vars(fancy_date)` code specifies that the `fancy_date` variable should be used to do the faceting (in other words, make one map for each day) and `strip.position = "bottom"` moves the labels *Jan. 01*, *Jan. 02*, and so on to the bottom of the maps. You can see the resulting map in Figure \@ref(fig:basic-map).
```{r results='asis'}
print_nostarch_file_name()
```
```{r basic-map, ref.label = "basic-map-code", fig.cap = "A map showing the incidence rate of COVID for the first six days of 2021"}
```
```{r results='asis'}
save_figure_for_nostarch()
```
Having generated the basic map, let’s now make it look good.
#### Applying Data Visualization Principles to the Map {-}
From now on, all of the code that Abdoul Madjid uses is to improve the appearance of the maps. Many of the tweaks shown here will feel familiar if you’ve read Chapter \@ref(data-viz-chapter), highlighting a benefit of making maps with ggplot: You can apply the same data-visualization principles you learned about when making charts.
```{r final-map, echo = TRUE, eval = FALSE}
usa_states_geom_covid_six_days %>%
ggplot() +
geom_sf(
aes(fill = incidence_rate),
size = .05,
color = "transparent"
) +
facet_wrap(
vars(fancy_date),
strip.position = "bottom"
) +
scale_fill_discrete_sequential(
palette = "Rocket",
name = "COVID-19 INCIDENCE RATE",
guide = guide_legend(
title.position = "top",
title.hjust = .5,
title.theme = element_text(
family = "Times New Roman",
size = rel(9),
margin = margin(
b = .1,
unit = "cm"
)
),
nrow = 1,
keyheight = unit(.3, "cm"),
keywidth = unit(.3, "cm"),
label.theme = element_text(
family = "Times New Roman",
size = rel(6),
margin = margin(
r = 5,
unit = "pt"
)
)
)
) +
labs(
title = "2021 · A pandemic year",
caption = "Incidence rates are calculated for 100,000 people in each state.
Inspired from a graphic in the DIE ZEIT newspaper of November 18, 2021.
Data from NY Times · Tidytuesday Week-1 2022 · Abdoul ISSA BIDA."
) +
theme_minimal() +
theme(
text = element_text(
family = "Times New Roman",
color = "#111111"
),
plot.title = element_text(
size = rel(2.5),
face = "bold",
hjust = 0.5,
margin = margin(
t = .25,
b = .25,
unit = "cm"
)
),
plot.caption = element_text(
hjust = .5,
face = "bold",
margin = margin(
t = .25,
b = .25,
unit = "cm"
)
),
strip.text = element_text(
size = rel(0.75),
face = "bold"
),
legend.position = "top",
legend.box.spacing = unit(.25, "cm"),
panel.grid = element_blank(),
axis.text = element_blank(),
plot.margin = margin(
t = .25,
r = .25,
b = .25,
l = .25,
unit = "cm"
),
plot.background = element_rect(
fill = "#e5e4e2",
color = NA
)
)
```
The `scale_fill_discrete_sequential()` function from the `colorspace` package sets the color scale. Madjid uses the "rocket" palette (the same palette that that Cédric Scherer and Georgios Karamanis used in Chapter \@ref(data-viz-chapter)) and changes the legend title to "COVID-19 INCIDENCE RATE." Within the `guide_legend()` function, Madjid puts adjusts the position and alignment as well as text properties of the title. He also puts the colored squares in one row, adjusts their height and width, and tweaks the text properties of the labels (the `>0`, `>5`, and so on).
Next, he adds a title and caption using the `labs()` function. After this, he uses `theme_minimal()` before making tweaks using the `theme()` function. These tweaks include setting the font and text color, making the title and caption bold, and adjusting their size, alignment, and the margins around them. He also adjusts the size of the strip text (the *Jan 01*, *Jan 02*, and so on) and makes it bold, puts the legend at the top of the maps, and adds a bit of spacing around it. He removes grid lines and the longitude and latitude lines, then adds a bit of padding around the entire visualization and makes the background a light gray.
There we have it: Figure \@ref(fig:final-map-map) shows our recreation of Abdoul Madjid’s COVID-19 map.
```{r results='asis'}
print_nostarch_file_name()
```
```{r final-map-map, fig.cap = "Our recreation of Abdoul Madjid's map"}
usa_states_geom_covid_six_days %>%
ggplot() +
geom_sf(
aes(fill = incidence_rate),
size = .05,
color = "transparent"
) +
facet_wrap(
vars(fancy_date),
strip.position = "bottom"
) +
scale_fill_discrete_sequential(
palette = "Rocket",
name = "COVID-19 INCIDENCE RATE",
guide = guide_legend(
title.position = "top",
title.hjust = .5,
title.theme = element_text(
family = "Times New Roman",
size = rel(9),
margin = margin(
b = .1,
unit = "cm"
)
),
nrow = 1,
keyheight = unit(.3, "cm"),
keywidth = unit(.3, "cm"),
label.theme = element_text(
family = "Times New Roman",
size = rel(6),
margin = margin(
r = 5,
unit = "pt"
)
)
)
) +
labs(
title = "2021 · A pandemic year",
caption = "Incidence rates are calculated for 100,000 people in each state.
Inspired from a graphic in the DIE ZEIT newspaper of November 18, 2021.
Data from NY Times · Tidytuesday Week-1 2022 · Abdoul ISSA BIDA."
) +
theme_minimal() +
theme(
text = element_text(
family = "Times New Roman",
color = "#111111"
),
plot.title = element_text(
size = rel(2.5),
face = "bold",
hjust = 0.5,
margin = margin(
t = .25,
b = .25,
unit = "cm"
)
),
plot.caption = element_blank(),
# plot.caption = element_text(
# hjust = .5,
# face = "bold",
# margin = margin(t = .25,
# b = .25,
# unit = "cm")),
strip.text = element_text(
size = rel(0.75),
face = "bold"
),
legend.position = "top",
legend.box.spacing = unit(.25, "cm"),
panel.grid = element_blank(),
axis.text = element_blank(),
plot.margin = margin(
t = .25,
r = .25,
b = .25,
l = .25,
unit = "cm"
),
plot.background = element_rect(
fill = "#e5e4e2",
color = NA
)
)
```
```{r results='asis'}
save_figure_for_nostarch()
```
From data import and data cleaning to analysis and visualization, we’ve shown how Abdoul Madjid made a beautiful map in R.
## Making Your Own Maps {-}
You may now be wondering: Okay, great, but how do I actually make my own maps? Let’s talk about where you can find geospatial data, how to choose a projection, and how to wrangle geospatial data to get it ready for mapping.
### Importing Raw Data {-}
There are two ways to access simple features geospatial data. The first is to import raw data. Geospatial data can take a number of different formats. While ESRI shapefiles (with the *.shp* extension) are the most common, you might also encounter GeoJSON files (*.geojson*), KML files (*.kml*), and others. Chapter 8 of *Geocomputation with R* by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow discusses this range of formats.
The good news for us is that a single function can read pretty much any type of geospatial data: `read_sf()` from the `sf` package. Let’s show an example of how it works. Say you’ve downloaded geospatial data about United States state boundaries from the website *geojson.xyz* in GeoJSON format, then saved it in the *data* folder as *states.geojson*. You can then import this data using the `read_sf()` function:
```{r echo = TRUE}
us_states <- read_sf(dsn = "data/states.geojson")
```
The dsn argument (which stands for data source name) tells read_sf() where to find the file. We save the data as the object `us_states`.
### Accessing Geospatial Data Using R Functions {-}
You’ll sometimes have to work with raw data in this way, but not always. That’s because certain R packages provide functions for accessing geospatial data. Madjid used the `usa_sf()` function from the `albersusa` package to acquire his data. Another package for accessing geospatial data related to the United States, `tigris`, has a number of well-named functions for different types of data. For example, we can load the `tigris` package and run the `states()` function:
```{r echo = TRUE}
library(tigris)
states_tigris <- states(
cb = TRUE,
resolution = "20m",
progress_bar = FALSE
)
```
We use the `cb = TRUE` argument to opt us out of using the most detailed shapefile and set the resolution to a more manageable 20m (1:20 million). Without these changes, the shapefile we’d get would be large and slow to work with. We also set `progress_bar = FALSE` so we won’t see the messages that `tigris` shares while it loads data. We then save the result as `states_tigris`.
The `tigris` package has functions to get geospatial data about counties, census tracts, roads, and more. Kyle Walker, developer of the package, wrote a book, *Analyzing US Census Data: Methods, Maps, and Models in R*, if you’d like to learn more about how to use it.
If you’re looking for data outside of the United States, fear not! The `rnaturalearth` package provides functions for importing geospatial data from across the world. For example, `ne_countries()` can retrieve geospatial data about various countries:
```{r echo = TRUE}
library(rnaturalearth)
africa_countries <- ne_countries(
returnclass = "sf",
continent = "Africa"
)
```
This code uses two arguments: `returnclass = "sf"` to get data in simple features format, and `continent = "Africa"` to get only countries on the African continent. If we save the result to an object called `africa_countries`, we can plot the data on a map:
```{r africa-map-code, echo = TRUE, eval = FALSE}
africa_countries %>%
ggplot() +
geom_sf()
```
Figure \@ref(fig:africa-map) shows the resulting map.
```{r results='asis'}
print_nostarch_file_name()
```
```{r africa-map, ref.label = "africa-map-code", fig.cap = "A map of Africa made with data from the rnaturalearth package"}
```
```{r results='asis'}
save_figure_for_nostarch()
```
If you can’t find an appropriate package, you can always fall back on using `read_sf()` from the `sf` package.
### Using Appropriate Projections {-}
Once you have access to geospatial data, you need to decide which projection to use. If you’re looking for a simple answer to this question, you’ll be disappointed. As Robin Lovelace, Jakub Nowosad, and Jannes Muenchow put it in their book *Geocomputation with R*, "the question of *which* CRS [to use] is tricky, and there is rarely a ‘right’ answer."
If you feel overwhelmed by the task of choosing a projection, the `crsuggest` package, also by Kyle Walker, can give you ideas. Its `suggest_top_crs()` function returns a coordinate reference system that is well-suited for your data. Let’s load `crsuggest` and try it out on our `africa_countries` data:
```{r echo = TRUE, eval = FALSE}
library(crsuggest)
africa_countries %>%
suggest_top_crs()
```
The `suggest_top_crs()` function should return projection number 28232. We can now pass this value to the `st_transform()` function to change the projection before we plot:
```{r africa-map-different-projection-code, echo = TRUE, eval = FALSE}
africa_countries %>%
st_transform(28232) %>%
ggplot() +
geom_sf()
```
When run, this code generates the map in Figure \@ref(fig:africa-map-different-projection).
```{r results='asis'}
print_nostarch_file_name()
```
```{r africa-map-different-projection, ref.label = "africa-map-different-projection-code", fig.cap = "A map of Africa made with projection number 28232"}
```
```{r results='asis'}
save_figure_for_nostarch()
```
As you can see, we’ve mapped Africa with a different projection.
### Wrangling Your Geospatial Data {-}
The ability to merge traditional data frames with geospatial data is a huge benefit of working with simple features data. Remember that for his COVID map, Madjid analyzed traditional data frames before merging them with geospatial data. But because simple features data acts just like traditional data frames, we can just as easily apply the data-wrangling and analysis functions from `tidyverse` directly to a simple features object. To demonstrate this, let’s return to the `africa_countries` simple features data, selecting two variables (`name` and `pop_est`) to see the name and population of the countries:
```{r echo = TRUE, eval = FALSE}
africa_countries %>%
select(name, pop_est)
```
The output looks like the following:
```{r}
africa_countries %>%
select(name, pop_est)
```
Say we want to make a map showing which African countries have populations larger than 20 million. To do so, we’d need to first calculate this value for each country. Let’s do this using the `mutate()` and `if_else()` functions, which will return `TRUE` if a country’s population is over 20 million and `FALSE` otherwise, then store the result in a variable called `population_above_20_million`:
```{r echo = TRUE, eval = FALSE}
africa_countries %>%
select(name, pop_est) %>%
mutate(population_above_20_million = if_else(pop_est > 20000000, TRUE, FALSE))
```
We can then take this code and pipe it into ggplot, setting the `fill` aesthetic property to be equal to `population_above_20_million`:
```{r africa-map-20m-code, echo = TRUE, eval = FALSE}
africa_countries %>%
select(name, pop_est) %>%
mutate(population_above_20_million = if_else(pop_est > 20000000, TRUE, FALSE)) %>%
ggplot(aes(fill = population_above_20_million)) +
geom_sf()
```
This code generates the map shown in Figure \@ref(fig:africa-map-20m).
```{r results='asis'}
print_nostarch_file_name()