- Introduction
- qplot
- Mastering the Grammar
- Build a Plot Layer by Layer
- Toolbox
- Scales, Axes, and Legends
- Positioning
- Polishing Your Plots for Publication
- Manipulating Data
- Reducing Duplication
Note: To learn more about grammar of graphics, see Wilkinson 2005, Wickham 2009 (pdf)
What is a statistical graphic?
- Mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars)
Basic components of such a mapping:
- data: stuff we want to visualize, mapping describes how variables are mapped to aesthetic attributes
- geom: represents what you see on the plot: points, lines, polygons, etc.
- scale: map values in data space to values in aesthetic space (color, size, shape), draws legends, axes
- coord: data coordinates mapped to the plane of the graphic
- facet: how to break data into subsets
How ggplot fits into other R graphics:
- Base graphics:
- Pen and paper model: can only draw on top of the plot, cannot modify or delete existing content
- No user accessible representation of graphics apart from appearance on screen
- Includes tools for drawing primitives and entire plots
- Grid:
- Developed by Paul Murrell (1998 disseration)
- Grid gobs (graphical objects) can be represented independently of the plot and modified later
- System of viewports (each containing own coord. system) allows for complex graphics
- Draws primitives but no way to produce statistical graphics
- Lattice:
- Sarkar 2008a
- Implements trellis of Cleveland (1985, 1993)
- Produces conditioned plots
- Lacks formal model
Basic plotting function.
# Assume x and y are quant. variables
qplot(x,y, data=data)
qplot(x, y, # x or y can be a complex variable (a*b*c)
data=data,
color=variable, # you can also do color=I("red") which isn't same as mapping
shape=variable,
alpha= I(1/10), # alpha for overplotting
geom = c("point", "smooth")) # geom="path" for assessing relation w/ time etc., paths
"
Smoothers:
- method="loess", span =.2
- method = "gam", formula = y ~ s(x), formula = y ~ s(x, bs = "cs")
- method = "lm"
- method = "lm", formula = y ~ ns(x, 5)
- method = "rlm" # robust linear model
"
qplot(x, y,
data=data,
geom=c("smooth"),
method="loess",
span=.2)
# Scatter plot for quant vars.
# To prevent overplotting - jitter, and alpha
qplot(categorical, quantitative, data=data, geom="jitter", alpha = I(1/10))
# for boxplots, geom="boxplot"
# 1d geoms: geom=histogram, freqpoly, density, bar
qplot(x, data=data, geom="histogram", # or geom = "density"
bindwith=.1,
fill=color, # color is a variable to which you want to map color
xlim=c(a,b))
# weighted bar chart
qplot(x, data=data, geom="bar", weight=variable)
qplot(date, y, data=data, geom="line")
qplot(x, data=data, facets = category ~ ., geom="histogram", binwidth=.1)
qplot(x, y, data=data,
xlab = "x axis label",
ylab = "y axis label",
main = "plot title",
xlim = c(xmin, xmax),
log = "xy") # logs x and y variables
- Geoms: describe type of plot
- scatterplot: geom = point
- bubblechart: geom = point, size = mapped to a variable
- barchart: geom = bar etc.
- scaling: convert data units to pixels/spacing/colors
- coord: use coordinate system for position (coord)
- position adjustment
- faceting
- End product: new dataset that records this information (x, y, color, size, shape)
- layer = data (and mappings to it), stat, geom, position adjustment
- Plot can have multiple layers, with diff. datasets
- a list with data, mapping, layers, scales, coordinates, and facet. and options (to store plot-specific theme options)
- to render to screen: print()
- to render to disk: ggsave()
- to describe structure: summary()
- to save cached copy to disk: save() (which can be brought back up using load())
ggplot always takes data as a data.frame. Start with mapping basic aesthetics to data, and then add layers to it:
p <- ggplot(data, aes(x, y, color=z)) # nothing would be displayed as no layers yet
p + layer(geom = "point") # use + to add layers, this uses the mapping given and default values for stat and position adj.
layer(geom, geom_params, stat, stat_params, data, mapping, position)
layer(geom = "bar",
geom_params=list(fill= "red"),
stat="bin",
stat_params = list(bindwidth=2))
- Specifying layer is verbose so we use shortcuts exploiting the fact that
- every geom is associated w/ default stat
- every stat w/ default geom
- All shortcut functions start either with geom_ or stat_
geom_histogram(binwidth=X, fill="red")
geom_point()
geom_smooth()
Data
- Must be a data.frame
- If you want to produce the same plot for different data frames
- to see same plot with new data:
p <- ggplot(data, aes(x,y)) + geom_point()
p %+% new_data
Aesthetic Mappings
- aes function
- the mappings can be modified later by doing + aes()
- do summary(p) to see how the object changes
- you can change the y-axis by doing
p + geom_pont(aes(y=new_y))
- Setting Versus Mapping
- color = "red" versus color = var
Grouping
- break by a categorical variable for instance
Geoms
- each geom has a set of aesthetics it needs and a larger set it understands.
- For instance, point needs x and y and understands color, size, shape
- Bar geom needs height (ymax) and understands width, border color, fill color
- every geom has a default stat
- list of all geoms:
- abline, area, bar,
- blank (draws nothing),
- boxplot
- countour (3d contours in 2d), crossbar (hollow bar with middle indicated by a line)
- density, density_2d (2d density)
- errorbar
- histogram, hline
- interval, jitter, line (connect obs.)
- linerange
- path (connect obs. in original order)
- point, pointrange (vertical line, with point in the middle)
- polygon
- ribbon, rug
- segment, smooth, step (connect obs. in stairs)
- text, tile, vline
Stat
- stat takes dataset as input and outputs a dataset
- stat_bin produces: count, density, center of the bin (x)
- list of all stats:
- bin, boxplot, contour, density, density_2d, function, identity, qq,
- quantile, smooth, spoke (converts angle and radius to xend and yend),
- sum, summary, unique
Position Adjustments
- dodge, fill, identity, jitter, stack
Growth Trajectory tip
- Plots of longitudinal data are called spaghetti plots
- Plot estimated trajectories + actual data and can't see the issues
- Plot residuals
- Display the data
- Display stat summaries
- Display metadata
- geom_area(), geom_bar()
- geom_line() (also understands linetype), geom_path() (same as geom_line but lines connected in order)
- geom_point(), geom_polygon() (filled path, each vertex = separate row)
- geom_text() (also has hjust, vjust)
- geom_tile()
- geom_histogram()
- geom_freqpoly
- geom_boxplot()
- geom_density()
- geom_jitter() (crude way of looking at categorical data, prevents overplotting)
- Make points smaller (shape =".") or using hollow glyphs (shape =1)
- alpha
- jitter
p <- ggplot(data, aes(x, y))
j <- position_jitter(width=.5)
p + geom_jitter(position = j, alpha("color", 1/10))
-
binning
- but breaking in squares can produce artifacts so use hexagon (geom_hexagon)
- stat_bin2d
- stat_binhex(bins = 10)
- stat_binhex(bindiwth=c(.02, 200))
-
stat summaries
- add map borders using borders() function
library(maps)
data(us.cities)
ggplot(data, aes(long, lat)) + borders("state", size=.5)
"
choropleth maps:
start by merging map data with whatever else
reorder the merged
geom = polygon
"
Geoms that can do intervals:
- geom_ribbon -> geom_smooth
- geom_errorbar -> geom_crossbar
- geom_linerange -> geom_pointrange
library(effects)
a <- as.data.frame(effect(...))
- stat_summary()
- individual summary funcs like fun.y, fun.ymin, fun.ymax - take functions that take a vector and return single numeric
funs <- function(x) mean(x, trim=.5)
stat_summary(fun.y = funs, geom="point")
- more complex data func = fun.data (return named vector as output)
stat_summary(fun.data=exoticfunc, geom="ribbon")
- Key thing: Annotations are extra data
geom_vline(), geom_hline # use it to highlight certain points on say a timeline
geom_rect() to highlight interesting rect regions (takes xmin, xmax, ymin, ymax)
geom_text(aes(x, y, label="caption")) # x, y are coords, label can be a single label, list of labels
- Arrows
- geom_line, geom_path, and geom_segment all have an arrow param
- use arrow() function to tweak the arrow. arrow takes angle, length, ends, type
Scales control mapping from data to aesthetics. They also provide viewers a guide to map back from aesthetics to data.
- Input can be discrete or continuous.
- Process of mapping domain to range:
- transformation: only for continuous. log etc.
- training: domain is learned. complicated when graph has multiple layers
- mapping: function that translates domain to range (which we knew before)
plot <- qplot(quant, quant, data)
plot + aes(x=categorical) # doesn't work as scale of x is still cont.
plot + aes(x=categorical) + scale_x_discrete()
To modify default_scale:
- Must construct a new scale and add with +
- Scale constructors
- start with scale_
- followed by aesthetic (color_, shape_, x_)
- followed by name of scale (gradient, hue, manual)
- for e.g. scale_color_hue
- Scale for each aesthetic
- Colour and fill:
- discrete: brewer, grey, hue, identity, manual (e.g. scale_color_brewer())
- cont: gradient, gradient2, gradientn
- position:
- discrete
- continuous, date
- shape:
- discrete only: shape, identity, manual
- line type:
- discrete only: linetype, identity, manual
- size:
- discrete: identity, manual
- cont: size
- Colour and fill:
- Types of Scales
- Position Scales: for axes
- Color Scales: map to colors
- Manual Scales: map manually to size, shape, color, line type etc.
- Identity Scale
-
name
- label for axis/legend
- use \n for a new line
- math expressions using plotmath
- common shortcuts: xlab, ylab, labs()
-
limits
- fixes limits
-
breaks, labels
- breaks control which vals appear on axis/legend
- labels - labels to appear on those breakpoints
-
formatter
- if no labels specified, this is called.
- useful formatters: comma, percent, dollar, scientific, and abbreviate (for categorical scales)
- Shorcuts
- xlim(10, 20) # produces cont. scale
- xlim("a", "b", "c")
- If you don't want graph to extend beyond range of data: expand = c(0,0)
- Continuous Scales:
- scale_x_continuous
- takes trans argument to transform: pow10, probit, recip (x^-1), reverse, sqrt, log, exp, identity, etc.
- scale_x_log10() = scale_x_continuous(trans = "log10")
- Date and Time
- takes as.Date or as.POSIXct
- takes three things: major (breaks), minor (breaks), format (formatting of labels)
- e.g. major = "2 weeks", format = "%d/%m/%y" etc.
- Read more about colour at http://www.handprint.com/HP/WCL/color7.html
- hcl color space: hue (color), chroma (purity of color), luminance (lightness of color)
- continuous:
- scale_colour_gradient() # 2 color gradient
- scale_color_gradient2() # 3 color gradient
- scale_color_gradientn() # custom n gradient
- scale_fill_gradient() etc. also...
- RColorBrewer::display.brewer.all, can do fill = vore (where vore is from color brewer)
- discrete:
- scale_linetype
- scale_shape
- scale_size_discrete
- scale_shape_manual
ggplot(data, aes(x)) +
geom_line(aes(y = y + 5, color = "above")) +
geom_line(aes(y = y -5, color = "below")) +
scale_color_manual("Legend Label", c("below" = "blue", "above" ="red"))
- scale_identity: data are colours
- Axis: axis label, tick mark, tick label
- Legend: legend title, key, key label
- Scale name controls axis label, legend title
- breaks and labels (see above)
- for visual appearance, see axis., legend.
- internal grid: panel.grid.major/minor
- legend.position, legend.justification
facet_grid: 2d grid of panels
- facet_grid(. ~ .) means neither rows nor columns are faceted
- facet_grid(. ~ var) produces single row with multiple cols
facet_wrap: 1d ribbon of panels wrapped into 2d
** Controlling Scales**
- scales="fixed", x and y scales are fixed across all panels
- scales = "free", x and y vary across panels
- scales = "free_x" only x varies, y is fixed
** Dodging vs. Faceting**
- Can be pretty similar except for the labeling.
qplot(xaxis, data=data, geom="bar", fill=diff_color_bars_same_x, position ="dodge")
# similar to
qplot(diff_color_bars_same_x, data=data, geom = "bar", fill=diff_color_bars_same_x) +
facet_grid(. ~ xaxis)
- To facet by continuous vars, categorize it: cut_interval(x, n=10) etc.
** Coordinate Systems**
- Diff coordinate systems built in:
- coord_flip: x is y now
- coord_equal: equal scale coords
- coord_trans: transform/log etc.
- coord_map
- coord_polar
plot + coord_flip()
plot + coord_trans(x="pow10")
plot + coord_cartesian(xlim=c(0,20))
# coord_polar takes theta - position variable that is mapped to angle
plot + coord_polar(theta="y") # default is x (creates pie chart from a stacked bar chart)
Themes
- Themes control non data elements of plot
- Themese control: font (title, axis, tick labels, strips, legend labels, legend key labels), color (ticks, grid lines, backgrounds (panel, plot, strip, legend))
- Built in themes
- theme_gray() (Based on advice by Tufte, Brewer, Carr etc.)
- theme_bw()
previous_theme <- theme_set(theme_bw())
plot + previous_theme
theme_set(previous_theme) # permanently restores previous theme
- Theme elements
- theme_text() for labels, headings. Control font family, color, face, size, hjust, vjust, angle, lineheight
- theme_line(), theme_segment() take color, size, and linetype. e.g. opts(panel.grid.major = theme_line(color="red"))
- theme_rect() for background etc. and takes color, size, and linetype (changes the border)
- theme_blank() draws nothing. e.g. opts(panel.grid.major = theme_blank())
- you can alter old theme with theme_update
- Save
- ggsave()
pdf(); qplot(); dev.off()
- Subplots
viewport(x, y, width, height)
pdf(); vp <- viewport(); small_graph; print(big_graph, vp=vp); dev.off()
- Grids
- grid.layout()
- grid.newpage()
- pushViewport(viewport(layout = grid.layout(2,2)))
- ddply(.data, .variables, .fun, ...)
- .data: dataset to break up
- .variables: grouping var
- .fun: summary function to apply
- colwise() to apply a function to each column of data
- colwise(function)(data)
- last plot can be accessed via last_plot()
- you can save plot templates
xquiet <- scale_x_continuous("", breaks=NA)
yquiet <- scale_y_continuous("", breaks=NA)
quiet <- c(xquiet, yquiet)
plot + quiet