February 13, 2020

Introduction

Why Visualize Data in R with ggplot2

  • Quality
    • ggplot2 has a very good reputation
  • Ease (somewhat)
    • Can make complex graphs relatively quickly
    • Can correct mistakes easily
    • ggplot2 has a logic to it, albeit somewhat confusing
  • Cohesion
    • Keep data visualization with data manipulation and analyses

What is ggplot2?

  • An R package
  • Created by Hadley Wickham (with the help of others)
  • Part of the tidyverse
  • Based on the Grammar of Graphics (Wilkinson, 2005)

This presentation is based on ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham

Preparing our Environment

#install.packages("ggplot2")
library(ggplot2)
mpg <- mpg

mpg is a dataset with fuel economy data from 1999 and 2008 for 38 popular models of car

Introduction to ggplot2

The Grammar of Graphics

All plots are composed of:

  • Aesthetic mappings describing how variables in the data are mapped to aesthetic attributes
  • Layers made up of:
    • Geometric objects, geoms for short, (e.g., points, lines)
    • Statistical transformations, stats for short, (e.g., binning and counting observations to create a histogram, summarising a 2d relationship with a linear model)
  • Scales which:
    • Map values in the data space to values in an aesthetic space, (e.g., color)
    • Draw a legend or axes that provide an inverse mapping to make it possible to read the original data values from the plot

The Grammar of Graphics

All plots are composed of:

  • A coordinate system, coord for short, which
    • Describes how data coordinates are mapped to the plane of the graphic (e.g., cartesian, polar, map)
    • Provides axes and gridlines to make it possible to read the graph
  • A theme which controls the finer points of display, (e.g., font size, background color)
  • (Optionally) A faceting specification, facet for short, describing how to break up the data into subsets and how to display those subsets (aka, conditioning/latticing/trellising)

Three Essential Elements of a Plot

You will always need to specify:

  • Data
  • Aesthetic mappings
  • At least one layer (with a geom function)

Data and aesthetic mappings are supplied in ggplot()

Then layers are added on with +

Three Essential Elements of a Plot

ggplot(mpg, aes(displ, hwy)) + 
  geom_point()

Almost every plot maps a variable to x and y, so the first two unnamed arguments to aes() will be mapped to x and y and you don’t need to specify those argument names

Layers

Layers

Remember: Layers are made up of:

  • Geometric objects, geoms for short, (e.g., points, lines)
  • Statistical transformations, stats for short, (e.g., binning and counting observations to create a histogram, summarising a 2d relationship with a linear model)

Each geom has a set of aesthetics and stats that it understands.

Popular Geoms

One Conintuous Variable

Histogram

ggplot(mpg, aes(displ)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Frequency Polygon

ggplot(mpg, aes(displ)) + 
  geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Smoothed Density Estimate

ggplot(mpg, aes(displ)) + 
  geom_density()

One Categorical Variable

Histogram

ggplot(mpg, aes(drv)) + 
  geom_bar()

Two Continuous Variables

Scatterplot

ggplot(mpg, aes(displ, cty)) + 
  geom_point()

Smoothed Line Of Best Fit

ggplot(mpg, aes(displ, cty)) + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Two Continuous Variables Without Repeated Xs (e.g., Economic Data)

Area Plot

ggplot(economics[1:100,], aes(date, unemploy)) + 
  geom_area()

Line Plot

ggplot(economics[1:100,], aes(date, unemploy)) + 
  geom_line()

Step Plot

ggplot(economics[1:100,], aes(date, unemploy)) + 
  geom_step()

Catergoical X, Continuous Y

Bar Plot

ggplot(mpg, aes(drv, cty)) + 
  geom_bar(stat = "identity")

Box Plot

ggplot(mpg, aes(drv, cty)) + 
  geom_boxplot()

Violin Plot

ggplot(mpg, aes(drv, cty)) + 
  geom_violin()

Multiple Layers

Multiple Layers

ggplot(mpg, aes(displ, cty)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can add geoms on top of each other

This becomes helpful when adding information about error.

See geom_crossbar(), geom_errorbar(), geom_linerangebar(), and geom_pointrange()

Stats

Stats

ggplot(mpg, aes(drv, cty)) + 
  geom_point() + 
  stat_summary(fun.y = "median", color = "red", size = 6, geom = "point")
## Warning: `fun.y` is deprecated. Use `fun` instead.

Stats can be used when you need to do a statistical transformation of the data that a geom can’t already do

Stat_summary is the most common

Stat functions and geom functions both combine a stat with a geom to make a layer

Aesthetics

Aesthetics

Remember: Aesthetic mappings describe how variables in the data are mapped to aesthetic attributes

Aesthetics as Constants Vs. Variables

An aesthetic can be mapped to a variable or set to a constant:

  • If you want appearance to be constant, put the value inside the geom_().
ggplot(mpg, aes(displ, cty)) + 
  geom_point(color = "blue")

Aesthetics as Constants Vs. Variables

  • If you want appearance to be governed by a variable, put the specification inside aes() in ggplot()
ggplot(mpg, aes(displ, cty, color = class)) + 
  geom_point()

Common Aesthetic Attributes For Variables

Color and Fill

ggplot(mpg, aes(displ, cty, color = hwy)) + 
  geom_point()

Good for continuous and categorical variables

Showed example with categorical variable on last slide, so here is an example with a continuous variable

Color and Fill

Fill example:

ggplot(mpg, aes(displ, cty, fill = drv)) + 
  geom_hex()

Shape

Good for categorical variables

ggplot(mpg, aes(displ, cty, shape = drv)) + 
  geom_point(size = 4)

Size

Good for continuous variables

ggplot(mpg, aes(displ, cty, size = hwy)) + 
  geom_point()

Linetype

Good for categorical variables

ggplot(mpg, aes(displ, cty, linetype = drv)) + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Label and Family

Good for categorical variables

Use with geom_text()

ggplot(mpg, aes(displ, cty, label = drv)) + 
  geom_text()

Common Constant Aesthetic Attributes

Alpha

Common constant aesthetic attributes are the common aesthetic attributes for variables and alpha

Good for overlapping data

ggplot(mpg, aes(displ, fill = drv)) + 
  geom_density(alpha = 0.4)

Facetting

Facetting

An alternative to using aesthetics to map properties of the data is to use facetting

Remember: facetting describes how to break up the data into subsets and how to display those subsets

  • facet_wrap(): “wraps” a 1d ribbon of panels into 2d
  • facet_grid(): produces a 2d grid of panels defined by variables which form the rows and columns

Wrapped Facetting

ggplot(mpg, aes(displ, cty)) + 
  geom_point() + 
  facet_wrap(~class)

Wrapped Facetting

You can control how the ribbon is wrapped into a grid with the following arguments:

  • ncol and nrow control how many columns and rows (you only need to set one)
  • as.table controls how the facets are laid out
    • TRUE: with highest values at the bottom-right
    • FALSE: with the highest values at the top-right
  • dir controls the direction of wrap: horizontal or vertical

Grid Facetting

facet_grid() lays out plots in a 2d grid, as defined by a formula:

. ~ a spreads the values of a across the columns. This direction facilitates comparisons of y position, because the vertical scales are aligned.

b ~ . spreads the values of b down the rows. This direction facilitates comparison of x position because the horizontal scales are aligned. This makes it particularly useful for comparing distributions.

a ~ b spreads a across columns and b down rows.

Grid Facetting

ggplot(mpg, aes(displ, cty)) + 
  geom_point() + 
  facet_grid(. ~ cyl)

Grid Facetting

ggplot(mpg, aes(displ, cty)) + 
  geom_point() + 
  facet_grid(drv ~ .)

Grid Facetting

ggplot(mpg, aes(displ, cty)) + 
  geom_point() + 
  facet_grid(drv ~ cyl)

Making Your Plots Beautiful: Scales, Axes, Legends, and Themes

Scales and Axes

Remember: Scales :

  • Map values in the data space to values in an aesthetic space, (e.g., color)
  • Draw a legend or axes that provide an inverse mapping to make it possible to read the original data values from the plot

Scales and Axes

Use scale_() functions to adjust:

  • Scale/axis name
  • Breaks and labels

See:

  • scale_x_continuous()
  • scale_x_discrete()
  • scale_fill_gradient()

Legends and Themes

There are around 40 unique elements that control the appearance of the plot

They can be roughly grouped into five categories:

  • Plot
  • Axis
  • Legend
  • Panel
  • Facet

Plot Elements

Some elements affect the plot as a whole:

  • plot.background (set with element_rect())
  • plot.title (set with element_text())
  • plot.margin (set with margin())

Axis Elements

  • axis.line and axis.ticks are set with element_line()
  • axis.ticks.length is set with unit()
  • axis.text, axis.text.x, axis.text.y, axis.title, axis.title.x, and axis.title.y, are set with element_text()

Legend Elements

The legend elements control the apperance of all legends. You can also modify the appearance of individual legends by modifying the same elements in guide_legend() or guide_colourbar()

  • legend.text.align and legend.title.align are set with a number from 0 to 1
  • legend.text and legend.title are set with element_text()
  • legend.background and legend.key are set with element_rect()
  • legend.key.size, legend.key.height, legend.key.width, and legend.margin are set with unit()

There are four other properties that control how legends are laid out in the context of the plot (legend.position, legend.direction,
legend.justification, and legend.box).

Panel Elements

  • aspect.ratio is set with a numeric value,
  • panel.background and panel.border are set with element_rect()
  • panel.grid.major, panel.grid.major.x, panel.grid.major.y, panel.grid.minor, panel.grid.minor.x, and panel.grid.minor.y are set with element_line()

The main difference between panel.background and panel.border is that the background is drawn underneath the data, and the border is drawn on top of it. For that reason, you’ll always need to assign fill = NA when overriding panel.border

Note that aspect ratio controls the aspect ratio of the panel, not the overall plot

Facetting Elements

  • panel.margin, panel.margin.x, and panel.margin.y are set with unit()
  • strip.background is set with element_rect()
  • strip.text, strip.text.x, and strip.text.y, are set with element_text()

strip.text.x affects both facet_wrap() or facet_grid(); strip.text.y only affects facet_grid()

What We Didn’t Cover

What We Didn’t Cover

  • Coordinate systems (e.g., maps - see coord_map, coord_polar(), and coord_trans())

  • Position adjustments (e.g., jittering points, bars on top of each other or side-by-side - see position argument of geom_bar() and geom_point())

  • Many smaller details - See ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham