This ggplot2 tutorial was originally written for members of MitoCAMB and later expanded to a workshop for the MRC Mitochondrial Biology Unit. It reflects my own inexpert, somewhat aesthetically impaired opinions. Proceed at your own peril.


A key part of our job as research scientists is to communicate our findings to a broader audience, including through papers, presentations, and conference posters. This means we routinely make and re-make figures which illustrate our work and any uncertainties associated with it. Data visualisation is a part of data science that we all engage in, regardless of our field of expertise.

Making figures shouldn’t be painful and confusing. Instead, it should be something we approach systematically and with a level of confidence. With this tutorial, I want to introduce you to my favourite data visualisation tool, ggplot2, and also to the way I think about building and refining figures.

This tutorial has three main parts. First we discuss the layered grammar of graphics, which is a systematic way of thinking about the components of a figure. This part is code-free, but should hopefully help make sense of the code to come. In the second part, we go through the core features of ggplot2, and how we can leverage these to make versatile, complex data visualisations. Finally, we talk about figure refinement, from changing axis labels to setting image size and resolution. I should warn you that in this last part I’m at my least authoritative and nevertheless most opinionated. But before we jump into it, let’s set up.

Materials and set-up

This tutorial is for you if you have some experience with R, even if only a little. You don’t need any prior experience with ggplot2 specifically, or with the wider tidyverse ecosystem.

You need an up-to-date version of R installed, and I thoroughly recommend working in the RStudio IDE. This R Markdown notebook was originally written in R 4.1.3 and mostly recently updated in R 4.2.0.

In this tutorial we’ll use two R packages: tidyverse and cowplot. The former contains ggplot2 in addition to a number of other data science packages. The latter is an add-on to ggplot2, designed to make publication-ready figures. We will use it in Part III. You can install or update packages in the usual way if you need to.

install.packages("tidyverse")
install.packages("cowplot")

Once you have everything installed and updated, we need to load the packages.

library("tidyverse")
library("cowplot")

Finally, it’s hard to talk about data visualisation without any data. Throughout, we will use Fisher’s iris data set. This famous toy data set contains petal and sepal measurements obtained from 150 individual flowers across three iris species. It might be familiar to you from other workshops and tutorials, but don’t worry if not. We can load it and take a look.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Above, iris is a data frame containing 150 observations (rows) of 5 variables (columns). The first four variables correspond to the flower measurements and are numeric, and the last one, Species is a factor, taking one of three values: "setosa", "versicolor" or "virginica". Data analysis and visualisation in R rely heavily on this data frame structure.

Part I: What is the Grammar of Graphics?

The “gg” in ggplot2 stands for “grammar of graphics”. If the grammar of a language is a way of describing the components of a sentence, e.g. subject, verb, object, and how these all fit together, then a grammar of graphics is a way of describing the components of a plot, e.g. data, axes, shapes. Just like grammar differs from language to language, so it can differ from one data visualisation tool to another. However, when people talk about the grammar of graphics, they usually mean the grammar first described by Leland Wilkinson and later extended and implemented by Hadley Wickham.

The basic idea behind this grammar is that we can (semi-)independently specify plot features and combine them to create arbitrarily complex figures. The three features we must explicitly specify for every ggplot2 plot are data, aesthetic mapping(s) and geometric object(s). Other features, usually handled by default in the background, include statistical transformations, scales, faceting, and theme.

Let’s illustrate the grammar of graphics. I made the plot below using ggplot2, although in principle I could have used base R, Excel, Prism, or even drawn it by hand. However a figure has been generated, we can always talk about it in terms of the grammar of graphics, even if we don’t see the code used to make it, or if in fact there is no code.

The data you see plotted above is the same iris data set I showed you earlier. While you have seen that the full data set contains a number of different variables, only petal length and species are shown here.

The aesthetic mappings of the plot link these data variables to plot elements like axes, colours, shapes, and sizes. In the plot above, there are two aesthetic mappings: petal length is mapped to the x-axis, and species is mapped to colour (or technically, fill). What about the y-axis? It stands for count, which is not a variable recorded in the data, but is instead a statistical transformation calculated implicitly when making the plot. This is why it doesn’t count (pun intended) as an aesthetic mapping. If you are trying to figure out what the aesthetic mappings of a plot are, you need to ask yourself (1) what variables are present in the plot, and (2) how they are represented using axes, colours, and so on.

Finally, geometric objects describe the types of common statistical plots you see. They can be things like scatter plots, regression lines, boxplots, etc. One of the features which make ggplot2 so versatile is that it allows us to layer different geometric objects on top of each other to make complex plots. In the example above I used only one geometric object: a histogram. In order to identify the geometric objects in a plot, you have to ask yourself what common different types of plots you can see. Keep in mind there might be more than one!

Exercise 1: The Grammar of Graphics. Now that you have an idea of what data, aesthetic mappings, and geometric objects are in the context of the grammar of graphics, why don’t you give it a try? Can you identify the different features of the plot below? If you wish to spend longer on this exercise, try doing the same for the figures in the last research article you read. Alternatively, head over to your favourite news website and see if you can find some data visualisations there to practise on. Articles on the climate emergency, demographics, and elections are all good places to look.

Part II: Creating complex, versatile plots

Syntax overview

We have discussed the grammar of graphics in theory, but what does it look like in practice? In its simplest form, the code to make a ggplot2 figure looks something like this:

ggplot(<data>, aes(<aesthetic mappings>)) +
  geom_<object>()

The first line is a ggplot() function call, which contains the data and shared aesthetic mappings of the plot. We afterwards add (literally, using a + sign) different plot elements, including geometric objects.

The first plot we discussed was a histogram of petal lengths, coloured by iris species. I added a title and used a few other features to customise my plot, but a simple version of the same figure can be generated using:

ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_histogram()

Unlike plotting in base R, the data used in ggplot() must be in a data frame format. Like in iris, rows should always correspond to samples or observations, and columns should correspond to variables. Luckily for us, this is arguably the most standard way to record data, and you might find a lot of your spreadsheets already look like this.

The reason to insist on this format is that we can then set aesthetic mappings by referring to the column names of the data frame. Above, x = Petal.Length means that the measurements recorded under Petal.Length should go on the x-axis. Similarly, fill = Species means that the fill (i.e. inside colour of a 2D object, as opposed to its boundary colour) is determined by values recorded under Species.

Finally, the geometric object here, affectionately known as the geom, is a histogram. Hence geom_histogram(). Good code is pretty self-explanatory like that.

There are several differences between the plot I showed you earlier and the “default” plot you will get running the two lines of code above. Let’s put them side by side to compare and contrast. The plot on the left is the one I showed you earlier, and the plot on the right is the result of the two lines of code above.

The data, aesthetic mappings, and geometric object in both plots are exactly the same. The differences you see are largely stylistic, and are controlled by some of the additional features of ggplot2. Some geoms have extra parameters you can tweak, including statistical transformations and position. I manually set the histogram bins in my plot to be different from the default, and also made the histograms corresponding to different species overlap instead of stacking on top of each other. I then edited the scales of the plot, in order to change the axis labels and histogram colours. I also used a different theme which is why the background colours, fonts, and legend positions in the two plots are different. I further added a title to my plot.

Throughout the rest of the tutorial, we will dig a little deeper into all these plot features. Some things we won’t have time to cover in great detail (e.g. using colour palettes and creating your own custom themes), but I’ll signpost you to other resources.

Aesthetics

The connection between aesthetic mappings and geometric objects isn’t entirely straightforward. They are not independent of each other: we usually specify both the x- and the y-axis for a scatter plot, but only the x-axis for a histogram. But neither are aesthetic mappings necessarily specific to a geom. When we make a complex figure including multiple geoms (e.g. a scatter plot together with a regression line), they might share some aesthetics, such as the x- and y-axis, but not others, such as colour.

As a general rule, every geom will have some required aesthetics, some optional ones, and some it complains about or ignores. You don’t need to know these by heart. Most of the time the obvious things work, and troubleshooting tends to be reasonably straightforward. However, it helps to be aware of the most commonly used aesthetics you can manipulate.

Axes

The two major aesthetics are the axes, x and y. Axes can map to both continuous variables, such as Petal.Length, or discrete ones, such as Species. Note also that aesthetic mappings don’t need to map to singular data variables, they can also map to expressions. We might, for example, be interested in the sum of petal length and width. Below is a boxplot which illustrates a discrete x-axis and a continuous y-axis using a more complex variable expression.

ggplot(iris, aes(x = Species, y = (Petal.Length + Petal.Width))) +
  geom_boxplot()

Colour and fill

The other most commonly used aesthetics are for colour and fill. Here colour means specifically the colour of one-dimensional “thin” shapes such as points and lines, as well as the border colour of 2D shapes like bars and tiles. Fill stands for the “inside” colour of bars and tiles, as well as the shading of areas more generally.

Exercise 2: Ignoring aesthetics. In the code below, change colour to fill. What happens and why do you think it happens? Can you add, remove, or change some of the aesthetic mappings to produce an error message?

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  geom_point()

Two-dimensional geoms such as histograms and boxplots can have both colour and fill mappings, although usually we want to focus on fill. What happens if you change fill to colour in the plot below?

ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_histogram()


Like axes, colours can map to both discrete variables, as above, or to continuous ones using colour gradients. By default low values start in dark blue and gradually turn lighter, but later we’ll see how to customise gradients.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Sepal.Length)) +
  geom_point()

Shape and linetype

When making figures, especially for print, we must always take into account our audience. Many of them will be colour-blind. And even those of us who aren’t often print papers in black & white. It can be beneficial to complement colour with different point shapes and line types. In ggplot2 the point shape aesthetic is conveniently called shape, and the line type one is, well, linetype. Below is an example which uses different colour and shape simultaneously.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species, shape = Species)) +
  geom_point()

Mappings and assignments

So far we’ve talked about aesthetic mappings, i.e. linking aesthetics to variables (or expressions of variables) in the data. But occasionally you may want to keep aesthetics fixed to a particular value. For example, you might want to make a scatter plot with cornflower blue points, instead of colouring points by species. We can set this using the same aesthetic colour, but moving it from the aes() call to the geom instead.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, shape = Species)) +
  geom_point(colour = "cornflowerblue")

This is no longer and aesthetic mapping, since we are not mapping any data variables. Instead, it is an aesthetic assignment, as we are assigning a fixed value. There are some aesthetics I assign more often than others: size, which controls point size and line width, and alpha for colour transparency. Transparency is useful when parts of the plot overlap. Setting it to zero will turn the object invisible, and the default value of one makes objects fully opaque. I always experiment with setting alpha = 0.1 on plots with hundreds or thousands of points. Setting alpha = 0.5 or similar can be useful when you have overlapping shapes such as bars.

This is just the tip of the iceberg, of course. If you want to learn more about different aesthetics and what values they can take, check out this official documentation page.

Geometric objects

So far we have seen three types of geometric objects: histograms geom_histogram(), scatter plots geom_point() and boxplots geom_boxplot(). Listing all geoms is well beyond the scope of this tutorial, and if I’m being honest beyond the scope of my own knowledge. Suffice it to say, there is a geom for just about any type of statistical plot out there. In this section we’ll see some more examples of commonly used geoms and how they can be combined.

Layering geoms

You might have noticed that right at the start I referred to the grammar of graphics as “layered”. This refers to our ability to add geoms together and layer them on top of each other. One of the things I frequently do is make violin plots and add small boxplots inside them, to visualise the median and upper and lower quartiles of each distribution.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_violin() +
  geom_boxplot(width = .15)

If you’ve ever used an image editing tool like Photoshop, Inkscape, or Gimp, you might be familiar with the concept of layering. It’s this idea that when you put opaque images on top of each other, bottom layers are hidden underneath top layers. We can control this by adjusting layer transparency and being careful about how we order things. In ggplot2, geoms are layered in the order you type them, i.e. the higher up the geom is in your code, the further back it is in the image. So in the picture above the larger violin plots are behind the smaller boxplots. However, if we swap the order of the geoms, the boxplots are no longer visible, as they are hidden behind the violins.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_boxplot(width = .15) +
  geom_violin()

Exercise 3: Layering. Another way to make sure background layers are visible is by making foreground layers transparent. Can you edit just the last line of the code above (reordering geoms or adding new ones is not allowed) to make this picture?

Shared and private aesthetics

In the example above, the violin plots and the boxplot shared the same aesthetic mappings, namely species for the x-axis and petal length for the y-axis. Axes are usually shared by all geoms in a plot. However, aesthetics don’t have to be shared all the time. It’s quite common to share some aesthetics and keep others private to specific geoms.

For example, suppose that I wanted to colour the violin plots by species, but keep the boxplots white for contrast. If I add fill = Species in my usual aes() call, I end up colouring both, which isn’t what I set out to do.

ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_violin() +
  geom_boxplot(width = .15)

This is because anything from the aes() call inside the ggplot() function gets passed down to all geometric objects underneath. These are what we refer to as shared aesthetics. However, I can add a new aes() mapping that is private just to one geom only. It will affect that specific geom, but nothing else. If I set aes(fill = Species) to be private to geom_violin(), then it wouldn’t affect the boxplot, which would remain white by default.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_violin(aes(fill = Species)) +
  geom_boxplot(width = .15)

Exercise 4: Shared and private aesthetics. I’m not claiming it’s a good idea, but can you make the plot below?

Statistical transformations

Data visualisation is ultimately a branch of statistics and like all branches of statistics, it (unfortunately) requires calculations, done if not by us then by a computer. And different types of plots, i.e. different geoms, require different calculations.

A scatter plot geom_point() is easy: we take the data as it is, no maths necessary. A histogram geom_histogram() is, however, more involved: we need to decide how to bin the data, and then count how many points fall in each bin. We might want to set parameters like binwidth or breakpoints in a histogram. In order to get the right height and shape boxplot geom_boxplot() we need to calculate the median and interquartile range of the data, instead of binning it. A violin plot geom_violin() may intuitively feel like a type of boxplot, but the maths behind it is different still: instead of quartiles, the shape of the distribution is estimated. Different geoms then require different statistical transformations.

Stats

Things get a little abstract and esoteric in this section, apologies for which. Don’t worry if you find this next bit confusing: it concerns ggplot2 features that are perhaps good to know, but that you’re unlikely to need.

Every geom has a default corresponding stat. These are properly stat_*() functions that handle the necessary statistical transformations, but for the purposes of this tutorial we’ll think of them as parameters of the geom_*() function. Since in geom_point() we leave the data as-is, its stat is stat = "identity". In histograms, the default stat is stat = "bin", because we need to bin the data. Other examples include stat = "density" for density plots and the rather mysterious stat = "smooth" for fitted lines and curves.

What happens if we try to change the stat away from the default? Let’s try. Below, I’ve used geom_line() but with the same stat that histograms require, stat = "bin".

ggplot(iris, aes(x = Petal.Length)) +
  geom_line(stat = "bin")

What we end up with looks a bit like a histogram or a density plot, except it’s made of lines. The formal name for this type of plot is a frequency polygon and it has its own geom, geom_freqpoly(). Try it out! How come geom_line() produced an almost-histogram? It’s because in ggplot2, the stat controls the calculation (in this case binning data), and the geom controls the way it’s visualised (in this case using lines).

Is changing stats ever actually useful? Yes, albeit very rarely. In the example above, there was simply another, better geom. Most of the time, you’ll find there is already a geom best suited to the stat you want. Still, understanding stats is important! It makes navigating ggplot2 documentation easier, for one, especially as you look for other parameters you can adjust.

Exercise 5: Geoms and stats. You might be forgiven if you assume that geom_line() by default fits a regression line to the data. However, it does something rather different. What is its default stat and how does it explain the figure you see? When might you want to use geom_line()?

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_line()

Additional parameters

Geoms have additional parameters you can tweak, some of which they inherit from the underlying stat. We used one such parameter earlier: width can make things like boxplots and bars narrower. Try varying the value of width below to see what happens.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_violin() +
  geom_boxplot(width = 0.15)

While width above changes how the boxes look, it doesn’t affect the underlying stat calculation. However, a lot of parameters do. Take histograms for example. Throughout this tutorial you might have seen the following message every time you’ve made a histogram: `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is because geom_histogram() quietly calls stat_bin() to do the necessary calculations, and by default stat_bin() creates 30 equally-sized bins for the data. We could override that default in several ways. We can change binwidth, like the message suggests. We could also set the number of bins with bins, or even manually fix the bin boundaries with breaks. Note that while technically these are all used by the stat_bin() function, we put them directly in geom_histogram() without explicitly referencing the stat.

ggplot(iris, aes(x = Petal.Length)) +
  geom_histogram(breaks = 0:8)

Exercise 6: Stat parameters. We can also set stat parameters within alternative geoms. Can you make the figure below using geom_line()?


In addition to making histograms, the other useful geom (or equivalently stat!) to know how to tweak is geom_smooth(). By default, this fits a curve through the data.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point() +
  geom_smooth()

The blue curve is the fitted model, and the grey band is its 95% confidence interval. The model in question is LOESS, a kind of local polynomial regression that you might not be familiar with, or that might not be relevant to the statistical analysis you actually do on your data. Much more often, we want to specify a linear model instead, which we can do by setting method = "lm".

By default, the linear regression plotted like this is fitted as lm(y ~ x). However, we can further specify the formula if we wanted to. Note the formula parameter uses the aesthetics we have mapped (in this case the axes x and y), not the data variable names. We can also hide the confidence interval by setting se = FALSE or make it narrower by setting another value, e.g. level = .90 for a 90% confidence interval. The code below changes the formula and the confidence interval width. What happens if you also add se = FALSE?

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ sqrt(x), level = .90)

Exercise 7: Regression lines. We are slowly working towards making the figures from the start of this tutorial. Can you make the plot below? The shade of grey used instead of the default blue for the fitted line is "grey70". As an additional challenge, can you also change the colour of the confidence interval band?

Position

In addition to having a stat, which handles relevant calculations, geoms also have a position, which ostensibly handles what happens when parts of the geom overlap. This usually happens when we have overlapping data points, and it often comes up in histograms.

Let’s put the iris data set to one side for a moment and illustrate with some fake data. The code below creates a data frame df containing 16 samples. Each sample has a value (either one or two) and a label (either A or B).

df <- data.frame(value = rep(1:2, 8),
                 label = rep(c("A", "B"), times = c(10, 6)))

The values are equally split between one and two, but there are more A’s than B’s. The data can be summarised like so:

table(df)
##      label
## value A B
##     1 5 3
##     2 5 3

Let’s make a histogram of the data, colouring it by label. The default position of histograms is position = "stack", meaning equal values appear in a single column on top of each other even if they have different labels. This is why we see two columns of height eight, one for each value. Each of these columns has a coral part for A, and a teal part for B.Note that I’ve set the transparency to alpha = 0.5 so you can convince yourself the coral and teal parts of each column don’t overlap.

ggplot(df, aes(x = value, fill = label)) +
  geom_histogram(alpha = 0.5)

However, sometimes when we plot histograms and colour them by a category (be that label or species), we do want them to overlap instead of stack. To achieve this, we set position = "identity". This way, instead of one column of size eight at each of the values, we have two overlapping columns: a coral column for A of size five, and a teal column for B of size three. I have again set the transparency to alpha = 0.5, and you can see the coral and teal overlap. What happens if you change alpha?

ggplot(df, aes(x = value, fill = label)) +
  geom_histogram(position = "identity", alpha = 0.5)

We might also want to plot the data side by side using position = "dodge". This is rather ill-advised for histograms, but it can be very useful in bar charts geom_bar() and column charts geom_col(). Note in the example below, I also made the continuous value discrete using factor(). Unlike histograms, (good) bar charts have discrete x-axes. See what happens if you change the position parameter, or remove the factor() function.

ggplot(df, aes(x = factor(value), fill = label)) +
  geom_bar(position = "dodge")

Exercise 8: Positions. Going back to our iris dataset, can you make the figure below? How do the column heights differ from earlier histograms we’ve made?

Faceting

So far we’ve discussed colour as a way of distinguishing between discrete groups in the data, e.g. iris species. However, sometimes it is clearer and more preferable to create a separate plot for each group. This is particularly true when there are many groups, or when we can’t easily make colours suitable for black & white printing.

Splitting a plot by group is known as faceting. There are two ways of doing it in ggplot2: facet_wrap() and facet_grid(). As the name suggests, the former is designed for faceting by one variable, and will wrap the plots in a roughly rectangular shape. The latter creates a grid, and is particularly useful when we are faceting by two variables.

The syntax of facet_*() is a bit unusual: facet_wrap(~<var>) for one variable, and facet_grid(<var_rows> ~ <var_cols>) for two variables in a grid. So if we wanted to split the petal length histogram by species, we would write:

ggplot(iris, aes(x = Petal.Length)) +
  facet_wrap(~Species) +
  geom_histogram()

Facet variables can be column names of the data, but like aesthetics, they can also be expressions. Let’s say I want to split the histograms by species and by petal width, so flowers with petal width below 1.5cm are in one group, and above are in another. The expression (Petal.Width >= 1.5) will be TRUE for the wider flowers and FALSE for the narrower ones. We can use it as a facet.

ggplot(iris, aes(x = Petal.Length)) +
  facet_grid((Petal.Width >= 1.5) ~ Species) +
  geom_histogram()

We can also have an “empty” facet, denoted with .. What happens if you swap the positions of . and Species in the code below?

ggplot(iris, aes(x = Petal.Length)) +
  facet_grid(. ~ Species) +
  geom_histogram()

Exercise 9: Facets. In the examples above scales are fixed, meaning facets share the same x- and y-axis ranges. However, you can change this by setting the scales parameter of the facet to "free". Can you make the figure below? More importantly, should you make figures like it? The confidence intervals for I. setosa and I. virginica appear to be similar, and much wider than the confidence interval for I. versicolor. Is this true?


Between aesthetic mapping and assignment, layered geometric objects, their statistical transformations and positioning, and finally facets you now have the tools to make some really complex, interesting and hopefully useful plots. However, these won’t necessarily satisfy your inner graphic designer. Data visualisation is a science as much as an art (well, perhaps not in my hands…), and what’s left is for us to “make stuff pretty”, as a former colleague likes to put it.

Part III: Refining figures

In my mind, this bit of the work is split in three or sometimes four subparts. After I’m more or less done with the “functional” plot, I start with scales. These are used to fix things like axis labels and colours. I then sometimes add annotations: figure titles, or text, lines, and boxes to focus my audience to a particular part of the plot. I continue by adding a theme, which controls the overall design of the figure, including background colour and font size. A great way to make all your figures consistent, e.g. within a paper or in your thesis, is to use the same theme. You can also use different themes for different occasions! Poster figures need larger font sizes than print figures for example. I round off by arranging my individual plots in larger figures, if I need to, and make sure I save them with the kind of resolution and aspect ratio that works for me.

Of course, this process is a lot less linear than I’m making it sound. Occasionally I discover that my font size needs changing right at the end, or that my colour choices aren’t printer-friendly and faceting is easier. The order in which I do things isn’t set in stone, but I find having a workflow in mind really helps.

Scales

With the exception of facet labels, most label printing, colour setting, etc., is managed using scales. Scales control aesthetic mappings and take the format of scale_<aesthetic>_<type>(). So if I want to format a discrete x-axis, I would add scale_x_discrete() to my plot. Every aesthetic has a scale and scale types can vary considerably. I tend to use discrete and continuous for axes, and manual for colour. The latter allows me to set specific colours.

Like geoms, scales have parameters you can set. A lot of these are common to most, if not all, scales. For example, the first argument of the scale is its name and serves as an overall aesthetic title. If the scale is numeric, limits set the range of values printed and breaks are the points where we see labels printed. These labels are handled by a labels parameter. Finally, you can set specific values with, you guessed it, values. Since we often want to reuse labels and values, I prefer defining them separately as named arrays, like so:

species_labs <- c("setosa" = "I. setosa", 
                  "versicolor" = "I. versicolor", 
                  "virginica" = "I. virginica")
species_cols <- c("setosa" = "red", 
                  "versicolor" = "blue", 
                  "virginica" = "yellow")

It is important that you include the names of each element of these arrays! Otherwise you might accidentally refactor some of your data and without realising unintentionally change colours between plots. I also strongly encourage you to name these arrays consistently. Personally, I use both the relevant variable name (e.g. species) and the parameter name (e.g. lab for labels and col for colours).

Here is the boxplot from earlier, now with adjusted scales.

ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot() +
  scale_x_discrete("", labels = species_labs) +
  scale_y_continuous("Petal Length", limits = c(0, 8), breaks = 1:7) +
  scale_fill_manual("", values = species_cols, labels = species_labs)

As you can see, I’ve used labels = species_labs in two different scales: once for the x-axis and once for fill. What happens if you remove one or both of these? You can also change the limits and breaks of the y-axis.

Exercise 10: Scales. Can you make the figure below? To get these exact colours, I used the Java palette from the MetBrewer package. For now you can just copy them from here:

species_cols <- c("setosa" = "#663171", "versicolor" = "#ea7428", "virginica" = "#0c7156")


Setting discrete colours is easy. What about colour gradients? Earlier in the section on aesthetic mappings, we made a scatter plot with colour = Sepal.Length. Let’s remake it here, but instead of plotting it, let’s save it as an object p.

p <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Sepal.Length)) +
  geom_point()

Since Sepal.Length is continuous, we can add a scale_colour_continous() to the plot, and use the same breaks, labels and limits parameters that you would use on an axis. In addition, we can adjust low and high colours of the gradient.

p +
  scale_colour_continuous("Sepal length", 
                          breaks = c(4.3, 6.1, 7.9), labels = c(4.3, 6.1, 7.9), limits = c(4, 8), 
                          low = "blue", high = "red")

A colour gradient doesn’t need to go just from low to high. To include a third colour in the middle, we change the scale type to gradient2. There is also a gradientn scale type, for further specifications. This page is very detailed both on colour theory and on intergrating colours in your plots.

p +
  scale_colour_gradient2("Sepal length", 
                         breaks = c(4.3, 6.1, 7.9), labels = c(4.3, 6.1, 7.9), limits = c(4, 8), 
                         low = "blue", mid = "grey30", high = "red",
                         midpoint = 6.1)


Since scales relate to aesthetics, one thing we can’t use them for are facet labels. Remember this questionable figure?

We can use scales to fix the axis labels, but not the facet ones. This is particularly problematic for the row facet, which had to do with petal width (Petal.Width >= 1.5). For these, we need to use labeller(), a special function reserved for facet labels.

To start, we need to decide on the labels. We already have species_labs, which we can reuse for column facets. We need to write a similar label array for (Petal.Width >= 1.5).

width_labs <- c("TRUE" = "Wide petals", "FALSE" = "Narrow petals")

The labeller is then a function linking facet variables to their respective labels. We can make that link either using variable names or refering to .row or .col. The latter is useful when your facet is an expression instead of a data variable. Here is what it looks like in action.

ggplot(iris, aes(x = Petal.Length)) +
  facet_grid((Petal.Width >= 1.5) ~ Species, 
             labeller = labeller(.rows = width_labs, Species = species_labs)) +
  geom_histogram() +
  scale_x_continuous("Petal length") +
  scale_y_continuous("Count")

Annotations

In presentations, less so than in publications, it helps to give figures titles and occasionally subtitles. This is done with ggtitle(). Sometimes you might also want to annotate a particular part of the plot with additional text. This is done using annotation layers. For a full description, head over to the docs. Below is an example of how to use a title, rectangle highlight and some text. Feel free to change the parameters! Does reordering annotations make a difference?

p_scatter <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  ggtitle("Title", subtitle = "Optional descriptive subtitle") +
  geom_point() +
  annotate("rect", xmin = 0.8, xmax = 2.1, ymin = 0, ymax = .7, alpha = 0.2) +
  annotate("text", x = 1.5, y = 0.85, label = "smaller flowers")
p_scatter

You can also use geoms to annotate your plots. Perhaps the most useful geom here is geom_abline(), which works in much the same way as abline() in base R. You will need to set the slope and intercept of the line you want to draw.

p_scatter +
  geom_abline(slope = 0.42, intercept = -0.36, linetype = "dotted")

If you want to get really meta, you can even annotate plots with other plots. These are called insets, and my friend and former colleague Clare West has a nifty tutorial for making them.


We haven’t had an exercise in a while, but by now you have the tools to almost recreate the two figures from the start of the tutorial. What is left is setting the theme…

Exercise 11: Everything but the theme. Can you make the two figures below?

Themes

Scales control for aesthetic mappings and annotations help us manually direct our audience’s attention to specific aspects of a plot. However, there are certain design choices that go into making a figure that are static and unrelated to aesthetics, geoms, or annotations. Background colour, text font, and legend position all fall into this category. These features are all controlled by the theme in a plot.

There are some themes already built into ggplot2. Among them are the default theme_gray(), its black & white equivalent theme_bw, and theme_classic() which you might recognise especially from scientific publications. We can add themes to a plot like we add anything else.

p_scatter