DATA VISUALISATION
Erik Kusch
erik.kusch@i-solution.de
Section for Ecoinformatics & Biodiversity
Center for Biodiversity and Dynamics in a Changing World (BIOCHANGE)
Aarhus University
Aarhus University Biostatistics - Why? What? How? 1 / 52
1 Introduction
Overview
2 Tables
Using Tables
Table Types
3 Plots
Using Plots
How To Make A Plot In R
Plot Types
4 Exercise
R-internal data sets
Making plots
Aarhus University Biostatistics - Why? What? How? 2 / 52
Introduction Overview
Means to Visualisation
Methods of data visualisation are manifold:
Tables:
Data Tables
Frequency Tables
Stem And Leaf Displays
Text-based descriptions of data:
Only applicable to minute data sets
Not used extensively
Plots:
Pie Charts
Scatter plots, Line Graphs
Bar Charts, Histograms,
Frequency Polygons
Box plots
Contour Plots, 3-D Plots
...
We will not be covering text-based data summaries here.
Aarhus University Biostatistics - Why? What? How? 4 / 52
Tables Using Tables
Table Etiquette
Tables are useful data summary and visualisation tools.
Etiquette in table making:
Vertical lines are used sparingly
Horizontal lines are used frequently
Table captions are placed above the table they belong to
Making tables directly in R can be difficult. Assuming you use L
A
T
E
X for writing
manuscripts (which you really should try if you haven’t yet):
L
A
T
E
X directly
The Excel2L
A
T
E
X-Add-in
(https://www.ctan.org/tex-archive/support/excel2latex/)
Various R packages (e.g.: ‘xtable‘)
R Markdown for writing manuscripts
Aarhus University Biostatistics - Why? What? How? 6 / 52
Tables Table Types
Data Tables
Can accommodate all kinds of data.
For publications:
Great way to summarise and
present data
Can be used to present a list of
definitions
Newton, A. C. (2016) ’Biodiversity Risks of Adopting Resilience as a Policy
Goal’, Conservation Letters, 9(October), pp. 369-376. doi:
10.1111/conl.12227.
Aarhus University Biostatistics - Why? What? How? 7 / 52
Tables Table Types
Data Tables
Can accommodate all kinds of data.
For publications:
Great way to summarise and
present data
Can be used to present a list of
definitions
For behind-the-scenes work:
Still a great way to summarise
and present data
Data management, mining and
exploration relies on tables (more
on this in seminar 7)
Salzmann, U. et al. (2008) ’A new global biome reconstruction and
data-model comparison for the Middle Pliocene’, Global Ecology and
Biogeography, 17(3), pp. 432-447. doi: 10.1111/j.1466-8238.2008.00381.x.
Aarhus University Biostatistics - Why? What? How? 8 / 52
Tables Table Types
Frequency Tables
Only accommodate frequency counts.
For publications:
Rarely ever used in publications
Applicable for publication of
appendices and manuscripts of
theses
For behind-the-scenes work:
Used excessively internally in ‘R‘
Basis for many plotting
approaches
Lotsch, A. (1996) Biome level classification of land cover at continental scales
using decision trees. Free University of Berlin. Available at:
http://cliveg.bu.edu/download/thdis/alotsch.MA.pdf.
Note that the table caption is misplaced on this table.
Aarhus University Biostatistics - Why? What? How? 9 / 52
Tables Table Types
Stems And Leaf Displays
Accommodate frequency/count data.
For publications:
Pretty outdated
Usually only included in books
and course material
For behind-the-scenes work:
Of no particular use when
considering small or excessively
big data sets
Can be helpful in data exploration
of medium-sized data sets
Lane, D. M. (2009) Introduction To Statistics, Introduction to Statistics. doi:
10.1016/B978-0-12-370483-2.00006-0.
Aarhus University Biostatistics - Why? What? How? 10 / 52
Plots Using Plots
Plot Etiquette
Plots are extremely useful data summary and visualisation tools.
Etiquette in plot making:
Less is more (strife for simplicity)
Figure captions are placed below the figure they belong to
Making plots directly in R entails a learning curve and there is a heavy debate
about how to do the plotting:
Using base R:
Can be cumbersome
Relies on same commands as
basic R coding
Using ggplot:
Extremely powerful
Relies on ggplot specific
commands
Aarhus University Biostatistics - Why? What? How? 12 / 52
Plots Using Plots
The Good, The Bad, And The Ugly
Good plots:
Clearly legible labels
Clean look
Concise caption
Bad plots:
Convoluted display
Overlapping plotting symbols
Overly complicated caption
Ugly plots:
Photos
No-Go for publications
Ok for presentations
Very good for keeping track of complex set-ups for yourself and to aid memory
when doing field work
Awkward legend
Awkward labelling (e.g. obvious R internal naming)
Excel figures
Aarhus University Biostatistics - Why? What? How? 13 / 52
Plots Using Plots
The Good
De Boeck, H. J. et al. (2017) ’Patterns and drivers of biodiversity-stability relationships under climate extremes’, Journal of Ecology, (October), pp. 1-13. doi:
10.1111/1365-2745.12897.
Aarhus University Biostatistics - Why? What? How? 14 / 52
Plots Using Plots
The Bad
De Boeck, H. J. et al. (2017) ’Patterns and drivers of biodiversity-stability relationships under climate extremes’, Journal of Ecology, (October), pp. 1-13. doi:
10.1111/1365-2745.12897.
Aarhus University Biostatistics - Why? What? How? 15 / 52
Plots Using Plots
The Ugly
Andrady, A. L. (2011) ’Microplastics in the marine environment’, Marine Pollution Bulletin. Pergamon, 62(8), pp. 1596-1605. doi:
10.1016/J.MARPOLBUL.2011.05.030.
Aarhus University Biostatistics - Why? What? How? 16 / 52
Plots Using Plots
The Ugly
Kanhai, L. D. K. et al. (2017) ’Microplastic abundance, distribution and composition along a latitudinal gradient in the Atlantic Ocean’, Marine Pollution
Bulletin, 115(1-2), pp. 307-314. doi: 10.1016/j.marpolbul.2016.12.025.
Aarhus University Biostatistics - Why? What? How? 17 / 52
Plots How To Make A Plot In R
ggplot Overview
These seminars will focus on how to create plots with ‘ggplot‘ instead of
teaching you data visualisation using base ‘R‘ commands.
Why we use ggplot:
It is extremely powerful
It is becoming the norm
Even base graphics look good
Why ggplot can frustrate you:
You need to memorise specific
commands
Certain objects in ‘R‘ are not
compatible with ‘ggplot‘ yet
It may be unintuituve at first
If you need an introduction to base plot, go here:
https://biostats.w.uib.no/topics/r/r-7-making-plot-learn-the-basics/.
Aarhus University Biostatistics - Why? What? How? 18 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (using ggplot)
The ggplot() function considers three basic components to a plot:
Data set - where to get the data to the plotted from
ggplot(diamonds)
Aesthetics - what variables should be used
ggplot(diamonds, aes(x=carat, y=price))
Layers/Geometry - what kind of plot to produce
ggplot(diamonds,aes(x=carat, y=price)) + geom_point()
You can find a ggplot cheatsheet in the course repository
(https://github.com/ErikKusch/An-Introduction-to-Biostatistics-Using-R).
Aarhus University Biostatistics - Why? What? How? 19 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Basic Scatterplot)
We start off by plotting data contained within the diamonds data set that
comes with the
ggplot2
package. We will be assessing how carats and price
of individual diamonds influence each other.
library(ggplot2)
p <- ggplot(diamonds, # the data set
aes(x=carat, y=price) # aesthetics
) + geom_point() # geometry
p
0
5000
10000
15000
0 1 2 3 4 5
carat
price
Aarhus University Biostatistics - Why? What? How? 20 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Labelling Axes and Title) I
A good plot always includes a title and sports some fancy axis labels:
library(ggplot2)
p <- p + labs(title="Scatterplot", x="Carat", y="Price")
p
0
5000
10000
15000
0 1 2 3 4 5
Carat
Price
Scatterplot
Aarhus University Biostatistics - Why? What? How? 21 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Labelling Axes and Title) II
Sometimes, you may wish to customise axes even further
p <- p + theme(axis.text.x = element_text(face="bold", color="#993333",
size=14, angle=45),
axis.text.y = element_text(face="bold", color="#993333",
size=14, angle=45))
p
0
5000
10000
15000
0
1
2
3
4
5
Carat
Price
Scatterplot
Aarhus University Biostatistics - Why? What? How? 22 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Symbols and Colours I)
Colours are a great way of adding information to the plot. In this case, we want
to visualise the quality of the cut of each diamond:
p <- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point(shape = 4) +
labs(title="Scatterplot", x="Carat", y="Price")
p
0
5000
10000
15000
0 1 2 3 4 5
Carat
Price
cut
Fair
Good
Very Good
Premium
Ideal
Scatterplot
Aarhus University Biostatistics - Why? What? How? 23 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Symbols and Colours II)
Symbols: Colours:
Hex colour codes for most precise
colour specifications
(https://www.color-hex.com/)
Name specification for easiest
coding (http://sape.inf.usi.ch/quick-
reference/ggplot2/colour)
Aarhus University Biostatistics - Why? What? How? 24 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Themes)
ggplot provides you with a set of themes for easy and quick adjustement of
basic plotting components:
p <- p + theme_bw() + theme(plot.title=element_text(size=20, face="bold"),
axis.text.x=element_text(size=15), axis.text.y=element_text(size=15),
axis.title.x=element_text(size=15), axis.title.y=element_text(size=15))
p
0
5000
10000
15000
0 1 2 3 4 5
Carat
Price
cut
Fair
Good
Very Good
Premium
Ideal
Scatterplot
Aarhus University Biostatistics - Why? What? How? 25 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Legend)
Legends are added automatically when colours are used but may not satisfy
the user:
p <- p +
theme(legend.justification=c(1,0), legend.position=c(1,0)) + # legend inside
scale_color_discrete(name="Cut Quality") # Change legend title
p
0
5000
10000
15000
0 1 2 3 4 5
Carat
Price
Cut Quality
Fair
Good
Very Good
Premium
Ideal
Scatterplot
Aarhus University Biostatistics - Why? What? How? 26 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Complex Plots)
Sometimes, you may want to show complex information that still includes the
base data:
p <- p + geom_smooth()
p
0
5000
10000
15000
20000
0 1 2 3 4 5
Carat
Price
Cut Quality
Fair
Good
Very Good
Premium
Ideal
Scatterplot
Aarhus University Biostatistics - Why? What? How? 27 / 52
Plots How To Make A Plot In R
How To Make A Plot In R (Saving Graphs)
Graphs can be saved either via the ggsave() function:
ggsave(filename = "Savedplot.jpg",
width = 10, height = 10, units = cm)
or via the drop-down menu in the Files and Plots pane in RStudio.
Combining plots to appear in sets of any given number is done using the
grid.arrange() command contained within the gridExtra package. For
example,
grid.arrange(plot1, plot2, ncol=2)
will result in a plotting
environment in which the plots (plot1 and plot 2) will be arranged side by side.
These can be saved to the hard drive as follows:
ggsave(filename = "Savedplot.jpg",
arrangeGrob(plot1, plot2))
Aarhus University Biostatistics - Why? What? How? 28 / 52
Plots Plot Types
Creating Some Data
For some of the following plotting methods, we will need the following data:
set.seed(42) # making the code reproducible
data_vec <- rnorm(mean = 20, sd = 2, n = 54)
matrix(data_vec, nrow = 6)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 22.74 23.02 17.22 15.12 23.79 20.91 18.43 21.52 19.14
## [2,] 18.87 19.81 19.44 22.64 19.14 21.41 18.30 18.55 21.31
## [3,] 20.73 24.04 19.73 19.39 19.49 22.07 15.17 17.26 20.64
## [4,] 21.27 19.87 21.27 16.44 16.47 18.78 20.07 20.87 18.43
## [5,] 20.81 22.61 19.43 19.66 20.92 21.01 20.41 18.38 23.15
## [6,] 19.79 24.57 14.69 22.43 18.72 16.57 19.28 22.89 21.29
Aarhus University Biostatistics - Why? What? How? 29 / 52
Plots Plot Types
Pie Charts In Practice
Accommodates frequency/count data.
For publications:
Can have merit when used to
show proportions.
Used seldom.
For behind-the-scenes work:
Can give some initial insight on
data properties.
Other plot types are usually
preferable.
Only really useful for showing
proportions and even then line graphs
may be more useful.
Anderson, T. (2007) Biology of the Ubiquitous House Sparrow: From Genes to
Populations, Biology of the Ubiquitous House Sparrow: From Genes to
Populations. doi: 10.1093/acprof:oso/9780195304114.001.0001.
Aarhus University Biostatistics - Why? What? How? 30 / 52
Plots Plot Types
Pie Charts In R
df <- data.frame(slices = c(4,9,16),
names = c("A", "B", "C"))
ggplot(df,
aes(x="", y = slices,
fill = names)) +
geom_bar(width = 1,
stat = "identity") +
coord_polar("y", start=0) +
theme_void() +
labs(title = "Pie Chart")
names
A
B
C
Pie Chart
Aarhus University Biostatistics - Why? What? How? 31 / 52
Plots Plot Types
Scatterplots In Practice
Accommodates all kinds of data.
For publications:
Great way of presenting
unaltered data.
Used extremely often.
For behind-the-scenes work:
Perfect method for data
exploration and data mining.
Used in almost every analysis.
Unavoidable data visualisation tool.
Scheffer, M. et al. (2012) ’Thresholds for boreal biome transitions.’,
Proceedings of the National Academy of Sciences of the United States of
America, 109(52), pp. 21384-9. doi: 10.1073/pnas.1219844110.
Aarhus University Biostatistics - Why? What? How? 32 / 52
Plots Plot Types
Scatterplots In R
df <- data.frame(
Data = data_vec,
Sequence = 1:length(data_vec))
ggplot(df,
aes(x=Sequence,
y = Data)) +
geom_point() +
theme_classic() +
labs(title = "Scatterplot")
15.0
17.5
20.0
22.5
25.0
0 20 40
Sequence
Data
Scatterplot
Aarhus University Biostatistics - Why? What? How? 33 / 52
Plots Plot Types
Line Graphs In Practice
Accommodates continuous data.
For publications:
Often used as a logical
conclusion to emerging trends in
scatter plots.
Used pretty often. Especially
when showing relationships.
For behind-the-scenes work:
Scatter plots may suffice.
When causal links between
variables are the goal, then these
are the way to go.
Papagiannopoulou, C. et al. (2017) ’A non-linear Granger-causality framework
to investigate climate-vegetation dynamics’, Geoscientific Model
Development, 10(5), pp. 1945-1960. doi: 10.5194/gmd-10-1945-2017.
Remember only to use if continuity is actually implied
Aarhus University Biostatistics - Why? What? How? 34 / 52
Plots Plot Types
Line Graphs In R
df <- data.frame(
Data = sort(data_vec),
Sequence = 1:length(data_vec))
ggplot(df,
aes(x=Sequence,
y = Data)) +
geom_line() +
theme_classic() +
labs(title = "Line Plot")
15.0
17.5
20.0
22.5
25.0
0 20 40
Sequence
Data
Line Plot
Aarhus University Biostatistics - Why? What? How? 35 / 52
Plots Plot Types
Bar Charts In Practice
Accommodates count data.
For publications:
Mostly used when data can be
arranged into distinct groups.
Used seldom.
For behind-the-scenes work:
Can be helpful in data exploration
but usually falls short of other
methods.
Useful for classifications.
Harris, A., Carr, A. S. and Dash, J. (2014) ’Remote sensing of vegetation
cover dynamics and resilience across southern Africa’, International Journal
of Applied Earth Observation and Geoinformation. Elsevier B.V., 28(1), pp.
131-139. doi: 10.1016/j.jag.2013.11.014.
Aarhus University Biostatistics - Why? What? How? 36 / 52
Plots Plot Types
Bar Charts In R
df <- data.frame(slices = c(4,9,16),
names = c("A", "B", "C"))
ggplot(df,
aes(x=names,
y = slices)) +
geom_bar(width = .5,
stat = "identity") +
theme_void() +
labs(title = "Bar Chart")
Bar Chart
Aarhus University Biostatistics - Why? What? How? 37 / 52
Plots Plot Types
Histograms In Practice
Accommodates frequency count data.
For publications:
Great way of presenting data
distributions.
Used extensively.
For behind-the-scenes work:
Almost unavoidable in data
exploration and assumption
checking.
Used to assess and understand data
distributions.
Scheffer, M. et al. (2012) ’Thresholds for boreal biome transitions.’,
Proceedings of the National Academy of Sciences of the United States of
America, 109(52), pp. 21384-9. doi: 10.1073/pnas.1219844110.
Aarhus University Biostatistics - Why? What? How? 38 / 52
Plots Plot Types
Histograms In R
ggplot() + aes(data_vec)+
geom_histogram(binwidth=1,
colour="black",
fill="white")
0.0
2.5
5.0
7.5
10.0
12.5
14 16 18 20 22 24 26
data_vec
count
Aarhus University Biostatistics - Why? What? How? 39 / 52
Plots Plot Types
Frequency Polygon In Practice
Accommodates frequency count data.
For publications:
May be used as the logical
conclusion to histogram displays.
Used rather sparingly due to a
possible masking effect.
For behind-the-scenes work:
You may wish to use this to add
more information to the plot
besides the distribution.
Histograms usually suffice.
McGill, B. J. et al. (2006) ’Rebuilding community ecology from functional
traits’, Trends in Ecology and Evolution, 21(4), pp. 178-185. doi:
10.1016/j.tree.2006.02.002.
Used to assess and understand data distributions.
Aarhus University Biostatistics - Why? What? How? 40 / 52
Plots Plot Types
Frequency Polygon In R
ggplot() + aes(data_vec) +
geom_freqpoly()
0
2
4
6
14 16 18 20 22 24
data_vec
count
Aarhus University Biostatistics - Why? What? How? 41 / 52
Plots Plot Types
Dendrograms In Practice
Accommodates classification data.
For publications:
Usage almost exclusively to
portraying phylogenetics.
Applicable to all clustering
approaches.
For behind-the-scenes work:
Intuitive display of data groups.
Coloured scatter plots may
outperform dendrograms in
certain situations.
Great to visualise hierarchical
clustering approaches.
Anderson, T. (2007) Biology of the Ubiquitous House Sparrow: From Genes to
Populations, Biology of the Ubiquitous House Sparrow: From Genes to
Populations. doi: 10.1093/acprof:oso/9780195304114.001.0001.
Aarhus University Biostatistics - Why? What? How? 42 / 52
Plots Plot Types
Dendrograms In R
ggplot can’t handle certain objects (such as these hierarchical clusters):
library(vegan)
dist_mat <- vegdist(
matrix(data_vec[1:25],
nrow=5))
clust <- hclust(
d = dist_mat,
method="single")
plot(clust)
2
1
5
3
4
0.050 0.055 0.060 0.065
Cluster Dendrogram
hclust (*, "single")
dist_mat
Height
Aarhus University Biostatistics - Why? What? How? 43 / 52
Plots Plot Types
Boxplots In Practice
Accommodates numerical data.
For publications:
Immensely useful data
visualisation tool to represent
parameters of groups of data.
Used very frequently.
For behind-the-scenes work:
Always nice for data exploration.
Hard to avoid (not that you’d want
to).
Used to present basic parameters of
descriptive statistics.
Smith, A. P. et al. (2017) ’Shifts in pore connectivity from precipitation versus
groundwater rewetting increases soil carbon loss after drought’, Nature
Communications. Springer US, 8(1), p. 1335. doi:
10.1038/s41467-017-01320-x.
Aarhus University Biostatistics - Why? What? How? 44 / 52
Plots Plot Types
Boxplots In Theory
Box plots are less intuitive than other plotting displays:
Contained information:
Lower and upper 99.3% intervals
of the data (expressed as
whiskers).
The cut-point for Quartile 1 and 3
(these are the outer edges of the
box, so 50% of the data fall inside
the box).
The Median, usually represented
by a bold line inside the box,
because its behaviour is robust
(more so than that of the mean).
Aarhus University Biostatistics - Why? What? How? 45 / 52
Plots Plot Types
Boxplots In R
You can also use
geom_violin()
for some fancy violin plots which result in a
roughly equal depiction of the data.
location <- as.factor(
c(rep("A",27),
rep("B",27)))
data_df <- data.frame(
data_vec,location)
ggplot(data_df,
aes(x = location,
y = data_vec)) +
geom_boxplot()
15.0
17.5
20.0
22.5
25.0
A B
location
data_vec
Aarhus University Biostatistics - Why? What? How? 46 / 52
Plots Plot Types
Contour Plots In Practice
Accommodates all kinds of data.
For publications:
More complicated to understand.
Used sparingly.
For behind-the-scenes work:
You might as well include it in
your final manuscript if you bother
coming up with one.
Used to understanding the
relationship of variables in a
classification setting.
Brewer, M. J. et al. (2016) ’Plateau: A new method for ecologically plausible
climate envelopes for species distribution modelling’, Methods in Ecology and
Evolution, pp. 1489-1502. doi: 10.1111/2041-210X.12609.
Aarhus University Biostatistics - Why? What? How? 47 / 52
Plots Plot Types
3-D Plots In Practice
Accommodates all kinds of data.
For publications:
>
2 dimensions translate badly to
paper.
Used extremely rarely.
For behind-the-scenes work:
Good for data exploration.
Especially useful when inspecting
PCA (Principal Component
Analysis) results.
Used to understanding the
relationship of variables in a
classification setting.
Díaz, S. et al. (2015) ’The global spectrum of plant form and function’, Nature.
Nature Publishing Group, 529(7585), pp. 167-171. doi: 10.1038/nature16489.
Aarhus University Biostatistics - Why? What? How? 48 / 52
Plots Plot Types
The Hat Goes Deeper!
There are way more plot types that you may want to use at some point.
Flow charts to illustrate your workflow, for example:
Seddon, A. W. R. et al. (2016) ’Sensitivity of global terrestrial ecosystems to climate variability.’, Nature, 531(7593), pp. 229-232. Available at:
http://dx.doi.org/10.1038/nature16986.
Aarhus University Biostatistics - Why? What? How? 49 / 52
Exercise R-internal data sets
The data that comes with R
R is supplied with in-built data sets and more data sets will be added to your
local library when you install additional packages. These data sets are
immensely useful in creating minimal working exmaples (MWEs) which
show how something works (or doesn’t) with the least amount of code possible.
You can retrieve all available data sets in your library using the command
data() and load any of the given data sets by adding the name of the data
set as the argument to the data() function.
Aarhus University Biostatistics - Why? What? How? 51 / 52
Exercise Making plots
Creating plots with R
Your ToDo-List for this exercise:
Load the R-internal iris data set (it is included in the datasets
package)
Inspect the data set
Produce a boxplot of Petal.Length by Species
Produce a scatterplot of Petal.Length and Petal.Width
Produce a
scatterplot
of
Petal.Length
and
Petal.Width
grouped by
Species
Produce a plot of your choice to show the relationship of
Sepal.Length and Sepal.Width
Produce a plot of your choice to show the relationship of
Sepal.Length and Sepal.Width when grouped by Species
Play around with other combinations of variables and plotting types
Aarhus University Biostatistics - Why? What? How? 52 / 52