Data Visualisation
Theory
These are the solutions to the exercises contained within the handout to Data Visualisation which walks you through the basics of data visualisation in R
using ggplot2
. The plots presented here are using data from the iris
data set supplied through the datasets
package. Keep in mind that there is probably a myriad of other ways to reach the same conclusions as presented in these solutions. I have prepared some slides for this session:
Data
This practical makes use of R-internal data so you don’t need to download anything extra today.
Packages
Recall the exercise that went along with the last seminar (Descriptive Statistics) where we learnt the difference between a basic and advanced preamble for package loading in R
. Here (and in future exercises) I will only supply you with the advanced version of the preamble.
Now let’s load the ggplot2
package into our R
session so we’ll be able to use its functionality for data visualisation as well as the datasets
package to get the iris
data set.
# function to load packages and install them if they haven't been installed yet
install.load.package <- function(x) {
if (!require(x, character.only = TRUE))
install.packages(x)
require(x, character.only = TRUE)
}
# packages to load/install if necessary
package_vec <- c("ggplot2", "datasets")
# applying function install.load.package to all packages specified in package_vec
sapply(package_vec, install.load.package)
## Loading required package: ggplot2
## ggplot2 datasets
## TRUE TRUE
Loading R
-internal data sets (iris
)
The iris
data set is included in the datasets
package in R
. An R
-internal data set is loaded through the command data()
. Take note that you do not have to assign this command’s output to a new object (via <-
). Instead, the dataset is loaded to your current environment by its name (iris, in this case). Keep in mind that this can override objects of the same name that are already present in your current session of R
.
data("iris")
Inspect the data set
Since we know that iris
is a dataset, we can be reasonably sure that this object will be complex enough to warrant using the str()
function for inspection:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The iris
dataset contains four measurements (Sepal.Length
, Sepal.Width
, Petal.Length
, Petal.Width
) for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica).
Boxplot of Petal.Length
by Species
ggplot(iris, # the data set
aes(x=Species, y=Petal.Length) # aesthetics
) + geom_boxplot() + # this is the end of the bare minimum plot
theme_bw() + labs(title="Petal Length of three different species of Iris")
THis boxplot shows us exactly how the distributions of petal length measurements of our three species of Iris are differing from one another. Despite the obvious trend in the data, be sure not to report results through figures alone! We will find out how to test whether the pattern we can observe here holds up to scrutiny at a later point in time of our seminars.
Scatterplot of Petal.Length
and Petal.Width
ggplot(iris, # the data set
aes(x=Petal.Width, y=Petal.Length) # aesthetics
) + geom_point() + # this is the end of the bare minimum plot
theme_bw() + labs(title="Petal Width and Petal Length of three different species of Iris")
Scatterplot of Petal.Length
and Petal.Width
grouped by Species
ggplot(iris, # the data set
aes(x=Petal.Width, y=Petal.Length, colour = Species) # aesthetics
) + geom_point() + # this is the end of the bare minimum plot
theme_bw() + labs(title="Petal Width and Petal Length of three different species of Iris") +
theme(legend.justification=c(1,0), legend.position=c(1,0)) + # legend inside
scale_color_discrete(name="Iris Species") # Change legend title
Relationship of Sepal.Length
and Sepal.Width
ggplot(iris, # the data set
aes(x=Sepal.Width, y=Sepal.Length) # aesthetics
) + geom_point() + geom_smooth() + # this is the end of the bare minimum plot
theme_bw() + labs(title="Petal Width and Petal Length of three different species of Iris")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Relationship of Sepal.Length
and Sepal.Width
(grouped by Species
)
ggplot(iris, # the data set
aes(x=Sepal.Width, y=Sepal.Length, colour = Species) # aesthetics
) + geom_point() + geom_smooth() + # this is the end of the bare minimum plot
theme_bw() + labs(title="Petal Width and Petal Length of three different species of Iris") +
theme(legend.justification=c(1,0), legend.position=c(1,0)) + # legend inside
scale_color_discrete(name="Iris Species") # Change legend title
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'