Descriptive Statistics
Theory
These are the solutions to the exercises contained within the handout to Descriptive Statistics which walks you through the basics of descriptive statistics and its parameters. The analyses presented here are using data from the StarWars
data set supplied through the dplyr
package that have been saved as a .csv file. Keep in mind that there is probably a myriad of other ways to reach the same conclusions as presented in these solutions. I have prepared some slides for this session:
Data
Find the data for this exercise here. Do not worry about downloading it for now.
Packages
As you will remember from our lecture slides, the calculation of the mode in R
can either be achieved through some intense coding or simply by using the mlv(..., method="mfv")
function contained within the modeest
package (unfortunately, this package is out of date and can sometimes be challenging to install).
Conclusively, it is now time for you to get familiar with how packages work in R
. Packages are the way by which R
is supplied with user-created and moderator-mediated functionality that exceeds the base applicability of R
. Many things you will want to accomplish in more advanced statistics is impossible without such packages and even basic tasks such as data visualisation (dealt with in our next seminar) are reliant on R
packages.
If you want to get a package and its functions into R
there are two ways we will discuss in the following. In general, it pays to load all packages at the beginning of a coding document before any actual analyses happen (in the preamble) so you get a good overview of what the program is calling upon.
Basic Preamble
This is the most basic version of getting packages into R
and is widely practised and taught. Unsurprisingly, I am not a big fan of it.
First, you use function install.packages()
to download the desired package off dedicated servers (usually CRAN-mirrors) to your machine where it is then unpacked into a library (a folder that’s located in your documents section by default). Secondly, you need to invoke the library()
function to load the R
package you need into your active R
session. In our case of the package modeest
it would look something like this:
install.packages("modeest")
library(modeest)
The reason I am not overly fond of this procedure is that it is clunky, can break easily through spelling mistakes and starts cluttering your preamble super fast if the analyses you are wanting to perform require excessive amounts of packages. Additionally, when you are some place with a bad internet connection you might not want to re-download packages that are already contained on your hard drive.
Advanced Preamble
There is a myriad of different preamble styles (just as there are tons of different, personalised coding styles). I am left with presenting my preamble of choice here but I do not claim that this is the most sophisticated one out there.
The way this preamble works is that it is structured around a user-defined function (something we will touch on later in our seminar series) which first checks whether a package is already downloaded and then installs (if necessary) and/or loads it into R
. This function is called install.load.package()
and you can see its specification down below (don’t worry if it doesn’t make sense to you yet - it is not supposed to at this point). Unfortunately, it can only ever be applied to one package at a time and so we need a workaround to make it work on multiple packages at once. This can be achieved by establishing a vector of all desired package names (package_vec
) and then applying (sapply()
) the install.load.package()
function to every item of the package name vector iteratively as follows:
# function to load packages and install them if they haven't been installed yet
install.load.package <- function(x) {
if (!require(x, character.only = TRUE))
install.packages(x)
require(x, character.only = TRUE)
}
# packages to load/install if necessary
package_vec <- c("modeest")
# applying function install.load.package to all packages specified in
# package_vec
sapply(package_vec, install.load.package)
## Loading required package: modeest
## modeest
## TRUE
Why do I prefer this? Firstly, it is way shorter than the basic method when dealing with many packages (which you will get into fast, I promise), reduces the chance for typos by 50% and does not override already installed packages hence speeding up your processing time.
Loading the Excel data into R
Our data is located in the Data
folder and is called DescriptiveData.csv
. Since it is a .csv file, we can simply use the R
in-built function read.csv()
to load the data by combining the former two identifiers into one long string with a backslash separating the two (the backslash indicates a step down in the folder hierarchy). Given this argument, read.csv()
will produce an object of type data.frame
in R
which we want to keep in our environment and hence need to assign a name to. In our case, let that name be Data_df
(I recommend using endings to your data object names that indicate their type for easier coding without constant type checking):
# Data_df <- read.csv('Data/DescriptiveData.csv') # load data file from Data
# folder if you downloaded the data as a .csv alternatively, read the csv
# directly from the url
Data_df <- read.csv("https://github.com/ErikKusch/Homepage/raw/master/content/courses/biostat101/Data/DescriptiveData.csv")
What’s contained within our data?
Now that our data set is finally loaded into R
, we can finally get to trying to make sense of it. Usually, this shouldn’t ever be something one has to do in R
but should be manageable through a project-/data-specific README file (we will cover this in our seminar on hypotheses testing and project planning) but for now we are stuck with pure exploration of our data set. Get your goggles on and let’s dive in!
Firstly, it always pays to asses the basic attributes of any data object (remember the Introduction to R
seminar):
- Name - we know the name (it is
Data_df
) since we named it that - Type - we already know that it is a
data.frame
because we created it using theread.csv
function - Mode - this is an interesting one as it means having to subset our data frame
- Dimensions - a crucial information about how many observations and variables are contained within our data set
Dimensions
Let’s start with the dimensions because these will tell us how many modes (these are object attribute modes and not descriptive parameter modes) to asses:
dim(Data_df)
## [1] 87 8
Using the dim()
function, we arrive at the conclusion that our Data_df
contains 87 rows and 8 columns. Since data frames are usually ordered as observations $\times$ variables, we can conclude that we have 87 observations and 8 variables at our hands.
You can arrive at the same point by using the function View()
in R
. I’m not showing this here because it does not translate well to paper and would make whoever chooses to print this waste paper.
Modes
Now it’s time to get a hang of the modes of the variable records within our data set. To do so, we have two choices, we can either subset the data frame by columns and apply the class()
function to each column subset or simply apply the str()
function to the data frame object. The reason str()
may be favourable in this case is due to the fact that str()
automatically breaks down the structure of R
-internal objects and hence saves us the sub-setting:
str(Data_df)
## 'data.frame': 87 obs. of 8 variables:
## $ name : chr "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int 172 167 96 202 150 178 165 97 183 182 ...
## $ mass : num 77 75 32 136 49 120 75 32 84 77 ...
## $ hair_color: chr "blond" "" "" "none" ...
## $ skin_color: chr "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr "blue" "yellow" "red" "yellow" ...
## $ birth_year: num 19 112 33 41.9 19 52 47 NA 24 57 ...
## $ gender : chr "male" "" "" "male" ...
As it turns out, our data frame knows the 8 variables of name, height, mass, hair_color, skin_color, eye_color, birth_year, gender which range from integer
to numeric
and factor
modes.
Data Content
So what does our data actually tell us? Answering this question usually comes down to some analyses but for now we are only really interested in what kind of information our data frame is storing.
Again, this would be easiest to asses using a README file or the View()
function in R
. However, for the sake of brevity we can make due with the following to commands which present the user with the first and last five rows of any respective data frame:
head(Data_df)
## name height mass hair_color skin_color eye_color birth_year
## 1 Luke Skywalker 172 77 blond fair blue 19.0
## 2 C-3PO 167 75 gold yellow 112.0
## 3 R2-D2 96 32 white, blue red 33.0
## 4 Darth Vader 202 136 none white yellow 41.9
## 5 Leia Organa 150 49 brown light brown 19.0
## 6 Owen Lars 178 120 brown, grey light blue 52.0
## gender
## 1 male
## 2
## 3
## 4 male
## 5 female
## 6 male
tail(Data_df)
## name height mass hair_color skin_color eye_color birth_year
## 82 Finn NA NA black dark dark NA
## 83 Rey NA NA brown light hazel NA
## 84 Poe Dameron NA NA brown light brown NA
## 85 BB8 NA NA none none black NA
## 86 Captain Phasma NA NA unknown unknown unknown NA
## 87 Padm\xe9 Amidala 165 45 brown light brown 46
## gender
## 82 male
## 83 female
## 84 male
## 85 none
## 86 female
## 87 female
The avid reader will surely have picked up on the fact that all the records in the name
column of Data_df
belong to characters from the Star Wars universe. In fact, this data set is a modified version of the StarWars
data set supplied by the dplyr
package and contains information of many of the important cast members of the Star Wars movie universe.
Parameters of descriptive statistics
Names
As it turns out (and should’ve been obvious from the onset if we’re honest), every major character in the cinematic Star Wars Universe has a unique name to themselves. Conclusively, the calculation of any parameters of descriptive statistics makes no sense with the names of our characters for the two following reasons:
- The name variable is of mode character/factor and only allows for the calculation of the mode
- Since every name only appears once, there is no mode
As long as the calculation of descriptive parameters of the name
variable of our data set is concerned, Admiral Ackbar said it best: “It’s a trap”.
Height
Let’s get started on figuring out some parameters of descriptive statistics for the height
variable of our Star Wars characters.
Subsetting
First, we need to extract the data in question from our big data frame object. This can be achieved by indexing through the column name as follows:
Height <- Data_df$height
Location Parameters
Now, with our Height
vector being the numeric height records of the Star Wars characters in our data set, we are primed to calculate location parameters as follows:
mean <- mean(Height, na.rm = TRUE)
median <- median(Height, na.rm = TRUE)
mode <- mlv(na.omit(Height), method = "mfv")
min <- min(Height, na.rm = TRUE)
max <- max(Height, na.rm = TRUE)
range <- max - min
# Combining all location parameters into one vector for easier viewing
LocPars_vec <- c(mean, median, mode, min, max, range)
names(LocPars_vec) <- c("mean", "median", "mode", "minimum", "maximum", "range")
LocPars_vec
## mean median mode minimum maximum range
## 174.358 180.000 183.000 66.000 264.000 198.000
As you can clearly see, there is a big range in the heights of our respective Star Wars characters with mean and median being fairly disjunct due to the outliers in height on especially either end.
Distribution Parameters
Now that we are aware of the location parameters of the Star Wars height records, we can move on to the distribution parameters/parameters of spread. Those can be calculated in R
as follows:
var <- var(Height, na.rm = TRUE)
sd <- sd(Height, na.rm = TRUE)
quantiles <- quantile(Height, na.rm = TRUE)
# Combining all location parameters into one vector for easier viewing
DisPars_vec <- c(var, sd, quantiles)
names(DisPars_vec) <- c("var", "sd", "0%", "25%", "50%", "75%", "100%")
DisPars_vec
## var sd 0% 25% 50% 75% 100%
## 1208.98272 34.77043 66.00000 167.00000 180.00000 191.00000 264.00000
Notice how some of the quantiles (actually quartiles in this case) link up with some of the parameters of central tendency.
Summary
Just to round this off, have a look at what the summary()
function in R
supplies you with:
summary <- summary(na.omit(Height))
summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 66.0 167.0 180.0 174.4 191.0 264.0
This is a nice assortment of location and dispersion parameters.
Mass
Now let’s focus on the weight of our Star Wars characters.
Subsetting
Again, we need to extract our data from the data frame. For the sake of brevity, I will refrain from showing you the rest of the analysis and only present its results to save some space.
Mass <- Data_df$mass
Location Parameters
## mean median mode minimum maximum range
## 97.31186 79.00000 80.00000 15.00000 1358.00000 1343.00000
As you can see, there is a huge range in weight records of Star Wars characters and especially the outlier on the upper end (1358kg) push the mean towards the upper end of the weight range and away from the median. We’ve got Jabba Desilijic Tiure to thank for that.
Distribution Parameters
## var sd 0% 25% 50% 75% 100%
## 28715.7300 169.4572 15.0000 55.6000 79.0000 84.5000 1358.0000
Quite obviously, the wide range of weight records also prompts a large variance and standard deviation.
Hair Color
Hair colour in our data set is saved in column 4 of our data set and so when sub-setting the data frame to obtain information about a characters hair colour, instead of calling on Data_df$hair_color
we can also do so as follows:
HCs <- Data_df[, 4]
Of course, hair colour is not a numeric
variable and much better represent by being of mode factor
. Therefore, we are unable to obtain most parameters of descriptive statistics but we can show a frequency count as follows which allows for the calculation of the mode:
table(HCs)
## HCs
## auburn auburn, grey auburn, white black
## 5 1 1 1 13
## blond blonde brown brown, grey grey
## 3 1 18 1 1
## none unknown white
## 37 1 4
Eye Colour
Eye colour is another factor
mode variable:
ECs <- Data_df$eye_color
We can only calculate the mode by looking for the maximum in our table()
output:
table(ECs)
## ECs
## black blue blue-gray brown dark
## 10 19 1 21 1
## gold green, yellow hazel orange pink
## 1 1 3 8 1
## red red, blue unknown white yellow
## 5 1 3 1 11
Birth Year
Subsetting
As another numeric
variable, birth year allows for the calculation of the full range of parameters of descriptive statistics:
BY <- Data_df$birth_year
Keep in mind that StarWars operates on a different time reference scale than we do.
Location Parameters
## mean median mode minimum maximum range
## 87.56512 52.00000 19.00000 8.00000 896.00000 888.00000
Again, there is a big disparity here between mean and median which stems from extreme outliers on both ends of the age spectrum (Yoda and Wicket Systri Warrick, respectively).
Distribution Parameters
## var sd 0% 25% 50% 75% 100%
## 23929.4414 154.6914 8.0000 35.0000 52.0000 72.0000 896.0000
Unsurprisingly, there is a big variance and standard deviation for the observed birth year/age records.
Gender
Gender is another factor
mode variable (obviously):
Gender <- Data_df$gender
We can, again, only judge the mode of our data from the output of the table()
function:
table(Gender)
## Gender
## female hermaphrodite male none
## 3 19 1 62 2