BASIC STATISTICS FOR BIOLOGISTS
Erik Kusch
erik.kusch@i-solution.de
Section for Ecoinformatics & Biodiversity
Center for Biodiversity and Dynamics in a Changing World (BIOCHANGE)
Aarhus University
Aarhus University Biostatistics - Why? What? How? 1 / 39
1 What To Expect
The Seminars
Course Resources and Reading
2 The Importance Of Proper Statistics
The Consequences Of Bad Statistics
What Are Bad Statistics?
Statistical Concern On The Rise
Further benefits of a statistical background
3 Terminology
Classifying Statistics
Basic Vocabulary
4 Introduction To R
Why Use R?
The R landscape
Layouts
Coding
Aarhus University Biostatistics - Why? What? How? 2 / 39
What To Expect The Seminars
Course Dates & Outline I
Block I - Theory and Basics of R
Date Time Topic Location
I.) Introduction
Date Time
(1) An Introduction to Basic Statistics for
Biologists
Location
Date Time (2) Introduction to R Location
II.) Basic statistical terminology
Date Time (3) A Primer for Statistical Tests Location
Date Time (4) Descriptive Statistics Location
Date Time (5) Data Visualisation Location
Date Time
(6) Inferential Statistics, Hypotheses and
our Research Project
Location
Aarhus University Biostatistics - Why? What? How? 4 / 39
What To Expect The Seminars
Course Dates & Outline II
Block II - Basic Statistics in R
Date Time Topic Location
III.) Handling Data
Date Time (7) Data Handling and Data Mining Location
IV.) Non-parametric tests
Date Time (8) Nominal Tests Location
Date Time (9) Correlation Tests Location
Date Time
(10) Ordinal and Metric Tests for two-
sample situations
Location
Date Time
(11) Ordinal and Metric Tests for more
than two-sample
Location
V.) Parametric tests
Date Time (12) Simple Parametric Tests Location
VI.) Closing
Date Time
(13) Summary and an Outlook on Ad-
vanced Statistics
Location
Aarhus University Biostatistics - Why? What? How? 5 / 39
What To Expect The Seminars
Learning Goals
1 A solid grasp of basic biostatistics
Have an overview of available methods
Be able to judge the applicability of individual methods
2 Basic proficiency in using R
Know base commands and how they function
Be able to prepare biologically relevant data sets for further analysis
Be able to apply basic statistical methods to biologically relevant data sets
3 Research Design
Understand how to formulate testable hypotheses
Know the importance of proper statistical approaches in research
Being able to critically assess statistical methods in research publications
Aarhus University Biostatistics - Why? What? How? 6 / 39
What To Expect The Seminars
Learning Methods
We will:
Cover useful theory of biostatistics
(lecture style)
Run biostatistical analyses in R
(seminar style)
Work through basic biostatistical
methods in a research project
using simulated data
Fully reproducible analyses
(https://github.com/ErikKusch/An-
Introduction-to-Biostatistics-
Using-R)
We will focus heavily on actually doing the statistics!
Aarhus University Biostatistics - Why? What? How? 7 / 39
What To Expect Course Resources and Reading
Let Me Introduce Myself
Aarhus University Biostatistics - Why? What? How? 8 / 39
What To Expect Course Resources and Reading
Useful Reading
You are NOT required to read these!
ISBN: 978-1-118-94109-6
ISBN: 978-0-387-79053-4
ISBN: 978-1-4471-4883-8
But these books are seriously good.
Aarhus University Biostatistics - Why? What? How? 9 / 39
The Importance Of Proper Statistics The Consequences Of Bad Statistics
When Mistakes Happen
Even the rigorous peer-review system might miss some minor flaws.
An example:
Birkenmeyer et. al published a
flawed paper in 2016.
The mistake in the data set was spotted
by Dr. B. M. Weiß. in early 2017
A corrigendum was put online
A corrected version of the paper
was uploaded
None of the results of the paper
changed.
No big deal so long as you offer corrections to your flawed work.
Aarhus University Biostatistics - Why? What? How? 11 / 39
The Importance Of Proper Statistics The Consequences Of Bad Statistics
Fraudulent Practices - The Case Of Andrew Wakefield
Probably one of the most reviled
doctors of the 21
st
century
Claimed to have found a link for
vaccines and autism (Paper from
1998)
Paper retracted by the publisher
General Medical Council of Britain
revoked his medical license
His academic career is over despite his
large community of followers in the
U.S., Australia and Brazil.
Knowingly fraudulent practices can cost you your career.
Aarhus University Biostatistics - Why? What? How? 12 / 39
The Importance Of Proper Statistics The Consequences Of Bad Statistics
Fraudulent Practices - The Case Of Diederik Stapel
Former star in academia, now a
laughing stock
Manipulated data and completely
fabricated entire studies
Fired from his position as
professor at Tilburg University
58 retracted papers
Papers of other authors needed to
be retracted as well
Knowingly fraudulent practices can cost you your career, discredit your
institution and your field of research, and even seriously impede the careers of
unknowing co-workers.
Aarhus University Biostatistics - Why? What? How? 13 / 39
The Importance Of Proper Statistics What Are Bad Statistics?
Wrong/Malinformed Use
Lack of statistical knowledge
Applying statistics to data which they aren’t meant for
Methods can “break”
Flawed understanding of the methodology
Incorrect conclusions
Pure biologists lack knowledge on statistics.
Aarhus University Biostatistics - Why? What? How? 14 / 39
The Importance Of Proper Statistics What Are Bad Statistics?
Uninformed Use
Lack of biological knowledge
Delineation of nonsensical but statistically significant relationships
p-hacking
No sense of how to establish testable, feasible hypotheses
Waste of time
Pure statisticians lack knowledge on biology.
Aarhus University Biostatistics - Why? What? How? 15 / 39
The Importance Of Proper Statistics What Are Bad Statistics?
Caveat
Biologists often have preformed
ideas of what to expect
data-tweaking to match
expectations?
Researchers also have a vested
interest in uncovering
extraordinary things
The more astounding a paper
the better?
ATTENTION!
Don’t let a personal bias inform your analysis!
Aarhus University Biostatistics - Why? What? How? 16 / 39
The Importance Of Proper Statistics Statistical Concern On The Rise
The Recent Debate
p-values are a cause of concern
More on this in seminar 6
(Inferential Statistics and
Hypotheses)
Pre-p-value statistics and data
handling increasingly subject of
scrutiny
More on this in seminar 7 (Data
Handling and Data Mining)
Practices in statistics are constantly subject to change.
Aarhus University Biostatistics - Why? What? How? 17 / 39
The Importance Of Proper Statistics Statistical Concern On The Rise
Why Keep Up With It?
Journals mights enact bans on
studies containing p values
Counter-productive according to
Andrew Vickers (Memorial Sloan
Kettering Cancer Center)
Statistically robust studies hold up
to scrutiny much better
Statistical prowess enhances
your research massively
Staying up-to-date can help
advance one’s understanding and
career
Aarhus University Biostatistics - Why? What? How? 18 / 39
The Importance Of Proper Statistics Statistical Concern On The Rise
Advancing In Statistics
"Treat statistics as a science, and not a
recipe!"
Andrew Vickers
Aarhus University Biostatistics - Why? What? How? 19 / 39
The Importance Of Proper Statistics Further benefits of a statistical background
The Lack Of Biostatisticans
Biological studies without rigorous statistical analyses are almost
unpublishable
Biostatisticians are rare
Almost every biological research group requires at least one capable
statistician
Biostatisticians are sought-after
Aarhus University Biostatistics - Why? What? How? 20 / 39
The Importance Of Proper Statistics Further benefits of a statistical background
Statistics As An Apphrodisiac
Aarhus University Biostatistics - Why? What? How? 21 / 39
Terminology Classifying Statistics
Frequently Used Classifications
According to how they are done:
Theoretical Statistics
Applied Statistics
According to topic:
Biostatistics
Economic Statistics
Statistical Physics
...
According to what the goal is and
what kind of data is available
Regression
Classification
According to how the analyses
makes use of the data
Supervised
Unsupervised
According to the kind of information returned by the methods
Descriptive Statistics
Inference/Inferential Statistics
Aarhus University Biostatistics - Why? What? How? 23 / 39
Terminology Classifying Statistics
Unsupervised Approaches
Unsupervised methods are often used to select the most informative X input
variables for supervised approaches.
Pre-requisites:
Only input variables are observed.
No solution/feedback (output) is
given.
Aims:
Divide the observations into
relatively distinct groups.
Model the underlying structure or
distribution in the data.
"Pre-processing" before a supervised learning analysis and
exploratory analyses
Aarhus University Biostatistics - Why? What? How? 24 / 39
Terminology Classifying Statistics
Supervised Approaches
Supervised methods are often informed by unsupervised approaches and used
to gain validated information about the data.
Pre-requisites:
Both predictors X, and responses
Y
are observed (there is one
y
i
for
each x
i
).
Data is split into Training and Test
Data Sets.
Aims:
Learn a mapping function f from
X to Y .
Validate established
function/model.
Further prediction and inference.
Mostly inferential analyses
Aarhus University Biostatistics - Why? What? How? 25 / 39
Terminology Basic Vocabulary
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 26 / 39
Terminology Basic Vocabulary
Training Data vs. Test Data
This differentiation is only applicable when concerned with modelling, which we
won’t cover in these seminars.
Training Data:
describes the subset of
the total data which is used to
establish/train the model.
Test Data: describes the subset of the
total data which is used to test the
performance of the model.
The problem: You have identified a way to model how mandible strength and
ant size are interconnected but don’t know how to assess the quality of your
model (a model will always fit the data it was built on extremely well).
The solution: Split the available data into two non-overlapping subsets of data
(Training and Test Data) and use these separately to build your model and
assess its performance.
Aarhus University Biostatistics - Why? What? How? 27 / 39
Terminology Basic Vocabulary
What Makes Data Truly Random?
Randomisation is one of the most important practices in biological
studies.
A
sampling
procedure is
random
when any member of the population has an
equal chance of being selected into the sample.
Training and Test Data Sets are established from the population with the same
sense of randomness although there may be exceptions depending on the
modelling procedure at hand.
Data collection: Number all units
contained within the set-up and sample
those units corresponding to random
numbers.
In R: Use the sample() function to
create truly random subsets.
Remember to use set.seed() to
make this step reproducible!
Aarhus University Biostatistics - Why? What? How? 28 / 39
Terminology Basic Vocabulary
Random Sampling in R
# Making it reproducible
set.seed(42)
# Establishing a population
pop <- c(1:15)
pop
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Establishing a random sample
sam <- sample(pop, 5, replace = FALSE)
sam
## [1] 1 5 15 9 10
Aarhus University Biostatistics - Why? What? How? 29 / 39
Introduction To R Why Use R?
The Power Of R
1 R is a powerful statistical and graphical tool
2 Available for almost every platform (Windows,
Linux, Mac, FreeBSD, etc.)
3 It is completely free
4 Open source
It can be modified heavily to suit individual
demands
Constant, moderated user input to widen
functionality
Dedicated, heavily frequented forums online
Allows for reproducible coding
R is the rising star of statistical applications in biological sciences!
Aarhus University Biostatistics - Why? What? How? 31 / 39
Introduction To R The R landscape
Obtaining R
R is a free statistical environment that is used by many researchers all around
the globe.
How to get it?
R is available at
https://www.r-project.org/
A host of editors is available freely
on the internet. I recommend
RStudio (available at
https://www.rstudio.com/).
What if I need help?
Multiple dedicated forums online:
https://stackoverflow.com/
https://stackexchange.com/
Aarhus University Biostatistics - Why? What? How? 32 / 39
Introduction To R Layouts
Layouts - The Console
Running R through the console . . .
... is a bad idea.
But you will have access to it anyway as it comes with R (we will use version
3.4.2. https://cran.r-project.org/bin/windows/base/old/3.4.2/).
Aarhus University Biostatistics - Why? What? How? 33 / 39
Introduction To R Layouts
Layouts - The Editor
Running R through an editor. . .
... is a much better idea!
I recommend RStudio (https://www.rstudio.com/). If you use it a lot, I also
recommend changing the appearance to ‘Vibrant Ink’ (setting located in the
‘Global Options’ window nested within the ‘Tools’ tab).
Aarhus University Biostatistics - Why? What? How? 34 / 39
Introduction To R Layouts
Layouts - The Editor Explained
The
Source
is where you load scripts and write most of your coding document.
Aarhus University Biostatistics - Why? What? How? 35 / 39
Introduction To R Layouts
Layouts - The Editor Explained
The Environment, History, Connections is where you will be able to quickly
access all objects of your current R session.
Aarhus University Biostatistics - Why? What? How? 36 / 39
Introduction To R Layouts
Layouts - The Editor Explained
Files, Plots, Packages, Help Viewer are especially useful for document
navigation, data visualisation and to get information on certain functions in R.
Aarhus University Biostatistics - Why? What? How? 37 / 39
Introduction To R Layouts
Layouts - The Editor Explained
The Console is where you execute short commands, and warning and error
messages are displayed.
Aarhus University Biostatistics - Why? What? How? 38 / 39
Introduction To R Coding
The Evolution Of Code
Your code and coding practices
evolve
Comment every line of code
Elegant code makes an analysis
easier to reproduce
Avoid hard-coding!
"If it looks stupid but it works, it isn’t
stupid."
Aarhus University Biostatistics - Why? What? How? 39 / 39