DATA HANDLING AND DATA MINING

Erik Kusch

erik.kusch@i-solution.de

Section for Ecoinformatics & Biodiversity

Center for Biodiversity and Dynamics in a Changing World (BIOCHANGE)

Aarhus University

Aarhus University Biostatistics - Why? What? How? 1 / 18

1 Collecting And Handling Data

Data Collection

Data Handling

Fixing Data

2 Mining Data

What To Mine For

How To Mine in R

3 Exercise

Aarhus University Biostatistics - Why? What? How? 2 / 18

Collecting And Handling Data Data Collection

Why Care?

Biostatisticians often use 70% of their time to handle data and just 30%

to actually analyse it.

Why care?

Proper data collection and data

handling ensure accurate results

Proper data collection cuts down

on data handling time

Proper data handling will make

reproducing an analysis much

easier

What to consider?

Which data format to use

What kind of data to record

How data values are

recorded/stored

What kind of data values are

feasible

Aarhus University Biostatistics - Why? What? How? 4 / 18

Collecting And Handling Data Data Collection

Recording Data

Guidelines for data recording:

When collecting categorical data, know what values the variables are

allowed to take

When collecting continuous data, know which range the variable values

can fall into

Make sure everyone involved in data collection is on the same page

Make regular back-ups of your data set

Aarhus University Biostatistics - Why? What? How? 5 / 18

Collecting And Handling Data Data Collection

Recording Data Collection - The README File

Documenting data recording is just as important as proper data collection!

To do so, one usually uses a README ﬁle containing the following:

Project Name and Summary

Primary contact information

Your name and title (if you aren’t the primary contact)

Other people working on the project

Location of data and supporting info

Organization and naming conventions used for the data

Any previous work on the project and where its located

Funding information

This ﬁle is always saved in conjunction with the actual data set!

Aarhus University Biostatistics - Why? What? How? 6 / 18

Collecting And Handling Data Data Handling

Data Structure

I recommend a structure like the one shown below with at least two hierarchy

levels.

The only ﬁles allowed in your ﬁrst hierarchy level are:

R master ﬁle

Manuscript master ﬁle

Additionally, make sure to back-up your project folder frequently and use

version control on it.

Aarhus University Biostatistics - Why? What? How? 7 / 18

Collecting And Handling Data Data Handling

Which Format To Use

When storing your data, you have a plethora of ﬁle formats to choose from.

R works very well with:

excel ﬁles (.xls, .xlsx, .csv)

text ﬁles (.txt)

Whilst both of these are accessible to everyone of your co-workers, excel is

easier to operate outside of R.

→ Make sure to provide co-workers with a master ﬁle before data collection to

avoid cell formatting issues on different computers

Aarhus University Biostatistics - Why? What? How? 8 / 18

Collecting And Handling Data Fixing Data

Common Issues

The Decimals

Always use a dot to indicate decimals.

→ It is the standard in science.

To NA Or Not To NA?

Never enter NA values manually into your data.

→ They cause problems in R.

Redundancy Or Sparsity?

Don’t clutter data with unnecessary data records.

→ Reduces storage space and chances for errors.

Aarhus University Biostatistics - Why? What? How? 9 / 18

Collecting And Handling Data Fixing Data

Workﬂow I

No data set is ever perfect (except fabricated ones)!

The etiquette of ﬁxing data:

Never overwrite/alter your original data ﬁle

Never apply ﬁxes by hand (you completely break the process of

reproducibility by doing so)

R is a beyond powerful tool for ﬁxing your data!

Aarhus University Biostatistics - Why? What? How? 10 / 18

Collecting And Handling Data Fixing Data

Workﬂow II

Fixing an data set is usually a two-step process:

Column/Variable Class

Variable record classes are

paramount to get right for speciﬁc

analyses in R.

Before data recording is done, we

should already have a desired

variable record class for each

variable’s records.

Column/Variable Content/Values

Typos and the like can often lead

to false data/variable records and

need to be ﬁxed or removed for

dependable results to be obtained.

Since records are usually stored in one column for each variable, we may wish

to asses column classes as “variable record classes”.

Aarhus University Biostatistics - Why? What? How? 11 / 18

Collecting And Handling Data Fixing Data

Useful Functions in R

dim(object) returns the dimensions of the object

summary(object) gives you a summary of values contained within the

object (see seminar 4)

View(object)

opens almost any

object in a new tab within

for visual

inspection

which(object == value)

returns a vector of

TRUE

and

FALSE

values

according to the statement in brackets

sum(object) returns the sum an objects values (also works for

TRUE/FALSE values)

vector[position] subsets elements of a vector

data.frame[Row,Column] subsets elements of a data.frame

Aarhus University Biostatistics - Why? What? How? 12 / 18

Mining Data What To Mine For

The README File Revisited

Using the README ﬁle, one can identify what information is contained within

the data set and thus decide:

What type/class a data record should be of

Which variables may be redundant

Which data records exceed their variable-speciﬁc feasible thresholds

Where to get comparative data sets from

Data Mining should then focus on:

Identifying problems within the data records

Explorative data analyses

Aarhus University Biostatistics - Why? What? How? 14 / 18

Mining Data How To Mine in R

Numbers or Visualisations?

For data mining, one may wish to enlist the use of methods contained in

seminar 4 & 5 (Descriptive Statistics & Data Visualisation):

Descriptive Statistics:

As far as descriptive statistics go, the

summary() command in R is probably

the most useful tool for data mining.

Data Visualisations:

Histograms

Scatter plots

Holistic data mining is best achieved using a combination of data

visualisations tools and parameters of descriptive statistics!

Aarhus University Biostatistics - Why? What? How? 15 / 18

Exercise

What To Do I

Data Loading

The data is located at

https://github.com/ErikKusch/An-

Introduction-to-Biostatistics-Using-R

The ﬁle is called SparrowData.csv

Expectations

Browse the README ﬁle to get a

feeling for the variables contained

within our data set.

Write down your expectations of data

range and variable mode within R.

Inspection

Use common functions to get a feeling

for the data set.

Can you spot some data errors right

away?

Do not do this in Excel.

Aarhus University Biostatistics - Why? What? How? 17 / 18

Exercise

What To Do II

Column/Variable Class

Find columns whose record classes do

not match up with your educated

expectation

Transform these column classes

Column/Variable Content/Values

Identify columns which, after ﬁxing

their record class, still exhibit

faulty/unreasonable values.

Decide what to do about them.

Redundancy

Remove columns from the data set

which are not needed because their

information is contained in another

column or are unnecessarily bloating

the data set.

→ Do this variable by variable.

Aarhus University Biostatistics - Why? What? How? 18 / 18