The content presented in this part comes from the Data Carpentry lesson Data Organization in Spreadsheets for Ecologists and is used under a CC-BY licence.
Image from The Turing Way: A Handbook for Reproducible Data Science (The Turing Way Community, 2021) used under a CC-BY licence.
From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society.
From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society. In many cases, it is a minimum requirement for verifying research; when research can be verified, and by extension corrected, science is built on solid ground.
From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society. In many cases, it is a minimum requirement for verifying research; when research can be verified, and by extension corrected, science is built on solid ground. Being able to reuse and build on available data and code saves time and resources, with the potential to speed up innovation.
From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society. In many cases, it is a minimum requirement for verifying research; when research can be verified, and by extension corrected, science is built on solid ground. Being able to reuse and build on available data and code saves time and resources, with the potential to speed up innovation.
But reproducibility can also be beneficial for individual researchers!
There aren't only altruistic reasons to work reproducibly; it can be good for your career too!
Markowetz (2015) notes five selfish reasons to work reproducibly.
1 in 2 examined papers had inconsistencies in the reported p-values
1 in 8 examined papers had inconsistencies large enough that they would affect the conclusions
1 in 5 papers in genetics is thought to have an error because of the default settings in Microsoft Excel reading in gene names as dates
To be continued...
with literate programming it's easy to combine text and code
with literate programming it's easy to combine text and code
meaning you can call your results into your document directly. No more copy-pasting = no more-copy pasting errors ✨
if you collect more data or change your data cleaning, you just need to rerun the analyses and all results will update
with literate programming it's easy to combine text and code
meaning you can call your results into your document directly. No more copy-pasting = no more-copy pasting errors ✨
if you collect more data or change your data cleaning, you just need to rerun the analyses and all results will update
working reproducibly often involves automation and/or good documentation, making it more likely you'll actually know what you did when it's time to write up 💃🏼
Sharing the underlying code and data makes it possible for the reviewers to:
Sharing the underlying code and data makes it possible for the reviewers to:
Sharing the underlying code and data makes it possible for the reviewers to:
understand exactly how you cleaned and analysed the data
try out different analyses themselves
Sharing the underlying code and data makes it possible for the reviewers to:
understand exactly how you cleaned and analysed the data
try out different analyses themselves
spot mistakes in your code👀
Good documentation of the data and code means:
Good documentation of the data and code means:
Good documentation of the data and code means:
you'll be able to go to a project after a break (e.g., holidays, peer review)
it's easier for others to pick up a project (important in long-term projects)
Sharing and documenting your data and code well:
Sharing and documenting your data and code well:
Sharing and documenting your data and code well:
helps you build a reputation as an honest and careful researcher. This could be contributing to a citation advantage of about 25% for publications with linked data (Colavizza et al., 2020).
if there is ever a problem with one of your papers you can show you did everything in good faith
Working reproducibly doesn't mean that there will definitely not be any mistakes in your work. But it does make it easier to figure out what went wrong and, by extension, to fix it.
Working reproducibly doesn't mean that there will definitely not be any mistakes in your work. But it does make it easier to figure out what went wrong and, by extension, to fix it.
Working reproducibly can also make the paper writing and reviewing process easier and more productive.
Working reproducibly doesn't mean that there will definitely not be any mistakes in your work. But it does make it easier to figure out what went wrong and, by extension, to fix it.
Working reproducibly can also make the paper writing and reviewing process easier and more productive.
And it can help others build on your work and build your reputation.
Looking at this matrix again, it's clear that for reproducibility to be possible, it is necessary to be able to get a copy of the data and code.
Additionally, code and data need to be Findable, Accessible, Interoperable, and Reusable, or in short FAIR (Wilkinson et al., 2016).
First people need to be able to find the data!
First people need to be able to find the data!
For data to be findable, they need to be described with rich metadata, or information about data. These metadata can be generic (e.g. title, author name, keywords) or discipline specific.
First people need to be able to find the data!
For data to be findable, they need to be described with rich metadata, or information about data. These metadata can be generic (e.g. title, author name, keywords) or discipline specific.
Data also need to be assigned a unique and persistent identifier. A commonly used identifier is the Digital Object Identifier (DOI). Such identifiers make it easy to find data, but also to link them with other relevant information (e.g. a publication).
After people have found the data they need to be able to access them!
This could mean that the data are publicly available in a data repository, like 4TU.ResearchData.
After people have found the data they need to be able to access them!
This could mean that the data are publicly available in a data repository, like 4TU.ResearchData.
If you're working with personal data, classified data, or other sensitive information, you may need to keep the data restricted and require authentication or authorisation before others can access the data.
After people have found the data they need to be able to access them!
This could mean that the data are publicly available in a data repository, like 4TU.ResearchData.
If you're working with personal data, classified data, or other sensitive information, you may need to keep the data restricted and require authentication or authorisation before others can access the data.
In those cases, metadata should still be accessible.
Data often need to be integrated with other data.
This integration is easier when data make reference to other relevant datasets.
Data often need to be integrated with other data.
This integration is easier when data make reference to other relevant datasets.
It's also important to make data available in open file formats, that anyone can open.
Data often need to be integrated with other data.
This integration is easier when data make reference to other relevant datasets.
It's also important to make data available in open file formats, that anyone can open.
Using controlled vocabularies is also highly recommended, if these exist in your field.
Adapted from Silvester et al., 2015.
To avoid ambiguity, use the RFC3339 standard: YYYYMMDD (or YYYY-MM-DD).
This image was created by cmglee, Canuckguy and many others for Wikimedia Commons and is used under a CC-BY-SA licence
Let's say people were able to find, access, and open your data. Now for the last part!
Let's say people were able to find, access, and open your data. Now for the last part!
To be able to reuse your work, people need to be able to understand it. This means you need to provide good documentation:
Let's say people were able to find, access, and open your data. Now for the last part!
To be able to reuse your work, people need to be able to understand it. This means you need to provide good documentation:
The Portal Project Teaching Database
Authors: Morgan Ernest, James Brown, Thomas Valone, Ethan P. White
General introduction
The Portal Project Teaching Database is a simplified version of the Portal Project Database designed for teaching. It provides a real world example of life-history, population, and ecological data, with sufficient complexity to teach many aspects of data analysis and management.
Purpose
This database is not designed for research as it intentionally removes some of the real-world complexities. The original database is published at Ecological Archives and that version should be used for research purposes.
Organisation
The Python code used for converting the original database to this teach version is included as 'create_portal_teach_dataset.py'.
Let's say people were able to find, access, and open your data. Now for the last part!
To be able to reuse your work, people need to be able to understand it. This means you need to provide good documentation:
Date collected | Species | Sex | Weight |
---|---|---|---|
2015-01-08 | PF | M | 7 |
2015-02-18 | OT | M | 24 |
2015-02-19 | OT | F | 23 |
2015-03-11 | NA | M | 232 |
2015-03-11 | OT | F | 22 |
2015-03-11 | OT | M | 26 |
2015-03-11 | PF | M | 8 |
2015-04-08 | NA | F | NA |
2015-05-06 | NA | NA | NA |
2015-05-18 | NA | F | 182 |
2015-06-09 | OT | F | 29 |
2015-07-08 | NA | F | 115 |
2015-07-08 | NA | M | 190 |
Date collected | Species | Sex | Weight |
---|---|---|---|
2015-01-08 | PF | M | 7 |
2015-02-18 | OT | M | 24 |
2015-02-19 | OT | F | 23 |
2015-03-11 | NA | M | 232 |
2015-03-11 | OT | F | 22 |
2015-03-11 | OT | M | 26 |
2015-03-11 | PF | M | 8 |
2015-04-08 | NA | F | NA |
2015-05-06 | NA | NA | NA |
2015-05-18 | NA | F | 182 |
2015-06-09 | OT | F | 29 |
2015-07-08 | NA | F | 115 |
2015-07-08 | NA | M | 190 |
Species
stand for?Weight
?Date collected
: the date the data were collected in YYYY-MM-DD formatSpecies
: a code for the species of the animal detected. See below for a table of what the codes stand forSex
: the sex of the animal detected, M
for male and F
for femaleWeight
: the weight of the animal detected measured in grams Missing data are coded as NA
.
species code | scientific name | common name |
---|---|---|
PF | Perognathus flavus | Silky pocket mouse |
OT | Onychomys torridus | Southern grasshopper mouse |
NA | Neotoma albigula | White-throated woodrat |
Another crucial component of reusability is usage licences. Just because something is shared online, doesn't mean anyone can use it. That's because the creator* of the work holds the copyright to it.
*In fact, it is often your university or institute that holds the copyright to the work you created while working there.
Another crucial component of reusability is usage licences. Just because something is shared online, doesn't mean anyone can use it. That's because the creator* of the work holds the copyright to it.
*In fact, it is often your university or institute that holds the copyright to the work you created while working there.
So, you need to tell people what they are allowed to do with your data and code. You do this by providing a usage licence.
Another crucial component of reusability is usage licences. Just because something is shared online, doesn't mean anyone can use it. That's because the creator* of the work holds the copyright to it.
*In fact, it is often your university or institute that holds the copyright to the work you created while working there.
So, you need to tell people what they are allowed to do with your data and code. You do this by providing a usage licence.
Note that usage licences are different for data and for code:
This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence. DOI: 10.5281/zenodo.3332807
Making your data and code Findable, Accessible, Interoperable and Reusable can take you a long way towards making your work reproducible.
Making your data and code Findable, Accessible, Interoperable and Reusable can take you a long way towards making your work reproducible.
There are many things you need to do to make your work FAIR, but:
Making your data and code Findable, Accessible, Interoperable and Reusable can take you a long way towards making your work reproducible.
There are many things you need to do to make your work FAIR, but:
So let's get started 🎉
Most researchers work with tabular data for their research, so knowing how to organise this type of data well is fundamental.
Tabular data refers to "rectangular tables made up of rows and columns" (Wickham, 2014).
Most researchers work with tabular data for their research, so knowing how to organise this type of data well is fundamental.
Tabular data refers to "rectangular tables made up of rows and columns" (Wickham, 2014).
Date collected | Species | Sex | Weight |
---|---|---|---|
2015-01-08 | PF | M | 7 |
2015-02-18 | OT | M | 24 |
2015-02-19 | OT | F | 23 |
2015-03-11 | NA | M | 232 |
2015-03-11 | OT | F | 22 |
2015-03-11 | OT | M | 26 |
2015-03-11 | PF | M | 8 |
2015-04-08 | NA | F | NA |
2015-05-06 | NA | NA | NA |
2015-05-18 | NA | F | 182 |
2015-06-09 | OT | F | 29 |
2015-07-08 | NA | F | 115 |
2015-07-08 | NA | M | 190 |
A good framework for working with tabular data is that of tidy data. The concept of tidy data is borrowed from R
and the tidyverse
packages within it (Wickham, 2014).
llustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst.
llustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst.
Using a consistent way to organise your data, e.g. following the tidy data structure, helps you be more efficient when working with your data. That's because you can create (and reuse) tools that expect that kind of structure instead of starting from scratch every time.
Using a consistent way to organise your data, e.g. following the tidy data structure, helps you be more efficient when working with your data. That's because you can create (and reuse) tools that expect that kind of structure instead of starting from scratch every time.
Like R
, NumPy
and pandas
can use vectorised processes that make working with large datasets more effcient and faster.
Placing variables in columns makes using vectorised processes easier.
Date collected | Species | Sex | Weight |
---|---|---|---|
2015-01-08 | PF | M | 7 |
2015-02-18 | OT | M | 24 |
2015-02-19 | OT | F | 23 |
2015-03-11 | NA | M | 232 |
2015-03-11 | OT | F | 22 |
2015-03-11 | OT | M | 26 |
2015-03-11 | PF | M | 8 |
2015-04-08 | NA | F | NA |
2015-05-06 | NA | NA | NA |
2015-05-18 | NA | F | 182 |
2015-06-09 | OT | F | 29 |
2015-07-08 | NA | F | 115 |
2015-07-08 | NA | M | 190 |
Date collected | Species | Sex | Weight |
---|---|---|---|
2015-01-08 | PF | M | 7 |
2015-02-18 | OT | M | 24 |
2015-02-19 | OT | F | 23 |
2015-03-11 | NA | M | 232 |
2015-03-11 | OT | F | 22 |
2015-03-11 | OT | M | 26 |
2015-03-11 | PF | M | 8 |
2015-04-08 | NA | F | NA |
2015-05-06 | NA | NA | NA |
2015-05-18 | NA | F | 182 |
2015-06-09 | OT | F | 29 |
2015-07-08 | NA | F | 115 |
2015-07-08 | NA | M | 190 |
Summary (long)
Species | Sex | Mean weight |
---|---|---|
NA | F | 148.50 |
NA | M | 211.00 |
OT | F | 24.67 |
OT | M | 25.00 |
PF | M | 7.50 |
Date collected | Species | Sex | Weight |
---|---|---|---|
2015-01-08 | PF | M | 7 |
2015-02-18 | OT | M | 24 |
2015-02-19 | OT | F | 23 |
2015-03-11 | NA | M | 232 |
2015-03-11 | OT | F | 22 |
2015-03-11 | OT | M | 26 |
2015-03-11 | PF | M | 8 |
2015-04-08 | NA | F | NA |
2015-05-06 | NA | NA | NA |
2015-05-18 | NA | F | 182 |
2015-06-09 | OT | F | 29 |
2015-07-08 | NA | F | 115 |
2015-07-08 | NA | M | 190 |
Summary (long)
Species | Sex | Mean weight |
---|---|---|
NA | F | 148.50 |
NA | M | 211.00 |
OT | F | 24.67 |
OT | M | 25.00 |
PF | M | 7.50 |
Summary (wide)
Sex | NA | OT | PF |
---|---|---|---|
F | 148.5 | 24.67 | NA |
M | 211.0 | 25.00 | 7.5 |
How you collect your data will vary depending on your field and specific project:
Researchers often use spreadsheet programmes like Microsoft Excel to work with their tabular data.
Although such programmes are good for some things, they aren't always appropriate...
Researchers often use spreadsheet programmes like Microsoft Excel to work with their tabular data.
Although such programmes are good for some things, they aren't always appropriate...
For example, when 16000 COVID cases went unreported in England when the old .xls file format ran out of rows.
easy to make mistakes (accidental deletions, misapplying formulas)
not reproducible (including mistakes)
easy to make mistakes (accidental deletions, misapplying formulas)
not reproducible (including mistakes)
Never edit your raw data in Excel!!!
easy to make mistakes (accidental deletions, misapplying formulas)
not reproducible (including mistakes)
Never edit your raw data in Excel!!!
easy to make mistakes (accidental deletions, misapplying formulas)
not reproducible (including mistakes)
Never edit your raw data in Excel!!!
easy to make mistakes (accidental deletions, misapplying formulas)
not reproducible (including mistakes)
Never edit your raw data in Excel!!!
data importing and exporting problems
non machine-readable formatting
We've known since 2016 that many genetics papers (1 in 5) have mistakes because Microsoft Excel interprets gene names as dates.
We've known since 2016 that many genetics papers (1 in 5) have mistakes because Microsoft Excel interprets gene names as dates.
We've known since 2016 that many genetics papers (1 in 5) have mistakes because Microsoft Excel interprets gene names as dates.
Also, for any language scientists out there working with a language that's not English, always use UTF-8 text encoding.
We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.
We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.
We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.
We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.
notes in the margin
spatial layout of data
We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.
notes in the margin
spatial layout of data
field formatting
We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.
notes in the margin
spatial layout of data
field formatting
using special characters and spaces
Null value | Problems | Compatibility | Recommendation |
---|---|---|---|
0 | Looks like true 0 | Never use | |
blank | Could be an oversight | R, Python, SQL | Best option |
999, -999 | Not recognised as null | Avoid | |
NA, na | Can be abbreviation | R | Good option |
N/A | Often incompatible | Avoid | |
NULL | Data type issues | SQL | Good option |
None | Data type issues | Python | Avoid |
No data | Data type issues | Avoid | |
Missing | Data type issues | Avoid | |
-, +, . | Data type issues | Avoid |
Adapted slightly from White et al. (2013)
07:00
Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)
Life expectancy: open file formats are more suitable for long-terms preservation because they don't rely ona single software programme
Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)
Life expectancy: open file formats are more suitable for long-terms preservation because they don't rely ona single software programme
Publishing: data repositories, journals and funding agencies may have requirements for open file formats
Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)
Life expectancy: open file formats are more suitable for long-terms preservation because they don't rely ona single software programme
Publishing: data repositories, journals and funding agencies may have requirements for open file formats
Universal format: interoperable format that produces the same results when it is imported by various software programmes, including plain text editors
comma-separated value files (.csv) are plain text files where the columns are separated by commas
👍🏼 commonly used
👎🏼 annoying when data itself contains commas
comma-separated value files (.csv) are plain text files where the columns are separated by commas
👍🏼 commonly used
👎🏼 annoying when data itself contains commas
tab-separated value files (.tsv) are plain text files where the columns are separated by tabs (\t)
👍🏼 no confusion when data contains commas or semicolons
👎🏼 not very commonly used (at least not yet)
File
and Save as
CSV UTF-8 (Comma delimited) (.csv)
from the listSave
File
and Save as
CSV UTF-8 (Comma delimited) (.csv)
from the listSave
Enclose the data fields with double quotes and double check that the file you are exporting can be read in correctly.
File
and Save as
Text (Tab delimited) (.txt)
from the listSave
.txt
to .tsv
Metadata:
Licences:
Colavizza, G., I. Hrynaszkiewicz, I. Staden, et al. (2020). The citation advantage of linking publications to research data. Vol. 15. 4 , p. e0230416. DOI: 10.1371/journal.pone.0230416.
Ernest, M., J. Brown, T. Valone, et al. (2020). Portal Project Teaching Database. DOI: 10.6084/m9.figshare.1314459.v10.
Kelion, L. (2020). Excel: Why using the Microsoft tool cause Covid-19 results to be lost. URL: https://www.bbc.com/news/technology-54423988.
Lowndes, J. and A. Horst (2020). Tidy data for efficiency, reproducibility and collaboration. URL: https://www.openscapes.org/blog/2020/10/12/tidy-data/.
Markowetz, F. (2015). Five selfish reasons to work reproducibly. Vol. 16.1 , p. 274. DOI: 10.1186/s13059-015-0850-7.
Nuijten, M. B., C. H. J. Hartgerink, M. A. L. M. van Assen, et al. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Vol. 48. 4 , pp. 1205-1226. DOI: 10.3758/s13428-015-0664-2.
Silvester, N., B. Alako, C. Amid, et al. (2015). Content discovery and retrieval services at the European Nucleotide Archive. Vol. 43 , pp. D23-D29. DOI: 10.1093/nar/gku1129.
The Turing Way Community (2021). The Turing Way: A handbook for reproducible, ethical and collaborative research. Version 1.0.1. DOI: 10.5281/zenodo.5671094.
The Turing Way Community and Scriberia (2021). Illustrations from the Turing Way book dashes. DOI: 10.5281/zenodo.5706310.
Vincent, J. (2020). Scientists rename human genes to stop Microsoft Excel from misreading them as dates. URL: https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates.
White, E. P., E. Baldridge, Z. T. Brym, et al. (2013). Nine simple ways to make it easier to (re)use your data. Vol. 1.e7v2. DOI: 10.7287/peerj.preprints.7v2.
Wickham, H. (2014). Tidy Data. Vol. 59.10 , pp. 1-23. DOI: 10.18637/jss.v059.i10.
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Vol. 3.1 , p. 160018. DOI: 10.1038/sdata.2016.18.
Ziemann, M., Y. Eren, and A. El-Osta (2016). Gene name errors are widespread in the scientific literature. Vol. 17. 1 , p. 177. DOI: 10.1186/s13059-016-1044-7.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |