+ - 0:00:00
Notes for current slide
Notes for next slide

Reproducible RDM workflows for tabular data

Eirini Zormpa

28 April 2022 (Updated: 2022-04-28)

1

Hi, I'm Eirini 👋

👩‍🎓 PhD Candidate @Max Planck Institute for Psycholinguistics

  • worked with lots of tabular data with observations from experiments with human participants
  • got interested in open science and reproducibility
2

Hi, I'm Eirini 👋

👩‍🎓 PhD Candidate @Max Planck Institute for Psycholinguistics

  • worked with lots of tabular data with observations from experiments with human participants
  • got interested in open science and reproducibility

💼 Trainer on Research Data Management @TU Delft and 4TU.ResearchData

  • train researchers and data supporters on good research data management practice
3

Session outline

  • Reproducibility & the FAIR principles
    • What is reproducibility and why is it important?
    • What are the FAIR principles?
  • Data Carpentry: Data organisation for tabular data*
    • Good data entry practices
    • Formatting data tables in spreadsheets
    • Importing and exporting data

The content presented in this part comes from the Data Carpentry lesson Data Organization in Spreadsheets for Ecologists and is used under a CC-BY licence.

4

Reproducibility & the FAIR principles

5

What is reproducibility?

Image from The Turing Way: A Handbook for Reproducible Data Science (The Turing Way Community, 2021) used under a CC-BY licence.

6

Why is reproducibility important?

From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society.

7

Why is reproducibility important?

From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society. In many cases, it is a minimum requirement for verifying research; when research can be verified, and by extension corrected, science is built on solid ground.

8

Why is reproducibility important?

From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society. In many cases, it is a minimum requirement for verifying research; when research can be verified, and by extension corrected, science is built on solid ground. Being able to reuse and build on available data and code saves time and resources, with the potential to speed up innovation.

9

Why is reproducibility important?

From an altruistic point of view, reproducibility is beneficial for the research community, and arguably society. In many cases, it is a minimum requirement for verifying research; when research can be verified, and by extension corrected, science is built on solid ground. Being able to reuse and build on available data and code saves time and resources, with the potential to speed up innovation.




But reproducibility can also be beneficial for individual researchers!

10

Reproducibility: benefits for individual researchers

There aren't only altruistic reasons to work reproducibly; it can be good for your career too!

Markowetz (2015) notes five selfish reasons to work reproducibly.

11

Reason 1: reproducibility helps to avoid disaster

12

Reason 1: reproducibility helps to avoid disaster

13

Reason 1: reproducibility helps to avoid disaster

  • 1 in 2 examined papers had inconsistencies in the reported p-values
14

Reason 1: reproducibility helps to avoid disaster

  • 1 in 2 examined papers had inconsistencies in the reported p-values

  • 1 in 8 examined papers had inconsistencies large enough that they would affect the conclusions

15

Reason 1: reproducibility helps to avoid disaster (cont'd)

16

Reason 1: reproducibility helps to avoid disaster (cont'd)

17

Reason 1: reproducibility helps to avoid disaster (cont'd)

  • 1 in 5 papers in genetics is thought to have an error because of the default settings in Microsoft Excel reading in gene names as dates
18

Reason 1: reproducibility helps to avoid disaster (cont'd)

  • 1 in 5 papers in genetics is thought to have an error because of the default settings in Microsoft Excel reading in gene names as dates

  • To be continued...

19

Reason 2: reproducibility makes it easier to write papers

  • with literate programming it's easy to combine text and code
20

Reason 2: reproducibility makes it easier to write papers

  • with literate programming it's easy to combine text and code

    • meaning you can call your results into your document directly. No more copy-pasting = no more-copy pasting errors
21

Reason 2: reproducibility makes it easier to write papers

  • with literate programming it's easy to combine text and code

    • meaning you can call your results into your document directly. No more copy-pasting = no more-copy pasting errors

    • if you collect more data or change your data cleaning, you just need to rerun the analyses and all results will update

22

Reason 2: reproducibility makes it easier to write papers

  • with literate programming it's easy to combine text and code

    • meaning you can call your results into your document directly. No more copy-pasting = no more-copy pasting errors

    • if you collect more data or change your data cleaning, you just need to rerun the analyses and all results will update

  • working reproducibly often involves automation and/or good documentation, making it more likely you'll actually know what you did when it's time to write up 💃🏼

23

Reason 3: reproducibility helps reviewers see it your way

Sharing the underlying code and data makes it possible for the reviewers to:

24

Reason 3: reproducibility helps reviewers see it your way

Sharing the underlying code and data makes it possible for the reviewers to:

  • understand exactly how you cleaned and analysed the data
25

Reason 3: reproducibility helps reviewers see it your way

Sharing the underlying code and data makes it possible for the reviewers to:

  • understand exactly how you cleaned and analysed the data

  • try out different analyses themselves

26

Reason 3: reproducibility helps reviewers see it your way

Sharing the underlying code and data makes it possible for the reviewers to:

  • understand exactly how you cleaned and analysed the data

  • try out different analyses themselves

  • spot mistakes in your code👀

27

Reason 4: reproducibility enables continuity of your work

Good documentation of the data and code means:

28

Reason 4: reproducibility enables continuity of your work

Good documentation of the data and code means:

  • you'll be able to go to a project after a break (e.g., holidays, peer review)
29

Reason 4: reproducibility enables continuity of your work

Good documentation of the data and code means:

  • you'll be able to go to a project after a break (e.g., holidays, peer review)

  • it's easier for others to pick up a project (important in long-term projects)

30

Reason 5: reproducibility helps to build your reputation

Sharing and documenting your data and code well:

31

Reason 5: reproducibility helps to build your reputation

Sharing and documenting your data and code well:

  • helps you build a reputation as an honest and careful researcher. This could be contributing to a citation advantage of about 25% for publications with linked data (Colavizza et al., 2020).
32

Reason 5: reproducibility helps to build your reputation

Sharing and documenting your data and code well:

  • helps you build a reputation as an honest and careful researcher. This could be contributing to a citation advantage of about 25% for publications with linked data (Colavizza et al., 2020).

  • if there is ever a problem with one of your papers you can show you did everything in good faith

33

Interim summary 1

Working reproducibly doesn't mean that there will definitely not be any mistakes in your work. But it does make it easier to figure out what went wrong and, by extension, to fix it.

34

Interim summary 1

Working reproducibly doesn't mean that there will definitely not be any mistakes in your work. But it does make it easier to figure out what went wrong and, by extension, to fix it.

Working reproducibly can also make the paper writing and reviewing process easier and more productive.

35

Interim summary 1

Working reproducibly doesn't mean that there will definitely not be any mistakes in your work. But it does make it easier to figure out what went wrong and, by extension, to fix it.

Working reproducibly can also make the paper writing and reviewing process easier and more productive.

And it can help others build on your work and build your reputation.

36

FAIR and open data and code

Looking at this matrix again, it's clear that for reproducibility to be possible, it is necessary to be able to get a copy of the data and code.

Additionally, code and data need to be Findable, Accessible, Interoperable, and Reusable, or in short FAIR (Wilkinson et al., 2016).

37

Findable 🔍

First people need to be able to find the data!

38

Findable 🔍

First people need to be able to find the data!

For data to be findable, they need to be described with rich metadata, or information about data. These metadata can be generic (e.g. title, author name, keywords) or discipline specific.

39

Findable 🔍

First people need to be able to find the data!

For data to be findable, they need to be described with rich metadata, or information about data. These metadata can be generic (e.g. title, author name, keywords) or discipline specific.

Data also need to be assigned a unique and persistent identifier. A commonly used identifier is the Digital Object Identifier (DOI). Such identifiers make it easy to find data, but also to link them with other relevant information (e.g. a publication).

40

Accessible 🔓

After people have found the data they need to be able to access them!

This could mean that the data are publicly available in a data repository, like 4TU.ResearchData.

41

Accessible 🔓

After people have found the data they need to be able to access them!

This could mean that the data are publicly available in a data repository, like 4TU.ResearchData.

⚠️ Personal data and other sensitive information

If you're working with personal data, classified data, or other sensitive information, you may need to keep the data restricted and require authentication or authorisation before others can access the data.

42

Accessible 🔓

After people have found the data they need to be able to access them!

This could mean that the data are publicly available in a data repository, like 4TU.ResearchData.

⚠️ Personal data and other sensitive information

If you're working with personal data, classified data, or other sensitive information, you may need to keep the data restricted and require authentication or authorisation before others can access the data.

In those cases, metadata should still be accessible.

43

Interoperable ⚙️

Data often need to be integrated with other data.

This integration is easier when data make reference to other relevant datasets.

44

Interoperable ⚙️

Data often need to be integrated with other data.

This integration is easier when data make reference to other relevant datasets.

It's also important to make data available in open file formats, that anyone can open.

45

Interoperable ⚙️

Data often need to be integrated with other data.

This integration is easier when data make reference to other relevant datasets.

It's also important to make data available in open file formats, that anyone can open.

Using controlled vocabularies is also highly recommended, if these exist in your field.

46

In how many different ways can you say "female"?*

18-day pregnant females
female (lactating)
individual female
worker caste (female)
2 yr old female
female (pregnant)
sex: female
400 yr. old female
female (outbred)
mare
female (other)
adult female
female parent
female (worker)
female plant
femal
castrate female
female with eggs
ovigerous female
3 female
cf.female
female worker
oviparous sexual females
female (phenotype)
cystocarpic female
gynoecious
thelytoky
dikaryon
female virgin
dioecious female
femlale
female (gynoecious)
remale
metafemale
f
femele
sterile female
famale
normal female
femail
sf
female
females
tetraploid female
strictly female
females only
worker
hexaploid female
healthy female
female (gynoecious)
probably female
female (note: this sample was originally provided as a "male" sample and labeled this way in the paper and original submission; however analyses carried out in the meantime clearly show that this sample stems from a female individual)

Adapted from Silvester et al., 2015.

47

Dates

To avoid ambiguity, use the RFC3339 standard: YYYYMMDD (or YYYY-MM-DD).

This image was created by cmglee, Canuckguy and many others for Wikimedia Commons and is used under a CC-BY-SA licence

48

Reusable ♻️

Let's say people were able to find, access, and open your data. Now for the last part!

49

Reusable ♻️

Let's say people were able to find, access, and open your data. Now for the last part!

To be able to reuse your work, people need to be able to understand it. This means you need to provide good documentation:

50

Reusable ♻️

Let's say people were able to find, access, and open your data. Now for the last part!

To be able to reuse your work, people need to be able to understand it. This means you need to provide good documentation:

  • at a minimum, you should provide a README file where you describe what the project is about, how the files are organised and how to reproduce the project
51

Example of a README file

The Portal Project Teaching Database

Authors: Morgan Ernest, James Brown, Thomas Valone, Ethan P. White

General introduction

The Portal Project Teaching Database is a simplified version of the Portal Project Database designed for teaching. It provides a real world example of life-history, population, and ecological data, with sufficient complexity to teach many aspects of data analysis and management.

  • Time coverage: Continuous data collection since 1977
  • Site: 20 hectares split in 24 experimental plots in the Chihuahuan Desert (USA)
  • Species: rodents

Purpose

This database is not designed for research as it intentionally removes some of the real-world complexities. The original database is published at Ecological Archives and that version should be used for research purposes.

Organisation

The Python code used for converting the original database to this teach version is included as 'create_portal_teach_dataset.py'.

52

Reusable ♻️

Let's say people were able to find, access, and open your data. Now for the last part!

To be able to reuse your work, people need to be able to understand it. This means you need to provide good documentation:

  • at a minimum, you should provide a README file where you describe what the project is about, how the files are organised and how to reproduce the project
  • when working with tabular data, you should also provide a data dictionary where you explain what the different variables mean, the measurement units, how missingness is encoded etc.
53

Some data

Date collected Species Sex Weight
2015-01-08 PF M 7
2015-02-18 OT M 24
2015-02-19 OT F 23
2015-03-11 NA M 232
2015-03-11 OT F 22
2015-03-11 OT M 26
2015-03-11 PF M 8
2015-04-08 NA F NA
2015-05-06 NA NA NA
2015-05-18 NA F 182
2015-06-09 OT F 29
2015-07-08 NA F 115
2015-07-08 NA M 190
54

Some data

Date collected Species Sex Weight
2015-01-08 PF M 7
2015-02-18 OT M 24
2015-02-19 OT F 23
2015-03-11 NA M 232
2015-03-11 OT F 22
2015-03-11 OT M 26
2015-03-11 PF M 8
2015-04-08 NA F NA
2015-05-06 NA NA NA
2015-05-18 NA F 182
2015-06-09 OT F 29
2015-07-08 NA F 115
2015-07-08 NA M 190

It's quite hard to make sense of data without context...

  • what do the acronyms in Species stand for?
  • what is the unit of measurement for Weight?
55

Example of a data dictionary

  • Date collected: the date the data were collected in YYYY-MM-DD format
  • Species: a code for the species of the animal detected. See below for a table of what the codes stand for
  • Sex: the sex of the animal detected, M for male and F for female
  • Weight: the weight of the animal detected measured in grams

Missing data are coded as NA.

species code scientific name common name
PF Perognathus flavus Silky pocket mouse
OT Onychomys torridus Southern grasshopper mouse
NA Neotoma albigula White-throated woodrat
56

Reusable (cont'd) ♻️

Another crucial component of reusability is usage licences. Just because something is shared online, doesn't mean anyone can use it. That's because the creator* of the work holds the copyright to it.

*In fact, it is often your university or institute that holds the copyright to the work you created while working there.

57

Reusable (cont'd) ♻️

Another crucial component of reusability is usage licences. Just because something is shared online, doesn't mean anyone can use it. That's because the creator* of the work holds the copyright to it.

*In fact, it is often your university or institute that holds the copyright to the work you created while working there.

So, you need to tell people what they are allowed to do with your data and code. You do this by providing a usage licence.

58

Reusable (cont'd) ♻️

Another crucial component of reusability is usage licences. Just because something is shared online, doesn't mean anyone can use it. That's because the creator* of the work holds the copyright to it.

*In fact, it is often your university or institute that holds the copyright to the work you created while working there.

So, you need to tell people what they are allowed to do with your data and code. You do this by providing a usage licence.

Note that usage licences are different for data and for code:

59

Don't panic!

This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence. DOI: 10.5281/zenodo.3332807

60

Interim summary 2

Making your data and code Findable, Accessible, Interoperable and Reusable can take you a long way towards making your work reproducible.

61

Interim summary 2

Making your data and code Findable, Accessible, Interoperable and Reusable can take you a long way towards making your work reproducible.

There are many things you need to do to make your work FAIR, but:

  1. these things will benefit you
  2. you don't have to start doing everything immediately
  3. we're here to help 🤝 ✨ In this workshop you'll learn good practices for working with tabular data, how to generate and add metadata to your data, how to validate your data, and how to publish your data in a data repository.
62

Interim summary 2

Making your data and code Findable, Accessible, Interoperable and Reusable can take you a long way towards making your work reproducible.

There are many things you need to do to make your work FAIR, but:

  1. these things will benefit you
  2. you don't have to start doing everything immediately
  3. we're here to help 🤝 ✨ In this workshop you'll learn good practices for working with tabular data, how to generate and add metadata to your data, how to validate your data, and how to publish your data in a data repository.

So let's get started 🎉

63

Data Carpentry: Data organisation for tabular data

64

Tabular data

Most researchers work with tabular data for their research, so knowing how to organise this type of data well is fundamental.


Tabular data refers to "rectangular tables made up of rows and columns" (Wickham, 2014).

65

Tabular data

Most researchers work with tabular data for their research, so knowing how to organise this type of data well is fundamental.


Tabular data refers to "rectangular tables made up of rows and columns" (Wickham, 2014).

Date collected Species Sex Weight
2015-01-08 PF M 7
2015-02-18 OT M 24
2015-02-19 OT F 23
2015-03-11 NA M 232
2015-03-11 OT F 22
2015-03-11 OT M 26
2015-03-11 PF M 8
2015-04-08 NA F NA
2015-05-06 NA NA NA
2015-05-18 NA F 182
2015-06-09 OT F 29
2015-07-08 NA F 115
2015-07-08 NA M 190
66

Tidy data

A good framework for working with tabular data is that of tidy data. The concept of tidy data is borrowed from R and the tidyverse packages within it (Wickham, 2014).

llustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst.

67

Tidy data

llustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst.

68

Why tidy data?

Using a consistent way to organise your data, e.g. following the tidy data structure, helps you be more efficient when working with your data. That's because you can create (and reuse) tools that expect that kind of structure instead of starting from scratch every time.




69

Why tidy data?

Using a consistent way to organise your data, e.g. following the tidy data structure, helps you be more efficient when working with your data. That's because you can create (and reuse) tools that expect that kind of structure instead of starting from scratch every time.




Like R, NumPy and pandas can use vectorised processes that make working with large datasets more effcient and faster. Placing variables in columns makes using vectorised processes easier.

70

Example

Date collected Species Sex Weight
2015-01-08 PF M 7
2015-02-18 OT M 24
2015-02-19 OT F 23
2015-03-11 NA M 232
2015-03-11 OT F 22
2015-03-11 OT M 26
2015-03-11 PF M 8
2015-04-08 NA F NA
2015-05-06 NA NA NA
2015-05-18 NA F 182
2015-06-09 OT F 29
2015-07-08 NA F 115
2015-07-08 NA M 190
71

Example

Date collected Species Sex Weight
2015-01-08 PF M 7
2015-02-18 OT M 24
2015-02-19 OT F 23
2015-03-11 NA M 232
2015-03-11 OT F 22
2015-03-11 OT M 26
2015-03-11 PF M 8
2015-04-08 NA F NA
2015-05-06 NA NA NA
2015-05-18 NA F 182
2015-06-09 OT F 29
2015-07-08 NA F 115
2015-07-08 NA M 190

Summary (long)

Species Sex Mean weight
NA F 148.50
NA M 211.00
OT F 24.67
OT M 25.00
PF M 7.50
72

Example

Date collected Species Sex Weight
2015-01-08 PF M 7
2015-02-18 OT M 24
2015-02-19 OT F 23
2015-03-11 NA M 232
2015-03-11 OT F 22
2015-03-11 OT M 26
2015-03-11 PF M 8
2015-04-08 NA F NA
2015-05-06 NA NA NA
2015-05-18 NA F 182
2015-06-09 OT F 29
2015-07-08 NA F 115
2015-07-08 NA M 190

Summary (long)

Species Sex Mean weight
NA F 148.50
NA M 211.00
OT F 24.67
OT M 25.00
PF M 7.50

Summary (wide)

Sex NA OT PF
F 148.5 24.67 NA
M 211.0 25.00 7.5
73

⚠️ Keep your data tidy from the start

How you collect your data will vary depending on your field and specific project:

  • If you can control how the data are created, make sure you use the tidy data structure from the beginning.
  • If you cannot control how the raw data are structured, wrangle them into the tidy format, ideally using a scripted programming language, like Python or R.
74

Spreadsheet programmes (e.g. Microsoft Excel)

Researchers often use spreadsheet programmes like Microsoft Excel to work with their tabular data.

Although such programmes are good for some things, they aren't always appropriate...

75

Spreadsheet programmes (e.g. Microsoft Excel)

Researchers often use spreadsheet programmes like Microsoft Excel to work with their tabular data.

Although such programmes are good for some things, they aren't always appropriate...

For example, when 16000 COVID cases went unreported in England when the old .xls file format ran out of rows.

76

Problems with Excel

⛔ Data wrangling, visualisation & analysis

77

Problems with Excel

⛔ Data wrangling, visualisation & analysis

  • easy to make mistakes (accidental deletions, misapplying formulas)
78

Problems with Excel

⛔ Data wrangling, visualisation & analysis

  • easy to make mistakes (accidental deletions, misapplying formulas)

  • not reproducible (including mistakes)

79

Problems with Excel

⛔ Data wrangling, visualisation & analysis

  • easy to make mistakes (accidental deletions, misapplying formulas)

  • not reproducible (including mistakes)

Never edit your raw data in Excel!!!

80

Problems with Excel

⛔ Data wrangling, visualisation & analysis

  • easy to make mistakes (accidental deletions, misapplying formulas)

  • not reproducible (including mistakes)

Never edit your raw data in Excel!!!

⚠️ Data entry

81

Problems with Excel

⛔ Data wrangling, visualisation & analysis

  • easy to make mistakes (accidental deletions, misapplying formulas)

  • not reproducible (including mistakes)

Never edit your raw data in Excel!!!

⚠️ Data entry

  • data importing and exporting problems
82

Problems with Excel

⛔ Data wrangling, visualisation & analysis

  • easy to make mistakes (accidental deletions, misapplying formulas)

  • not reproducible (including mistakes)

Never edit your raw data in Excel!!!

⚠️ Data entry

  • data importing and exporting problems

  • non machine-readable formatting

83

Data importing and exporting problems

We've known since 2016 that many genetics papers (1 in 5) have mistakes because Microsoft Excel interprets gene names as dates.

84

Data importing and exporting problems

We've known since 2016 that many genetics papers (1 in 5) have mistakes because Microsoft Excel interprets gene names as dates.

85

Data importing and exporting problems

We've known since 2016 that many genetics papers (1 in 5) have mistakes because Microsoft Excel interprets gene names as dates.

Also, for any language scientists out there working with a language that's not English, always use UTF-8 text encoding.

86

Non machine-readable formatting

We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.

87

Non machine-readable formatting

We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.

Non machine-readable formatting includes:

88

Non machine-readable formatting

We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.

Non machine-readable formatting includes:

  • notes in the margin
89

Non machine-readable formatting

We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.

Non machine-readable formatting includes:

  • notes in the margin

  • spatial layout of data

90

Non machine-readable formatting

We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.

Non machine-readable formatting includes:

  • notes in the margin

  • spatial layout of data

  • field formatting

91

Non machine-readable formatting

We often organise data the way we humans like to work, i.e. by relying on context. That doesn't work for software like Python or R that we use to analyse data. So we need to structure our data in a way computers can understand.

Non machine-readable formatting includes:

  • notes in the margin

  • spatial layout of data

  • field formatting

  • using special characters and spaces

92

How to code missing values*

Null value Problems Compatibility Recommendation
0 Looks like true 0 Never use
blank Could be an oversight R, Python, SQL Best option
999, -999 Not recognised as null Avoid
NA, na Can be abbreviation R Good option
N/A Often incompatible Avoid
NULL Data type issues SQL Good option
None Data type issues Python Avoid
No data Data type issues Avoid
Missing Data type issues Avoid
-, +, . Data type issues Avoid

Adapted slightly from White et al. (2013)

93

Summary: Tips for structuring tabular data

  • Leave the raw data alone!
  • Be consistent
  • Put variables in columns
  • Put observations in rows
  • Put as little information as possible in a single cell
  • Avoid spaces and special characters
  • Use controlled vocabularies and standards (e.g. for dates)
  • Create documentation (e.g. README and data dictionary)
  • Think about how to code missing vales (and add that in the documentation!)
  • Do not use formatting (colours, font, bolding)
  • Use data validation to avoid errors (Frictionless Data framework)
94

Exercise 📝

  • Download the messy data
  • Open up the data in a spreadsheet programme
  • Notice the two tabs: Two RAs conducted surveys, one in 2013 and one in 2014. They kept track of the data in their own way in the tabs 2013 and 2014 respectively. Now you're the person in charge of this project and you want to start analysing the data.
  • You have 7 minutes to clean up as much of the dataset as you can.
  • Once the 7 minutes are up:
    • take a screenshot of your cleaned dataset
    • add a comment with what confused or annoyed you the most
  • Look at the posts of the other participants and give them a +1 if you like their solution/comment
07:00
95

Using open file formats

  • Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)
96

Using open file formats

  • Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)

  • Life expectancy: open file formats are more suitable for long-terms preservation because they don't rely ona single software programme

97

Using open file formats

  • Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)

  • Life expectancy: open file formats are more suitable for long-terms preservation because they don't rely ona single software programme

  • Publishing: data repositories, journals and funding agencies may have requirements for open file formats

98

Using open file formats

  • Accessible: usable by multiple types of software that are freely available (e.g. note that Microsoft requires a paid licence)

  • Life expectancy: open file formats are more suitable for long-terms preservation because they don't rely ona single software programme

  • Publishing: data repositories, journals and funding agencies may have requirements for open file formats

  • Universal format: interoperable format that produces the same results when it is imported by various software programmes, including plain text editors

99

Which open file format? 🤔

100

Which open file format? 🤔

Comma delimited

comma-separated value files (.csv) are plain text files where the columns are separated by commas

👍🏼 commonly used

👎🏼 annoying when data itself contains commas

101

Which open file format? 🤔

Comma delimited

comma-separated value files (.csv) are plain text files where the columns are separated by commas

👍🏼 commonly used

👎🏼 annoying when data itself contains commas

Tab delimited

tab-separated value files (.tsv) are plain text files where the columns are separated by tabs (\t)

👍🏼 no confusion when data contains commas or semicolons

👎🏼 not very commonly used (at least not yet)

102

Exporting from Excel to .csv

  1. From the top menu, select File and Save as
  2. In the Format field select CSV UTF-8 (Comma delimited) (.csv) from the list
  3. Double check the file name and the location where you want to save the data and hit Save
103

Exporting from Excel to .csv

  1. From the top menu, select File and Save as
  2. In the Format field select CSV UTF-8 (Comma delimited) (.csv) from the list
  3. Double check the file name and the location where you want to save the data and hit Save

My data contains commas!

Enclose the data fields with double quotes and double check that the file you are exporting can be read in correctly.

104

Exporting from Excel to .tsv

  1. From the top menu, select File and Save as
  2. In the Format field select Text (Tab delimited) (.txt) from the list
  3. Double check the file name and the location where you want to save the data and hit Save
  4. Rename the file you just saved and change the extension from .txt to .tsv
105

References

Colavizza, G., I. Hrynaszkiewicz, I. Staden, et al. (2020). The citation advantage of linking publications to research data. Vol. 15. 4 , p. e0230416. DOI: 10.1371/journal.pone.0230416.

Ernest, M., J. Brown, T. Valone, et al. (2020). Portal Project Teaching Database. DOI: 10.6084/m9.figshare.1314459.v10.

Kelion, L. (2020). Excel: Why using the Microsoft tool cause Covid-19 results to be lost. URL: https://www.bbc.com/news/technology-54423988.

Lowndes, J. and A. Horst (2020). Tidy data for efficiency, reproducibility and collaboration. URL: https://www.openscapes.org/blog/2020/10/12/tidy-data/.

Markowetz, F. (2015). Five selfish reasons to work reproducibly. Vol. 16.1 , p. 274. DOI: 10.1186/s13059-015-0850-7.

Nuijten, M. B., C. H. J. Hartgerink, M. A. L. M. van Assen, et al. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Vol. 48. 4 , pp. 1205-1226. DOI: 10.3758/s13428-015-0664-2.

Silvester, N., B. Alako, C. Amid, et al. (2015). Content discovery and retrieval services at the European Nucleotide Archive. Vol. 43 , pp. D23-D29. DOI: 10.1093/nar/gku1129.

The Turing Way Community (2021). The Turing Way: A handbook for reproducible, ethical and collaborative research. Version 1.0.1. DOI: 10.5281/zenodo.5671094.

107

References

The Turing Way Community and Scriberia (2021). Illustrations from the Turing Way book dashes. DOI: 10.5281/zenodo.5706310.

Vincent, J. (2020). Scientists rename human genes to stop Microsoft Excel from misreading them as dates. URL: https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates.

White, E. P., E. Baldridge, Z. T. Brym, et al. (2013). Nine simple ways to make it easier to (re)use your data. Vol. 1.e7v2. DOI: 10.7287/peerj.preprints.7v2.

Wickham, H. (2014). Tidy Data. Vol. 59.10 , pp. 1-23. DOI: 10.18637/jss.v059.i10.

Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Vol. 3.1 , p. 160018. DOI: 10.1038/sdata.2016.18.

Ziemann, M., Y. Eren, and A. El-Osta (2016). Gene name errors are widespread in the scientific literature. Vol. 17. 1 , p. 177. DOI: 10.1186/s13059-016-1044-7.

108

Hi, I'm Eirini 👋

👩‍🎓 PhD Candidate @Max Planck Institute for Psycholinguistics

  • worked with lots of tabular data with observations from experiments with human participants
  • got interested in open science and reproducibility
2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow