Documenting and Publishing your Data
Lesson 11 with Rachael Blake
Lesson Objectives
- Review key concepts of making your data tidy
- Discover the importance of metadata
- Learn what a data package is
- Prepare to publish your data from R
Specific Achievements
- Create metadata locally
- Build data package locally
- Learn how data versioning can help you collaborate more efficiently
- Practice uploading data package to repository (exercise outside lesson)
What is a data package?
A data package is a collection of files that describe your data.
There are two essential parts to a data package: data and metadata.
Data packages also frequently include 1) code scripts that clean, process, or perform statistical anslyses on your data, and 2) visualizations that are direct products of coded anlyses. The relationships between these files are described in the provenance information for a data package. We will get into more details on these components in this lesson.
Why package and publish your data?
-
Data is a valuable asset - easily publish your data now or in the future
-
Efficient collaboration - improve sharing data with collaborators now
-
Credit for your work
-
Funder requirement
-
Publisher requirement
-
Open, Reproducible Science
Integrating data documentation
It is important to incorporate documentation of your data throughout your reproducible research pipeline.
Preparing Data for Publication
Both raw data and derived data (data you’ve assembled or processed in some way) from a project need to be neat and tidy before publication. Following principles of tidy data from the beginning will make your workflows more efficient during a project, and will make it easier for others to undertand and use your data once published. However, if your data isn’t yet tidy, you can make it tidier before you publish it.
Consistency is key when managing your data!
-
At a minimum, use the same terms to refer to the same things throughout your data.
- “origin”, not “starting place”, “beginning location”, and “source”
- “Orcinus orca”, not “killer Whale”, “Killer whale”, “killer.whale”, and “orca”
-
Consider using a discipline-wide published vocabulary if appropriate (ex: hydrological controlled vocab.)
Naming data files
Bad: “CeNsus data*2ndTry 2/15/2017.csv”
Good: “census_data.csv”
library(tidyverse)
stm_dat <- read_csv("data/StormEvents.csv")
Formatting data
- One table/file for each type of observation.
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
Downstream operations/analysis require tidy data.
These principles closely map to best practices for “normalization” in database design. R developer Hadley Wickham further describes the priciples of tidy data in this paper (Wickham 2014).
You can work towards ready-to-analyze data incrementally, documenting the intermediate data and cleaning steps you took in your scripts. This can be a powerful accelerator for your analysis both now and in the future.
Let’s do a few checks to see if our data is tidy.
- make sure data meet criteria above
- make sure blanks are NA or some other standard
> head(stm_dat)
> tail(stm_dat)
- check date format
> str(stm_dat)
- check case, white space, etc.
> unique(stm_dat$EVENT_NARRATIVE)
Outputting derived data
Because we followed the principle of one table/file for each type of observation, we have one table to be written out to a file now.
dir.create('storm_project', showWarnings = FALSE)
write_csv(stm_dat, "storm_project/StormEvents_d2006.csv")
Versioning your derived data
Issue: Making sure you and your collaborators are using the same version of a dataset
One solution: version your data in a repository (more on this later)
What is Metadata?
Metadata is the information needed for someone else to understand and use your data. It is the Who, What, When, Where, Why, and How about your data.
This includes package-level and file-level metadata.
Common components of metadata
Package-level:
- Data creator
- Geographic and temporal extents of data (generally)
- Funding source
- Licensing information (Creative Commons, etc.)
- Publication date
File-level:
- Define variable names
- Describe variables (units, etc.)
- Define allowed values for a variable
- Describe file formats
Metadata Standards
The goal is to have a machine and human readable description of your data.
Metadata standards create a structural expression of metadata necessary to document a data set.
This can mean using a controlled set of descriptors for your data specified by the standard.
Without metadata standards, your digital data may be irretrievable, unidentifiable or unusable. Metadata standards contain definitions of the data elements and standardised ways of representing them in digital formats such as databases and XML (eXtensible Markup Language).
These standards ensure consistent structure that facilitates data sharing and searching, record provenance and technical processes, and manage access permissions. Recording metadata in digital formats such as XML ensures effective machine searches through consistent sturctured data entry, and thesauri using controlled vocabularies.
Metadata standards are often developed by user communities.
For example, EML was developed by ecologists to describe environmental and ecological data.
Therefore, the standards can vary between disciplines and types of data.
Some examples include:
- General: Dublin Core
- Ecological/Environmental/Biological: EML, Darwin Core
- Social science: DDI, EAD
- Geospatial/Meterological/Oceanographic: ISO 19115, FGDC/CSDGM (no longer current)
Creating metadata
Employer-specific mandated methods (ex: USGS)
Repository-specific methods
- website for a repository
Some repository websites guide you through metadata creation during the process of uploading your data.
Stand-alone software
- Data Curator - still in beta
Coding
Example of coding up some metadata
R Package | What does it do? |
---|---|
dataspice |
creates metadata files in json-ld format |
here |
facilitates finding your files in R |
emld |
aids conversion of metadata files between EML and json-ld |
EML |
creates EML metadata files |
jsonlite |
reads json and json-ld file formats in R |
We’ll use the dataspice package to create metadata in the EML metadata standard.
library(dataspice)
library(here)
Create data package templates.
create_spice(dir = "storm_project")
Look at your storm_project folder, and see the CSV template files.
The templates are empty, so now we need to populate them. We’ll start with package-level metadata.
Add extent, coverage, license, publication, funder, keywords, etc.
We can get the temporal and geographic extent information using the range()
function.
> range(stm_dat$YEAR)
>
> range(stm_dat$BEGIN_LAT, na.rm=TRUE)
> range(stm_dat$BEGIN_LON, na.rm=TRUE)
This extent information can now be added to the biblio.csv
metadata file.
edit_biblio(metadata_dir = here("storm_project", "metadata"))
Describe the creators of the data.
edit_creators(metadata_dir = here("storm_project", "metadata"))
Add information about where the data can be accessed.
The prep_access()
function tries to discover the metadata for itself.
prep_access(data_path = here("storm_project"),
access_path = here("storm_project", "metadata", "access.csv"))
The edit_access()
function can be used to edit the generated metadata, or used by itself to manually enter access metadata.
edit_access(metadata_dir = here("storm_project", "metadata"))
We’ll describe the file-level metadata now.
Add attributes of the data.
The prep_attributes()
function tries also to discover the metadata for itself.
prep_attributes(data_path = here("storm_project"),
attributes_path = here("storm_project", "metadata", "attributes.csv"))
The edit_attributes()
function can be used to further edit the file attribute metadata.
edit_attributes(metadata_dir = here("storm_project", "metadata"))
Now we can write our metadata to a json-ld file.
write_spice(path = here("storm_project", "metadata"))
Now convert the json-ld file into EML (Ecological Metadata Language) format.
library(emld)
library(EML)
library(jsonlite)
json <- read_json("storm_project/metadata/dataspice.json")
eml <- as_emld(json)
write_eml(eml, "storm_project/metadata/dataspice.xml")
Publishing your data
Choosing to publish your data in a long-term repository can:
- enable versioning of your data
- fulfill a journal requirement to publish data
- facilitate citing your dataset by assigning a permanent uniquely identifiable code (DOI)
- improve discovery of and access to your dataset for others
- enable re-use and greater visibility of your work
Why can’t I just put my data on Dropbox, Google Drive, my website, etc?
Key reasons to use a repository:
- preservation (permanence)
- stability (replication / backup)
- access
- standards
When to publish your data?
Near the beginning? At the very end?
A couple issues to think about:
1) How do you know you’re using the same version of a file as your collaborator?
- Publish your data privately and use the unique ids of datasets to manage versions.
2) How do you control access to your data until you’re ready to go public with it?
- Embargoing - controlling access for a certain period of time and then making your data public.
Get an ORCiD
ORCiDs identify you and link you to your publications and research products. They are used by journals, repositories, etc. and often as a log in.
To obtain an ORCiD, register at https://orcid.org.
Creating a data package
Data packages can include metadata, data, and script files, as well as descriptions of the relationships between those files.
Currently there are a few ways to make a data package:
- Frictionless Data uses json-ld format, and has the R package
datapackage.r
which creates metadata files using schema.org specifications and creates a data package.
- DataONE frequently uses EML format for metadata, and has related R packages
datapack
andrdataone
that create data packages and upload data packages to a repository.
We’ll follow the DataONE way of creating a data package in this lesson.
R Package | What does it do? |
---|---|
datapack |
creates data package including file relationships |
uuid |
creates a unique identifier for your metadata |
We’ll create a local data package using datapack
:
library(datapack)
library(uuid)
dp <- new("DataPackage") # create empty data package
Add the metadata file we created earlier to the blank data package.
emlFile <- "storm_project/metadata/dataspice.xml"
emlId <- paste("urn:uuid:", UUIDgenerate(), sep = "")
mdObj <- new("DataObject", id = emlId, format = "eml://ecoinformatics.org/eml-2.1.1", file = emlFile)
dp <- addMember(dp, mdObj) # add metadata file to data package
Add the data file we saved earlier to the data package.
datafile <- "storm_project/StormEvents_d2006.csv"
dataId <- paste("urn:uuid:", UUIDgenerate(), sep = "")
dataObj <- new("DataObject", id = dataId, format = "text/csv", filename = datafile)
dp <- addMember(dp, dataObj) # add data file to data package
Define the relationship between the data and metadata.
dp <- insertRelationship(dp, subjectID = emlId, objectIDs = dataId)
You can also add scripts and derived data files to the data package.
Create a Resource Description Framework (RDF) of the relationships between data and metadata.
serializationId <- paste("resourceMap", UUIDgenerate(), sep = "")
filePath <- file.path(sprintf("%s/%s.rdf", tempdir(), serializationId))
status <- serializePackage(dp, filePath, id=serializationId, resolveURI = "")
Save the data package to a file, using the BagIt packaging format.
Right now this creates a zipped file in the tmp directory. We’ll have to move the file out of the temp directory after it is created. Hopefully this will be changed soon!
dp_bagit <- serializeToBagIt(dp)
file.copy(dp_bagit, "storm_project/Storm_dp.zip")
Picking a repository
There are many repositories out there, and it can seem overwhelming picking a suitable one for your data.
Repositories can be subject or domain specific. re3data lists repositories by subject and can help you pick an appropriate repository for your data.
DataONE is a federation of repositories housing many different types of data. Some of these repositories include Knowledge Network for Biocomplexity (KNB), Environmental Data Initiative (EDI), Dryad, USGS Science Data Catalog
For qualitative data, there are a few dedicated repositories: QDR, Data-PASS
Though a bit different, Zenodo facilitates publishing (with a DOI) and archiving all research outputs from all research fields.
This can be used to publish releases of your code that lives in a GitHub repository. However, since GitHub is not designed for data storage and is not a persistent repository, this is not a recommended way to store or publish data.
Uploading to a repository
Uploading requirements can vary by repository and type of data. The minimum you usually need is basic metadata, and a data file.
If you have a small number of files, using a repository GUI will usually be simpler. For large numbers of files, automating uploads from R will save time. And it’s reproducible!
If you choose, you can upload the data package to a repository in the DataONE federation using the rdataone
package.
1) Get authentication token for DataONE (follow steps here)
2) Upload your data package using R (from vignette for rdataone)
R Package | What does it do? |
---|---|
dataone |
uploads data package to repository |
Load my API token
source("D1_token.R")
Set access rules
library(dataone)
dpAccessRules <- data.frame(subject="http://orcid.org/0000-0003-0847-9100",
permission="changePermission")
This gives this particular orcid (person) permission to read, write, and change permissions for others for this package
dpAccessRules2 <- data.frame(subject = c("http://orcid.org/0000-0003-0847-9100",
"http://orcid.org/0000-0000-0000-0001"),
permission = c("changePermission", "read")
)
NOTE: When you upload the package, you also need to set public = FALSE
if you don’t want your package public yet.
Upload data package
The first argument here is the environment - “PROD” is production where you publish your data. “STAGING” can be used if you’re not yet sure you have everything in order and want to test uploading your data package.
The second argument is the repository specification. A table of member node IDs: (data/Nodes.csv)
> read.csv("data/Nodes.csv")
First set the environment and repository you’ll upload to:
d1c <- D1Client("STAGING2", "urn:node:mnTestKNB")
Now do the actual uploading of your data package:
packageId <- uploadDataPackage(d1c, dp, public = TRUE, accessRules = dpAccessRules,
quiet = FALSE)
Citation
Getting a Digital Object Identifier (DOI) for your data package can make it easier for others to find and cite your data.
You can assign a DOI to the metadata file for your data package using:
First specify the environment and repository.
cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:DRYAD")
doi <- generateIdentifier(mn, "DOI")
Now overwrite the previous metadata file with the new DOI identified metadata file
mdObj <- new("DataObject", id = doi, format = "eml://ecoinformatics.org/eml-2.1.1",
file = emlFile)
Versioning Data
How do you know you’re using the same version of a data file as your collaborator?
- Do not edit raw data files! Make edits/changes via your code only, and then output a derived “clean” data file.
Potential ways forward:
-
Upload successive versions of the data to a repo with versioning, but keep it private until you’ve reached the final version or the end of the project. (see below)
-
Google Sheets has some support for viewing the edit history of cells, added or deleted columns and rows, changed formatting, etc.
Updating your data package on the repository
To take advantage of the versioning of data built into many repositories, you can update the data package (replace files with new versions).
Please see the vignettes for rdataone
, section on “Replace an object with a newer version”.
Provenance
Prov is tracking the inputs and outputs generated by the research process, from raw data to publication(s).
It is often very helpful to create a diagram that shows the relationships between files.
A couple examples include:
-
simple path diagram
-
a more descriptive diagram
Adding provenance to your data package
Prov can be added to a data package using datapack
and the function below. But see the vignette for more details.
dp <- describeWorkflow(dp, sources = doIn, program = progObj, derivations = doOut)
Closing thoughts
Integrating data documentation and publication into your research workflow will increase collaboration efficiency, reproducibility, and impact in the science community.
Exercises
Exercise 1
ORCiDs identify you (like a DOI identifies a paper) and link you to your publications and research products.
If you don’t have an ORCiD, register and obtain your number at https://orcid.org.
Exercise 2
The DataONE federation of repositories requires a unique authentication token for you to upload a data package.
Follow the steps here to obtain your token.
Exercise 3
There are many repositories you could upload your data package to within the DataONE federation.
Pick a repository from the list (located at data/Nodes.csv) and upload a test data package to the “STAGING” environment.
Use the example code from the lesson.
Exercise 4
DOIs (digital object identifiers) are almost univerally used to identify papers. They also improve citation and discoverability for your published data packages.
Get a DOI for your published data package.
Use the example code from the lesson.
Exercise 5
If you’re using a data repository to manage versions of your derived data files, you will likely want to update your data file at some point. Use the vignette from the dataone
package to “replace an older version of your derived data” with a newer version.
Exercise 6
Provenance is an important description of how your data have been obtained, cleaned, processed, and analyzed. It is being incorporated into descriptions of data packages in some repositories.
Use the describeWorkflow()
function from the datapack
package to add provenance to your test data package.
Use the example code from the lesson.
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!