Skip to content

Instantly share code, notes, and snippets.

@Stephen-Gates
Last active September 26, 2017 16:50
Show Gist options
  • Select an option

  • Save Stephen-Gates/5d2326d22821eb2b80b06d440de0576e to your computer and use it in GitHub Desktop.

Select an option

Save Stephen-Gates/5d2326d22821eb2b80b06d440de0576e to your computer and use it in GitHub Desktop.
Data Package Version

Data Package Version

The Data Package version format follows the Semantic Versioning specification format: MAJOR.MINOR.PATCH

Semantic Versioning helps developers manage dependencies between software packages. The version numbers, and the way they change, convey meaning about the underlying code and what has been modified from one version to the next.

In Data Packages this concept is applied to data.

The version numbers, and the way they change, convey meaning how the data package has been modified from one version to the next.

Given a Data Package version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible changes, e.g.

  • Change the data structure table schema
  • Change field or data package names or data package identifiers
  • Add, remove or re-order fields

MINOR version when you add data in a backwards-compatible manner, e.g.

  • Add new data to existing data resource
  • Add a new data resource

PATCH version when you make backwards-compatible fixes, e.g.

  • corrections to existing data
  • changes to metadata

Scenarios

  • You are developing your data though public consultation.before releasing to production. Start your initial data release at 0.1.0
  • You release your datato production for the first time. Use version 1.0.0
  • You append last months data to an existing production release. Increment the MINOR version number
@ethanwhite
Copy link

Overall this looks good, but it feels a little too softwarey to me. If it was me I'd mention the association with software versioning briefly toward the beginning and then try to focus on data specific language.

E.g.:

  • I wouldn't talk about "production". That won't mean a lot to many data folks (at least in the sciences).
  • When talking about "names or identifiers" I'd try to be specific (or give specific examples): "Change names or identifiers including file names, column headers, ..."
  • I'm also torn about "backwards-compatible" but I don't see an obvious improvement to suggest that doesn't get wordy

@Stephen-Gates
Copy link
Author

Thanks @ethanwhite
I've struck out or included some text based on your suggestions.

@henrykironde
Copy link

henrykironde commented Sep 24, 2017

Thanks @Stephen-Gates and everyone for the input. I have gone through the previous issue, and I am grad to say that we are converging to the same point.

To emphasize our current progress towards our goal of the versioning for data, we have stated that a complete data package contains data and it's resources. The resources include but not limited to the metadata info and the data specification description file which should contain the version specification.

I think @Stephen-Gates has put together a good categorization of some these cases. If we could add more detailed explanation to these cases. For example, Change the data structure may sound a bit ambiguous to some of the users who are not very familiar with the terms as applied.

Protocol as written above

Given a Data Package version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible changes, e.g.

Change the data structure
Change field or data package names or data package identifiers
Add or remove or fields

MINOR version when you add data in a backwards-compatible manner, e.g.

Add new data to existing data resource
Add a new data resource

PATCH version when you make backwards-compatible fixes, e.g.

corrections to existing data
changes to metadata

Additionally we should also think about users who take time to provide individual packaging for non packaged data. These users are also going to utilize these protocols. On a good note, some of the described version protocols will hold for both packaged and non packaged data.

Ethan mentioned,

Location is an interesting case. Certainly moving data breaks code that accesses it directly,
but the data itself doesn't change. 
Based on the software analogy I'm tempted to say that we ignore where the data is for data versioning, 
since I wouldn't bump a version moving software from GitHub to BitBucket.

My concern here is that many products or software are universally managed by package managers like apt-get, pip, Conda and a good number of software are standalone applications.

In the case of data products / packages, they are provided as services and are used or ingested based on their URL. If a URL changed, the user's tool will not be able to get this data. In the case of a software, a change from Github to bitbucket, either the download page is updated with a new link or the package manager is given an update to the current location of the source. Let me know what you think

I will compile some of the questions/concerns that I do feel that have not been answered and repost them here for more contributions.

@Stephen-Gates
Copy link
Author

@henrykironde Thanks!

In the Gist:

  • I replaced data structure with table schema
  • I added re-order fields (based on a comment by @rufuspollock)

This change makes the suggestion of appending a column being a MINOR change, now a MAJOR change (as @ethanwhite originally suggested)

Based on previous discussion I thought we were ignoring location changes.

I look forward to your questions/concerns

@henrykironde
Copy link

@Stephen-Gates, Thanks for the updates, I totally agree with you on the URL changes after discussing the same with @zhangcandrew. We should ignore the change in the URL.

About the notifications, I am not getting any notifications from gist, but I get all the notifications from the git issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment