Data Management

Michael Schramm

Texas Water Resources Institute

10/11/22

What is data management?

  • the process of providing the appropriate labeling, storage, and access for data at all stages of a research project (https://datamanagement.hms.harvard.edu/about/what-research-data-management)

Why data management?

Why data management?

digital data

[xkcd.com](https://xkcd.com/1683/)

Why data management?

  • Sponsor Requirements

  • Publication Requirements

  • Project Continuity

  • Facilitate Collaboration

How do we manage data?

  • Identify tools and formats that meet your needs

  • Plan additional time and effort to carry out management tasks

Making progress

[xkcd.com](https://xkcd.com/1906/)

How do we manage data?

  • Try to retain raw data in non-proprietary formats
  • πŸ˜€ good formats :
    • text files - .txt, .csv, .xml, .ascii, .html, etc.
    • containers - .tar, .gzip, .zip
    • geospatial - .shp, .tiff, .gpk, .cdf, etc.
    • images - .tiff, .png, .bmp, and other lossless formats.
    • databases - xml and csv for simple databases, MySQL and PostgreSQL for relational dbs.
    • lots of others available depending on project needs
  • Library of Congress maintains information about the sustainability of various digital formats: https://www.loc.gov/preservation/digital/formats/

How do we manage data?

  • 🀨 meh formats:
    • text files - various MS office files. Be aware that Excel is notorious for auto-formatting data resulting in data transformation or loss.
    • geospatial - .gdb proprietary spatial database package for ESRI. Standard in the industry but inaccessible to QGIS, R, Python, etc.
  • 😬 generally not suitable for long-term storage or sharing:
    • Completely proprietary formats -.rds, .rdata, .sas7bdat

Don’t manipulate raw data

  • Once raw data is altered, you may or may not be able to restore it.

  • Read/copy data into to an analysis spreadsheet, script, etc.

Metadata

  • Data about your data (Variable names, units, how the data was collected, model derived or measured, etc.).

  • Several tools available to create metadata that meets ISO standards and federal guidelines.

Sync/backup

  • Backup and syncing snapshots the current state of your data and stores it somewhere else.

  • Allows restoration of prior versions of your data.

Sync/backup

Method Ease.of.Imp Reliability Collaboration Notes
Local Backup βœ… ❌ ❌ - Automated backup require additional setup/software
- No file versioning
OneDrive/Teams βœ… 🟑 βœ… - Fully automated
- Limited file versioning
Distributed
Version Control
(Github, GitLab, etc.
❌ βœ… βœ… - Strong file versioning
- Difficult to learn

Sharing

Sharing means making the current state of your data available to others (internal or external to the Institute). Why is this needed?

  • Collaboration on a data analysis, writing, etc.

  • Publication or delivery of datasets

Sharing

Sharing for collaboration:

  • Distributed version control is difficult but standard for code projects.
  • OneDrive/Google Sheets and other cloud services make sharing easy.

Sharing

Sharing for publication:

  • Long-term archiving available through numerous data repository services.
  • Examples: Texas Data Repository, CUAHSI Hydroshare, figshare
  • re3Data provides a comprehensive list and information on data repositories.
  • TAMU librarians are good resource to select appropriate repositories.

Thanks!