Research Data Management

Data Types Guidance

Data are raw or semi-processed/cleaned up information that have been collected for the purpose of analysis. A data set is inherently neutral, may be qualitative or quantitative, born-digital or digitized, and is not necessarily tabular or even digital. Data may come in many formats, including text, numerical, tabular, multimedia, models and software, as well as discipline- and instrument-specific data.

The word "data" means different things to different people in different contexts. What type of data are you creating?

Sciences

Observational data, like sensor readings and weather research

Experimental data, like gene sequences and clinical trials

Model or simulation data, like climate or economic models

Derived, compiled or computational data, like text or data mining

Social Sciences

Quantitative data, like statistical data collected from surveys, recorded in spreadsheets and manipulated with tools such as R, SPSS, SAS or Excel

Qualitative data, like audio or video interviews and coded transcripts

Humanities

If you are a humanist, anything could be data!

The digital humanities are an emerging field that combines methodologies from traditional humanities disciplines with tools provided by computing and digital publishing.

Back to Top

File Format Guidance

File formats

A file format is a standard way that information is encoded for storage in a computer file.

Experts have high confidence that relatively common, uncompressed (or at least lossless compressed) and non-proprietary file formats will be able to be rendered in the future. File formats without these characteristics are more likely to become obsolete or corrupt. If you are trying out new software, file formats or compression techniques, you may want to keep two versions of your data, one experimental, and one more stable.

At some point during your project, you may need to migrate your files to a new format. These types of transformations may affect the "look and feel" of a digital file, but should not affect its content. Extreme caution should be exercised when transforming "complex" digital objects like databases and websites. Embedded formulas in programs like Excel should be translated into values.

Data and database normalization are also important considerations, especially when conducting multi-institutional research. It is better to be explicit about how data will be collected at the onset of a new project than to try to clean this up after both institutions have collected a large amount.

Text files made up of plain text are very basic, but if they work for you, they are probably your best bet for future-proofing your data.

Back to Top

Organizing Data

Research data files and folders should be named and structured in such a way that someone else could make sense of them without you around.

Unlike system-generated file names, which are not always intuitive (i.e., 000001.jpg, 000002.jpg, 000003.jpg, or Document1.docx), good file names are unique, descriptive and consistent, even for people outside of your project.

Good file names do not use symbols (i.e., ! ? @ # $ % ^ & ~ ` _ + - = . , ), brackets (i.e., ( ) { } [ ] ) or spaces, which can disrupt various systems and processes. You should not assume case sensitivity, and avoid excessively long file names which may get truncated.

File versioning is an important consideration when creating a file naming convention, especially when research involves more than one person. You want to avoid accidental overwriting of files! Using Network Data Storage from GVSU Information Technology can help, but you can also keep track of versioning manually by adding a consistent suffix such as v01, v02 to your file names. You could also use version control software like Mercurial or TortoiseSVN.

As long as the "raw" versions of files are kept and well-documented, intermediate working files can be deleted.

ReNamer is a very powerful and flexible file renaming tool we use at the library which offers all of the standard renaming procedures, including prefixes, suffixes, replacements, case changes, as well as removing contents of brackets, adding number sequences, changing file extensions, etc.

Back to Top

Storage & Security

Data storage

Fir0002/Flagstaffotos

Depending on your needs, Network Data Storage from GVSU Information Technology is likely the safest, most secure place for your research data during creation, processing and analysis.

Storage

Research data, like any other data, may be stored on networked drives, personal computers or laptops, external storage devices or the cloud. Wherever you store your data during data collection and analysis, remember to back up your files using at least two different types of storage media and to keep at least one copy of your data in a separate geographic location (i.e., original + external/local backup + external/remote backup).

Personal computers or laptops are convenient, and removable media like external storage devices are convenient, cheap and portable. Neither of these types of devices, which are susceptible to physical damage, loss and theft, should be used for storing master versions of your research data. Never use CDs and DVDs; their media life is not reliable in the long-term.

Cloud Computing is an attractive option, but posting sensitive research data, particularly human subjects data, could violate FERPA, HIPAA and other privacy protection laws.

Security

Data Security means protecting data from unauthorized users and corruption. It can be divided into three categories: network security, physical security and computer systems & files.

Regarding network security, keep confidential information off of the Internet, and put sensitive materials on computers not connected to the Internet.

Regarding physical security, restrict access to buildings and rooms not connected to the Internet, and only let trusted individuals troubleshoot computer problems. Do not share your passwords, lock your office and store external hard drives with highly sensitive data in a locked safe.

Finally, regarding computer systems & files, keep ant-virus protection up to date, don't send confidential data via e-mail or FTP, and use passwords on files and computers. Data Encryption should be considered when sensitive data needs to be stored on devices other than GVSU secure servers.

You may also need to consider a means for confidential disposal of research data. Simply deleting a file will not completely remove it from your computer! Use a tool like BCWipe to delete files forever.

Finally, this may not be all you ever need to know on storage & security. Your sponsor's data security requirements may more stringent!

Have other data management needs, particularly those that arise during the data collection phase of a research project? See Information Technology & Security from University Compliance, or:

Contact the Director of Information Technology

Back to Top

Documentation & Metadata Guidance

Metadata

Metadata is a word that has only recently come into popular usage. It means "data about data."

Documentation

Documentation is a type of unstructured metadata, usually a supplementary file or document intended to accompany data. The most basic form of documentation is the README.txt file.

According to Data Dryad, a README.txt file for tabular data should include:

  • Definitions of column headings and row labels, data codes (including missing data), and measurement units
  • For each file name, a short description of what data it includes
  • Any processing steps, especially if not described in the publications, that may affect interpretation of results
  • A description of what associated datasets are stored elsewhere, if applicable
  • Whom to contact with questions

A codebook is another type of documentation commonly used to describe data. According to the ICPSR, a codebook should contain:

  • Column locations and widths for each variable
  • Definitions of different record types
  • Response codes for each variable
  • Codes used to indicated nonresponse and missing data
  • Other indications of the content and characteristics of each variable

Metadata

What is metadata? Simply put, metadata is "data about data."

In the digital world, it typically refers to structured information that exists alongside or is embedded in a digital file that describes its content, context and, as necessary, structure. Technical information about your data and descriptive information about your project should always be included in your metadata.

Metadata should make your data intelligible to someone outside of your project, and even outside of your discipline. Most data does not make sense to be full-text searchable, so metadata is the primary menas by which your data will be discovered over the web.

There are many metadata schemes for research data, ranging from the very simple (Dublin Core, DataCite), to the more complex (Content Standard for Digital Geospatial Metadata, Data Documentation Initiative).

There are also accepted domain-local standards, ontologies and nomenclature, such as:

See the Digital Curation Centre's Disciplinary Metadata for more information on these domain-local standards and to search by your discipline.

DataUp and Colectica are Excel plug-ins designed to help you clean up your data and create metadata for your tabular data.

Have questions about formal or informal metadata? Need to apply a particular metadata scheme to your dataset before depositing it in a data repository?

Contact the Metadata & Digital Curation Librarian

Back to Top

Copyright & Privacy/Confidentiality Guidance

Copyright

Copyright is the exclusive legal right, given to an originator or an assignee to print, publish, perform, film, or record literary, artistic, or musical material, and to authorize others to do the same.

Do you have the right to make data available? Should the data be embargoed for a certain period of time? How will you protect privacy, security, confidentiality and intellectual property? Can you think of any privacy, ethical or confidentiality concerns, particularly for human subjects data? Do you need to anonymize or aggregate data before sharing it? Do any regulations, such as HIPAA, apply to your data? See MIT Libraries' Ethical and Legal Issues, or Information Privacy and Protection from University Compliance for more information.

Is your data covered by copyright? Copyright can be waived under a CC0 license.

Have other copyright or licensing questions for your data? Wondering how to apply a CC license to your data set? See the University Libraries' Copyright Basics site, or:

Contact the Scholarly Communications Outreach Coordinator

Back to Top

Archiving & Sharing Data Guidance

Data Sharing and Management Snafu in 3 Short Acts

A data management horror story by Karen Hanson, Alisa Surkis and Karen Yacobucci.

Archiving

Backups are great, but on their own are insufficient for long-term preservation. Digital preservation is much more comprehensive, and takes into account issues of metadata, security, documentation, auditing, weeding, sharing and discovery, format obsolescence, media corruption and failure, organizational risks, etc.

Many researchers have a lot of experience managing their data, but most are not in a place to manage their research data long-term. Similarly, most network data storage providers are not in the business of long-term preservation. Instead, they operate under that assumption that data loses value over time.

Fortunately, a good data repository or archive will take care of all of this for you. In fact, as recently as March 21, 2014, the Office of Science and Technology Policy (OSTP), an office in the Executive Office of the President (EOP), issued a letter encouraging granting agencies to include "a strategy for leveraging existing archives, where appropriate" in their plans.

Sharing Data

In general, sharing our data will allow others to do follow-up research, do new research and scrutinize your findings. It is also a provision of the NSF Award and Administration Guide, Section VI.D.4. Sharing your data using a disciplinary data repository will allow you to share your data more widely by making it findable through search engines, track data impact metrics and receive credit for re-use of published data via data citation. And it's more effective than e-mail.

Need a place to share your data? ScholarWorks@GVSU is the University Libraries' format agnostic platform for dissemination of scholarly and creative content. While it is not meant for long-term curation of research data, ScholarWorks@GVSU is available to help meet basic data sharing requirements.

Don't know where to start? Your sponsor likely has terms and conditions for sharing, and may even suggest a particular data repository. If not, Databib or the Registry of Research Data Repositories are great resources that can help you find discipline-specific data repositories for archiving, sharing and/or re-use. figshare is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner, and works well for individual researchers.

Have questions about which repositories to choose, which are "Trustworthy" or how to prepare your data for deposit?

Contact the Metadata & Digital Curation Librarian

Back to Top

Re-use, Re-distribution

Identify who will be allowed to use your data, how they will be allowed to use your data, and whether or not they will be allowed to disseminate your data. This may refer to local users on your team or in the larger research community. Ask yourself:

  • Will any permission restrictions need to be placed on the data?
  • Who is your audience? Which bodies/groups are likely to be interested in the data?
  • Who may be interested in your data in the future and what might it be used for?

The PI is responsible for sharing data without violating federal law or regulation (Export Controls) or compromising individual rights (FERPA, FOIA, HIPAA or Intellectual Property).

Contact the Office of Sponsored Programs

Contact the University Counsel Office

Back to Top

Budget

Budget

Photo credit: www.LendingMemo.com

You are encouraged to put data sharing costs in your budget proposal.

NIH guidelines state that:

NIH recognizes that it takes time and money to prepare data for sharing. Thus, applicants can request funds for data sharing and archiving in their grant application. (See also the section on What to Include in an NIH Application.) Investigators who incorporate data sharing in the initial design of the study may more readily and economically establish adequate procedures for protecting the identities of participants and share a useful dataset with appropriate documentation.

And, according to the NSF Social Sciences Directorate:

Any costs should be explained in the Budget Justification pages.

Back to Top

Need More Data Management Help?

Best Practices for Research Data Managment

Best Practices for Research Data Management

Best Practices for Research Data Management, Responsible Conduct of Research (RCR) Bootcamp (2014).

University Libraries

Contact the University Libraries for help sharing your data, writing a DMP, or for help with data management generally.

Max Eckard
Metadata & Digital Curation Librarian
Phone: (616) 331-5072

E-mail

Information Technology

Contact Information Technology for IT related policies and guidelines, especially during data collection.

Sue Korzinek
Director of Information Technology
Phone: (616) 331-2035

E-mail

Office of Sponsored Programs

Contact the OSP for help with externally sponsored agreements for scholarly research and creative activity.

Christine Chamberlain
Director of Office of Sponsored Programs
Phone: (616) 331-6868

E-mail

Need a break? Take a look at "My Data Management Plan - a satire."

Page last modified August 27, 2014