HKDCWS

Introduction to Data Curation

Digital data have become a critical information object in our everyday lives: Our phones, our social networks, and our cars all transmit and consume data making it possible for things like GPS to help us find destinations, and even tell us how to most effectively get there.

Digital data produced by research in the humanities, science, and engineering has, similarly, become an important aspect of scholarship. Just as our phones and cars depend on data to effectively navigate, scholars depend on access to and preservation of data in order to effectively carry out their research.

Archives, Museums, and Libraries have responded to the data needs of scholars by developing research data curation services.

The goal of this institute is to provide you a working knowledge of the concepts, tools, and approaches to doing data curation. By understanding the foundations of data curation - as it cuts across different disciplines and domains - you will be better prepared to make decisions about the costs and benefits of offering curation services in your own institutional settings.

An important first step in this process is to broadly define what it is we mean when we talk about data curation, and what it is that we do NOT mean.

To begin, we find it helpful to offer a broad view of both "data" and "curation" - before defining "data curation."

What do we mean by data?

In a scholarly context data can mean different things to different disciplines. A humanist may treat a collection of digitized maps as a source of data, while an ecologist might be concerned with geo-location data in the form of latitude and longitude.

In this institute, rather than assuming that data is a distinct type of artifact (i.e. a map, a database, a digital document) we consider data to be a role that many different artifacts are capable of playing.

In the library, archives and museums setting, research data are information artifacts playing an evidential role - they are the resources that scholars draw upon to support their claims, and verify observations or experimental findings.

In short, we will rely upon the following definition:

Research data are information artifacts that scholars draw upon for evidence in supporting research claims, and producing new knowledge.

Note that in this definition we do not make a distinction about data being digital; research data could include any number of physical artifacts, such as a biological specimen, an architectural blue print, a paper map, etc.

Although the focus of this institute will be on the curation of digital data there are times when it is necessary to consider how digital data are, and should be linked to physical data. (We will discuss these issues in sessions related to provenance and workflow technologies.)

What kinds of evidential roles do data play?

In research data curation it is also important to make a distinction about the different evidential roles that data can play, namely scientific and humanistic.

Humanistic Data

It may initially seem awkward to assume that humanists have data. In the humanities, and the rapidly developing field of Digital Humanities, data support operations, and interpretations that are ultimately assigned meaning by specific communities (Munoz and Renear, 2011). Humanities data may take digital forms that range greatly in size and complexity - from single authored TEI files to multi-institutional collections of digitized manuscripts, images, models, and music, like the HathiTrust

Scientific data

Scientific data is encoded and described with the aim of supporting retrieval, meaningful interpretation, use, and reuse (Wickett et al. 2012). This includes digitized physical materials (e.g. specimens, core samples, chemical analyses) as well as born-digital data (e.g. databases, sequenced DNA, streamed sensor data). In this domain the verifiability and validity of a research finding is often dependent upon access to underlying data. The curation and preservation of this research data is therefore a vital part of contemporary scientific practice.

While these are not the only two evidential roles that research data can take, they are the most dominant and most important for a foundation in data curation.

What do we mean by curation?

Curation has traditionally meant the stewardship, preservation, and meaningful arrangement of material collections. Over time the concept has been transferred from the context of physical specimens found in natural history museums to a digital environment where the curatorial roles of stewardship, preservation and meaningful arrangement have taken on new meanings (Weber et al., 2013).

In a scholarly environment curatorial roles include developing indexing systems, metadata standards, ontologies, and retrieval systems that will make it possible for research data to work in concert with existing digital libraries, archives, and repositories (Palmer, Renear and Cragin, 2007). Increasingly, this work is aimed at improving data quality, guaranteeing authentication of digital resources throughout a lifecycle of use and reuse, and capturing the provenance of scholarly activities using workflow technologies.

Data Curation

Bringing the different notions of data and curation together we have developed a general definition for data curation the University of Illinois' iSchool:

Data curation is the active and ongoing management of data throughout its entire lifecycle of interest and usefulness to scholarship, including its reuse in unanticipated contexts (Cragin et al. 2007)

Building off of this definition, Muñoz and Flanders offer a careful explanation of some of these concepts:

Active and ongoing management:

Data curators intervene in the research process in order to translate or migrate data into new formats, to enhance it through additional layers of context or markup, to create connections between data sets, and to otherwise ensure that data is maintained in as highly-functional a form as possible.

Entire lifecycle:

As we enter the era of thoroughly digital research, the full lifecycle of digital research data is still not yet known to us. However, we can anticipate that some data (particularly data collected through destructive means, such as archaeological data) will have a very long horizon of usefulness (in addition to increased evidentiary value for historical analysis and stewardship of our cultural heritage). The uses of data will likely change over time and with different stages of research.

Interest and usefulness to scholarship:

The term "scholarship" should be construed broadly, especially since data creation, use, and curation are not limited to the academy. Data curation seeks to retain the interest and usefulness of any data that has a serious purpose to fulfill. Furthermore, even for data created and curated within the academy, it is worth remarking that there may be potential user communities outside the academy that could provide valuable motivation and even resources for the curation process. Especially given the recent and increasing emphasis on crowd-sourcing, community-driven data, and the permeability of the boundary between the academy and the public sphere, it is reasonable to anticipate that definitions of "scholarship" may continue to broaden rather than narrow over time.

Frequently Asked Questions

Differences between digital and data curation

The two terms are often used interchangeably, but there are important differences to keep in mind when writing and speaking about curation. Digital curation is, or has become, a term that better accommodates a broader range of digital material than research data. It does not indicate what is being curated nor does it necessarily imply which communities can be purposefully served by curatorial activities. Data curation, on the other hand, relates directly to data that is produced and used by scholarly communities, and it facilitates the reuse and repurposing of data to meet new research needs (Weber et al., 2013). As Renear and Muñoz aptly put it, “Data curation addresses the challenge of maintaining digital information that is produced in the course of research in a manner that preserves its meaning and usefulness as a potential input for further research” (2011).

Big Data

Big data has received an intense amount of public attention over the last five years. Like data curation, big data can mean many different things to many different communities. A helpful definition offered by the National Science Foundation and the National Institutes for Health (NIH) in the USA is as follows:

‘Big data ... refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available.

An equally popular and oft-cited explanation of data big data is:

‘Big data’ is high volume, high velocity and high variety information assets that demand cost-effective, innovative forms of information processing[.]

As Dr. Hanna Wallach explains, "...volume refers to the amount of data in question, velocity refers to the speed with which that data can be obtained and/or processed, while variety refers to the range of different data types and sources." (2014)

For the most part, this institute will provide a foundation for data curation that is helpful, but by no means unique to big data curation.

Works Cited

Wickett, K. M., Thomer, A., Sacchi, S., Baker, K. S., & Dubin, D. (2012). What dataset descriptions actually describe: Using the systematic assertion model to connect theory and practice.
Munoz, T., & Renear, A. H. (2011). Issues in humanities data curation. Link
Weber, N. M., Palmer, C. L., Muñoz, T., & Renear, A. H. (2013). Foundations of data curation: the pedagogy and practice of “purposeful work” with research data. Arch J, 3.
Cragin, M. H., Heidorn, P. B., Palmer, C. L., & Smith, L. C. (2007). An educational program on data curation. ASIS&T annual conference, PDF
David Dubin, Karen Wickett, and Simone Sacchi, “Content, Format, and Interpretation,” Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies 7 (2011), doi:10.4242/BalisageVol7.Dubin01.
Wallach, H. (2014) Big Data, Machine Learning, and the Social Sciences Link

Bibliography

The following articles both introduce and provide important background for the field of research data curation in libraries.

Gold, A. K. (2007). Cyberinfrastructure, data, and libraries, part 1: A cyberinfrastructure primer for librarians. Office of the Dean (Library), 16. [PDF] (http://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=1015&context=lib_dean)
Gold, A. K. (2007). Cyberinfrastructure, data, and libraries, part 2: Libraries and the data challenge: Roles and actions for libraries. Office of the Dean (Library), 17. PDF
Gray, J., Szalay, A. S., Thakar, A. R., & Stoughton, C. (2002, December). Online scientific data curation, publication, and archiving. In Astronomical Telescopes and Instrumentation (pp. 103-107). International Society for Optics and Photonics. PDF
Weber, N. M., Palmer, C. L., Muñoz, T., & Renear, A. H. (2013). Foundations of data curation: the pedagogy and practice of “purposeful work” with research data. Archives Journal, 3. PDF
Lord, P., Macdonald, A., Lyon, L., & Giaretta, D. (2004). From data deluge to data curation. In Proceedings of the UK e-science All Hands meeting (pp. 371-357). PDF
Donnelly, M. (2014). Review: Research Data Management: Practical Strategies for Information Professionals. International Journal of Digital Curation, 9(2), 1-5. PDF
Choudury, S. (2010). Data curation An ecological perspective. College & Research Libraries News, 71(4), 194-196. PDF