HKDCWS

Data Curation with Open Refine

Vocabulary

Normalization: from relational database design, "the process of organizing the attributes and tables of a relational database to minimize data redundancy." More generally used to mean standardizing a dataset according to a controlled vocabulary or metadata schema.
Faceting: filtering your dataset based on a particular text feature link
Clustering: The process of using an algorithm to group similar data together, and edit in bulk link
Reconciliation: "a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine alone is not sufficient and human judgment is essential." link

Basics of Open Refine

General resources on Open Refine: http://openrefine.org/

Open Refine Wiki: https://github.com/OpenRefine/OpenRefine/wiki

Dataset for this section

Data from the New York Public Library "What's on the Menu?" Project: http://menus.nypl.org/ (Should be on your desktop as menus.csv)

Reconciliation

Biodiversity Reconciliation Services (from Rod Page's iPhylo Blog)

Dataset for this section

Custom dataset from GBIF.org. Should be on your desktop as biodiversity.txt

Publishing XML

Using NYPL dataset again; template should be on your desktop as template.txt

Further Resources

A quick note on the Freebase Reconciliation API: Many Refine services rely/relied on this; however, Google has been slowly shutting Freebase down over the last year. Developers are currently working on integration with Wikidata as a replacement, but there's no estimate on its availability. That said, there are many tutorials available that use Freebase as a resource which may be worth exploring.
- Freebase API services now available from: http://reconcile.freebaseapps.com/reconcile
- Further information about Freebase integration with Open Refine https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation
- Google on Freebase shut down: link
Rod Page's full tutorial on taxonomic name reconciliation: link
Cleaning Geo-Data in Open Refine (AKA georeferencing):link
Converting Spreadsheets into MODS XML through Open Refine: link
More tutorials: link

Advanced resources

Google Refine "Cheat Sheet": link
Google Refine Expression Language (GREL): link
Clustering in depth: link

Acknowledgements

This NYPL portion of this exercise was created at the Maryland Institute for Technology in the Humanities and developed through the Digital Humanities Data Curation Institute series, generously funded by the National Endowment of the Humanities.

Special thanks to Trevor Muñoz, Lydia Zvyaginsteva and Megan Senseney!