HKDCWS

Data Curation with Open Refine

Vocabulary

  • Normalization: from relational database design, "the process of organizing the attributes and tables of a relational database to minimize data redundancy." More generally used to mean standardizing a dataset according to a controlled vocabulary or metadata schema.

  • Faceting: filtering your dataset based on a particular text feature link

  • Clustering: The process of using an algorithm to group similar data together, and edit in bulk link

  • Reconciliation: "a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine alone is not sufficient and human judgment is essential." link

Basics of Open Refine

General resources on Open Refine: http://openrefine.org/

Open Refine Wiki: https://github.com/OpenRefine/OpenRefine/wiki

Dataset for this section

Data from the New York Public Library "What's on the Menu?" Project: http://menus.nypl.org/ (Should be on your desktop as menus.csv)

Reconciliation

Biodiversity Reconciliation Services (from Rod Page's iPhylo Blog)

Dataset for this section

Custom dataset from GBIF.org. Should be on your desktop as biodiversity.txt

Publishing XML

Using NYPL dataset again; template should be on your desktop as template.txt

Further Resources

  • A quick note on the Freebase Reconciliation API: Many Refine services rely/relied on this; however, Google has been slowly shutting Freebase down over the last year. Developers are currently working on integration with Wikidata as a replacement, but there's no estimate on its availability. That said, there are many tutorials available that use Freebase as a resource which may be worth exploring.

  • Rod Page's full tutorial on taxonomic name reconciliation: link

  • Cleaning Geo-Data in Open Refine (AKA georeferencing):link

  • Converting Spreadsheets into MODS XML through Open Refine: link

  • More tutorials: link

Advanced resources

  • Google Refine "Cheat Sheet": link

  • Google Refine Expression Language (GREL): link

  • Clustering in depth: link

Acknowledgements

This NYPL portion of this exercise was created at the Maryland Institute for Technology in the Humanities and developed through the Digital Humanities Data Curation Institute series, generously funded by the National Endowment of the Humanities.

Special thanks to Trevor Muñoz, Lydia Zvyaginsteva and Megan Senseney!