Normalization: from relational database design, "the process of organizing the attributes and tables of a relational database to minimize data redundancy." More generally used to mean standardizing a dataset according to a controlled vocabulary or metadata schema.
Faceting: filtering your dataset based on a particular text feature link
Clustering: The process of using an algorithm to group similar data together, and edit in bulk link
Reconciliation: "a semi-automated process of matching text names to database IDs (keys). This is semi-automated because in some cases, machine alone is not sufficient and human judgment is essential." link
General resources on Open Refine: http://openrefine.org/
Open Refine Wiki: https://github.com/OpenRefine/OpenRefine/wiki
Dataset for this section
Data from the New York Public Library "What's on the Menu?" Project: http://menus.nypl.org/ (Should be on your desktop as menus.csv)
Biodiversity Reconciliation Services (from Rod Page's iPhylo Blog)
Dataset for this section
Custom dataset from GBIF.org. Should be on your desktop as biodiversity.txt
Using NYPL dataset again; template should be on your desktop as template.txt
A quick note on the Freebase Reconciliation API: Many Refine services rely/relied on this; however, Google has been slowly shutting Freebase down over the last year. Developers are currently working on integration with Wikidata as a replacement, but there's no estimate on its availability. That said, there are many tutorials available that use Freebase as a resource which may be worth exploring.
Rod Page's full tutorial on taxonomic name reconciliation: link
Cleaning Geo-Data in Open Refine (AKA georeferencing):link
Converting Spreadsheets into MODS XML through Open Refine: link
More tutorials: link
Google Refine "Cheat Sheet": link
Google Refine Expression Language (GREL): link
Clustering in depth: link
This NYPL portion of this exercise was created at the Maryland Institute for Technology in the Humanities and developed through the Digital Humanities Data Curation Institute series, generously funded by the National Endowment of the Humanities.
Special thanks to Trevor Muñoz, Lydia Zvyaginsteva and Megan Senseney!