Introducing pgdedupe!

Combining datasets and performing large aggregate analyses are powerful new ways to improve service across large populations. Critically important in this task is the deduplication of identities across multiple data sets that were rarely designed to work together. Inconsistent data entry, typographical errors, and real world identity changes pose significant challenges to this process. To help, we have built a tool called pgdedupe.

With pgdedupe, you specify the database columns you’d like to use for matching. pgdedupe then displays those columns for two records. You tell pgdedupe whether you think those records describe the same entity (yes/no/unsure). pgdedupe continues to ask for your input until you tell it to stop, learning to better match records along the way.

Learn more about the pgdedupe here.

pgdedupe uses dedupe, a tool written by our friends at DataMade. We extend special thanks to Forrest Gregg for his help with this project.