System Architecture and Technical Summary

Pipeline, architecture, and software

Software stack

  • Django 4 (Python framework)
  • Python 3.10.7
  • PostgreSQL 15 (relational database)
  • Elasticsearch 8 (index)
  • Nginx, Gunicorn (web server)
  • Celery/Redis (task queueing)
  • Ubuntu 22.10 (operating system)
  • Front-end: Javascript, MapLibreGL, Webpack, Leaflet, JQuery, Bootstrap 5, D3

Data workflow

  • WHG has two data stores: a relational database (db) and a high-speed index (idx).
  • Interfaces to this data include a graphical web application (GUI) and APIs.
  • Contributed data in Linked Places or LP-TSV format is uploaded by registered users to the database.
  • Once uploaded, datasets are managed in a private set of screens, where they can be browsed and reconciled against Wikidata.
  • Reconciliation entails initiating a task managed by Celery/Redis and reviewing the candidate matches returned.
  • Confirming matches to Wikidata records augments the contributed dataset by adding new place_link and, if desired, place_geom records. NOTE: The original contribution can always be retrieved in its original state, as it was uploaded.
  • Once an uploaded dataset is reconciled and as many place_link records are generated for it as possible, it can be flagged as "public" and at that point it becomes a browseable and searchable data publication
  • As a further step, published datasets can be accessioned to the WHG index, a process that links individual records for the same (or "closely matched") places from multiple datasets.
  • The accessioning task will be initiated by WHG staff, but review of results will be by the dataset owner and any designated collaborators.
  • Accessioning to the WHG index is another reconciliation process with two steps: initiating the task and reviewing results. Incoming records that share a link to an external gazetteer (e.g. wikidata, geonames, etc.) with a record already in our index are queued separately and can be added automatically, associating it with that match and any other similarly linked "siblings."
  • Incoming records that don't share one or more links to existing index items become new "seed" records in the index, referred to internally as "parents."