← ~/logs LOG-014

>Digitising Historical Data with Django

A five-stage OCR + LLM pipeline for digitising 60 years of caving incident reports, with custom model fields and pluggable processing steps.

Andrew Northall’s talk at DjangoCon Europe 2026 on turning 60 years of scanned PDF caving-incident reports into a structured Django database. Of the 2,700+ incidents extracted, ~12 hours of compute replaced five years of volunteer effort.

The talk focuses on three Django-shaped patterns that made the pipeline tractable.

Pluggable processing steps

Every step is an Operation subclass with a tiny contract: should_run(incident) + run(incident). A class-level requires declares prerequisites. A @register decorator adds steps to the pipeline. New step = new file. No central list, no if-ladder.

_REGISTRY: list[type["Operation"]] = []

def register(op_cls):
    _REGISTRY.append(op_cls)
    return op_cls

class Operation:
    requires: list[type["Operation"]] = []
    def should_run(self, incident) -> bool: return True
    def run(self, incident) -> None: raise NotImplementedError

@register
class FakeLLMCritic(Operation):
    requires = [FakeLLMRewrite]
    def run(self, incident):
        if re.search(r"\?{2,}", incident.cleaned_text):
            raise ValueError("unresolved OCR glyph")  # recorded as FAILED

The runner iterates operations x incidents, records an OperationRun row per attempt, and respects dependencies. A second LLM pass grades the first — if output looks bad, record FAILED and move on.

Custom FuzzyDate field

Dates in the source: “1971”, “August 1985”, “Autumn 1996”, “15 March 2024”. No DateField captures all four. FuzzyDate serializes to a lexicographically-sortable string like "1996-09-01:season:autumn":

@dataclass(frozen=True)
class FuzzyDate:
    year: int
    precision: Precision  # year / season / month / day
    month: int | None = None
    day: int | None = None
    season: str | None = None

class FuzzyDateField(models.CharField):
    def from_db_value(self, value, expression, connection):
        return None if value is None else FuzzyDate.from_storage(value)
    def to_python(self, value):
        if value is None or isinstance(value, FuzzyDate): return value
        return FuzzyDate.from_storage(value)
    def get_prep_value(self, value):
        return None if value is None else value.to_storage()

Three lifecycle hooks — that’s the whole custom-field pattern in ~30 lines.

Tree-structured locations

Using django-treebeard’s MP_Node. Incidents attach at whatever depth the source provides — “Arizona” is a state-level node, “Carlsbad Caverns” is a region-level node under New Mexico. Rolling up “how many incidents in the USA?” is a single get_descendants() query:

class Location(MP_Node):
    name = models.CharField(max_length=100)
    level = models.CharField(max_length=20, choices=LocationLevel.choices)
    node_order_by = ["name"]

# All incidents anywhere in the USA:
Incident.objects.filter(location__in=usa.get_descendants())

Key takeaways

  • Operations as classes + a registry means new steps are drop-in with rerun/status tracking for free
  • Have the LLM grade its own output — a single pass always over-commits
  • Custom model fields are cheap: to_python / from_db_value / get_prep_value is the whole contract
  • If your depth varies, use a tree instead of nullable FK columns for country/state/region

Slides | Experiment code