§ 4 — Modules

Three research modules. One typed pipeline.

The DOCBOT Framework is decomposed into three cooperating research modules, each governed by an explicit type contract and each independently evaluable through the unit-evaluation harness described in §6. The decomposition is intentionally symmetric — every module exposes an interface boundary, a computational kernel, a provenance fabric and an emission boundary — which permits the same orchestration, observability and ablation infrastructure to be reused without modification across the three agents. Together they form the canonical configuration of the framework; in isolation each constitutes a falsifiable scientific contribution in its own right.

Module · M1

DOCBOT — Document Intelligence

DOCBOT is the document intelligence layer of the framework. It admits heterogeneous source documents — scanned PDFs, semi-structured forms, machine-readable XML and structurally noisy email bodies — through a single typed acquisition interface and emits schema-bound structured records. Extraction is decomposed into three composable subsystems: a layout analyser that recovers the physical structure of the page, a typed-entity extractor that projects the layout onto a domain-specific schema, and a confidence calibrator that attaches a per-field reliability score derived from agreement across redundant decoders. The module is deliberately agnostic to downstream interpretation: its sole responsibility is to convert unstructured evidence into a typed envelope amenable to formal reasoning further down the pipeline.

Typed acquisition across PDF, image, XML and email payloads

Layout analysis with topological reconstruction of reading order

Schema-bound entity and field extraction under explicit type contracts

Per-field confidence calibration through redundant-decoder agreement

Composable processing pipeline with deterministic replay against snapshots

Scientific Contribution

Formalises document intelligence as a typed, side-effect-free transformation from unstructured corpora to schema-bound representations admitting compositional reasoning, ablation and audit-grade replay.

python

# DOCBOT — Document Intelligence Kernel
from typing import Mapping
from docbot.types import (
    SourceRef, DocumentEnvelope, ExtractedRecord,
    LayoutTree, Typed, ProvenanceTree,
)

def acquire(source: SourceRef) -> DocumentEnvelope:
    """Typed, source-agnostic acquisition boundary."""
    raw  = source.fetch()
    meta = extract_metadata(raw)
    return DocumentEnvelope(
        payload     = raw,
        mime        = meta.mime,
        provenance  = meta.provenance,
    )

# Composable, pure transformations
pipeline = compose(
    parse_pdf,
    segment_layout,        # → LayoutTree
    project_to_schema,     # → Mapping[str, Typed]
    calibrate_confidence,  # redundant-decoder agreement
)

record: ExtractedRecord = pipeline(envelope)
# record.fields      :: Mapping[str, Typed]
# record.confidence  :: Mapping[str, float ∈ [0,1]]
# record.layout      :: LayoutTree
# record.provenance  :: ProvenanceTree

Interface signature — DOCBOT (illustrative).

Module · M2

SYSTEMBOT — Cross-Source Validation

SYSTEMBOT introduces validation as a first-class pipeline stage rather than a post-hoc quality check. Given an extracted record R and an indexed family of independent evidence sources E = {S₁, …, Sₙ}, it computes an agreement functional a(R, E) weighted by an empirically calibrated reliability prior over the sources and returns a confidence-weighted verdict together with a closed provenance subgraph. The module is designed for fan-in/fan-out topologies in which redundant sources attenuate the variance of any individual extractor; disagreement is resolved through a principled arbitration procedure rather than through ad-hoc heuristics, eliminating the silent failure modes characteristic of first-source-wins strategies.

Cross-source validation under heterogeneous reliability profiles

Government and registry data verification through typed adapters

Consistency analysis with closed-form replay against evidence snapshots

Principled disagreement resolution and arbitration auditing

Full provenance tracking materialised as an immutable DAG

Scientific Contribution

Provides formal consistency guarantees across heterogeneous evidence sources and converts validation from an implicit assumption of the pipeline into an explicit, evaluable kernel.

python

# SYSTEMBOT — Cross-Source Validation Kernel
from systembot.types import (
    ExtractedRecord, Source, Verdict, Evidence,
)
from systembot.priors import RELIABILITY_PRIOR

THRESHOLD: float = 0.78  # empirically calibrated

def validate(
    record: ExtractedRecord,
    sources: list[Source],
) -> Verdict:
    """Agreement functional a(R, E) under heterogeneous reliability."""
    evidence: list[Evidence] = [
        s.lookup(record.key) for s in sources
    ]

    # Weighted agreement under reliability prior
    weights = [RELIABILITY_PRIOR[s.id] for s in sources]
    score   = weighted_agreement(record, evidence, weights)

    return Verdict(
        consistent  = score >= THRESHOLD,
        confidence  = score,
        evidence    = evidence,
        arbitration = arbitrate_disagreement(record, evidence),
        provenance  = build_dag(record, evidence, weights),
    )

verdict = validate(record, sources=[s1, s2, s3])
# verdict.consistent  :: bool
# verdict.confidence  :: float ∈ [0,1]
# verdict.provenance  :: DAG[Source, Transform]

Interface signature — SYSTEMBOT (illustrative).

Module · M3

RESTRICTIONBOT — Restriction Analysis

RESTRICTIONBOT encodes operational restrictions as a declarative constraint set C and evaluates it symbolically against the validated record R. The output is an auditable decision-support emission carrying the decision status, the set of violated constraints, and a machine-readable rationale paired with a natural-language justification suitable for downstream human review. The constraint system is constructively monotone in C — adding a restriction never converts a reject into an approve — which guarantees that constraint catalogues can be extended without invalidating prior decisions, a property required for stable longitudinal audit and regulatory compliance.

Declarative business rules expressed in a constrained logical fragment

Symbolic restriction evaluation with monotonicity guarantees

Operational validation against compliance and regulatory catalogues

Audit-ready justifications in machine- and human-readable form

Decision-support emission with explicit rationale and violation set

Scientific Contribution

Enables transparent, restriction-aware decision support with full traceability and longitudinal stability under constraint catalogue evolution.

python

# RESTRICTIONBOT — Symbolic Restriction Evaluator
from restrictionbot.types import (
    ExtractedRecord, Verdict, Constraint,
    Decision, Justification, Status,
)
from restrictionbot.catalogue import RESTRICTION_SET

def evaluate(
    record:   ExtractedRecord,
    verdict:  Verdict,
    rules:    frozenset[Constraint] = RESTRICTION_SET,
) -> Decision:
    """Monotone symbolic evaluation of C against (R, verdict)."""
    violations: set[Constraint] = {
        c for c in rules if not c.holds(record, verdict)
    }

    status: Status = (
        Status.APPROVE if not violations and verdict.consistent
        else Status.REVIEW if verdict.confidence >= 0.6
        else Status.REJECT
    )

    return Decision(
        status      = status,
        violations  = frozenset(violations),
        rationale   = Justification.from_violations(violations),
        provenance  = verdict.provenance.extend(rules),
    )

decision = evaluate(record, verdict, RESTRICTION_SET)
# decision.status     :: {approve, review, reject}
# decision.violations :: Set[Constraint]
# decision.rationale  :: Justification

Interface signature — RESTRICTIONBOT (illustrative).