Experimental evaluation of the DOCBOT Framework.
We evaluate the framework along seven orthogonal dimensions on a synthetic but representative enterprise corpus instrumented to reproduce the modality heterogeneity, source-disagreement spectrum and longitudinal drift observed in production deployments. Reported figures decompose extraction error from validation error, expose tail-latency contributors at stage granularity, and quantify the marginal contribution of each module through a pre-registered ablation protocol with bootstrap-derived 99% confidence intervals. Quantitative figures shown below are illustrative of the evaluation harness; statistically significant results against the monolithic baseline are deferred to the upcoming technical report (TR-2026-01).
Headline indicators
A compact summary of the four primary indicators tracked across the evaluation harness. Trend lines depict the most recent ten evaluation windows under the canonical configuration (DOCBOT · SYSTEMBOT · RESTRICTIONBOT enabled).
Evaluation Dimensions
The seven evaluation dimensions define a multi-axis quality envelope. Each dimension is reported independently and aggregated into the quality profile in Fig. R3.
Token-, entity- and record-level F1 measured at every module interface, decomposing extraction error from validation error.
Compute footprint per pipeline stage, normalised by document complexity and reported as work-units per validated decision.
End-to-end p50 / p95 / p99 wall-clock, partitioned by stage to expose tail-latency contributors and queueing effects.
Sustained throughput as a function of corpus size and concurrency, fit against an Amdahl-bounded reference curve.
Performance degradation under synthetic noise injection, adversarial perturbation, and source-disagreement scenarios.
Provenance completeness and justification coverage measured as the fraction of decisions admitting full causal replay.
Human-hours saved per audit cycle and reduction in escalation rate relative to a monolithic baseline.
Bit-identical re-execution rate under fixed (seed, corpus, manifest) triples; deviations are themselves treated as candidate determinism faults.
Aggregate compute, storage and human-review cost per validated decision, partitioned by stage and reported against a monolithic reference baseline.
Preliminary figures
Figures R1–R5 illustrate the evaluation surface. Distributions are computed from a synthetic harness sized to match the operating envelope of the target deployment; statistical tests against the monolithic baseline are reported in the technical appendix.
