§ 7 — Results

Experimental evaluation of the DOCBOT Framework.

We evaluate the framework along seven orthogonal dimensions on a synthetic but representative enterprise corpus instrumented to reproduce the modality heterogeneity, source-disagreement spectrum and longitudinal drift observed in production deployments. Reported figures decompose extraction error from validation error, expose tail-latency contributors at stage granularity, and quantify the marginal contribution of each module through a pre-registered ablation protocol with bootstrap-derived 99% confidence intervals. Quantitative figures shown below are illustrative of the evaluation harness; statistically significant results against the monolithic baseline are deferred to the upcoming technical report (TR-2026-01).

7.1

Headline indicators

A compact summary of the four primary indicators tracked across the evaluation harness. Trend lines depict the most recent ten evaluation windows under the canonical configuration (DOCBOT · SYSTEMBOT · RESTRICTIONBOT enabled).

Extraction F1
0.912
+0.18
Latency p95
184ms
−42%
Throughput
2.4k/h
+3.1×
Escalations
4.8%
−61%
7.2

Evaluation Dimensions

The seven evaluation dimensions define a multi-axis quality envelope. Each dimension is reported independently and aggregated into the quality profile in Fig. R3.

Accuracy

Token-, entity- and record-level F1 measured at every module interface, decomposing extraction error from validation error.

Performance

Compute footprint per pipeline stage, normalised by document complexity and reported as work-units per validated decision.

Latency

End-to-end p50 / p95 / p99 wall-clock, partitioned by stage to expose tail-latency contributors and queueing effects.

Scalability

Sustained throughput as a function of corpus size and concurrency, fit against an Amdahl-bounded reference curve.

Robustness

Performance degradation under synthetic noise injection, adversarial perturbation, and source-disagreement scenarios.

Auditability

Provenance completeness and justification coverage measured as the fraction of decisions admitting full causal replay.

Operational Efficiency

Human-hours saved per audit cycle and reduction in escalation rate relative to a monolithic baseline.

Reproducibility

Bit-identical re-execution rate under fixed (seed, corpus, manifest) triples; deviations are themselves treated as candidate determinism faults.

Cost Profile

Aggregate compute, storage and human-review cost per validated decision, partitioned by stage and reported against a monolithic reference baseline.

7.3

Preliminary figures

Figures R1–R5 illustrate the evaluation surface. Distributions are computed from a synthetic harness sized to match the operating envelope of the target deployment; statistical tests against the monolithic baseline are reported in the technical appendix.

Fig. R1 — End-to-end throughput (sliding window)
Preliminary
1007550250docs/min · 60s rolling window
Fig. R2 — Latency distribution
Preliminary
p50p95end-to-end latency (ms, log-bin)
Fig. R3 — Quality profile vs. baseline
Preliminary
AccuracyLatencyThroughputRobustnessAuditabilityCost
Fig. R4 — Median stage cost
Preliminary
018355370msacquire22extract64validate41restrict18emit9
Fig. R5 — Error reduction by module ablation
Preliminary
0285583110%Baseline (monolith)100− RESTRICTIONBOT78− SYSTEMBOT56− DOCBOT28Full framework18