MVPRAGEvalsCompliance

DOXA

The RAG engine that tunes itself

A smart search engine for your company documents that tests and improves its own answers. Built for industries where a wrong answer is expensive: healthcare, finance, and government.

Try the interactive demo

Architecture decision records

Pipeline stages instrumented

85%

Test coverage gate (CI)

49%

Retrieval failure reduction (contextual chunking)

See the real product

Recorded auto-tune run — diagnose → Bayesian optimization → canary → promote

The problem

Eval vendors tell you your RAG pipeline is broken; none of them fix it. Regulated organizations (HIPAA, FedRAMP, FINRA) can't ship RAG that hallucinates — and they can't afford a research team to hand-tune chunking strategies, retrievers, and prompts forever.

What we built

Designed a ten-stage pipeline — ingest, parse, chunk, embed, index, retrieve, generate, verify, evaluate, auto-tune — where every stage is a swappable, benchmarked provider behind an abstraction layer.

Built the auto-tune loop: SMAC3 Bayesian optimization over retrieval hyperparameters, bandit search over chunking strategies, and DSPy/MIPROv2 prompt optimization, all scored against golden sets with RAGAS-style metrics.

Made promotion safe for regulated environments: candidates run in shadow traffic, advance through canary stages (5% → 25% → 100%) with SLO gates, and auto-roll back if recall drops more than 2% or faithfulness falls below 0.75.

Enforced trust at the output boundary: two-stage faithfulness verification (NLI entailment model + LLM-as-judge), structured per-claim citations, PII redaction at ingest, and a hash-chained audit log.

Architecture

RetrievalHybrid dense + BM25 with Reciprocal Rank Fusion, BGE cross-encoder reranking, optional graph hops

ChunkingAnthropic-style contextual retrieval default — LLM-generated context prefixes per chunk

Auto-tuneSMAC3 Bayesian optimization + DSPy prompt tuning, Pareto selection on quality/latency/cost

SafetyShadow eval → staged canary → automatic rollback on SLO breach; suggest-mode for regulated tenants

GovernanceNLI + judge faithfulness ensemble, citation enforcement, Presidio PII redaction, audit hash-chain

Outcomes

▸9,700+ lines of typed Python across 88 modules, 85% coverage gate, six GitHub Actions workflows including nightly security audits and a RAG regression gate
▸18 architecture decision records covering every load-bearing choice — vector store, multi-tenancy, secrets, residency
▸Eval-driven by construction: no config change ships without beating the golden-set baseline
▸Commissioned an independent 5-agent external architecture audit and published the findings internally — including the failures

Stack

PythonFastAPIPostgreSQLpgvectorHatchetSMAC3DSPyRAGASClaude APIAzure OpenAIVoyage AIDockerTerraform

Next case study

ResearchPilot

→