Multi-Agent Verification
The core problem with LLM outputs is trust. A model can produce a confident, well-formatted response that is completely wrong. It can generate code that looks correct but has a subtle bug. It can produce a security analysis that misses a critical vulnerability. There is no way to know from the output alone whether it should be trusted.
The Nomos Verifier addresses this by using a different model to evaluate the output. The generator and the verifier are adversarial by design — the verifier’s job is to find problems, not confirm correctness.
Why Separate Models
Section titled “Why Separate Models”Using the same model to verify its own output is circular. A model that hallucinates a fact will also hallucinate that the fact is correct when asked to verify it. The errors are correlated because the knowledge gaps are the same.
Using a different model breaks this correlation. If Claude generates a response and GPT-4o verifies it, they have different training data, different biases, and different failure modes. Agreement between independent models is stronger evidence of correctness than self-consistency.
This is the same principle behind multi-witness testimony, double-entry bookkeeping, and independent code review. The value comes from independence, not from the number of checks.
The Four Axes
Section titled “The Four Axes”The Verifier evaluates every output on four axes. Each axis tests a different dimension of quality, and each can independently pass or fail.
Faithfulness
Section titled “Faithfulness”Does the response actually answer the question that was asked?
This axis catches:
- Factual errors (wrong capital, wrong date, wrong formula)
- Hallucinated information presented as fact
- Responses that drift from the topic
- Incomplete answers that miss key parts of the request
For code tasks, faithfulness means the code does what the specification requires. For analysis tasks, it means the conclusions follow from the evidence presented.
Well-Formedness
Section titled “Well-Formedness”Is the output structurally correct for its type?
This axis catches:
- Syntax errors in generated code
- Malformed JSON, XML, or other structured formats
- Logical inconsistencies within the response
- Missing sections or broken structure
A response can be faithful (it addresses the right question) but poorly formed (the code has a syntax error). These are orthogonal concerns.
Security
Section titled “Security”Does the response contain anything that should not be in the output?
This axis catches:
- Leaked credentials, API keys, or internal paths
- Generated code with security vulnerabilities
- Instructions that could enable harm if followed
- Information that violates data handling policies
This is the output-side complement to the Security Gate. The gate catches dangerous inputs; the security axis catches dangerous outputs. Together they create a full perimeter.
Quality
Section titled “Quality”Is the response actually useful?
This axis is a holistic assessment that asks whether the output serves the user’s needs. A response can be faithful, well-formed, and secure, but still unhelpful if it is too vague, too verbose, or misses the point.
Quality is the most subjective axis and naturally has the widest score variance. It is also the axis where verification channels are most valuable — multiple independent assessments of quality converge on a more reliable signal than a single assessment.
Verification Channels
Section titled “Verification Channels”The channels parameter controls how many independent verification passes run for each request. More channels mean higher confidence but also higher cost and latency.
Single Channel (channels=1)
Section titled “Single Channel (channels=1)”One verification pass. Fast and cheap. Appropriate for:
- Low-stakes tasks (formatting, simple Q&A)
- High-throughput pipelines where latency matters
- Cases where the domain is well-constrained
Double Channel (channels=2)
Section titled “Double Channel (channels=2)”Two independent passes. Scores are averaged, and an axis must pass in both channels to count as passed. Appropriate for:
- Code generation where correctness matters
- Analysis tasks where factual accuracy is important
- Any task where a false positive would be costly
Triple Channel (channels=3)
Section titled “Triple Channel (channels=3)”Three independent passes with majority voting. The highest confidence level. Appropriate for:
- Security-critical outputs
- Medical, legal, or financial content
- Any task where the output will be acted on without human review
- Cases where the cost of a wrong answer significantly exceeds the cost of verification
Scaling Strategy
Section titled “Scaling Strategy”A practical deployment scales verification with task criticality:
| Task Type | Recommended Channels | Rationale |
|---|---|---|
general_chat | 1 | Low stakes, high volume |
code_generation | 2 | Correctness matters |
code_review | 2 | Missed issues are costly |
security_analysis | 3 | Highest stakes |
document_analysis | 1-2 | Depends on use case |
data_analysis | 2 | Accuracy matters |
strategic_reasoning | 2-3 | Decisions have consequences |
Evolution from Krisis
Section titled “Evolution from Krisis”The Verifier evolved from Krisis, a multi-agent judgment system built for the Nomos Harness. Krisis used a panel of agents (each with a different model) to evaluate the quality of model outputs against a scoring rubric.
The key insights from Krisis that shaped the Verifier:
Structured axes outperform holistic scoring. Early versions of Krisis asked a single question: “Is this response good?” The results were noisy and inconsistent. Breaking evaluation into specific axes (faithfulness, well-formedness, security, quality) produced much more reliable and actionable results.
Independence matters more than quantity. Adding more verification passes with the same model showed diminishing returns. Switching to a different model family for verification produced a larger improvement than doubling the number of same-model passes.
Reasoning improves scores. When the verifier is required to explain its scores (the reasoning field), the scores themselves become more accurate. This is consistent with chain-of-thought findings — models make better judgments when they show their work.
Adversarial framing reduces false positives. Instructing the verifier to “find problems” rather than “evaluate quality” produces more critical and ultimately more useful assessments. The verification prompt is tuned for skepticism.
The Verifier is a focused, productionized version of the Krisis evaluation pipeline — stripped down to the four axes that proved most reliable, with channels replacing the multi-agent panel, and a clean API instead of the internal harness interface.