Multi-Agent Verification

The core problem with LLM outputs is trust. A model can produce a confident, well-formatted response that is completely wrong. It can generate code that looks correct but has a subtle bug. It can produce a security analysis that misses a critical vulnerability. There is no way to know from the output alone whether it should be trusted.

The Nomos Verifier addresses this by using a different model to evaluate the output. The generator and the verifier are adversarial by design — the verifier’s job is to find problems, not confirm correctness.

Why Separate Models

Using the same model to verify its own output is circular. A model that hallucinates a fact will also hallucinate that the fact is correct when asked to verify it. The errors are correlated because the knowledge gaps are the same.

Using a different model breaks this correlation. If Claude generates a response and GPT-4o verifies it, they have different training data, different biases, and different failure modes. Agreement between independent models is stronger evidence of correctness than self-consistency.

This is the same principle behind multi-witness testimony, double-entry bookkeeping, and independent code review. The value comes from independence, not from the number of checks.

The Four Axes

The Verifier evaluates every output on four axes. Each axis tests a different dimension of quality, and each can independently pass or fail.

Faithfulness

Does the response actually answer the question that was asked?

This axis catches:

Factual errors (wrong capital, wrong date, wrong formula)
Hallucinated information presented as fact
Responses that drift from the topic
Incomplete answers that miss key parts of the request

For code tasks, faithfulness means the code does what the specification requires. For analysis tasks, it means the conclusions follow from the evidence presented.

Well-Formedness

Is the output structurally correct for its type?

This axis catches:

Syntax errors in generated code
Malformed JSON, XML, or other structured formats
Logical inconsistencies within the response
Missing sections or broken structure

A response can be faithful (it addresses the right question) but poorly formed (the code has a syntax error). These are orthogonal concerns.

Security

Does the response contain anything that should not be in the output?

This axis catches:

Leaked credentials, API keys, or internal paths
Generated code with security vulnerabilities
Instructions that could enable harm if followed
Information that violates data handling policies

This is the output-side complement to the Security Gate. The gate catches dangerous inputs; the security axis catches dangerous outputs. Together they create a full perimeter.

Quality

Is the response actually useful?

This axis is a holistic assessment that asks whether the output serves the user’s needs. A response can be faithful, well-formed, and secure, but still unhelpful if it is too vague, too verbose, or misses the point.

Quality is the most subjective axis and naturally has the widest score variance. It is also the axis where verification channels are most valuable — multiple independent assessments of quality converge on a more reliable signal than a single assessment.

Verification Channels

The channels parameter controls how many independent verification passes run for each request. More channels mean higher confidence but also higher cost and latency.

Single Channel (channels=1)

One verification pass. Fast and cheap. Appropriate for:

Low-stakes tasks (formatting, simple Q&A)
High-throughput pipelines where latency matters
Cases where the domain is well-constrained

Double Channel (channels=2)

Two independent passes. Scores are averaged, and an axis must pass in both channels to count as passed. Appropriate for:

Code generation where correctness matters
Analysis tasks where factual accuracy is important
Any task where a false positive would be costly

Triple Channel (channels=3)

Three independent passes with majority voting. The highest confidence level. Appropriate for:

Security-critical outputs
Medical, legal, or financial content
Any task where the output will be acted on without human review
Cases where the cost of a wrong answer significantly exceeds the cost of verification

Scaling Strategy

A practical deployment scales verification with task criticality:

Task Type	Recommended Channels	Rationale
`general_chat`	1	Low stakes, high volume
`code_generation`	2	Correctness matters
`code_review`	2	Missed issues are costly
`security_analysis`	3	Highest stakes
`document_analysis`	1-2	Depends on use case
`data_analysis`	2	Accuracy matters
`strategic_reasoning`	2-3	Decisions have consequences

Evolution from Krisis

The Verifier evolved from Krisis, a multi-agent judgment system built for the Nomos Harness. Krisis used a panel of agents (each with a different model) to evaluate the quality of model outputs against a scoring rubric.

The key insights from Krisis that shaped the Verifier:

Structured axes outperform holistic scoring. Early versions of Krisis asked a single question: “Is this response good?” The results were noisy and inconsistent. Breaking evaluation into specific axes (faithfulness, well-formedness, security, quality) produced much more reliable and actionable results.

Independence matters more than quantity. Adding more verification passes with the same model showed diminishing returns. Switching to a different model family for verification produced a larger improvement than doubling the number of same-model passes.

Reasoning improves scores. When the verifier is required to explain its scores (the reasoning field), the scores themselves become more accurate. This is consistent with chain-of-thought findings — models make better judgments when they show their work.

Adversarial framing reduces false positives. Instructing the verifier to “find problems” rather than “evaluate quality” produces more critical and ultimately more useful assessments. The verification prompt is tuned for skepticism.

The Verifier is a focused, productionized version of the Krisis evaluation pipeline — stripped down to the four axes that proved most reliable, with channels replacing the multi-agent panel, and a clean API instead of the internal harness interface.