Skip to content

Multi-Agent Verification

The core problem with LLM outputs is trust. A model can produce a confident, well-formatted response that is completely wrong. It can generate code that looks correct but has a subtle bug. It can produce a security analysis that misses a critical vulnerability. There is no way to know from the output alone whether it should be trusted.

The Nomos Verifier addresses this by using a different model to evaluate the output. The generator and the verifier are adversarial by design — the verifier’s job is to find problems, not confirm correctness.

Using the same model to verify its own output is circular. A model that hallucinates a fact will also hallucinate that the fact is correct when asked to verify it. The errors are correlated because the knowledge gaps are the same.

Using a different model breaks this correlation. If Claude generates a response and GPT-4o verifies it, they have different training data, different biases, and different failure modes. Agreement between independent models is stronger evidence of correctness than self-consistency.

This is the same principle behind multi-witness testimony, double-entry bookkeeping, and independent code review. The value comes from independence, not from the number of checks.

The Verifier evaluates every output on four axes. Each axis tests a different dimension of quality, and each can independently pass or fail.

Does the response actually answer the question that was asked?

This axis catches:

  • Factual errors (wrong capital, wrong date, wrong formula)
  • Hallucinated information presented as fact
  • Responses that drift from the topic
  • Incomplete answers that miss key parts of the request

For code tasks, faithfulness means the code does what the specification requires. For analysis tasks, it means the conclusions follow from the evidence presented.

Is the output structurally correct for its type?

This axis catches:

  • Syntax errors in generated code
  • Malformed JSON, XML, or other structured formats
  • Logical inconsistencies within the response
  • Missing sections or broken structure

A response can be faithful (it addresses the right question) but poorly formed (the code has a syntax error). These are orthogonal concerns.

Does the response contain anything that should not be in the output?

This axis catches:

  • Leaked credentials, API keys, or internal paths
  • Generated code with security vulnerabilities
  • Instructions that could enable harm if followed
  • Information that violates data handling policies

This is the output-side complement to the Security Gate. The gate catches dangerous inputs; the security axis catches dangerous outputs. Together they create a full perimeter.

Is the response actually useful?

This axis is a holistic assessment that asks whether the output serves the user’s needs. A response can be faithful, well-formed, and secure, but still unhelpful if it is too vague, too verbose, or misses the point.

Quality is the most subjective axis and naturally has the widest score variance. It is also the axis where verification channels are most valuable — multiple independent assessments of quality converge on a more reliable signal than a single assessment.

The channels parameter controls how many independent verification passes run for each request. More channels mean higher confidence but also higher cost and latency.

One verification pass. Fast and cheap. Appropriate for:

  • Low-stakes tasks (formatting, simple Q&A)
  • High-throughput pipelines where latency matters
  • Cases where the domain is well-constrained

Two independent passes. Scores are averaged, and an axis must pass in both channels to count as passed. Appropriate for:

  • Code generation where correctness matters
  • Analysis tasks where factual accuracy is important
  • Any task where a false positive would be costly

Three independent passes with majority voting. The highest confidence level. Appropriate for:

  • Security-critical outputs
  • Medical, legal, or financial content
  • Any task where the output will be acted on without human review
  • Cases where the cost of a wrong answer significantly exceeds the cost of verification

A practical deployment scales verification with task criticality:

Task TypeRecommended ChannelsRationale
general_chat1Low stakes, high volume
code_generation2Correctness matters
code_review2Missed issues are costly
security_analysis3Highest stakes
document_analysis1-2Depends on use case
data_analysis2Accuracy matters
strategic_reasoning2-3Decisions have consequences

The Verifier evolved from Krisis, a multi-agent judgment system built for the Nomos Harness. Krisis used a panel of agents (each with a different model) to evaluate the quality of model outputs against a scoring rubric.

The key insights from Krisis that shaped the Verifier:

Structured axes outperform holistic scoring. Early versions of Krisis asked a single question: “Is this response good?” The results were noisy and inconsistent. Breaking evaluation into specific axes (faithfulness, well-formedness, security, quality) produced much more reliable and actionable results.

Independence matters more than quantity. Adding more verification passes with the same model showed diminishing returns. Switching to a different model family for verification produced a larger improvement than doubling the number of same-model passes.

Reasoning improves scores. When the verifier is required to explain its scores (the reasoning field), the scores themselves become more accurate. This is consistent with chain-of-thought findings — models make better judgments when they show their work.

Adversarial framing reduces false positives. Instructing the verifier to “find problems” rather than “evaluate quality” produces more critical and ultimately more useful assessments. The verification prompt is tuned for skepticism.

The Verifier is a focused, productionized version of the Krisis evaluation pipeline — stripped down to the four axes that proved most reliable, with channels replacing the multi-agent panel, and a clean API instead of the internal harness interface.