Verifiers
Verifiers are programmatic, zero-cost checks that score a model response against a rule. Every verifier is pure Python — no model call, no token cost, no flakiness. One exception is field_judge, which calls an LLM and is the only verifier that incurs a per-evaluation cost.
The platform ships with 49 built-in verifier types grouped into nine families.
Verifier vs. judge:
- Verifier — the answer is mechanically checkable: “is the output valid JSON?”, “did the response contain the patient’s MRN?”, “is there a markdown table with at least three columns?”
- Judge (LLM rubric) — the answer needs language understanding: “was the response empathetic?”, “did the agent stay on-topic?”
Use both. Verifiers catch cheap, mechanical regressions on every call; judges sample the harder ones.
Two ways to use them
| Mode | Endpoint | When to use |
|---|---|---|
| Stateless validation | POST /api/v1/verify | You already know which verifiers to run for this response (e.g. IFEval-style instruction generation produces verifier specs alongside the prompt). One round trip → one report. No setup. |
| Continuous monitoring | POST /api/v1/logs/ingest + harness setup | You want every production log scored automatically by a fixed set of verifiers + LLM judges, with a dashboard. Verifiers are registered once and attached to a harness. |
The runners and config schemas are identical in both modes — pick the path that fits your workflow.
Stateless validation — POST /api/v1/verify
Send the model output plus an inline list of verifier specs; get back a per-verifier report and an aggregate score. No resources to register.
POST /api/v1/verify
X-API-Key: sk-…
Content-Type: application/json
{
"output": "Sure thing! So that's (555) 123-4567 — is that correct?",
"extracted_json": { "customer": { "phone": "5551234567" } },
"verifiers": [
{ "type": "no_emoji", "config": {} },
{ "type": "max_sentence_length", "config": { "max_words": 25 } },
{ "type": "value_echoed", "config": { "value": "5551234567" } },
{ "type": "contains_phrase", "config": { "phrase": "is that correct" } }
],
"external_id": "turn_42",
"extra": { "session_id": "abc123" }
}Response:
{
"record_id": "8f2c3a40-1d4b-4e3a-9f9c-2b8e7e4a6c11",
"passed": true,
"score": 1.0,
"latency_ms": 4,
"results": [
{ "type": "no_emoji", "passed": true, "score": 1.0, "flags": [], "details": {} },
{ "type": "max_sentence_length", "passed": true, "score": 1.0, "flags": [], "details": { "sentence_count": 2 } },
{ "type": "value_echoed", "passed": true, "score": 1.0, "flags": [], "details": { "value": "5551234567", "mode": "normalized" } },
{ "type": "contains_phrase", "passed": true, "score": 1.0, "flags": [], "details": { "phrase": "is that correct" } }
]
}Request fields
| Field | Type | Notes |
|---|---|---|
output | string | Required. The model response to verify. |
verifiers | array | Required. 1 to 25 entries. Each entry: { "type": "<key>", "config": { ... } }. Type keys come from GET /api/v1/models/verifier-types. |
extracted_json | object | Optional. Used by verifiers like field_judge and required_slots. |
external_id | string | Optional. Your own ID for this call — surfaced in dashboards and useful for joining back to your traces. Max 255 chars. |
project_id | uuid | Optional. Falls back to the API key’s project. |
extra | object | Optional. Arbitrary metadata stored on the persisted record. |
Response fields
| Field | Type | Notes |
|---|---|---|
record_id | uuid | The persisted log row. Use it to fetch later via GET /api/v1/logs/{id}. |
passed | bool | true iff every verifier passed. |
score | float | Mean of per-verifier scores, 0.0–1.0. |
latency_ms | int | End-to-end runner time. |
results[] | array | Per-verifier envelope. See result shape. |
Errors
- 400 —
unknown_verifier_typeif anytypekey isn’t in the registry. Response body lists the offending keys. - 401 — invalid or missing
X-API-Key. - 422 — payload validation (e.g. empty
verifiers, more than 25 entries).
Persistence
Every call writes a kind=verify row to your tenant’s log store with: the output, extracted_json, the verifier specs you sent, the per-verifier results, the aggregate score, and latency_ms. Use it for billing reconciliation, regression tracking, and dashboard queries — same row format as ingested production logs.
To fetch the row later:
GET /api/v1/logs/{record_id}
Authorization: Bearer <session-jwt>Python example
import os, requests
def verify(output, verifiers, extracted_json=None, **extra):
return requests.post(
"https://ai-gateway.bolder.services/api/v1/verify",
headers={"X-API-Key": os.environ["BEVAL_API_KEY"]},
json={
"output": output,
"extracted_json": extracted_json,
"verifiers": verifiers,
**extra,
},
timeout=10,
).json()
report = verify(
output=assistant_response,
extracted_json=tool_call_args,
verifiers=instruction.verifiers, # list you carry per instruction
external_id=f"turn_{turn_id}",
)
if not report["passed"]:
log.warning("validation failed", flags=[f for r in report["results"] for f in r["flags"]])TypeScript example
const report = await fetch("https://ai-gateway.bolder.services/api/v1/verify", {
method: "POST",
headers: {
"X-API-Key": process.env.BEVAL_API_KEY!,
"Content-Type": "application/json",
},
body: JSON.stringify({
output: assistantResponse,
extracted_json: toolCallArgs,
verifiers: instruction.verifiers,
external_id: `turn_${turnId}`,
}),
}).then((r) => r.json());
if (!report.passed) {
console.warn("validation failed", report.results.filter((r) => !r.passed));
}IFEval-style integration
If you generate prompts and constraints together (instruction text + a verifier spec per constraint), forward those specs unmodified:
instruction = {
"text": "Limit your response to 25 words and end with a postscript.",
"verifiers": [
{"type": "word_count", "config": {"relation": "at_most", "expected": 25}},
{"type": "postscript", "config": {"marker": "PS:"}},
],
}
response = call_my_llm(instruction["text"])
report = verify(response, instruction["verifiers"])Continuous monitoring — harness flow
For tenants who want every production log scored automatically — without the caller picking verifiers each time — register verifiers once and attach them to a harness. The harness runs on every log ingested via POST /api/v1/logs/ingest.
How verifiers run in this mode
log ingest → active harnesses for the tenant
→ for each harness:
→ for each attached verifier:
runner(output, config, extracted_json?)
returns { passed, score, flags, details }
→ aggregate: verifier_score = mean(scores)
verifier_passed = all(passed)Verifiers run synchronously in the ingest path — every runner is designed to finish in milliseconds. field_judge is the exception (1–2 s per call).
Harness API
The endpoints below are specific to the monitoring path — registering verifier resources and binding them to harnesses. The first two (list types, result shape) apply to both modes.
List all verifier types
GET /api/v1/models/verifier-typesReturns the registry — every type’s key, human-readable name, description, params schema, and tags.
[
{
"key": "json_schema",
"name": "JSON Schema",
"description": "Output JSON validates against a JSON Schema (Draft 2020-12).",
"params": [
{ "key": "schema", "label": "JSON Schema", "type": "json", "required": true }
],
"tags": []
},
{
"key": "no_emoji",
"name": "No Emoji",
"description": "Output contains no emoji (TTS reads them as 'smiley face').",
"params": [],
"tags": ["voice"]
}
]Param type values are: string, textarea, number, boolean, select, json, string_array, number_array.
Create a verifier
POST /api/v1/verifiers/tenant/{tenant_id}
Content-Type: application/json
{
"name": "Prescription schema valid",
"description": "Extracted prescription JSON must match the medication schema",
"verifier_type": "json_schema",
"config": {
"schema": {
"type": "object",
"required": ["medications"],
"properties": {
"medications": {
"type": "array",
"items": {
"type": "object",
"required": ["drug", "dose_mg", "frequency"],
"properties": {
"drug": { "type": "string" },
"dose_mg": { "type": "number" },
"frequency": { "type": "string" }
}
}
}
}
}
},
"is_active": true
}Update / delete a verifier
PUT /api/v1/verifiers/{id}
DELETE /api/v1/verifiers/{id}Attach verifiers to a harness
POST /api/v1/harnesses/tenant/{tenant_id}
Content-Type: application/json
{
"name": "Clinical extraction baseline",
"verifier_ids": ["<id1>", "<id2>", "<id3>"],
"judge_ids": [],
"run_on_ingest": true,
"sampling_rate": 1.0
}Every log ingested for this tenant will be evaluated by every attached verifier.
Run a harness on demand
POST /api/v1/logs/{log_id}/run-harness/{harness_id}Re-runs the harness against an existing log. Replaces any prior result for the same (log_id, harness_id) pair.
Result shape
Every verifier runner returns the same envelope:
{
"passed": true,
"score": 1.0,
"flags": [],
"details": { "...verifier-specific": "..." }
}passed— boolean. Aggregated asverifier_passed = all(passed).score—0.0to1.0. Aggregated asverifier_score = mean(scores).flags— short machine-readable failure tags, e.g."word_count:got_47_expected_at_least_50". Useful for grouping failures.details— structured payload. Counts, matched values, sample failing sentences, etc.
Recipes
Healthcare VLM — prescription extraction
Vision-language model reads a prescription image and emits structured JSON.
| Verifier | Why |
|---|---|
json_valid | Output must parse — no markdown wrappers. |
json_schema | Strict shape: medications[].drug, dose_mg, frequency. |
required_slots (patient.mrn, patient.dob) | Don’t write the row if patient identity is missing. |
numeric_inclusion (≥ 1) | A dosage must include digits — catches “as needed” leakage. |
forbidden_words (off-label brand list) | Off-formulary drugs flagged. |
field_judge (medications.0.dose_mg, “is the dose plausible for the patient’s age?”) | Hybrid sanity check. |
Legal drafting — formal correspondence
Agent drafts a demand letter or filing.
| Verifier | Why |
|---|---|
starts_with (“Dear”) | Mandatory salutation. |
ends_with (“Sincerely,“) | Mandatory sign-off. |
placeholder_count (= 0) | No [CLIENT NAME] template leakage in production output. |
forbidden_words (banned phrases, opposing-party slips) | Compliance. |
terminal_punctuation_allowlist (”.”) | No ! or ? in formal letters. |
regex_match (Case No\. \d{2}-CV-\d{4,6}) | Case number format. |
max_paragraph_length (≤ 600 chars) | Readability for the bench. |
Voice agent — TTS-safe replies
Voice agent on a phone call.
| Verifier | Why |
|---|---|
no_emoji | TTS pronounces them. |
no_markdown_formatting | TTS reads **bold** as asterisk-bold-asterisk. |
no_url_or_email | Long URLs read aloud are useless. |
contains_phrase (“is that correct”) | Confirmation step before commit actions. |
forbidden_phrases (e.g. “I am an AI”, “I cannot”) | Compliance + brand voice. |
value_echoed (phone / address / order ID) | Agent must repeat back critical fields. |
required_slots (customer.name, customer.phone) | Tool-call slots filled before booking. |
max_sentence_length (≤ 25 words) | Long sentences sound robotic. |
character_count (≤ 800 chars) | Hard cap on per-turn audio length. |
Reference
Every verifier returns the standard { passed, score, flags, details } envelope. Examples below show the parts that vary.
Family 1 — JSON / structure
json_valid
Output parses as valid JSON.
| Param | Type | Notes |
|---|
Use case. Tool-call agents that should emit pure JSON; VLM extraction pipelines.
config: {}
# pass
output: '{"diagnosis": "J45.901", "confidence": 0.92}'
# fail — leaked prose
output: 'sure thing — { diagnosis: J45.901 }'
flags: ["invalid_json"]json_schema
Output JSON validates against a Draft 2020-12 schema.
| Param | Type | Notes |
|---|---|---|
schema | json | Any valid JSON Schema. |
Use case. Strict enforcement of an extraction or tool-call signature.
# Healthcare VLM extracting an EOB line item
config:
schema:
type: object
required: [member_id, claim_id, paid_amount]
properties:
member_id: { type: string, pattern: "^[A-Z0-9]{9}$" }
claim_id: { type: string }
paid_amount: { type: number, minimum: 0 }Family 2 — Counts (use a relation + expected)
Every verifier in this family takes the same two parameters:
| Param | Type | Notes |
|---|---|---|
relation | select | One of at_least, at_most, equal_to, less_than, greater_than. |
expected | number | The integer to compare against. |
word_count
Total words in the output.
# Legal — keep summary briefs under 250 words
config: { relation: at_most, expected: 250 }character_count
Total characters.
# Per-turn audio length cap (voice)
config: { relation: at_most, expected: 800 }sentence_count
Sentences (split on ./!/?).
# Healthcare — patient-instruction summary, exactly 3 sentences
config: { relation: equal_to, expected: 3 }paragraph_count
Paragraphs (blank-line separated).
# Legal memo — minimum two paragraphs
config: { relation: at_least, expected: 2 }bullet_count
Bullet items (-, *, •, +).
# Discharge instructions — at least three action items
config: { relation: at_least, expected: 3 }numbered_list_count
Numbered list items (1., 2., …).
# Legal procedure — exactly five steps in a checklist
config: { relation: equal_to, expected: 5 }placeholder_count
[bracketed] placeholders.
# Legal — production output must have zero leftover template tokens
# like [CLIENT NAME], [DATE OF FILING]
config: { relation: equal_to, expected: 0 }numeric_inclusion
Total digits in the response.
# Healthcare — every dose must include digits (catches "as needed")
config: { relation: at_least, expected: 1 }question_exclaim_count
Total ? and ! marks.
# Formal legal writing — at most one question, no exclamations
config: { relation: at_most, expected: 1 }Family 3 — Affix and pattern
starts_with
Output begins with a string.
| Param | Type | Notes |
|---|---|---|
prefix | string | Required. |
case_sensitive | boolean | Default true. |
Use case. Mandatory salutation in legal correspondence.
config: { prefix: "Dear ", case_sensitive: true }ends_with
Output ends with a string.
| Param | Type | Notes |
|---|---|---|
suffix | string | Required. |
case_sensitive | boolean | Default true. |
Use case. Sign-off enforcement.
config: { suffix: "Sincerely,\nCounsel" }regex_match
Output (must / must-not) match a regular expression.
| Param | Type | Notes |
|---|---|---|
pattern | string | Python regex. |
must_match | boolean | Default true. Set false to invert. |
ignore_case | boolean | Default false. |
Use case. Custom format checks — case numbers, MRNs, ICD-10 codes.
# Legal — case number format like "Case No. 24-CV-001234"
config:
pattern: "Case No\\. \\d{2}-CV-\\d{4,6}"
must_match: true
# Healthcare — block free-text dosing in a structured field
config:
pattern: "as needed|prn|q\\.d\\."
must_match: false
ignore_case: truewrap_with
Output starts and ends with the same phrase.
| Param | Type | Notes |
|---|---|---|
wrap_phrase | string | Required. |
Use case. Markdown code fences; XML-style envelopes.
config: { wrap_phrase: "```" }quoted
Output is fully wrapped in "...". No params.
Use case. Legal — direct-quotation extraction tasks.
title_wrapped
First non-empty line is wrapped in << >>. No params.
Use case. Report-style outputs that must lead with a <<Title>> line for downstream parsing.
# pass: "<<Quarterly Compliance Review>>\n\nbody…"
# fail: "Quarterly Compliance Review\n\nbody…"Family 4 — Frequency
keyword_frequency
A keyword appears the required number of times (whole-word match).
| Param | Type | Notes |
|---|---|---|
keyword | string | Required. |
relation | select | See Family 2. |
expected | number | Required. |
case_sensitive | boolean | Default false. |
Use case. Healthcare — mention “follow-up” at least once in a discharge summary.
config: { keyword: "follow-up", relation: at_least, expected: 1 }char_frequency
A single character appears the required number of times.
| Param | Type | Notes |
|---|---|---|
char | string (1 char) | Required. |
relation / expected | — | Family 2 params. |
case_sensitive | boolean | Default true. |
Use case. Detect token overuse, or count delimiters in a structured response.
alliteration
Words starting with a target letter, count vs relation.
| Param | Type | Notes |
|---|---|---|
letter | string (1 char) | Required. |
relation / expected | — | Family 2 params. |
case_sensitive | boolean | Default false. |
Use case. Creative-writing harnesses; stylistic tasks.
config: { letter: "s", relation: at_least, expected: 5 }
# pass: "Sally sells silver seashells silently."Family 5 — Keyword presence and position
keywords_all_present
Every listed keyword must appear at least once.
| Param | Type | Notes |
|---|---|---|
keywords | string_array | Required. Comma- or newline-separated input. |
case_sensitive | boolean | Default false. |
Use case. Required talking points — patient-education content must mention every key term.
config:
keywords: ["dosage", "side effects", "when to call your doctor"]forbidden_words
None of the listed words appear (whole-word match).
| Param | Type | Notes |
|---|---|---|
words | string_array | Required. |
case_sensitive | boolean | Default false. |
Use case. Banned terms — opposing-counsel names in a draft, off-label drugs in a clinical summary.
# Legal — never reference the opposing party informally
config: { words: ["Bob", "the other side"] }keyword_position
The word at index position (0-based) equals keyword.
| Param | Type | Notes |
|---|---|---|
keyword | string | Required. |
position | number | 0-indexed word position. |
case_sensitive | boolean | Default false. |
Use case. Enforce a structural opening — first word must be "WHEREAS", etc.
palindrome_word
Output contains at least one palindrome word at or above the minimum length.
| Param | Type | Notes |
|---|---|---|
min_length | number | Default 3. |
Use case. Niche generative tasks; fun-fact validation.
Family 6 — Length constraints
max_sentence_length
No sentence exceeds the maximum word count.
| Param | Type | Notes |
|---|---|---|
max_words | number | Required, ≥ 1. |
Use case. Healthcare — patient-facing instructions, plain-language guidelines.
config: { max_words: 20 }max_word_repetition
No word repeats more than N times.
| Param | Type | Notes |
|---|---|---|
max_repeats | number | Required, ≥ 1. |
case_sensitive | boolean | Default false. |
Use case. Detect parroting / stuck loops.
unique_word_count
Distinct words count vs relation.
| Param | Type | Notes |
|---|---|---|
relation / expected | — | Family 2 params. |
case_sensitive | boolean | Default false. |
Use case. Vocabulary diversity floor for long-form writing.
word_length_range
Every word’s length is within [min_length, max_length].
| Param | Type | Notes |
|---|---|---|
min_length | number | Required, ≥ 1. |
max_length | number | Required, ≥ min_length. |
Use case. Plain-language requirements — flag any word longer than 12 characters in pediatric patient handouts.
avg_word_length
Average word length is within [min_ratio, max_ratio].
| Param | Type | Notes |
|---|---|---|
min_ratio | number | Required, ≥ 0. |
max_ratio | number | Required. |
Use case. Tone calibration — readable but not childish.
config: { min_ratio: 4.5, max_ratio: 6.0 }paragraph_word_count
Every paragraph’s word count satisfies the relation.
| Param | Type | Notes |
|---|---|---|
relation / expected | — | Family 2 params. |
Use case. Long-form content — legal memos, every paragraph at least 30 words.
Family 7 — Markdown structure
section_count
Number of ### <name> N markdown sections matches the relation.
| Param | Type | Notes |
|---|---|---|
section_name | string | Required. Must not contain # or *. |
relation / expected | — | Family 2 params. |
Use case. Templated reports with ### Finding 1, ### Finding 2, …
nested_list
A list at least min_depth deep with at least num_subitems items exists.
| Param | Type | Notes |
|---|---|---|
min_depth | number | Default 2. |
num_subitems | number | Default 2. |
Use case. Outline generation, structured how-to content.
config: { min_depth: 2, num_subitems: 2 }
# pass:
# - top
# - sub-1
# - sub-2markdown_table
Output contains a markdown table with at least N rows and N columns.
| Param | Type | Notes |
|---|---|---|
min_rows | number | Default 1. |
min_cols | number | Default 2. |
Use case. Healthcare — drug-interaction or differential-diagnosis comparison tables.
heading_depth
All required heading levels (1=#, 2=##, …) are present.
| Param | Type | Notes |
|---|---|---|
levels | number_array | Required. e.g. [1, 2] means at least one # and one ##. |
Use case. Document structure — legal briefs requiring # for sections and ## for subsections.
max_paragraph_length
No paragraph exceeds max_chars characters.
| Param | Type | Notes |
|---|---|---|
max_chars | number | Required, ≥ 1. |
Use case. Readability guardrail.
sentences_per_paragraph
Every paragraph’s sentence count satisfies the relation.
| Param | Type | Notes |
|---|---|---|
relation / expected | — | Family 2 params. |
Use case. Style constraint — at most three sentences per paragraph.
sentence_endings_variety
At least N distinct sentence-ending punctuation marks are used (. ! ?).
| Param | Type | Notes |
|---|---|---|
min_variants | number | Default 2. |
Use case. Detect monotone output (only .-terminated sentences).
terminal_punctuation_allowlist
Only the allowed end-of-sentence marks (subset of .!?) appear.
| Param | Type | Notes |
|---|---|---|
allowed | string | E.g. "." to forbid ! and ?. |
Use case. Formal legal writing — no exclamations.
no_comma
Output contains no commas. No params.
Use case. Simple-style writing exercises.
no_period
Output contains no periods. No params.
Use case. Stream-of-consciousness or single-sentence outputs.
postscript
Last non-empty line begins with a marker (e.g. PS:) and has content after it.
| Param | Type | Notes |
|---|---|---|
marker | string | Default "PS:". |
Use case. Letter-style outputs that must end with a postscript.
Family 8 — Voice-agent verifiers
All tagged voice in the registry. Designed for transcript and tool-call output from voice agents.
no_emoji
Output contains no emoji. No params.
Why it matters. TTS engines pronounce 😀 as “smiley face” or “grinning face emoji.” Hard fail for any voice surface.
no_markdown_formatting
Output is plain text — no **bold**, *italic*, __under__, headings, bullets, numbered lists, code fences, inline code, or pipe tables. No params.
Why it matters. TTS reads markdown literally. Detected formatting kinds appear in details.kinds for triage.
no_url_or_email
Output contains no URLs (https://..., www.x.com) or email addresses. No params.
Why it matters. “Visit aitch tee tee pee colon slash slash example dot com” is unusable. Send these via SMS or email after the call.
contains_phrase
Output contains a required substring.
| Param | Type | Notes |
|---|---|---|
phrase | string | Required. |
case_sensitive | boolean | Default false. |
Use case. Confirmation step (“is that correct”), legal disclaimer, closing phrase.
config: { phrase: "is that correct" }Unlike
keyword_frequency, this is a substring match — works for multi-word phrases.
forbidden_phrases
Output contains none of the listed phrases (substring match).
| Param | Type | Notes |
|---|---|---|
phrases | string_array | Required. |
case_sensitive | boolean | Default false. |
Use case. “I am an AI”, “I cannot help with that”, competitor names, internal model identifiers.
config:
phrases: ["I am an AI", "as a language model", "I cannot"]value_echoed
Agent repeats back a critical value (phone, address, order ID).
| Param | Type | Notes |
|---|---|---|
value | string | Required. The literal value to look for. |
normalize_digits | boolean | Default true. Strips non-alphanumerics for comparison so (555) 123-4567 matches 5551234567. |
case_sensitive | boolean | Default false. |
Use case. Confirmation patterns — “So that’s 555-123-4567 — is that right?”
config: { value: "5551234567", normalize_digits: true }
# pass: "(555) 123-4567"
# pass: "555 1234 567"
# fail: "5551234560"details.mode reports "normalized" or "literal" depending on which match path succeeded.
required_slots
Every listed JSON path resolves to a non-empty value in extracted_json (or, failing that, in the parsed output).
| Param | Type | Notes |
|---|---|---|
paths | string_array | Required. Dotted paths, e.g. customer.name, items.0.sku. Supports leading $.. |
Use case. Tool-call gating — don’t send a book_appointment action until name + phone + slot are filled.
config:
paths: ["customer.name", "customer.phone", "appointment.start_iso"]Failures land in details.missing (path absent) and details.empty (path resolved to null / "" / [] / {}).
Family 9 — LLM-backed
field_judge
LLM evaluates a single JSON field against a free-form criterion. The only verifier that calls an LLM.
| Param | Type | Notes |
|---|---|---|
path | string | Required. Dotted JSON path; leading $. accepted. |
criterion | textarea | Required. Plain-English criterion. |
min_score | number | Default 0.7. Pass threshold (0–1). |
model | string | Optional. Defaults to gpt-4o. |
Cost / latency. ~1–2 seconds per call, ~$0.001–0.005 per evaluation depending on field size. Reads extracted_json first; falls back to parsing output as JSON. Missing fields fail fast (field_not_found:<path>) with no LLM call.
Use case. Hybrid checks where deterministic rules can’t capture the requirement.
# Healthcare — clinical reasonableness
config:
path: medications.0.dose_mg
criterion: |
Given the patient's age and weight in the same JSON object,
is this dose within the standard pediatric range for this drug?
min_score: 0.7
# Legal — citation appropriateness
config:
path: arguments.0.citation
criterion: |
Is this citation a real, currently good-law case that supports
the proposition in arguments[0].claim?
min_score: 0.8details.reasoning contains the LLM’s 1–2 sentence justification.
Roadmap
Domain-specific verifiers that need new metadata fields or libraries — not yet shipped:
latency_ttfa— time-to-first-audio. Needs avoice_metrics.ttfa_msfield on logs.barge_in_ack_ms— agent stops speaking within N ms of user starting.audio_duration_ratio— agent speech / total call duration.language_consistency— agent stays in the language the user opened with.wer_against_reference— word error rate vs gold transcript. Eval-mode only.tool_call_sequence— tools fire in a required order.