Verifiers

Verifiers are programmatic, zero-cost checks that score a model response against a rule. Every verifier is pure Python — no model call, no token cost, no flakiness. One exception is field_judge, which calls an LLM and is the only verifier that incurs a per-evaluation cost.

The platform ships with 49 built-in verifier types grouped into nine families.

Verifier vs. judge:

Verifier — the answer is mechanically checkable: “is the output valid JSON?”, “did the response contain the patient’s MRN?”, “is there a markdown table with at least three columns?”

Judge (LLM rubric) — the answer needs language understanding: “was the response empathetic?”, “did the agent stay on-topic?”

Use both. Verifiers catch cheap, mechanical regressions on every call; judges sample the harder ones.

Two ways to use them

Mode	Endpoint	When to use
Stateless validation	`POST /api/v1/verify`	You already know which verifiers to run for this response (e.g. IFEval-style instruction generation produces verifier specs alongside the prompt). One round trip → one report. No setup.
Continuous monitoring	`POST /api/v1/logs/ingest` + harness setup	You want every production log scored automatically by a fixed set of verifiers + LLM judges, with a dashboard. Verifiers are registered once and attached to a harness.

The runners and config schemas are identical in both modes — pick the path that fits your workflow.

Stateless validation — `POST /api/v1/verify`

Send the model output plus an inline list of verifier specs; get back a per-verifier report and an aggregate score. No resources to register.


POST /api/v1/verify
X-API-Key: sk-…
Content-Type: application/json
 
{
  "output": "Sure thing! So that's (555) 123-4567 — is that correct?",
  "extracted_json": { "customer": { "phone": "5551234567" } },
  "verifiers": [
    { "type": "no_emoji",            "config": {} },
    { "type": "max_sentence_length", "config": { "max_words": 25 } },
    { "type": "value_echoed",        "config": { "value": "5551234567" } },
    { "type": "contains_phrase",     "config": { "phrase": "is that correct" } }
  ],
  "external_id": "turn_42",
  "extra": { "session_id": "abc123" }
}

Response:


{
  "record_id": "8f2c3a40-1d4b-4e3a-9f9c-2b8e7e4a6c11",
  "passed": true,
  "score": 1.0,
  "latency_ms": 4,
  "results": [
    { "type": "no_emoji",            "passed": true, "score": 1.0, "flags": [], "details": {} },
    { "type": "max_sentence_length", "passed": true, "score": 1.0, "flags": [], "details": { "sentence_count": 2 } },
    { "type": "value_echoed",        "passed": true, "score": 1.0, "flags": [], "details": { "value": "5551234567", "mode": "normalized" } },
    { "type": "contains_phrase",     "passed": true, "score": 1.0, "flags": [], "details": { "phrase": "is that correct" } }
  ]
}

Request fields

Field	Type	Notes
`output`	`string`	Required. The model response to verify.
`verifiers`	`array`	Required. 1 to 25 entries. Each entry: `{ "type": "<key>", "config": { ... } }`. Type keys come from `GET /api/v1/models/verifier-types`.
`extracted_json`	`object`	Optional. Used by verifiers like `field_judge` and `required_slots`.
`external_id`	`string`	Optional. Your own ID for this call — surfaced in dashboards and useful for joining back to your traces. Max 255 chars.
`project_id`	`uuid`	Optional. Falls back to the API key’s project.
`extra`	`object`	Optional. Arbitrary metadata stored on the persisted record.

Response fields

Field	Type	Notes
`record_id`	`uuid`	The persisted log row. Use it to fetch later via `GET /api/v1/logs/{id}`.
`passed`	`bool`	`true` iff every verifier passed.
`score`	`float`	Mean of per-verifier scores, 0.0–1.0.
`latency_ms`	`int`	End-to-end runner time.
`results[]`	`array`	Per-verifier envelope. See result shape.

Errors

400 — unknown_verifier_type if any type key isn’t in the registry. Response body lists the offending keys.
401 — invalid or missing X-API-Key.
422 — payload validation (e.g. empty verifiers, more than 25 entries).

Persistence

Every call writes a kind=verify row to your tenant’s log store with: the output, extracted_json, the verifier specs you sent, the per-verifier results, the aggregate score, and latency_ms. Use it for billing reconciliation, regression tracking, and dashboard queries — same row format as ingested production logs.

To fetch the row later:


GET /api/v1/logs/{record_id}
Authorization: Bearer <session-jwt>

Python example


import os, requests
 
def verify(output, verifiers, extracted_json=None, **extra):
    return requests.post(
        "https://ai-gateway.bolder.services/api/v1/verify",
        headers={"X-API-Key": os.environ["BEVAL_API_KEY"]},
        json={
            "output": output,
            "extracted_json": extracted_json,
            "verifiers": verifiers,
            **extra,
        },
        timeout=10,
    ).json()
 
report = verify(
    output=assistant_response,
    extracted_json=tool_call_args,
    verifiers=instruction.verifiers,   # list you carry per instruction
    external_id=f"turn_{turn_id}",
)
if not report["passed"]:
    log.warning("validation failed", flags=[f for r in report["results"] for f in r["flags"]])

TypeScript example


const report = await fetch("https://ai-gateway.bolder.services/api/v1/verify", {
  method: "POST",
  headers: {
    "X-API-Key": process.env.BEVAL_API_KEY!,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    output: assistantResponse,
    extracted_json: toolCallArgs,
    verifiers: instruction.verifiers,
    external_id: `turn_${turnId}`,
  }),
}).then((r) => r.json());
 
if (!report.passed) {
  console.warn("validation failed", report.results.filter((r) => !r.passed));
}

IFEval-style integration

If you generate prompts and constraints together (instruction text + a verifier spec per constraint), forward those specs unmodified:


instruction = {
    "text": "Limit your response to 25 words and end with a postscript.",
    "verifiers": [
        {"type": "word_count", "config": {"relation": "at_most", "expected": 25}},
        {"type": "postscript", "config": {"marker": "PS:"}},
    ],
}
 
response = call_my_llm(instruction["text"])
report = verify(response, instruction["verifiers"])

Continuous monitoring — harness flow

For tenants who want every production log scored automatically — without the caller picking verifiers each time — register verifiers once and attach them to a harness. The harness runs on every log ingested via POST /api/v1/logs/ingest.

How verifiers run in this mode


log ingest  →  active harnesses for the tenant
            →  for each harness:
                 →  for each attached verifier:
                       runner(output, config, extracted_json?)
                       returns { passed, score, flags, details }
                 →  aggregate: verifier_score = mean(scores)
                              verifier_passed = all(passed)

Verifiers run synchronously in the ingest path — every runner is designed to finish in milliseconds. field_judge is the exception (1–2 s per call).

Harness API

The endpoints below are specific to the monitoring path — registering verifier resources and binding them to harnesses. The first two (list types, result shape) apply to both modes.

List all verifier types


GET /api/v1/models/verifier-types

Returns the registry — every type’s key, human-readable name, description, params schema, and tags.


[
  {
    "key": "json_schema",
    "name": "JSON Schema",
    "description": "Output JSON validates against a JSON Schema (Draft 2020-12).",
    "params": [
      { "key": "schema", "label": "JSON Schema", "type": "json", "required": true }
    ],
    "tags": []
  },
  {
    "key": "no_emoji",
    "name": "No Emoji",
    "description": "Output contains no emoji (TTS reads them as 'smiley face').",
    "params": [],
    "tags": ["voice"]
  }
]

Param type values are: string, textarea, number, boolean, select, json, string_array, number_array.

Create a verifier


POST /api/v1/verifiers/tenant/{tenant_id}
Content-Type: application/json
 
{
  "name": "Prescription schema valid",
  "description": "Extracted prescription JSON must match the medication schema",
  "verifier_type": "json_schema",
  "config": {
    "schema": {
      "type": "object",
      "required": ["medications"],
      "properties": {
        "medications": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["drug", "dose_mg", "frequency"],
            "properties": {
              "drug":      { "type": "string" },
              "dose_mg":   { "type": "number" },
              "frequency": { "type": "string" }
            }
          }
        }
      }
    }
  },
  "is_active": true
}

Update / delete a verifier


PUT    /api/v1/verifiers/{id}
DELETE /api/v1/verifiers/{id}

Attach verifiers to a harness


POST /api/v1/harnesses/tenant/{tenant_id}
Content-Type: application/json
 
{
  "name": "Clinical extraction baseline",
  "verifier_ids": ["<id1>", "<id2>", "<id3>"],
  "judge_ids": [],
  "run_on_ingest": true,
  "sampling_rate": 1.0
}

Every log ingested for this tenant will be evaluated by every attached verifier.

Run a harness on demand


POST /api/v1/logs/{log_id}/run-harness/{harness_id}

Re-runs the harness against an existing log. Replaces any prior result for the same (log_id, harness_id) pair.

Result shape

Every verifier runner returns the same envelope:


{
  "passed": true,
  "score": 1.0,
  "flags": [],
  "details": { "...verifier-specific": "..." }
}

passed — boolean. Aggregated as verifier_passed = all(passed).
score — 0.0 to 1.0. Aggregated as verifier_score = mean(scores).
flags — short machine-readable failure tags, e.g. "word_count:got_47_expected_at_least_50". Useful for grouping failures.
details — structured payload. Counts, matched values, sample failing sentences, etc.

Recipes

Healthcare VLM — prescription extraction

Vision-language model reads a prescription image and emits structured JSON.

Verifier	Why
`json_valid`	Output must parse — no markdown wrappers.
`json_schema`	Strict shape: `medications[].drug`, `dose_mg`, `frequency`.
`required_slots` (`patient.mrn`, `patient.dob`)	Don’t write the row if patient identity is missing.
`numeric_inclusion` (≥ 1)	A dosage must include digits — catches “as needed” leakage.
`forbidden_words` (off-label brand list)	Off-formulary drugs flagged.
`field_judge` (`medications.0.dose_mg`, “is the dose plausible for the patient’s age?”)	Hybrid sanity check.

Legal drafting — formal correspondence

Agent drafts a demand letter or filing.

Verifier	Why
`starts_with` (“Dear”)	Mandatory salutation.
`ends_with` (“Sincerely,“)	Mandatory sign-off.
`placeholder_count` (= 0)	No `[CLIENT NAME]` template leakage in production output.
`forbidden_words` (banned phrases, opposing-party slips)	Compliance.
`terminal_punctuation_allowlist` (”.”)	No `!` or `?` in formal letters.
`regex_match` (`Case No\. \d{2}-CV-\d{4,6}`)	Case number format.
`max_paragraph_length` (≤ 600 chars)	Readability for the bench.

Voice agent — TTS-safe replies

Voice agent on a phone call.

Verifier	Why
`no_emoji`	TTS pronounces them.
`no_markdown_formatting`	TTS reads `bold` as asterisk-bold-asterisk.
`no_url_or_email`	Long URLs read aloud are useless.
`contains_phrase` (“is that correct”)	Confirmation step before commit actions.
`forbidden_phrases` (e.g. “I am an AI”, “I cannot”)	Compliance + brand voice.
`value_echoed` (phone / address / order ID)	Agent must repeat back critical fields.
`required_slots` (`customer.name`, `customer.phone`)	Tool-call slots filled before booking.
`max_sentence_length` (≤ 25 words)	Long sentences sound robotic.
`character_count` (≤ 800 chars)	Hard cap on per-turn audio length.

Reference

Every verifier returns the standard { passed, score, flags, details } envelope. Examples below show the parts that vary.

Family 1 — JSON / structure

`json_valid`

Output parses as valid JSON.

Param	Type	Notes

Use case. Tool-call agents that should emit pure JSON; VLM extraction pipelines.


config: {}
 
# pass
output: '{"diagnosis": "J45.901", "confidence": 0.92}'
 
# fail — leaked prose
output: 'sure thing — { diagnosis: J45.901 }'
flags: ["invalid_json"]

`json_schema`

Output JSON validates against a Draft 2020-12 schema.

Param	Type	Notes
`schema`	`json`	Any valid JSON Schema.

Use case. Strict enforcement of an extraction or tool-call signature.


# Healthcare VLM extracting an EOB line item
config:
  schema:
    type: object
    required: [member_id, claim_id, paid_amount]
    properties:
      member_id:   { type: string, pattern: "^[A-Z0-9]{9}$" }
      claim_id:    { type: string }
      paid_amount: { type: number, minimum: 0 }

Family 2 — Counts (use a `relation` + `expected`)

Every verifier in this family takes the same two parameters:

Param	Type	Notes
`relation`	`select`	One of `at_least`, `at_most`, `equal_to`, `less_than`, `greater_than`.
`expected`	`number`	The integer to compare against.

`word_count`

Total words in the output.


# Legal — keep summary briefs under 250 words
config: { relation: at_most, expected: 250 }

`character_count`

Total characters.


# Per-turn audio length cap (voice)
config: { relation: at_most, expected: 800 }

`sentence_count`

Sentences (split on ./!/?).


# Healthcare — patient-instruction summary, exactly 3 sentences
config: { relation: equal_to, expected: 3 }

`paragraph_count`

Paragraphs (blank-line separated).


# Legal memo — minimum two paragraphs
config: { relation: at_least, expected: 2 }

`bullet_count`

Bullet items (-, *, •, +).


# Discharge instructions — at least three action items
config: { relation: at_least, expected: 3 }

`numbered_list_count`

Numbered list items (1., 2., …).


# Legal procedure — exactly five steps in a checklist
config: { relation: equal_to, expected: 5 }

`placeholder_count`

[bracketed] placeholders.


# Legal — production output must have zero leftover template tokens
# like [CLIENT NAME], [DATE OF FILING]
config: { relation: equal_to, expected: 0 }

`numeric_inclusion`

Total digits in the response.


# Healthcare — every dose must include digits (catches "as needed")
config: { relation: at_least, expected: 1 }

`question_exclaim_count`

Total ? and ! marks.


# Formal legal writing — at most one question, no exclamations
config: { relation: at_most, expected: 1 }

Family 3 — Affix and pattern

`starts_with`

Output begins with a string.

Param	Type	Notes
`prefix`	`string`	Required.
`case_sensitive`	`boolean`	Default `true`.

Use case. Mandatory salutation in legal correspondence.


config: { prefix: "Dear ", case_sensitive: true }

`ends_with`

Output ends with a string.

Param	Type	Notes
`suffix`	`string`	Required.
`case_sensitive`	`boolean`	Default `true`.

Use case. Sign-off enforcement.


config: { suffix: "Sincerely,\nCounsel" }

`regex_match`

Output (must / must-not) match a regular expression.

Param	Type	Notes
`pattern`	`string`	Python regex.
`must_match`	`boolean`	Default `true`. Set `false` to invert.
`ignore_case`	`boolean`	Default `false`.

Use case. Custom format checks — case numbers, MRNs, ICD-10 codes.


# Legal — case number format like "Case No. 24-CV-001234"
config:
  pattern: "Case No\\. \\d{2}-CV-\\d{4,6}"
  must_match: true
 
# Healthcare — block free-text dosing in a structured field
config:
  pattern: "as needed|prn|q\\.d\\."
  must_match: false
  ignore_case: true

`wrap_with`

Output starts and ends with the same phrase.

Param	Type	Notes
`wrap_phrase`	`string`	Required.

Use case. Markdown code fences; XML-style envelopes.


config: { wrap_phrase: "```" }

`quoted`

Output is fully wrapped in "...". No params.

Use case. Legal — direct-quotation extraction tasks.

`title_wrapped`

First non-empty line is wrapped in << >>. No params.

Use case. Report-style outputs that must lead with a <<Title>> line for downstream parsing.


# pass:  "<<Quarterly Compliance Review>>\n\nbody…"
# fail:  "Quarterly Compliance Review\n\nbody…"

Family 4 — Frequency

`keyword_frequency`

A keyword appears the required number of times (whole-word match).

Param	Type	Notes
`keyword`	`string`	Required.
`relation`	`select`	See Family 2.
`expected`	`number`	Required.
`case_sensitive`	`boolean`	Default `false`.

Use case. Healthcare — mention “follow-up” at least once in a discharge summary.


config: { keyword: "follow-up", relation: at_least, expected: 1 }

`char_frequency`

A single character appears the required number of times.

Param	Type	Notes
`char`	`string` (1 char)	Required.
`relation` / `expected`	—	Family 2 params.
`case_sensitive`	`boolean`	Default `true`.

Use case. Detect token overuse, or count delimiters in a structured response.

`alliteration`

Words starting with a target letter, count vs relation.

Param	Type	Notes
`letter`	`string` (1 char)	Required.
`relation` / `expected`	—	Family 2 params.
`case_sensitive`	`boolean`	Default `false`.

Use case. Creative-writing harnesses; stylistic tasks.


config: { letter: "s", relation: at_least, expected: 5 }
# pass: "Sally sells silver seashells silently."

Family 5 — Keyword presence and position

`keywords_all_present`

Every listed keyword must appear at least once.

Param	Type	Notes
`keywords`	`string_array`	Required. Comma- or newline-separated input.
`case_sensitive`	`boolean`	Default `false`.

Use case. Required talking points — patient-education content must mention every key term.


config:
  keywords: ["dosage", "side effects", "when to call your doctor"]

`forbidden_words`

None of the listed words appear (whole-word match).

Param	Type	Notes
`words`	`string_array`	Required.
`case_sensitive`	`boolean`	Default `false`.

Use case. Banned terms — opposing-counsel names in a draft, off-label drugs in a clinical summary.


# Legal — never reference the opposing party informally
config: { words: ["Bob", "the other side"] }

`keyword_position`

The word at index position (0-based) equals keyword.

Param	Type	Notes
`keyword`	`string`	Required.
`position`	`number`	0-indexed word position.
`case_sensitive`	`boolean`	Default `false`.

Use case. Enforce a structural opening — first word must be "WHEREAS", etc.

`palindrome_word`

Output contains at least one palindrome word at or above the minimum length.

Param	Type	Notes
`min_length`	`number`	Default `3`.

Use case. Niche generative tasks; fun-fact validation.

Family 6 — Length constraints

`max_sentence_length`

No sentence exceeds the maximum word count.

Param	Type	Notes
`max_words`	`number`	Required, ≥ 1.

Use case. Healthcare — patient-facing instructions, plain-language guidelines.


config: { max_words: 20 }

`max_word_repetition`

No word repeats more than N times.

Param	Type	Notes
`max_repeats`	`number`	Required, ≥ 1.
`case_sensitive`	`boolean`	Default `false`.

Use case. Detect parroting / stuck loops.

`unique_word_count`

Distinct words count vs relation.

Param	Type	Notes
`relation` / `expected`	—	Family 2 params.
`case_sensitive`	`boolean`	Default `false`.

Use case. Vocabulary diversity floor for long-form writing.

`word_length_range`

Every word’s length is within [min_length, max_length].

Param	Type	Notes
`min_length`	`number`	Required, ≥ 1.
`max_length`	`number`	Required, ≥ `min_length`.

Use case. Plain-language requirements — flag any word longer than 12 characters in pediatric patient handouts.

`avg_word_length`

Average word length is within [min_ratio, max_ratio].

Param	Type	Notes
`min_ratio`	`number`	Required, ≥ 0.
`max_ratio`	`number`	Required.

Use case. Tone calibration — readable but not childish.


config: { min_ratio: 4.5, max_ratio: 6.0 }

`paragraph_word_count`

Every paragraph’s word count satisfies the relation.

Param	Type	Notes
`relation` / `expected`	—	Family 2 params.

Use case. Long-form content — legal memos, every paragraph at least 30 words.

Family 7 — Markdown structure

`section_count`

Number of ### <name> N markdown sections matches the relation.

Param	Type	Notes
`section_name`	`string`	Required. Must not contain `#` or `*`.
`relation` / `expected`	—	Family 2 params.

Use case. Templated reports with ### Finding 1, ### Finding 2, …

`nested_list`

A list at least min_depth deep with at least num_subitems items exists.

Param	Type	Notes
`min_depth`	`number`	Default `2`.
`num_subitems`	`number`	Default `2`.

Use case. Outline generation, structured how-to content.


config: { min_depth: 2, num_subitems: 2 }
# pass:
# - top
#   - sub-1
#   - sub-2

`markdown_table`

Output contains a markdown table with at least N rows and N columns.

Param	Type	Notes
`min_rows`	`number`	Default `1`.
`min_cols`	`number`	Default `2`.

Use case. Healthcare — drug-interaction or differential-diagnosis comparison tables.

`heading_depth`

All required heading levels (1=#, 2=##, …) are present.

Param	Type	Notes
`levels`	`number_array`	Required. e.g. `[1, 2]` means at least one `#` and one `##`.

Use case. Document structure — legal briefs requiring # for sections and ## for subsections.

`max_paragraph_length`

No paragraph exceeds max_chars characters.

Param	Type	Notes
`max_chars`	`number`	Required, ≥ 1.

Use case. Readability guardrail.

`sentences_per_paragraph`

Every paragraph’s sentence count satisfies the relation.

Param	Type	Notes
`relation` / `expected`	—	Family 2 params.

Use case. Style constraint — at most three sentences per paragraph.

`sentence_endings_variety`

At least N distinct sentence-ending punctuation marks are used (. ! ?).

Param	Type	Notes
`min_variants`	`number`	Default `2`.

Use case. Detect monotone output (only .-terminated sentences).

`terminal_punctuation_allowlist`

Only the allowed end-of-sentence marks (subset of .!?) appear.

Param	Type	Notes
`allowed`	`string`	E.g. `"."` to forbid `!` and `?`.

Use case. Formal legal writing — no exclamations.

`no_comma`

Output contains no commas. No params.

Use case. Simple-style writing exercises.

`no_period`

Output contains no periods. No params.

Use case. Stream-of-consciousness or single-sentence outputs.

`postscript`

Last non-empty line begins with a marker (e.g. PS:) and has content after it.

Param	Type	Notes
`marker`	`string`	Default `"PS:"`.

Use case. Letter-style outputs that must end with a postscript.

Family 8 — Voice-agent verifiers

All tagged voice in the registry. Designed for transcript and tool-call output from voice agents.

`no_emoji`

Output contains no emoji. No params.

Why it matters. TTS engines pronounce 😀 as “smiley face” or “grinning face emoji.” Hard fail for any voice surface.

`no_markdown_formatting`

Output is plain text — no **bold**, *italic*, __under__, headings, bullets, numbered lists, code fences, inline code, or pipe tables. No params.

Why it matters. TTS reads markdown literally. Detected formatting kinds appear in details.kinds for triage.

`no_url_or_email`

Output contains no URLs (https://..., www.x.com) or email addresses. No params.

Why it matters. “Visit aitch tee tee pee colon slash slash example dot com” is unusable. Send these via SMS or email after the call.

`contains_phrase`

Output contains a required substring.

Param	Type	Notes
`phrase`	`string`	Required.
`case_sensitive`	`boolean`	Default `false`.

Use case. Confirmation step (“is that correct”), legal disclaimer, closing phrase.


config: { phrase: "is that correct" }

Unlike keyword_frequency, this is a substring match — works for multi-word phrases.

`forbidden_phrases`

Output contains none of the listed phrases (substring match).

Param	Type	Notes
`phrases`	`string_array`	Required.
`case_sensitive`	`boolean`	Default `false`.

Use case. “I am an AI”, “I cannot help with that”, competitor names, internal model identifiers.


config:
  phrases: ["I am an AI", "as a language model", "I cannot"]

`value_echoed`

Agent repeats back a critical value (phone, address, order ID).

Param	Type	Notes
`value`	`string`	Required. The literal value to look for.
`normalize_digits`	`boolean`	Default `true`. Strips non-alphanumerics for comparison so `(555) 123-4567` matches `5551234567`.
`case_sensitive`	`boolean`	Default `false`.

Use case. Confirmation patterns — “So that’s 555-123-4567 — is that right?”


config: { value: "5551234567", normalize_digits: true }
# pass: "(555) 123-4567"
# pass: "555 1234 567"
# fail: "5551234560"

details.mode reports "normalized" or "literal" depending on which match path succeeded.

`required_slots`

Every listed JSON path resolves to a non-empty value in extracted_json (or, failing that, in the parsed output).

Param	Type	Notes
`paths`	`string_array`	Required. Dotted paths, e.g. `customer.name`, `items.0.sku`. Supports leading `$.`.

Use case. Tool-call gating — don’t send a book_appointment action until name + phone + slot are filled.


config:
  paths: ["customer.name", "customer.phone", "appointment.start_iso"]

Failures land in details.missing (path absent) and details.empty (path resolved to null / "" / [] / {}).

Family 9 — LLM-backed

`field_judge`

LLM evaluates a single JSON field against a free-form criterion. The only verifier that calls an LLM.

Param	Type	Notes
`path`	`string`	Required. Dotted JSON path; leading `$.` accepted.
`criterion`	`textarea`	Required. Plain-English criterion.
`min_score`	`number`	Default `0.7`. Pass threshold (0–1).
`model`	`string`	Optional. Defaults to `gpt-4o`.

Cost / latency. ~1–2 seconds per call, ~$0.001–0.005 per evaluation depending on field size. Reads extracted_json first; falls back to parsing output as JSON. Missing fields fail fast (field_not_found:<path>) with no LLM call.

Use case. Hybrid checks where deterministic rules can’t capture the requirement.


# Healthcare — clinical reasonableness
config:
  path: medications.0.dose_mg
  criterion: |
    Given the patient's age and weight in the same JSON object,
    is this dose within the standard pediatric range for this drug?
  min_score: 0.7
 
# Legal — citation appropriateness
config:
  path: arguments.0.citation
  criterion: |
    Is this citation a real, currently good-law case that supports
    the proposition in arguments[0].claim?
  min_score: 0.8

details.reasoning contains the LLM’s 1–2 sentence justification.

Roadmap

Domain-specific verifiers that need new metadata fields or libraries — not yet shipped:

latency_ttfa — time-to-first-audio. Needs a voice_metrics.ttfa_ms field on logs.
barge_in_ack_ms — agent stops speaking within N ms of user starting.
audio_duration_ratio — agent speech / total call duration.
language_consistency — agent stays in the language the user opened with.
wer_against_reference — word error rate vs gold transcript. Eval-mode only.
tool_call_sequence — tools fire in a required order.

Verifiers

Two ways to use them

Stateless validation — POST /api/v1/verify

Request fields

Response fields

Errors

Persistence

Python example

TypeScript example

IFEval-style integration

Continuous monitoring — harness flow

How verifiers run in this mode

Harness API

List all verifier types

Create a verifier

Update / delete a verifier

Attach verifiers to a harness

Run a harness on demand

Result shape

Recipes

Healthcare VLM — prescription extraction

Legal drafting — formal correspondence

Voice agent — TTS-safe replies

Reference

Family 1 — JSON / structure

json_valid

json_schema

Family 2 — Counts (use a relation + expected)

word_count

character_count

sentence_count

paragraph_count

bullet_count

numbered_list_count

placeholder_count

numeric_inclusion

question_exclaim_count

Family 3 — Affix and pattern

starts_with

ends_with

regex_match

wrap_with

quoted

title_wrapped

Family 4 — Frequency

keyword_frequency

char_frequency

alliteration

Family 5 — Keyword presence and position

keywords_all_present

forbidden_words

keyword_position

palindrome_word

Family 6 — Length constraints

max_sentence_length

max_word_repetition

unique_word_count

word_length_range

avg_word_length

paragraph_word_count