Skip to Content
DocsVerifiers

Verifiers

Verifiers are programmatic, zero-cost checks that score a model response against a rule. Every verifier is pure Python — no model call, no token cost, no flakiness. One exception is field_judge, which calls an LLM and is the only verifier that incurs a per-evaluation cost.

The platform ships with 49 built-in verifier types grouped into nine families.

Verifier vs. judge:

  • Verifier — the answer is mechanically checkable: “is the output valid JSON?”, “did the response contain the patient’s MRN?”, “is there a markdown table with at least three columns?”
  • Judge (LLM rubric) — the answer needs language understanding: “was the response empathetic?”, “did the agent stay on-topic?”

Use both. Verifiers catch cheap, mechanical regressions on every call; judges sample the harder ones.


Two ways to use them

ModeEndpointWhen to use
Stateless validationPOST /api/v1/verifyYou already know which verifiers to run for this response (e.g. IFEval-style instruction generation produces verifier specs alongside the prompt). One round trip → one report. No setup.
Continuous monitoringPOST /api/v1/logs/ingest + harness setupYou want every production log scored automatically by a fixed set of verifiers + LLM judges, with a dashboard. Verifiers are registered once and attached to a harness.

The runners and config schemas are identical in both modes — pick the path that fits your workflow.


Stateless validation — POST /api/v1/verify

Send the model output plus an inline list of verifier specs; get back a per-verifier report and an aggregate score. No resources to register.

POST /api/v1/verify X-API-Key: sk-… Content-Type: application/json { "output": "Sure thing! So that's (555) 123-4567 — is that correct?", "extracted_json": { "customer": { "phone": "5551234567" } }, "verifiers": [ { "type": "no_emoji", "config": {} }, { "type": "max_sentence_length", "config": { "max_words": 25 } }, { "type": "value_echoed", "config": { "value": "5551234567" } }, { "type": "contains_phrase", "config": { "phrase": "is that correct" } } ], "external_id": "turn_42", "extra": { "session_id": "abc123" } }

Response:

{ "record_id": "8f2c3a40-1d4b-4e3a-9f9c-2b8e7e4a6c11", "passed": true, "score": 1.0, "latency_ms": 4, "results": [ { "type": "no_emoji", "passed": true, "score": 1.0, "flags": [], "details": {} }, { "type": "max_sentence_length", "passed": true, "score": 1.0, "flags": [], "details": { "sentence_count": 2 } }, { "type": "value_echoed", "passed": true, "score": 1.0, "flags": [], "details": { "value": "5551234567", "mode": "normalized" } }, { "type": "contains_phrase", "passed": true, "score": 1.0, "flags": [], "details": { "phrase": "is that correct" } } ] }

Request fields

FieldTypeNotes
outputstringRequired. The model response to verify.
verifiersarrayRequired. 1 to 25 entries. Each entry: { "type": "<key>", "config": { ... } }. Type keys come from GET /api/v1/models/verifier-types.
extracted_jsonobjectOptional. Used by verifiers like field_judge and required_slots.
external_idstringOptional. Your own ID for this call — surfaced in dashboards and useful for joining back to your traces. Max 255 chars.
project_iduuidOptional. Falls back to the API key’s project.
extraobjectOptional. Arbitrary metadata stored on the persisted record.

Response fields

FieldTypeNotes
record_iduuidThe persisted log row. Use it to fetch later via GET /api/v1/logs/{id}.
passedbooltrue iff every verifier passed.
scorefloatMean of per-verifier scores, 0.0–1.0.
latency_msintEnd-to-end runner time.
results[]arrayPer-verifier envelope. See result shape.

Errors

  • 400unknown_verifier_type if any type key isn’t in the registry. Response body lists the offending keys.
  • 401 — invalid or missing X-API-Key.
  • 422 — payload validation (e.g. empty verifiers, more than 25 entries).

Persistence

Every call writes a kind=verify row to your tenant’s log store with: the output, extracted_json, the verifier specs you sent, the per-verifier results, the aggregate score, and latency_ms. Use it for billing reconciliation, regression tracking, and dashboard queries — same row format as ingested production logs.

To fetch the row later:

GET /api/v1/logs/{record_id} Authorization: Bearer <session-jwt>

Python example

import os, requests def verify(output, verifiers, extracted_json=None, **extra): return requests.post( "https://ai-gateway.bolder.services/api/v1/verify", headers={"X-API-Key": os.environ["BEVAL_API_KEY"]}, json={ "output": output, "extracted_json": extracted_json, "verifiers": verifiers, **extra, }, timeout=10, ).json() report = verify( output=assistant_response, extracted_json=tool_call_args, verifiers=instruction.verifiers, # list you carry per instruction external_id=f"turn_{turn_id}", ) if not report["passed"]: log.warning("validation failed", flags=[f for r in report["results"] for f in r["flags"]])

TypeScript example

const report = await fetch("https://ai-gateway.bolder.services/api/v1/verify", { method: "POST", headers: { "X-API-Key": process.env.BEVAL_API_KEY!, "Content-Type": "application/json", }, body: JSON.stringify({ output: assistantResponse, extracted_json: toolCallArgs, verifiers: instruction.verifiers, external_id: `turn_${turnId}`, }), }).then((r) => r.json()); if (!report.passed) { console.warn("validation failed", report.results.filter((r) => !r.passed)); }

IFEval-style integration

If you generate prompts and constraints together (instruction text + a verifier spec per constraint), forward those specs unmodified:

instruction = { "text": "Limit your response to 25 words and end with a postscript.", "verifiers": [ {"type": "word_count", "config": {"relation": "at_most", "expected": 25}}, {"type": "postscript", "config": {"marker": "PS:"}}, ], } response = call_my_llm(instruction["text"]) report = verify(response, instruction["verifiers"])

Continuous monitoring — harness flow

For tenants who want every production log scored automatically — without the caller picking verifiers each time — register verifiers once and attach them to a harness. The harness runs on every log ingested via POST /api/v1/logs/ingest.

How verifiers run in this mode

log ingest → active harnesses for the tenant → for each harness: → for each attached verifier: runner(output, config, extracted_json?) returns { passed, score, flags, details } → aggregate: verifier_score = mean(scores) verifier_passed = all(passed)

Verifiers run synchronously in the ingest path — every runner is designed to finish in milliseconds. field_judge is the exception (1–2 s per call).

Harness API

The endpoints below are specific to the monitoring path — registering verifier resources and binding them to harnesses. The first two (list types, result shape) apply to both modes.

List all verifier types

GET /api/v1/models/verifier-types

Returns the registry — every type’s key, human-readable name, description, params schema, and tags.

[ { "key": "json_schema", "name": "JSON Schema", "description": "Output JSON validates against a JSON Schema (Draft 2020-12).", "params": [ { "key": "schema", "label": "JSON Schema", "type": "json", "required": true } ], "tags": [] }, { "key": "no_emoji", "name": "No Emoji", "description": "Output contains no emoji (TTS reads them as 'smiley face').", "params": [], "tags": ["voice"] } ]

Param type values are: string, textarea, number, boolean, select, json, string_array, number_array.

Create a verifier

POST /api/v1/verifiers/tenant/{tenant_id} Content-Type: application/json { "name": "Prescription schema valid", "description": "Extracted prescription JSON must match the medication schema", "verifier_type": "json_schema", "config": { "schema": { "type": "object", "required": ["medications"], "properties": { "medications": { "type": "array", "items": { "type": "object", "required": ["drug", "dose_mg", "frequency"], "properties": { "drug": { "type": "string" }, "dose_mg": { "type": "number" }, "frequency": { "type": "string" } } } } } } }, "is_active": true }

Update / delete a verifier

PUT /api/v1/verifiers/{id} DELETE /api/v1/verifiers/{id}

Attach verifiers to a harness

POST /api/v1/harnesses/tenant/{tenant_id} Content-Type: application/json { "name": "Clinical extraction baseline", "verifier_ids": ["<id1>", "<id2>", "<id3>"], "judge_ids": [], "run_on_ingest": true, "sampling_rate": 1.0 }

Every log ingested for this tenant will be evaluated by every attached verifier.

Run a harness on demand

POST /api/v1/logs/{log_id}/run-harness/{harness_id}

Re-runs the harness against an existing log. Replaces any prior result for the same (log_id, harness_id) pair.

Result shape

Every verifier runner returns the same envelope:

{ "passed": true, "score": 1.0, "flags": [], "details": { "...verifier-specific": "..." } }
  • passed — boolean. Aggregated as verifier_passed = all(passed).
  • score0.0 to 1.0. Aggregated as verifier_score = mean(scores).
  • flags — short machine-readable failure tags, e.g. "word_count:got_47_expected_at_least_50". Useful for grouping failures.
  • details — structured payload. Counts, matched values, sample failing sentences, etc.

Recipes

Healthcare VLM — prescription extraction

Vision-language model reads a prescription image and emits structured JSON.

VerifierWhy
json_validOutput must parse — no markdown wrappers.
json_schemaStrict shape: medications[].drug, dose_mg, frequency.
required_slots (patient.mrn, patient.dob)Don’t write the row if patient identity is missing.
numeric_inclusion (≥ 1)A dosage must include digits — catches “as needed” leakage.
forbidden_words (off-label brand list)Off-formulary drugs flagged.
field_judge (medications.0.dose_mg, “is the dose plausible for the patient’s age?”)Hybrid sanity check.

Agent drafts a demand letter or filing.

VerifierWhy
starts_with (“Dear”)Mandatory salutation.
ends_with (“Sincerely,“)Mandatory sign-off.
placeholder_count (= 0)No [CLIENT NAME] template leakage in production output.
forbidden_words (banned phrases, opposing-party slips)Compliance.
terminal_punctuation_allowlist (”.”)No ! or ? in formal letters.
regex_match (Case No\. \d{2}-CV-\d{4,6})Case number format.
max_paragraph_length (≤ 600 chars)Readability for the bench.

Voice agent — TTS-safe replies

Voice agent on a phone call.

VerifierWhy
no_emojiTTS pronounces them.
no_markdown_formattingTTS reads **bold** as asterisk-bold-asterisk.
no_url_or_emailLong URLs read aloud are useless.
contains_phrase (“is that correct”)Confirmation step before commit actions.
forbidden_phrases (e.g. “I am an AI”, “I cannot”)Compliance + brand voice.
value_echoed (phone / address / order ID)Agent must repeat back critical fields.
required_slots (customer.name, customer.phone)Tool-call slots filled before booking.
max_sentence_length (≤ 25 words)Long sentences sound robotic.
character_count (≤ 800 chars)Hard cap on per-turn audio length.

Reference

Every verifier returns the standard { passed, score, flags, details } envelope. Examples below show the parts that vary.

Family 1 — JSON / structure

json_valid

Output parses as valid JSON.

ParamTypeNotes

Use case. Tool-call agents that should emit pure JSON; VLM extraction pipelines.

config: {} # pass output: '{"diagnosis": "J45.901", "confidence": 0.92}' # fail — leaked prose output: 'sure thing — { diagnosis: J45.901 }' flags: ["invalid_json"]

json_schema

Output JSON validates against a Draft 2020-12 schema.

ParamTypeNotes
schemajsonAny valid JSON Schema.

Use case. Strict enforcement of an extraction or tool-call signature.

# Healthcare VLM extracting an EOB line item config: schema: type: object required: [member_id, claim_id, paid_amount] properties: member_id: { type: string, pattern: "^[A-Z0-9]{9}$" } claim_id: { type: string } paid_amount: { type: number, minimum: 0 }

Family 2 — Counts (use a relation + expected)

Every verifier in this family takes the same two parameters:

ParamTypeNotes
relationselectOne of at_least, at_most, equal_to, less_than, greater_than.
expectednumberThe integer to compare against.

word_count

Total words in the output.

# Legal — keep summary briefs under 250 words config: { relation: at_most, expected: 250 }

character_count

Total characters.

# Per-turn audio length cap (voice) config: { relation: at_most, expected: 800 }

sentence_count

Sentences (split on ./!/?).

# Healthcare — patient-instruction summary, exactly 3 sentences config: { relation: equal_to, expected: 3 }

paragraph_count

Paragraphs (blank-line separated).

# Legal memo — minimum two paragraphs config: { relation: at_least, expected: 2 }

bullet_count

Bullet items (-, *, , +).

# Discharge instructions — at least three action items config: { relation: at_least, expected: 3 }

numbered_list_count

Numbered list items (1., 2., …).

# Legal procedure — exactly five steps in a checklist config: { relation: equal_to, expected: 5 }

placeholder_count

[bracketed] placeholders.

# Legal — production output must have zero leftover template tokens # like [CLIENT NAME], [DATE OF FILING] config: { relation: equal_to, expected: 0 }

numeric_inclusion

Total digits in the response.

# Healthcare — every dose must include digits (catches "as needed") config: { relation: at_least, expected: 1 }

question_exclaim_count

Total ? and ! marks.

# Formal legal writing — at most one question, no exclamations config: { relation: at_most, expected: 1 }

Family 3 — Affix and pattern

starts_with

Output begins with a string.

ParamTypeNotes
prefixstringRequired.
case_sensitivebooleanDefault true.

Use case. Mandatory salutation in legal correspondence.

config: { prefix: "Dear ", case_sensitive: true }

ends_with

Output ends with a string.

ParamTypeNotes
suffixstringRequired.
case_sensitivebooleanDefault true.

Use case. Sign-off enforcement.

config: { suffix: "Sincerely,\nCounsel" }

regex_match

Output (must / must-not) match a regular expression.

ParamTypeNotes
patternstringPython regex.
must_matchbooleanDefault true. Set false to invert.
ignore_casebooleanDefault false.

Use case. Custom format checks — case numbers, MRNs, ICD-10 codes.

# Legal — case number format like "Case No. 24-CV-001234" config: pattern: "Case No\\. \\d{2}-CV-\\d{4,6}" must_match: true # Healthcare — block free-text dosing in a structured field config: pattern: "as needed|prn|q\\.d\\." must_match: false ignore_case: true

wrap_with

Output starts and ends with the same phrase.

ParamTypeNotes
wrap_phrasestringRequired.

Use case. Markdown code fences; XML-style envelopes.

config: { wrap_phrase: "```" }

quoted

Output is fully wrapped in "...". No params.

Use case. Legal — direct-quotation extraction tasks.

title_wrapped

First non-empty line is wrapped in << >>. No params.

Use case. Report-style outputs that must lead with a <<Title>> line for downstream parsing.

# pass: "<<Quarterly Compliance Review>>\n\nbody…" # fail: "Quarterly Compliance Review\n\nbody…"

Family 4 — Frequency

keyword_frequency

A keyword appears the required number of times (whole-word match).

ParamTypeNotes
keywordstringRequired.
relationselectSee Family 2.
expectednumberRequired.
case_sensitivebooleanDefault false.

Use case. Healthcare — mention “follow-up” at least once in a discharge summary.

config: { keyword: "follow-up", relation: at_least, expected: 1 }

char_frequency

A single character appears the required number of times.

ParamTypeNotes
charstring (1 char)Required.
relation / expectedFamily 2 params.
case_sensitivebooleanDefault true.

Use case. Detect token overuse, or count delimiters in a structured response.

alliteration

Words starting with a target letter, count vs relation.

ParamTypeNotes
letterstring (1 char)Required.
relation / expectedFamily 2 params.
case_sensitivebooleanDefault false.

Use case. Creative-writing harnesses; stylistic tasks.

config: { letter: "s", relation: at_least, expected: 5 } # pass: "Sally sells silver seashells silently."

Family 5 — Keyword presence and position

keywords_all_present

Every listed keyword must appear at least once.

ParamTypeNotes
keywordsstring_arrayRequired. Comma- or newline-separated input.
case_sensitivebooleanDefault false.

Use case. Required talking points — patient-education content must mention every key term.

config: keywords: ["dosage", "side effects", "when to call your doctor"]

forbidden_words

None of the listed words appear (whole-word match).

ParamTypeNotes
wordsstring_arrayRequired.
case_sensitivebooleanDefault false.

Use case. Banned terms — opposing-counsel names in a draft, off-label drugs in a clinical summary.

# Legal — never reference the opposing party informally config: { words: ["Bob", "the other side"] }

keyword_position

The word at index position (0-based) equals keyword.

ParamTypeNotes
keywordstringRequired.
positionnumber0-indexed word position.
case_sensitivebooleanDefault false.

Use case. Enforce a structural opening — first word must be "WHEREAS", etc.

palindrome_word

Output contains at least one palindrome word at or above the minimum length.

ParamTypeNotes
min_lengthnumberDefault 3.

Use case. Niche generative tasks; fun-fact validation.


Family 6 — Length constraints

max_sentence_length

No sentence exceeds the maximum word count.

ParamTypeNotes
max_wordsnumberRequired, ≥ 1.

Use case. Healthcare — patient-facing instructions, plain-language guidelines.

config: { max_words: 20 }

max_word_repetition

No word repeats more than N times.

ParamTypeNotes
max_repeatsnumberRequired, ≥ 1.
case_sensitivebooleanDefault false.

Use case. Detect parroting / stuck loops.

unique_word_count

Distinct words count vs relation.

ParamTypeNotes
relation / expectedFamily 2 params.
case_sensitivebooleanDefault false.

Use case. Vocabulary diversity floor for long-form writing.

word_length_range

Every word’s length is within [min_length, max_length].

ParamTypeNotes
min_lengthnumberRequired, ≥ 1.
max_lengthnumberRequired, ≥ min_length.

Use case. Plain-language requirements — flag any word longer than 12 characters in pediatric patient handouts.

avg_word_length

Average word length is within [min_ratio, max_ratio].

ParamTypeNotes
min_rationumberRequired, ≥ 0.
max_rationumberRequired.

Use case. Tone calibration — readable but not childish.

config: { min_ratio: 4.5, max_ratio: 6.0 }

paragraph_word_count

Every paragraph’s word count satisfies the relation.

ParamTypeNotes
relation / expectedFamily 2 params.

Use case. Long-form content — legal memos, every paragraph at least 30 words.


Family 7 — Markdown structure

section_count

Number of ### <name> N markdown sections matches the relation.

ParamTypeNotes
section_namestringRequired. Must not contain # or *.
relation / expectedFamily 2 params.

Use case. Templated reports with ### Finding 1, ### Finding 2, …

nested_list

A list at least min_depth deep with at least num_subitems items exists.

ParamTypeNotes
min_depthnumberDefault 2.
num_subitemsnumberDefault 2.

Use case. Outline generation, structured how-to content.

config: { min_depth: 2, num_subitems: 2 } # pass: # - top # - sub-1 # - sub-2

markdown_table

Output contains a markdown table with at least N rows and N columns.

ParamTypeNotes
min_rowsnumberDefault 1.
min_colsnumberDefault 2.

Use case. Healthcare — drug-interaction or differential-diagnosis comparison tables.

heading_depth

All required heading levels (1=#, 2=##, …) are present.

ParamTypeNotes
levelsnumber_arrayRequired. e.g. [1, 2] means at least one # and one ##.

Use case. Document structure — legal briefs requiring # for sections and ## for subsections.

max_paragraph_length

No paragraph exceeds max_chars characters.

ParamTypeNotes
max_charsnumberRequired, ≥ 1.

Use case. Readability guardrail.

sentences_per_paragraph

Every paragraph’s sentence count satisfies the relation.

ParamTypeNotes
relation / expectedFamily 2 params.

Use case. Style constraint — at most three sentences per paragraph.

sentence_endings_variety

At least N distinct sentence-ending punctuation marks are used (. ! ?).

ParamTypeNotes
min_variantsnumberDefault 2.

Use case. Detect monotone output (only .-terminated sentences).

terminal_punctuation_allowlist

Only the allowed end-of-sentence marks (subset of .!?) appear.

ParamTypeNotes
allowedstringE.g. "." to forbid ! and ?.

Use case. Formal legal writing — no exclamations.

no_comma

Output contains no commas. No params.

Use case. Simple-style writing exercises.

no_period

Output contains no periods. No params.

Use case. Stream-of-consciousness or single-sentence outputs.

postscript

Last non-empty line begins with a marker (e.g. PS:) and has content after it.

ParamTypeNotes
markerstringDefault "PS:".

Use case. Letter-style outputs that must end with a postscript.


Family 8 — Voice-agent verifiers

All tagged voice in the registry. Designed for transcript and tool-call output from voice agents.

no_emoji

Output contains no emoji. No params.

Why it matters. TTS engines pronounce 😀 as “smiley face” or “grinning face emoji.” Hard fail for any voice surface.

no_markdown_formatting

Output is plain text — no **bold**, *italic*, __under__, headings, bullets, numbered lists, code fences, inline code, or pipe tables. No params.

Why it matters. TTS reads markdown literally. Detected formatting kinds appear in details.kinds for triage.

no_url_or_email

Output contains no URLs (https://..., www.x.com) or email addresses. No params.

Why it matters. “Visit aitch tee tee pee colon slash slash example dot com” is unusable. Send these via SMS or email after the call.

contains_phrase

Output contains a required substring.

ParamTypeNotes
phrasestringRequired.
case_sensitivebooleanDefault false.

Use case. Confirmation step (“is that correct”), legal disclaimer, closing phrase.

config: { phrase: "is that correct" }

Unlike keyword_frequency, this is a substring match — works for multi-word phrases.

forbidden_phrases

Output contains none of the listed phrases (substring match).

ParamTypeNotes
phrasesstring_arrayRequired.
case_sensitivebooleanDefault false.

Use case. “I am an AI”, “I cannot help with that”, competitor names, internal model identifiers.

config: phrases: ["I am an AI", "as a language model", "I cannot"]

value_echoed

Agent repeats back a critical value (phone, address, order ID).

ParamTypeNotes
valuestringRequired. The literal value to look for.
normalize_digitsbooleanDefault true. Strips non-alphanumerics for comparison so (555) 123-4567 matches 5551234567.
case_sensitivebooleanDefault false.

Use case. Confirmation patterns — “So that’s 555-123-4567 — is that right?”

config: { value: "5551234567", normalize_digits: true } # pass: "(555) 123-4567" # pass: "555 1234 567" # fail: "5551234560"

details.mode reports "normalized" or "literal" depending on which match path succeeded.

required_slots

Every listed JSON path resolves to a non-empty value in extracted_json (or, failing that, in the parsed output).

ParamTypeNotes
pathsstring_arrayRequired. Dotted paths, e.g. customer.name, items.0.sku. Supports leading $..

Use case. Tool-call gating — don’t send a book_appointment action until name + phone + slot are filled.

config: paths: ["customer.name", "customer.phone", "appointment.start_iso"]

Failures land in details.missing (path absent) and details.empty (path resolved to null / "" / [] / {}).


Family 9 — LLM-backed

field_judge

LLM evaluates a single JSON field against a free-form criterion. The only verifier that calls an LLM.

ParamTypeNotes
pathstringRequired. Dotted JSON path; leading $. accepted.
criteriontextareaRequired. Plain-English criterion.
min_scorenumberDefault 0.7. Pass threshold (0–1).
modelstringOptional. Defaults to gpt-4o.

Cost / latency. ~1–2 seconds per call, ~$0.001–0.005 per evaluation depending on field size. Reads extracted_json first; falls back to parsing output as JSON. Missing fields fail fast (field_not_found:<path>) with no LLM call.

Use case. Hybrid checks where deterministic rules can’t capture the requirement.

# Healthcare — clinical reasonableness config: path: medications.0.dose_mg criterion: | Given the patient's age and weight in the same JSON object, is this dose within the standard pediatric range for this drug? min_score: 0.7 # Legal — citation appropriateness config: path: arguments.0.citation criterion: | Is this citation a real, currently good-law case that supports the proposition in arguments[0].claim? min_score: 0.8

details.reasoning contains the LLM’s 1–2 sentence justification.


Roadmap

Domain-specific verifiers that need new metadata fields or libraries — not yet shipped:

  • latency_ttfa — time-to-first-audio. Needs a voice_metrics.ttfa_ms field on logs.
  • barge_in_ack_ms — agent stops speaking within N ms of user starting.
  • audio_duration_ratio — agent speech / total call duration.
  • language_consistency — agent stays in the language the user opened with.
  • wer_against_reference — word error rate vs gold transcript. Eval-mode only.
  • tool_call_sequence — tools fire in a required order.
Last updated on