Skip to Content
DocsConcepts

Concepts

A quick mental model before the API reference.

A log is one AI interaction

The atomic unit in BEval is a log: one call to an LLM, one VLM completion, one agent turn, one retrieval step. Each log captures:

  • What went in (input)
  • What came out (output)
  • Metadata: model, tokens, latency, cost, status
  • Free-form extra for anything else you want to carry

The dashboard lists logs, scores them against evaluation harnesses, and groups them by project or time window.

Kinds

Every log has a kind field. This drives how the dashboard renders it and which eval harnesses match.

KindMeaning
llmA text-in, text-out LLM call
vlmA multimodal call with image input (automatically set when you pass image=)
agentOne agent turn (typically a function wrapped with @beval.trace)
embeddingAn embedding generation call
ocrAn OCR pass
completionA raw completion call (non-chat format)

The default is llm. The auto-wrappers pick the right kind for you.

Projects and tenants

Your API key is scoped to a tenant and, optionally, a project.

  • Tenant — your organization on BEval. Set automatically from the API key.
  • Project — a subdivision inside a tenant (e.g. prod-chat, staging-rag, experiment-xyz). Used for filtering in the dashboard.

If the API key has a project, logs use it by default. You can override per-call:

beval.log( input="...", project_id="123e4567-e89b-12d3-a456-426614174000", )

Or set a default in init():

beval.init(project_id="...")

external_id

An optional client-supplied ID for the log. Useful for:

  • Idempotent retries — the SDK sets this automatically inside @beval.trace
  • Correlation — joining a BEval log back to a row in your own database
  • Debuggability — grepping logs for a specific interaction

If unset, BEval generates a UUID server-side.

extra

A free-form JSON object stored alongside every log. Use it for anything the core fields don’t cover:

beval.log( input="...", output="...", extra={ "user_id": "u_123", "session_id": "s_456", "experiment": "rag-v2", "retrieval_chunks": 8, }, )

Appears in the dashboard drawer and is queryable server-side. No schema enforced.

Harnesses, rubrics, verifiers, judges

These live in the BEval dashboard, not in the SDK. The SDK only ships logs; evaluation runs on top of them.

  • Verifiers — deterministic checks on the log (JSON valid, word count, regex match, field-level LLM judge)
  • Rubrics — natural-language evaluation questions
  • Judges — LLM models that answer rubric questions
  • Harnesses — bundles of verifiers + judges that run on every log (or on demand)

See the BEval Studio docs  for configuring these. The SDK’s job ends once the log is shipped.

Ingest flow

your code → beval.log(...) → SDK payload built, enqueued (non-blocking, returns in μs) → background thread drains queue → HTTPS POST /api/v1/logs/ingest → gateway writes row, optionally runs harnesses

Your call returns before any HTTP happens. If the network is down, logs queue in memory (default capacity: 10,000). On queue overflow, new logs are dropped silently and a warning is emitted to the beval Python logger.

See Reliability & Performance for failure modes in detail.

Last updated on