Concepts

A quick mental model before the API reference.

A log is one AI interaction

The atomic unit in BEval is a log: one call to an LLM, one VLM completion, one agent turn, one retrieval step. Each log captures:

What went in (input)
What came out (output)
Metadata: model, tokens, latency, cost, status
Free-form extra for anything else you want to carry

The dashboard lists logs, scores them against evaluation harnesses, and groups them by project or time window.

Kinds

Every log has a kind field. This drives how the dashboard renders it and which eval harnesses match.

Kind	Meaning
`llm`	A text-in, text-out LLM call
`vlm`	A multimodal call with image input (automatically set when you pass `image=`)
`agent`	One agent turn (typically a function wrapped with `@beval.trace`)
`embedding`	An embedding generation call
`ocr`	An OCR pass
`completion`	A raw completion call (non-chat format)

The default is llm. The auto-wrappers pick the right kind for you.

Projects and tenants

Your API key is scoped to a tenant and, optionally, a project.

Tenant — your organization on BEval. Set automatically from the API key.
Project — a subdivision inside a tenant (e.g. prod-chat, staging-rag, experiment-xyz). Used for filtering in the dashboard.

If the API key has a project, logs use it by default. You can override per-call:


beval.log(
    input="...",
    project_id="123e4567-e89b-12d3-a456-426614174000",
)

Or set a default in init():


beval.init(project_id="...")

`external_id`

An optional client-supplied ID for the log. Useful for:

Idempotent retries — the SDK sets this automatically inside @beval.trace
Correlation — joining a BEval log back to a row in your own database
Debuggability — grepping logs for a specific interaction

If unset, BEval generates a UUID server-side.

`extra`

A free-form JSON object stored alongside every log. Use it for anything the core fields don’t cover:


beval.log(
    input="...",
    output="...",
    extra={
        "user_id": "u_123",
        "session_id": "s_456",
        "experiment": "rag-v2",
        "retrieval_chunks": 8,
    },
)

Appears in the dashboard drawer and is queryable server-side. No schema enforced.

Harnesses, rubrics, verifiers, judges

These live in the BEval dashboard, not in the SDK. The SDK only ships logs; evaluation runs on top of them.

Verifiers — deterministic checks on the log (JSON valid, word count, regex match, field-level LLM judge)
Rubrics — natural-language evaluation questions
Judges — LLM models that answer rubric questions
Harnesses — bundles of verifiers + judges that run on every log (or on demand)

See the BEval Studio docs for configuring these. The SDK’s job ends once the log is shipped.

Ingest flow


your code
  → beval.log(...)
    → SDK payload built, enqueued (non-blocking, returns in μs)
      → background thread drains queue
        → HTTPS POST /api/v1/logs/ingest
          → gateway writes row, optionally runs harnesses

Your call returns before any HTTP happens. If the network is down, logs queue in memory (default capacity: 10,000). On queue overflow, new logs are dropped silently and a warning is emitted to the beval Python logger.

See Reliability & Performance for failure modes in detail.