Concepts
A quick mental model before the API reference.
A log is one AI interaction
The atomic unit in BEval is a log: one call to an LLM, one VLM completion, one agent turn, one retrieval step. Each log captures:
- What went in (
input) - What came out (
output) - Metadata: model, tokens, latency, cost, status
- Free-form
extrafor anything else you want to carry
The dashboard lists logs, scores them against evaluation harnesses, and groups them by project or time window.
Kinds
Every log has a kind field. This drives how the dashboard renders it and which eval harnesses match.
| Kind | Meaning |
|---|---|
llm | A text-in, text-out LLM call |
vlm | A multimodal call with image input (automatically set when you pass image=) |
agent | One agent turn (typically a function wrapped with @beval.trace) |
embedding | An embedding generation call |
ocr | An OCR pass |
completion | A raw completion call (non-chat format) |
The default is llm. The auto-wrappers pick the right kind for you.
Projects and tenants
Your API key is scoped to a tenant and, optionally, a project.
- Tenant — your organization on BEval. Set automatically from the API key.
- Project — a subdivision inside a tenant (e.g.
prod-chat,staging-rag,experiment-xyz). Used for filtering in the dashboard.
If the API key has a project, logs use it by default. You can override per-call:
beval.log(
input="...",
project_id="123e4567-e89b-12d3-a456-426614174000",
)Or set a default in init():
beval.init(project_id="...")external_id
An optional client-supplied ID for the log. Useful for:
- Idempotent retries — the SDK sets this automatically inside
@beval.trace - Correlation — joining a BEval log back to a row in your own database
- Debuggability — grepping logs for a specific interaction
If unset, BEval generates a UUID server-side.
extra
A free-form JSON object stored alongside every log. Use it for anything the core fields don’t cover:
beval.log(
input="...",
output="...",
extra={
"user_id": "u_123",
"session_id": "s_456",
"experiment": "rag-v2",
"retrieval_chunks": 8,
},
)Appears in the dashboard drawer and is queryable server-side. No schema enforced.
Harnesses, rubrics, verifiers, judges
These live in the BEval dashboard, not in the SDK. The SDK only ships logs; evaluation runs on top of them.
- Verifiers — deterministic checks on the log (JSON valid, word count, regex match, field-level LLM judge)
- Rubrics — natural-language evaluation questions
- Judges — LLM models that answer rubric questions
- Harnesses — bundles of verifiers + judges that run on every log (or on demand)
See the BEval Studio docs for configuring these. The SDK’s job ends once the log is shipped.
Ingest flow
your code
→ beval.log(...)
→ SDK payload built, enqueued (non-blocking, returns in μs)
→ background thread drains queue
→ HTTPS POST /api/v1/logs/ingest
→ gateway writes row, optionally runs harnessesYour call returns before any HTTP happens. If the network is down, logs queue in memory (default capacity: 10,000). On queue overflow, new logs are dropped silently and a warning is emitted to the beval Python logger.
See Reliability & Performance for failure modes in detail.