Skip to Content
DocsUsage

Usage

Three ways to use the SDK. Pick the one that fits your code.

beval.log(...) — direct logging

The lowest-level API. Call it from anywhere.

import beval beval.init() beval.log( kind="llm", model_id="gpt-4o-mini", input="What is the capital of France?", output="Paris.", latency_ms=312, tokens_in=7, tokens_out=2, )

Every argument is optional except what you want to see in the dashboard. Full list in the API Reference.

Capturing latency and errors

from time import perf_counter t0 = perf_counter() try: output = call_my_llm(prompt) beval.log( kind="llm", input=prompt, output=output, model_id="my-model", latency_ms=int((perf_counter() - t0) * 1000), status="success", ) except Exception as e: beval.log( kind="llm", input=prompt, model_id="my-model", latency_ms=int((perf_counter() - t0) * 1000), status="failure", error_message=f"{type(e).__name__}: {e}", ) raise

beval.wrap(client) — auto-instrument OpenAI or Anthropic

One line of change. Every LLM call through the wrapped client is logged.

OpenAI

import beval from openai import OpenAI beval.init() client = beval.wrap(OpenAI()) resp = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Hello."}], )

Captured automatically:

  • Input messages (serialized as role: content lines)
  • Output text
  • Model name
  • Token usage (prompt_tokens, completion_tokens)
  • Latency
  • Exceptions (logged with status="failure", then re-raised)

Anthropic

import beval from anthropic import Anthropic beval.init() client = beval.wrap(Anthropic()) resp = client.messages.create( model="claude-sonnet-4-6", max_tokens=64, messages=[{"role": "user", "content": "Hello."}], )

Same fields captured, plus the system prompt is prepended to the input.

What’s not yet supported

  • Streaming responses (stream=True) — wrappers short-circuit on streams in 0.1. See Changelog for status.
  • Async clients — same story. Coming in a minor release.
  • Tool / function calling metadata — captured in extra for OpenAI, not yet for Anthropic.

For these cases, fall back to beval.log(...) directly.

@beval.trace — decorate agent functions

Wraps any function (sync or async) as an agent turn.

import beval beval.init() @beval.trace def run_agent(query: str) -> str: # ... your agent logic ... return answer result = run_agent("Plan my week")

Captures:

  • Arguments (as JSON)
  • Return value (as JSON, truncated to 4 KB)
  • Latency
  • Exceptions (logged as status="failure", then re-raised)

With arguments

@beval.trace(name="tool:search", kind="agent") async def search(q: str) -> list[dict]: ...
  • name — overrides the default (module.qualname). Good for grouping in the dashboard.
  • kind — defaults to "agent"; override to any log kind.
  • capture_args — set False to skip argument capture (e.g. for functions with huge inputs).
  • capture_return — set False to skip return-value capture.

Async support

@beval.trace async def my_async_tool(x): ...

Detected automatically — no separate API.

VLM — attaching images

Pass image= as bytes, base64 string, or a data: URL. The log is promoted to kind="vlm" and the image appears in the dashboard drawer.

with open("screenshot.png", "rb") as f: beval.log( input="What's in this image?", output="A login screen.", model_id="gpt-4o", image=f.read(), image_mime="image/png", )

For images larger than ~256 KB, consider logging the reference / URL in extra instead of inlining — base64 in JSON is expensive. Direct-to-S3 upload is on the roadmap.

Mixing approaches

All three APIs share the same background queue and config. You can use them together:

beval.init() client = beval.wrap(OpenAI()) # wraps all client.chat.completions.create @beval.trace # logs as kind="agent" def answer(q: str) -> str: # This LLM call is logged by the wrapper as kind="llm" resp = client.chat.completions.create(...) # And this manual log is logged as kind="embedding" beval.log(kind="embedding", input=q, ...) return resp.choices[0].message.content

A single agent invocation produces multiple logs — one agent per @trace, one llm per wrapped call, one embedding per explicit log(). Nested trace support (one parent span with children) is on the roadmap — see Changelog.

Last updated on