tcapelle/weave-eval-skill.md

name

weave-eval

description

Convert any kind of evaluation code into a Weave Evaluation. Use whenever the user has scoring code, an evaluation loop, a model-vs-ground-truth comparison, a list of predictions with metrics, a pandas DataFrame / CSV / JSONL of results to log a posteriori, a per-row loop with a black-box model (remote API, third-party agent), or any "I'm checking how well my model/LLM/agent performs" workflow and wants it logged to Weave. Covers both the declarative (`weave.Evaluation`) and imperative (`weave.EvaluationLogger`) APIs and picks between them. Triggers on phrases like "convert this to a weave evaluation", "wrap this in weave.Evaluation", "log this as a weave eval", "use EvaluationLogger", "log this batch to weave", "track this in weave", or any code resembling an eval loop where the user asks for Weave integration. Also use proactively when the user mentions evaluating LLMs/models and is already using `weave.init` — the right answer is almost always one of the two APIs in this skill.

Weave Evaluations: Conversion Skill

Weave has two evaluation APIs. Picking the right one is more than half the work.

API	Class	Use when…
Declarative	`weave.Evaluation`	The user has (or can produce) a dataset + a model/predict function + scorers, and is happy to let Weave drive the loop. Best for fresh code, parallel runs, trials, auto-summary.
Imperative	`weave.EvaluationLogger`	The user already has an evaluation loop they don't want to refactor, scoring happens out-of-band, results already exist as a DataFrame/list, or they're logging from a notebook/script that owns its own iteration.

When in doubt: if the user shows you a for loop, default to EvaluationLogger (preserves their structure). If they show a dataset and a scoring function but no loop yet, default to Evaluation (Weave runs it for them).

Decision flow

A posteriori logging — the eval already ran (results live in a DataFrame, CSV, JSONL, prior run output, spreadsheet, etc.) and we're just shipping it to Weave for visibility? → Imperative, almost always log_example per row. Do not rebuild a model; the prediction already exists.
Black-box per-row output — there's a loop, and each row calls something Weave can't (or shouldn't) trace into: a remote API, a third-party agent, a subprocess, a service over HTTP. The "model" returns a value and that's all you have? → Imperative. Treat each call's return as the output. Don't try to wrap the black box in @weave.op.
Existing working loop with traceable code that calls your own model and computes scores per row? → Imperative, with optional with blocks if the model itself is @weave.op-decorated and you want children traces.
Clean separation: a dataset + a model(x) you control + scoring functions, and you want Weave to run the loop with parallelism / trials / auto-summary? → Declarative (weave.Evaluation).
Comparing multiple models against the same dataset? → Declarative, run evaluate() once per model — Weave links them by the shared evaluation ref.

The big mental split: declarative is "Weave drives", imperative is "you drive, Weave records". Anything with already-computed results, opaque calls, or a loop you don't want to refactor is imperative.

Declarative API: `weave.Evaluation`

Source of truth: weave/evaluation/eval.py.

Minimal shape

import asyncio
import weave

weave.init("my-project")

dataset = [
    {"question": "Capital of France?", "expected": "Paris"},
    {"question": "Author of 1984?",     "expected": "George Orwell"},
]

@weave.op
def match_score(expected: str, output: str) -> dict:
    return {"match": expected.lower() in output.lower()}

@weave.op
def my_model(question: str) -> str:
    # your real model / LLM call here
    return "Paris" if "France" in question else "..."

evaluation = weave.Evaluation(dataset=dataset, scorers=[match_score])
asyncio.run(evaluation.evaluate(my_model))

Three things to get right

Scorer parameter names must match dataset column names, plus output for the model's return value. So a dataset row {"question": ..., "expected": ...} pairs with a scorer signed def scorer(expected, output). Extra dataset columns are passed if the scorer asks for them; missing ones raise.
The model must be valid per is_valid_model: a weave.Model subclass instance, an @weave.op-decorated function, or a saved WeaveObject with a predict op. Plain functions or callables fail.
evaluate() is async. From sync code use asyncio.run(...). Inside an existing event loop, await it directly.

Scorer styles

# Function scorer — simplest, must be @weave.op
@weave.op
def exact_match(expected: str, output: str) -> bool:
    return expected == output

# Class scorer — when you need state, configuration, or a custom summarize
class ToleranceScorer(weave.Scorer):
    tolerance: float = 0.01
    @weave.op
    def score(self, expected: float, output: float) -> dict:
        return {"close_enough": abs(expected - output) < self.tolerance}
    @weave.op
    def summarize(self, score_rows: list) -> dict:
        n = sum(1 for r in score_rows if r["close_enough"])
        return {"pass_rate": n / len(score_rows) if score_rows else 0.0}

Model styles

# Style 1: @weave.op function — quickest
@weave.op
def predict(question: str) -> str: ...

# Style 2: Model subclass — when you want config tracked & versioned
class MyModel(weave.Model):
    system_prompt: str
    temperature: float = 0.0
    @weave.op
    def predict(self, question: str) -> str:
        ...

evaluation = weave.Evaluation(dataset=ds, scorers=[s])
asyncio.run(evaluation.evaluate(MyModel(system_prompt="You are…")))

Useful options

trials=N — run every row N times (variance, non-determinism).
evaluation_name="my run" — persistent label on the Evaluation object; can also be a callable returning a string per call.
metadata={"git_sha": ..., "model_size": ...} — arbitrary metadata stored on the eval.
preprocess_model_input=fn — transforms each row before the model sees it. Scorers still receive the original row (this is a documented gotcha).
Per-run display name (different from persistent evaluation_name): pass __weave={"display_name": "..."} into evaluate() to label that one run distinctly in the UI.

Imperative API: `weave.EvaluationLogger`

Source of truth: weave/evaluation/eval_imperative.py.

Minimal shape

import weave
from weave import EvaluationLogger

weave.init("my-project")

ev = EvaluationLogger(
    name="my-eval-2026-04",          # optional: display name for the run
    model="my-model-v1",             # str | dict | weave.Model
    dataset="qa-set",                # str | list[dict] | weave.Dataset
    scorers=["accuracy", "f1"],      # optional closed list of expected scorer names
    eval_attributes={"git_sha": "abc123", "notes": "..."},  # optional metadata
)

for row in rows:
    pred = ev.log_prediction(inputs={"question": row["q"]}, output=run_my_model(row["q"]))
    pred.log_score("accuracy", float(pred.output == row["expected"]))
    pred.finish()

ev.log_summary({"notes": "first run"})

Three things to get right

Canonical import is from weave import EvaluationLogger. It's also reachable via weave.EvaluationLogger. Do not import from weave.flow.eval_imperative or weave.evaluation.eval_imperative — those are internal paths that may move.
scorers=[...] is a closed list of expected scorer names. If you set it, list every scorer name you'll later log. Logging a name not in this list produces "Scorer 'X' is not in the predefined scorers list" warnings on every row. If you don't know the full list up front, omit scorers= entirely — Weave will accept any name you log.
Always finish each prediction before log_summary(). Either call pred.finish() explicitly, or use the context-manager form which finishes automatically. log_summary() is terminal — after it runs, no more predictions or scores.

Context-manager form (captures nested ops)

When the prediction itself involves traced ops (LLM calls, tools), use the with block so those calls become children of the predict call in the trace tree:

with ev.log_prediction(inputs={"q": row["q"]}) as pred:
    response = my_traced_llm_call(row["q"])  # becomes a child of pred
    pred.output = response.text
    pred.log_score("correctness", grade(response.text, row["expected"]))
# finish() runs on __exit__

For scoring that itself involves traced computation, the same pattern works on log_score:

with pred.log_score("rubric") as s:
    s.value = run_rubric_grader(pred.output)  # rubric calls become children of the score

`log_example` shortcut

When inputs, output, and all scores are known up front (e.g. iterating a DataFrame of past predictions):

ev.log_example(
    inputs={"question": row["q"]},
    output=row["prediction"],
    scores={"accuracy": row["acc"], "bleu": row["bleu"]},
)

This is the right call for "I already ran the model offline, just log it."

Conversion patterns (the heart of this skill)

Match the user's input shape to the recipe.

Pattern 1: existing `for` loop with a model and per-row scoring

# Before
for row in rows:
    pred = my_model(row["question"])
    correct = pred == row["expected"]
    results.append(correct)
print("acc:", sum(results) / len(results))

→ Imperative, minimal change:

ev = EvaluationLogger(model="my-model", dataset=rows, scorers=["accuracy"])
for row in rows:
    pred = ev.log_prediction(inputs={"question": row["question"]}, output=my_model(row["question"]))
    pred.log_score("accuracy", pred.output == row["expected"])
    pred.finish()
ev.log_summary()

If they're willing to refactor, the declarative version is cleaner — wrap my_model in @weave.op, write @weave.op def accuracy(expected, output): ..., and call Evaluation(...).evaluate(...).

Pattern 2: list of dicts (or `Dataset`) plus a scoring function

→ Declarative. Make sure scorer params match dataset keys + output.

ds = [{"q": "...", "gold": "..."}, ...]

@weave.op
def my_model(q: str) -> str: ...

@weave.op
def score(gold: str, output: str) -> dict:
    return {"em": gold == output}

asyncio.run(weave.Evaluation(dataset=ds, scorers=[score]).evaluate(my_model))

Pattern 3: a posteriori logging — predictions already exist

This is the catch-all for "I already ran the eval, just put it in Weave". The source can be a pandas DataFrame, a CSV, a JSONL file, a SQL query, last week's notebook output, anything. No model is re-run. Use log_example per row.

from weave import EvaluationLogger

ev = EvaluationLogger(
    model="offline-run-2026-04",   # any string identifying the run
    dataset="qa-eval-v3",
    # omit `scorers=` unless you want a closed list — see gotchas
)
for _, row in df.iterrows():       # or: for row in json.load(f), etc.
    ev.log_example(
        inputs={"question": row["question"]},
        output=row["prediction"],
        scores={"accuracy": row["acc"], "rouge": row["rouge"]},
    )
ev.log_summary({"source_csv": path})

The trick: log_example does log_prediction + log_scores + finish() in one call, so it's the cleanest fit for "I have all the data, just record it." Use log_prediction + log_score + finish() only when you want to attach extra trace context per row (rare in a-posteriori land).

Pattern 3b: black-box model in a per-row loop

The user has a loop, and each row calls something opaque to Weave: a remote LLM API, an internal HTTP service, a subprocess, a 3rd-party agent. They get a value back, score it, move on. They don't want (or can't) wrap the call in @weave.op.

→ Imperative. Treat the call's return as output; log it the same way as a-posteriori, just inline:

from weave import EvaluationLogger

ev = EvaluationLogger(model="my-remote-llm-v1", dataset="qa-set")

for row in dataset:
    output = call_black_box_api(row["question"])   # un-traced; Weave only sees the return
    score = grade(output, row["expected"])
    ev.log_example(
        inputs={"question": row["question"]},
        output=output,
        scores={"correctness": score},
    )

ev.log_summary()

Why not declarative here? weave.Evaluation requires a model that's an Op, a Model, or a saved object with predict. A bare network call doesn't fit, and trying to shoe-horn it via @weave.op only logs the wrapper — there's nothing more to capture inside. Imperative is honest about the boundary.

If you do want partial visibility (e.g. logging request/response payloads), wrap the black-box call in a thin @weave.op and use the context-manager log_prediction so the op call becomes a child of the predict call:

@weave.op  # records inputs/output of the API call as a traced op
def call_api(question: str) -> str:
    return call_black_box_api(question)

for row in dataset:
    with ev.log_prediction(inputs={"question": row["question"]}) as pred:
        pred.output = call_api(row["question"])  # nested under pred
        pred.log_score("correctness", grade(pred.output, row["expected"]))

ev.log_summary()

Pattern 4: a class with a `.predict` method

→ Declarative with weave.Model. Subclass it (so config is captured and versioned) and decorate predict with @weave.op.

class MyClassifier(weave.Model):
    threshold: float = 0.5
    @weave.op
    def predict(self, text: str) -> str:
        return "pos" if self._score(text) > self.threshold else "neg"

asyncio.run(weave.Evaluation(dataset=ds, scorers=[s]).evaluate(MyClassifier()))

If the class is third-party and you can't subclass cleanly, wrap it: @weave.op def predict(text): return clf.predict(text).

Pattern 5: a plain function that returns a prediction

→ Declarative. Decorate with @weave.op and pass directly. A bare function will be rejected by is_valid_model.

Pattern 6: scoring function takes `(expected, actual)`

The Weave convention is (expected, output). Rename actual to output, or do an adapter:

@weave.op
def adapter(expected: str, output: str) -> dict:
    return your_scorer(expected, output)

Dataset key is expected; the model's return becomes output automatically.

Pattern 7: scores computed asynchronously / out-of-band (e.g. human review)

→ Imperative. Log predictions first with pred.finish() (no scores yet). Later, when ratings come in, you can log a fresh EvaluationLogger keyed off the same dataset, or attach scores to the original calls via the calls API. The point is: imperative lets you decouple prediction time from scoring time.

Pattern 8: comparing multiple models on the same dataset

→ Declarative, run once per model. Build the Evaluation once, then call evaluate(model_a) and evaluate(model_b). The Weave UI groups them automatically because they share the evaluation ref.

evaluation = weave.Evaluation(dataset=ds, scorers=[score])
for m in [model_a, model_b, model_c]:
    asyncio.run(evaluation.evaluate(m))

Common gotchas (read before debugging)

Token/cost tracking requires EvaluationLogger to exist BEFORE you call your LLM. Token usage and cost data are only captured for @weave.op-decorated calls that happen while a logger is alive. The wrong order silently drops token data:

# ❌ wrong — logger created after the call, no token data
for row in rows:
    output = my_llm(row["q"])         # call already logged with no parent
    ev = EvaluationLogger(...)         # too late
    ev.log_prediction(...)

# ✅ right — logger exists when the call happens
ev = EvaluationLogger(...)
for row in rows:
    output = my_llm(row["q"])         # captured under the eval
    ev.log_example(...)

This applies even more strongly with the context-manager form (with ev.log_prediction(...) as pred:), which makes the LLM call a child of the predict call so token attribution is exact.

"My scorer says it's missing output" — old Weave code may use model_output instead of output. Both work, but don't mix in one scorer; pick one and stay consistent. The codebase auto-detects legacy scorers and switches the output key.
"is_valid_model rejected my callable" — wrap with @weave.op or make it a weave.Model subclass. Plain functions and lambdas are not accepted.
"Scorer didn't see my preprocessed input" — that's intentional. preprocess_model_input only feeds the model. Scorers always see the raw dataset row. If you need preprocessing in the scorer too, do it inside the scorer or pre-bake the dataset.
"Trials are running serially / too fast / too slow" — parallelism is controlled by WEAVE_PARALLELISM (read by get_weave_parallelism()). Tune this if you need to throttle (rate limits) or speed up.
"My imperative eval shows no summary" — you forgot ev.log_summary(...) (or ev.finish()), or a prediction wasn't finish()ed before summary. Use with blocks to make this automatic.
"Can I add scores after log_summary?" — no, the eval is finalized. Start a new EvaluationLogger.
Async caveat — Evaluation.evaluate is async. From a Jupyter cell that already has a loop, just await evaluation.evaluate(model). From a plain script, asyncio.run(...).
Rich media works in inputs/outputs. Pass PIL.Image objects or wave.open(...) audio handles directly inside inputs={...} or as output=... — Weave stores and renders them in the UI. Useful for vision/speech evals.
Comparison view — running multiple EvaluationLoggers with the same name= (or multiple Evaluation.evaluate runs over the same Evaluation object) groups them into Weave's compare view automatically.

Working with the user

When converting code, do this in order:

Identify the shape of their input: dataset, model, scorers, existing loop, existing results. Quote back what you see.
Pick declarative vs imperative using the decision flow above and tell them why.
Show the smallest change that works, not a from-scratch rewrite. The goal is "your code, but logged to Weave."
Surface the one or two gotchas that apply to their case (param names, async, model validity).
Confirm weave.init("project-name") is called once at startup. Without it nothing is logged. If you don't see it, add it.

References

Declarative source: weave/evaluation/eval.py (look at Evaluation, is_valid_model, predict_and_score, summarize).
Imperative source: weave/evaluation/eval_imperative.py (look at EvaluationLogger, ScoreLogger, log_prediction, log_example).
Official docs: https://docs.wandb.ai/weave/guides/core-types/evaluations

tcapelle/weave-eval-skill.md

Select an option

No results found

Select an option

No results found

Weave Evaluations: Conversion Skill

Decision flow

Declarative API: `weave.Evaluation`

Minimal shape

Three things to get right

Scorer styles

Model styles

Useful options

Imperative API: `weave.EvaluationLogger`

Minimal shape

Three things to get right

Context-manager form (captures nested ops)

`log_example` shortcut

Conversion patterns (the heart of this skill)

Pattern 1: existing `for` loop with a model and per-row scoring

Pattern 2: list of dicts (or `Dataset`) plus a scoring function

Pattern 3: a posteriori logging — predictions already exist

Pattern 3b: black-box model in a per-row loop

Pattern 4: a class with a `.predict` method

Pattern 5: a plain function that returns a prediction

Pattern 6: scoring function takes `(expected, actual)`

Pattern 7: scores computed asynchronously / out-of-band (e.g. human review)

Pattern 8: comparing multiple models on the same dataset

Common gotchas (read before debugging)

Working with the user

References

tcapelle/weave-eval-skill.md

Weave Evaluations: Conversion Skill

Decision flow

Declarative API: weave.Evaluation

Minimal shape

Three things to get right

Scorer styles

Model styles

Useful options

Imperative API: weave.EvaluationLogger

Minimal shape

Three things to get right

Context-manager form (captures nested ops)

log_example shortcut

Conversion patterns (the heart of this skill)

Pattern 1: existing for loop with a model and per-row scoring

Pattern 2: list of dicts (or Dataset) plus a scoring function

Pattern 3: a posteriori logging — predictions already exist

Pattern 3b: black-box model in a per-row loop

Pattern 4: a class with a .predict method

Pattern 5: a plain function that returns a prediction

Pattern 6: scoring function takes (expected, actual)

Pattern 7: scores computed asynchronously / out-of-band (e.g. human review)

Pattern 8: comparing multiple models on the same dataset

Common gotchas (read before debugging)

Working with the user

References

Declarative API: `weave.Evaluation`

Imperative API: `weave.EvaluationLogger`

`log_example` shortcut

Pattern 1: existing `for` loop with a model and per-row scoring

Pattern 2: list of dicts (or `Dataset`) plus a scoring function

Pattern 4: a class with a `.predict` method

Pattern 6: scoring function takes `(expected, actual)`