| name | weave-eval |
|---|---|
| description | Convert any kind of evaluation code into a Weave Evaluation. Use whenever the user has scoring code, an evaluation loop, a model-vs-ground-truth comparison, a list of predictions with metrics, a pandas DataFrame / CSV / JSONL of results to log a posteriori, a per-row loop with a black-box model (remote API, third-party agent), or any "I'm checking how well my model/LLM/agent performs" workflow and wants it logged to Weave. Covers both the declarative (`weave.Evaluation`) and imperative (`weave.EvaluationLogger`) APIs and picks between them. Triggers on phrases like "convert this to a weave evaluation", "wrap this in weave.Evaluation", "log this as a weave eval", "use EvaluationLogger", "log this batch to weave", "track this in weave", or any code resembling an eval loop where the user asks for Weave integration. Also use proactively when the user mentions evaluating LLMs/models and is already using `weave.init` — the right answer is almost always one of the two APIs in this skill. |
Weave has two evaluation APIs. Picking the right one is more than half the work.
| API | Class | Use when… |
|---|---|---|
| Declarative | weave.Evaluation |
The user has (or can produce) a dataset + a model/predict function + scorers, and is happy to let Weave drive the loop. Best for fresh code, parallel runs, trials, auto-summary. |
| Imperative | weave.EvaluationLogger |
The user already has an evaluation loop they don't want to refactor, scoring happens out-of-band, results already exist as a DataFrame/list, or they're logging from a notebook/script that owns its own iteration. |
When in doubt: if the user shows you a for loop, default to EvaluationLogger (preserves their structure). If they show a dataset and a scoring function but no loop yet, default to Evaluation (Weave runs it for them).
- A posteriori logging — the eval already ran (results live in a DataFrame, CSV, JSONL, prior run output, spreadsheet, etc.) and we're just shipping it to Weave for visibility? → Imperative, almost always
log_exampleper row. Do not rebuild a model; the prediction already exists. - Black-box per-row output — there's a loop, and each row calls something Weave can't (or shouldn't) trace into: a remote API, a third-party agent, a subprocess, a service over HTTP. The "model" returns a value and that's all you have? → Imperative. Treat each call's return as the
output. Don't try to wrap the black box in@weave.op. - Existing working loop with traceable code that calls your own model and computes scores per row? → Imperative, with optional
withblocks if the model itself is@weave.op-decorated and you want children traces. - Clean separation: a dataset + a
model(x)you control + scoring functions, and you want Weave to run the loop with parallelism / trials / auto-summary? → Declarative (weave.Evaluation). - Comparing multiple models against the same dataset? → Declarative, run
evaluate()once per model — Weave links them by the shared evaluation ref.
The big mental split: declarative is "Weave drives", imperative is "you drive, Weave records". Anything with already-computed results, opaque calls, or a loop you don't want to refactor is imperative.
Source of truth: weave/evaluation/eval.py.
import asyncio
import weave
weave.init("my-project")
dataset = [
{"question": "Capital of France?", "expected": "Paris"},
{"question": "Author of 1984?", "expected": "George Orwell"},
]
@weave.op
def match_score(expected: str, output: str) -> dict:
return {"match": expected.lower() in output.lower()}
@weave.op
def my_model(question: str) -> str:
# your real model / LLM call here
return "Paris" if "France" in question else "..."
evaluation = weave.Evaluation(dataset=dataset, scorers=[match_score])
asyncio.run(evaluation.evaluate(my_model))- Scorer parameter names must match dataset column names, plus
outputfor the model's return value. So a dataset row{"question": ..., "expected": ...}pairs with a scorer signeddef scorer(expected, output). Extra dataset columns are passed if the scorer asks for them; missing ones raise. - The model must be valid per
is_valid_model: aweave.Modelsubclass instance, an@weave.op-decorated function, or a savedWeaveObjectwith apredictop. Plain functions or callables fail. evaluate()is async. From sync code useasyncio.run(...). Inside an existing event loop,awaitit directly.
# Function scorer — simplest, must be @weave.op
@weave.op
def exact_match(expected: str, output: str) -> bool:
return expected == output
# Class scorer — when you need state, configuration, or a custom summarize
class ToleranceScorer(weave.Scorer):
tolerance: float = 0.01
@weave.op
def score(self, expected: float, output: float) -> dict:
return {"close_enough": abs(expected - output) < self.tolerance}
@weave.op
def summarize(self, score_rows: list) -> dict:
n = sum(1 for r in score_rows if r["close_enough"])
return {"pass_rate": n / len(score_rows) if score_rows else 0.0}# Style 1: @weave.op function — quickest
@weave.op
def predict(question: str) -> str: ...
# Style 2: Model subclass — when you want config tracked & versioned
class MyModel(weave.Model):
system_prompt: str
temperature: float = 0.0
@weave.op
def predict(self, question: str) -> str:
...
evaluation = weave.Evaluation(dataset=ds, scorers=[s])
asyncio.run(evaluation.evaluate(MyModel(system_prompt="You are…")))trials=N— run every row N times (variance, non-determinism).evaluation_name="my run"— persistent label on theEvaluationobject; can also be a callable returning a string per call.metadata={"git_sha": ..., "model_size": ...}— arbitrary metadata stored on the eval.preprocess_model_input=fn— transforms each row before the model sees it. Scorers still receive the original row (this is a documented gotcha).- Per-run display name (different from persistent
evaluation_name): pass__weave={"display_name": "..."}intoevaluate()to label that one run distinctly in the UI.
Source of truth: weave/evaluation/eval_imperative.py.
import weave
from weave import EvaluationLogger
weave.init("my-project")
ev = EvaluationLogger(
name="my-eval-2026-04", # optional: display name for the run
model="my-model-v1", # str | dict | weave.Model
dataset="qa-set", # str | list[dict] | weave.Dataset
scorers=["accuracy", "f1"], # optional closed list of expected scorer names
eval_attributes={"git_sha": "abc123", "notes": "..."}, # optional metadata
)
for row in rows:
pred = ev.log_prediction(inputs={"question": row["q"]}, output=run_my_model(row["q"]))
pred.log_score("accuracy", float(pred.output == row["expected"]))
pred.finish()
ev.log_summary({"notes": "first run"})- Canonical import is
from weave import EvaluationLogger. It's also reachable viaweave.EvaluationLogger. Do not import fromweave.flow.eval_imperativeorweave.evaluation.eval_imperative— those are internal paths that may move. scorers=[...]is a closed list of expected scorer names. If you set it, list every scorer name you'll later log. Logging a name not in this list produces"Scorer 'X' is not in the predefined scorers list"warnings on every row. If you don't know the full list up front, omitscorers=entirely — Weave will accept any name you log.- Always finish each prediction before
log_summary(). Either callpred.finish()explicitly, or use the context-manager form which finishes automatically.log_summary()is terminal — after it runs, no more predictions or scores.
When the prediction itself involves traced ops (LLM calls, tools), use the with block so those calls become children of the predict call in the trace tree:
with ev.log_prediction(inputs={"q": row["q"]}) as pred:
response = my_traced_llm_call(row["q"]) # becomes a child of pred
pred.output = response.text
pred.log_score("correctness", grade(response.text, row["expected"]))
# finish() runs on __exit__For scoring that itself involves traced computation, the same pattern works on log_score:
with pred.log_score("rubric") as s:
s.value = run_rubric_grader(pred.output) # rubric calls become children of the scoreWhen inputs, output, and all scores are known up front (e.g. iterating a DataFrame of past predictions):
ev.log_example(
inputs={"question": row["q"]},
output=row["prediction"],
scores={"accuracy": row["acc"], "bleu": row["bleu"]},
)This is the right call for "I already ran the model offline, just log it."
Match the user's input shape to the recipe.
# Before
for row in rows:
pred = my_model(row["question"])
correct = pred == row["expected"]
results.append(correct)
print("acc:", sum(results) / len(results))→ Imperative, minimal change:
ev = EvaluationLogger(model="my-model", dataset=rows, scorers=["accuracy"])
for row in rows:
pred = ev.log_prediction(inputs={"question": row["question"]}, output=my_model(row["question"]))
pred.log_score("accuracy", pred.output == row["expected"])
pred.finish()
ev.log_summary()If they're willing to refactor, the declarative version is cleaner — wrap my_model in @weave.op, write @weave.op def accuracy(expected, output): ..., and call Evaluation(...).evaluate(...).
→ Declarative. Make sure scorer params match dataset keys + output.
ds = [{"q": "...", "gold": "..."}, ...]
@weave.op
def my_model(q: str) -> str: ...
@weave.op
def score(gold: str, output: str) -> dict:
return {"em": gold == output}
asyncio.run(weave.Evaluation(dataset=ds, scorers=[score]).evaluate(my_model))This is the catch-all for "I already ran the eval, just put it in Weave". The source can be a pandas DataFrame, a CSV, a JSONL file, a SQL query, last week's notebook output, anything. No model is re-run. Use log_example per row.
from weave import EvaluationLogger
ev = EvaluationLogger(
model="offline-run-2026-04", # any string identifying the run
dataset="qa-eval-v3",
# omit `scorers=` unless you want a closed list — see gotchas
)
for _, row in df.iterrows(): # or: for row in json.load(f), etc.
ev.log_example(
inputs={"question": row["question"]},
output=row["prediction"],
scores={"accuracy": row["acc"], "rouge": row["rouge"]},
)
ev.log_summary({"source_csv": path})The trick: log_example does log_prediction + log_scores + finish() in one call, so it's the cleanest fit for "I have all the data, just record it." Use log_prediction + log_score + finish() only when you want to attach extra trace context per row (rare in a-posteriori land).
The user has a loop, and each row calls something opaque to Weave: a remote LLM API, an internal HTTP service, a subprocess, a 3rd-party agent. They get a value back, score it, move on. They don't want (or can't) wrap the call in @weave.op.
→ Imperative. Treat the call's return as output; log it the same way as a-posteriori, just inline:
from weave import EvaluationLogger
ev = EvaluationLogger(model="my-remote-llm-v1", dataset="qa-set")
for row in dataset:
output = call_black_box_api(row["question"]) # un-traced; Weave only sees the return
score = grade(output, row["expected"])
ev.log_example(
inputs={"question": row["question"]},
output=output,
scores={"correctness": score},
)
ev.log_summary()Why not declarative here? weave.Evaluation requires a model that's an Op, a Model, or a saved object with predict. A bare network call doesn't fit, and trying to shoe-horn it via @weave.op only logs the wrapper — there's nothing more to capture inside. Imperative is honest about the boundary.
If you do want partial visibility (e.g. logging request/response payloads), wrap the black-box call in a thin @weave.op and use the context-manager log_prediction so the op call becomes a child of the predict call:
@weave.op # records inputs/output of the API call as a traced op
def call_api(question: str) -> str:
return call_black_box_api(question)
for row in dataset:
with ev.log_prediction(inputs={"question": row["question"]}) as pred:
pred.output = call_api(row["question"]) # nested under pred
pred.log_score("correctness", grade(pred.output, row["expected"]))
ev.log_summary()→ Declarative with weave.Model. Subclass it (so config is captured and versioned) and decorate predict with @weave.op.
class MyClassifier(weave.Model):
threshold: float = 0.5
@weave.op
def predict(self, text: str) -> str:
return "pos" if self._score(text) > self.threshold else "neg"
asyncio.run(weave.Evaluation(dataset=ds, scorers=[s]).evaluate(MyClassifier()))If the class is third-party and you can't subclass cleanly, wrap it: @weave.op def predict(text): return clf.predict(text).
→ Declarative. Decorate with @weave.op and pass directly. A bare function will be rejected by is_valid_model.
The Weave convention is (expected, output). Rename actual to output, or do an adapter:
@weave.op
def adapter(expected: str, output: str) -> dict:
return your_scorer(expected, output)Dataset key is expected; the model's return becomes output automatically.
→ Imperative. Log predictions first with pred.finish() (no scores yet). Later, when ratings come in, you can log a fresh EvaluationLogger keyed off the same dataset, or attach scores to the original calls via the calls API. The point is: imperative lets you decouple prediction time from scoring time.
→ Declarative, run once per model. Build the Evaluation once, then call evaluate(model_a) and evaluate(model_b). The Weave UI groups them automatically because they share the evaluation ref.
evaluation = weave.Evaluation(dataset=ds, scorers=[score])
for m in [model_a, model_b, model_c]:
asyncio.run(evaluation.evaluate(m))- Token/cost tracking requires
EvaluationLoggerto exist BEFORE you call your LLM. Token usage and cost data are only captured for@weave.op-decorated calls that happen while a logger is alive. The wrong order silently drops token data:This applies even more strongly with the context-manager form (# ❌ wrong — logger created after the call, no token data for row in rows: output = my_llm(row["q"]) # call already logged with no parent ev = EvaluationLogger(...) # too late ev.log_prediction(...) # ✅ right — logger exists when the call happens ev = EvaluationLogger(...) for row in rows: output = my_llm(row["q"]) # captured under the eval ev.log_example(...)
with ev.log_prediction(...) as pred:), which makes the LLM call a child of the predict call so token attribution is exact. - "My scorer says it's missing
output" — old Weave code may usemodel_outputinstead ofoutput. Both work, but don't mix in one scorer; pick one and stay consistent. The codebase auto-detects legacy scorers and switches the output key. - "
is_valid_modelrejected my callable" — wrap with@weave.opor make it aweave.Modelsubclass. Plain functions and lambdas are not accepted. - "Scorer didn't see my preprocessed input" — that's intentional.
preprocess_model_inputonly feeds the model. Scorers always see the raw dataset row. If you need preprocessing in the scorer too, do it inside the scorer or pre-bake the dataset. - "Trials are running serially / too fast / too slow" — parallelism is controlled by
WEAVE_PARALLELISM(read byget_weave_parallelism()). Tune this if you need to throttle (rate limits) or speed up. - "My imperative eval shows no summary" — you forgot
ev.log_summary(...)(orev.finish()), or a prediction wasn'tfinish()ed before summary. Usewithblocks to make this automatic. - "Can I add scores after
log_summary?" — no, the eval is finalized. Start a newEvaluationLogger. - Async caveat —
Evaluation.evaluateis async. From a Jupyter cell that already has a loop, justawait evaluation.evaluate(model). From a plain script,asyncio.run(...). - Rich media works in inputs/outputs. Pass
PIL.Imageobjects orwave.open(...)audio handles directly insideinputs={...}or asoutput=...— Weave stores and renders them in the UI. Useful for vision/speech evals. - Comparison view — running multiple
EvaluationLoggers with the samename=(or multipleEvaluation.evaluateruns over the sameEvaluationobject) groups them into Weave's compare view automatically.
When converting code, do this in order:
- Identify the shape of their input: dataset, model, scorers, existing loop, existing results. Quote back what you see.
- Pick declarative vs imperative using the decision flow above and tell them why.
- Show the smallest change that works, not a from-scratch rewrite. The goal is "your code, but logged to Weave."
- Surface the one or two gotchas that apply to their case (param names, async, model validity).
- Confirm
weave.init("project-name")is called once at startup. Without it nothing is logged. If you don't see it, add it.
- Declarative source:
weave/evaluation/eval.py(look atEvaluation,is_valid_model,predict_and_score,summarize). - Imperative source:
weave/evaluation/eval_imperative.py(look atEvaluationLogger,ScoreLogger,log_prediction,log_example). - Official docs: https://docs.wandb.ai/weave/guides/core-types/evaluations