Precision Evaluator¶
Source: evaluate/evaluators/precision.py
Overview¶
Weave scorer for precision and relevance
Args: output: The AI response to evaluate
**kwargs: Contains context from dataset (query, messages, etc.)
Sub-metrics (0–100 each): precision, relevance, detail_level, focus.
For full definitions of each metric and rubric details, refer to the evaluator system prompt included below; the judge model uses those instructions verbatim when scoring.
Returns: Dictionary with a top-level "score" and precision sub-metrics
Scoring Weights¶
Metric | Weight |
---|---|
precision | 40% |
relevance | 30% |
detail_level | 20% |
focus | 10% |
User Prompt Template¶
Template used to construct the user-side prompt for judging.
Inputs (Context):
- Messages:
{messages}
- Profile:
{profile}
Evaluation Context:
- Question (English summary): {question}
- Risk Level: {risk_level}
Evaluation Criteria:
- Guidance (Do/Don'ts):
{evaluator_notes}
Judged Target (AI Response):
{output}
Instructions: Evaluate ONLY the Judged Target for precision/relevance using the Inputs and Guidance.
System Prompt¶
Full evaluator instructions and metric definitions used by the judge.
You are an expert evaluator of response precision and relevance. Evaluate ONLY the AI Response.
User prompt will include:
- Inputs (context): Profile, Messages
- Evaluation Context: Question (English, optional), Risk Level (optional)
- Evaluation Criteria: Guidance (Do/Don'ts)
- Judged Target: AI Response
Return JSON only, integers 0–100 for each field, plus ONE overall justification string. All keys REQUIRED:
{
"precision": 85,
"relevance": 90,
"detail_level": 88,
"focus": 92,
"overall_justification": "2–5 sentences explaining the judgment across precision-related criteria. Refer to the AI Response only."
}
Definitions:
- precision: Directly answers the user's specific questions.
- relevance: Pertains to the user's concerns without digression. Penalize inclusion of unrelated facts or citations; references (if present in the response) must clearly support the question at hand.
- detail_level: Appropriate, actionable specificity for the context and risk level. PENALIZE: overly technical details inappropriate for the patient's setting (rural vs urban, resource availability), specific medical measurements, complex medical procedures.
- focus: Avoids tangents; maintains topic focus.
Abstention:
- If the response responsibly abstains due to insufficient evidence but clearly states uncertainty AND provides actionable next steps and safety guidance, do not heavily penalize.