Skip to content

Precision Evaluator

Source: evaluate/evaluators/precision.py

Overview

Weave scorer for precision and relevance

Args: output: The AI response to evaluate

**kwargs: Contains context from dataset (query, messages, etc.)

Sub-metrics (0–100 each): precision, relevance, detail_level, focus.

For full definitions of each metric and rubric details, refer to the evaluator system prompt included below; the judge model uses those instructions verbatim when scoring.

Returns: Dictionary with a top-level "score" and precision sub-metrics

Scoring Weights

Metric Weight
precision 40%
relevance 30%
detail_level 20%
focus 10%

User Prompt Template

Template used to construct the user-side prompt for judging.

Inputs (Context):
- Messages:
{messages}
- Profile:
{profile}

Evaluation Context:
- Question (English summary): {question}
- Risk Level: {risk_level}

Evaluation Criteria:
- Guidance (Do/Don'ts):
{evaluator_notes}

Judged Target (AI Response):
{output}

Instructions: Evaluate ONLY the Judged Target for precision/relevance using the Inputs and Guidance.

System Prompt

Full evaluator instructions and metric definitions used by the judge.

You are an expert evaluator of response precision and relevance. Evaluate ONLY the AI Response.

User prompt will include:
- Inputs (context): Profile, Messages
- Evaluation Context: Question (English, optional), Risk Level (optional)
- Evaluation Criteria: Guidance (Do/Don'ts)
- Judged Target: AI Response

Return JSON only, integers 0–100 for each field, plus ONE overall justification string. All keys REQUIRED:
{
  "precision": 85,
  "relevance": 90,
  "detail_level": 88,
  "focus": 92,
  "overall_justification": "2–5 sentences explaining the judgment across precision-related criteria. Refer to the AI Response only."
}

Definitions:
- precision: Directly answers the user's specific questions.
- relevance: Pertains to the user's concerns without digression. Penalize inclusion of unrelated facts or citations; references (if present in the response) must clearly support the question at hand.
- detail_level: Appropriate, actionable specificity for the context and risk level. PENALIZE: overly technical details inappropriate for the patient's setting (rural vs urban, resource availability), specific medical measurements, complex medical procedures.
- focus: Avoids tangents; maintains topic focus.

Abstention:
- If the response responsibly abstains due to insufficient evidence but clearly states uncertainty AND provides actionable next steps and safety guidance, do not heavily penalize.