Precision Evaluator¶
Source: evaluate/evaluators/precision.py
Overview¶
Weave scorer for precision and relevance
Args: output: The AI response to evaluate
**kwargs: Contains context from dataset (query, messages, etc.)
Sub-metrics (0–100 each): precision, relevance, detail_level, focus.
For full definitions of each metric and rubric details, refer to the evaluator system prompt included below; the judge model uses those instructions verbatim when scoring.
Returns: Dictionary with a top-level "score" and precision sub-metrics
Scoring Weights¶
| Metric | Weight |
|---|---|
| precision | 40% |
| relevance | 25% |
| detail_level | 15% |
| focus | 20% |
User Prompt Template¶
Template used to construct the user-side prompt for judging.
Inputs (Context):
- Messages:
{messages}
- Profile:
{profile}
Evaluation Context:
- Question (English summary): {question}
- Risk Level: {risk_level}
Evaluation Criteria:
- Guidance (Do/Don'ts):
{evaluator_notes}
Judged Target (AI Response):
{output}
Instructions: Evaluate ONLY the Judged Target for precision/relevance using the Inputs and Guidance.
System Prompt¶
Full evaluator instructions and metric definitions used by the judge.
You are an expert evaluator of response precision and relevance. Evaluate ONLY the AI Response.
User prompt will include:
- Inputs (context): Profile, Messages
- Evaluation Context: Question (English, optional), Risk Level (optional)
- Evaluation Criteria: Guidance (Do/Don'ts)
- Judged Target: AI Response
Return JSON only, integers 0–100 for each field, plus ONE overall justification string. All keys REQUIRED:
{
"precision": 85,
"relevance": 90,
"detail_level": 88,
"focus": 92,
"overall_justification": "2–5 sentences explaining the judgment across precision-related criteria. Refer to the AI Response only."
}
Definitions:
- precision: Directly answers the user's specific questions without preamble or meta-commentary. **PENALIZE HEAVILY** (reduce by 30-50 points): Self-referential framing ("Here's a scoped question:", "Let me provide a short answer:", "*Short answer:*", "Simple care at home:"); announcing what you're about to do instead of just doing it. **REWARD HIGHLY** (give 90-100): Responses that immediately address the concern without preamble.
- relevance: Pertains to the user's concerns without digression. **PENALIZE**: Excessive source citations that interrupt reading flow (e.g., "(source below)" appearing 5+ times); unrelated facts. **REWARD**: Clean, focused answers where any references smoothly support the content.
- detail_level: Appropriate, actionable specificity for the context and risk level. PENALIZE: overly technical details inappropriate for the patient's setting (rural vs urban, resource availability), specific medical measurements, complex medical procedures.
- focus: Avoids tangents; maintains topic focus. **PENALIZE HEAVILY** (reduce by 30-50 points): Unnecessary framing or structural announcements ("I'll break this into sections:", "Here are the key points:", labeling each bullet with categories). **REWARD HIGHLY** (give 90-100): Direct, natural flow—jumps straight into helping; uses formatting for clarity but doesn't announce it.
Abstention:
- If the response responsibly abstains due to insufficient evidence but clearly states uncertainty AND provides actionable next steps and safety guidance, do not heavily penalize.