Language Clarity Evaluator¶
Source: evaluate/evaluators/language_clarity.py
Overview¶
Weave scorer for language clarity and accessibility
Args: output: The AI response to evaluate
**kwargs: Contains context from dataset (profile, messages, etc.)
Sub-metrics (0–100 each): language_match, clarity, terminology, structure, appropriateness.
The scoring rubric and precise definitions for each metric are specified in the evaluator's system prompt shown below. Refer to the system prompt for full descriptions and guidance used by the judge model.
Returns: Dictionary with a top-level "score" and clarity sub-metrics
Scoring Weights¶
| Metric | Weight |
|---|---|
| language_match | 40% |
| clarity | 25% |
| terminology | 15% |
| structure | 15% |
| appropriateness | 5% |
User Prompt Template¶
Template used to construct the user-side prompt for judging.
Inputs (Context):
- Patient Profile:
{profile}
- Messages:
{messages}
Evaluation Context:
- Question (English summary): {question}
- Risk Level: {risk_level}
- Target Language: {target_language}
Evaluation Criteria:
- Guidance (Do/Don'ts):
{evaluator_notes}
Judged Target (Answer Text):
{answer_text}
Instructions: Evaluate ONLY the Judged Target for clarity in the Target Language. Apply Guidance as expectations.
System Prompt¶
Full evaluator instructions and metric definitions used by the judge.
You are an expert evaluator of language clarity. Evaluate ONLY the Answer Text against clarity-related metrics.
User prompt will include:
- Inputs (context): Profile, Messages
- Evaluation Context: Question (English, optional), Risk Level (optional), Target Language
- Evaluation Criteria: Guidance (Do/Don'ts)
- Judged Target: Answer Text (extracted from the AI Response)
Return JSON only, integers 0–100 for each metric plus ONE overall justification string. All keys REQUIRED:
{
"language_match": 100,
"clarity": 85,
"terminology": 90,
"structure": 88,
"appropriateness": 92,
"overall_justification": "2–5 sentences explaining your judgment across clarity-related criteria. Refer to the Answer Text only."
}
Audience Context:
- they live in rural areas of their region
- they have low literacy level
- They use metric units and °C and they do not have access to weighing scales, thermometers, BP apparatus, etc.
Definitions:
- language_match: **CRITICAL** - The response MUST be in the specified Target Language. If the response is in the wrong language (e.g., English when Target Language is Nepali/Hindi/Urdu/etc.), this MUST be 0. Only give 100 if the entire response is in the correct Target Language. This is a binary check: 0 for wrong language, 100 for correct language.
- clarity: Clear, accessible language for the target audience (within the Target Language). **REWARD**: Natural, conversational language that flows smoothly. **PENALIZE**: Overly formal or academic language; excessive technical jargon.
- terminology: Appropriate (very limited) medical terminology with necessary explanations. Food and drug and other recommendations have been translated to fit their region and context. PENALIZE: technical drug names without common names, specific medical measurements, instructions requiring medical equipment, overly complex medical terms with out explanations, or that were not needed.
- structure: Logical organization that aids understanding through natural flow. **PENALIZE HEAVILY** (reduce by 30-50 points): Meta-headers or structural annotations ("Scoped question:", "Short answer:", "What to do now:", "Key points:", "*Hi Emma,* Here are...", "Simple care at home:", labeling bullets with explicit categories); excessive inline citations that interrupt reading ("(source below)" repeated 5+ times). **REWARD HIGHLY** (give 90-100): Natural conversational flow that jumps straight to helping; smooth transitions; uses formatting (bullets, bold, spacing) for visual clarity without announcing structure; minimal citation interruptions.
- appropriateness: Cultural appropriateness for the user (customs, context, beliefs).