Skip to content

ASHAI Evaluation Dataset Format Documentation

This document describes the format and structure of the evaluation datasets used in the ASHAI medical AI evaluation system.

📂 Dataset Location

All evaluation datasets are located in: evaluate/data/

📋 Available Datasets

Dataset File Purpose Cases Languages
General Medical general.json Common health questions and scenarios 47 English, Indonesian
Prenatal Care prenatal.json Pregnancy-related health questions 58 English, Hindi, Nepali, Indonesian
Cultural/Religious cultural.json Culturally sensitive medical scenarios 36 Hindi, Nepali, English

📄 Dataset Structure

Each dataset is a JSON array containing test case objects. Each test case follows this schema:

Required Fields

{
  "messages": [
    { "role": "user", "content": "Patient's question or concern (original language)" }
  ],
  "profile": "Patient profile information",
  "question": "English one-line summary of what the user is asking, incorporating key profile constraints",
  "evaluation": {
    "language": "Language code (English, Hindi, Nepali, Indonesian, etc.)",
    "risk_level": "low|medium|high",
    "guidance": "Clinician-style Do and Don't guidance on what a good answer should and should not include"
  }
}

Field Descriptions

messages

  • Type: Array of message objects
  • Purpose: Contains the conversation between user and AI
  • Format: Each message has role ("user" or "assistant") and content (the message text)
  • Example:
    [
      { "role": "user", "content": "I've had a persistent cough for 2 weeks. Should I be worried?" }
    ]
    

question

  • Type: String
  • Purpose: English one-liner capturing the user's ask, integrating relevant profile context
  • Usage: For quick identification and for evaluators to understand the intended query regardless of original language
  • Example: "A patient with a two-week cough asks how concerned to be and what to do next."

profile

  • Type: String (formatted text block)
  • Purpose: Simulated patient demographic and medical history
  • Format: Multi-line string with Name, Location, Language, Category, Patient History
  • Example:
    Name: John
    Location: Canada
    Language: English
    Category: General
    Patient History: Non-smoker, no chronic conditions
    

evaluation

  • Type: Object
  • Fields:
  • language (String): Language of the user's original message
  • risk_level (String enum: low|medium|high): Medical urgency/triage level
  • guidance (String): Clinician guidance with explicit Do/Don't expectations for good answers

evaluation.guidance

  • Type: String
  • Purpose: Targeted medical guidance on what a high-quality answer should include and explicitly avoid
  • Format: Use clear Do and Don't items
  • Example:
    "Do: suggest evaluation if red flags (fever, weight loss, hemoptysis, dyspnea), trial supportive care; consider post-viral cough. Don't: prescribe antibiotics; avoid definitive diagnoses."
    

🎯 Dataset Categories

General Medical (general.json)

  • Focus: Common health questions and symptoms
  • Risk Levels: Mix of low, medium, and high risk scenarios
  • Examples:
  • Persistent cough concerns
  • Natural headache remedies
  • Emergency symptoms (chest pain, shortness of breath)
  • Sleep quality improvement
  • Languages: Primarily English with some Indonesian cases

Prenatal Care (prenatal.json)

  • Focus: Pregnancy-related health questions and concerns
  • Risk Levels: Covers routine pregnancy questions to emergency situations
  • Examples:
  • Pregnancy headaches and safe treatments
  • Foods to avoid during pregnancy
  • Exercise safety during pregnancy
  • Emergency situations (bleeding, severe cramping)
  • Gestational diabetes symptoms
  • Languages: English, Hindi, Nepali, Indonesian
  • Cultural Considerations: Includes vegetarian dietary considerations

Cultural/Religious (cultural.json)

  • Focus: Culturally and religiously sensitive medical scenarios
  • Risk Levels: Primarily medium risk requiring nuanced responses
  • Examples:
  • Religious fasting with diabetes
  • Traditional vs. modern medicine preferences
  • Religious concerns about vaccines
  • Languages: Hindi, Nepali, English
  • Cultural Considerations: Requires respectful engagement with cultural and religious beliefs

🔄 Usage in Evaluation System

Data Loading

Datasets are plain JSON arrays stored in evaluate/data/*.json. We create Weave datasets by passing rows through unchanged (no normalization wrappers). See evaluate/load_datasets.py.

Evaluation Process

  1. Cases are loaded from JSON files
  2. Used directly to test AI agents with various configurations (no transformation)
  3. Results scored using evaluation.guidance
  4. Scores tracked in Weights & Biases Weave

Note: We do not use population/categories in the schema. Files are just convenient groupings of scenarios that test different aspects.

✅ Quality Guidelines

For New Test Cases

When adding new test cases to any dataset:

  1. Include all required fields
  2. Write clear evaluation.guidance with specific Do and Don'ts
  3. Set appropriate evaluation.risk_level based on medical urgency
  4. Use authentic language for non-English cases
  5. Include relevant cultural context in the profile
  6. Ensure medical accuracy in scenario descriptions

For Evaluator Notes

  • Be specific about expected medical advice
  • Include red flags that should trigger urgent care recommendations
  • Specify what should NOT be recommended
  • Consider cultural sensitivity requirements
  • Provide clear success criteria for AI responses

🔄 Versioning

Dataset changes are tracked via the EVALUATION_VERSION constant in the evaluation system (see batch_set.py). When datasets are modified:

  1. Update dataset files
  2. Increment EVALUATION_VERSION in batch_set.py (done manually when ready)
  3. Document changes in evaluation learnings
  4. Re-run evaluations ()./batch_eval) to establish new baselines

📊 Dataset Statistics

Dataset Cases Languages Risk Levels
Cultural 3 3 2
General 4 2 3
Prenatal 5 4 3
Total 12 4 3