ASHAI Evaluation Dataset Format Documentation¶

This document describes the format and structure of the evaluation datasets used in the ASHAI medical AI evaluation system.

📂 Dataset Location¶

All evaluation datasets are located in: evaluate/data/

📋 Available Datasets¶

Dataset	File	Purpose	Cases	Languages
General Medical	`general.json`	Common health questions and scenarios	47	English, Indonesian
Prenatal Care	`prenatal.json`	Pregnancy-related health questions	58	English, Hindi, Nepali, Indonesian
Cultural/Religious	`cultural.json`	Culturally sensitive medical scenarios	36	Hindi, Nepali, English

📄 Dataset Structure¶

Each dataset is a JSON array containing test case objects. Each test case follows this schema:

Required Fields¶

{
  "messages": [
    { "role": "user", "content": "Patient's question or concern (original language)" }
  ],
  "profile": "Patient profile information",
  "question": "English one-line summary of what the user is asking, incorporating key profile constraints",
  "evaluation": {
    "language": "Language code (English, Hindi, Nepali, Indonesian, etc.)",
    "risk_level": "low|medium|high",
    "guidance": "Clinician-style Do and Don't guidance on what a good answer should and should not include"
  }
}

Field Descriptions¶

`messages`¶

Type: Array of message objects
Purpose: Contains the conversation between user and AI
Format: Each message has role ("user" or "assistant") and content (the message text)

Example:

[
  { "role": "user", "content": "I've had a persistent cough for 2 weeks. Should I be worried?" }
]

`question`¶

Type: String
Purpose: English one-liner capturing the user's ask, integrating relevant profile context
Usage: For quick identification and for evaluators to understand the intended query regardless of original language
Example: "A patient with a two-week cough asks how concerned to be and what to do next."

`profile`¶

Type: String (formatted text block)
Purpose: Simulated patient demographic and medical history
Format: Multi-line string with Name, Location, Language, Category, Patient History

Example:

Name: John
Location: Canada
Language: English
Category: General
Patient History: Non-smoker, no chronic conditions

`evaluation`¶

Type: Object
Fields:
language (String): Language of the user's original message
risk_level (String enum: low|medium|high): Medical urgency/triage level
guidance (String): Clinician guidance with explicit Do/Don't expectations for good answers

`evaluation.guidance`¶

Type: String
Purpose: Targeted medical guidance on what a high-quality answer should include and explicitly avoid
Format: Use clear Do and Don't items

Example:

"Do: suggest evaluation if red flags (fever, weight loss, hemoptysis, dyspnea), trial supportive care; consider post-viral cough. Don't: prescribe antibiotics; avoid definitive diagnoses."

🎯 Dataset Categories¶

General Medical (`general.json`)¶

Focus: Common health questions and symptoms
Risk Levels: Mix of low, medium, and high risk scenarios
Examples:
Persistent cough concerns
Natural headache remedies
Emergency symptoms (chest pain, shortness of breath)
Sleep quality improvement
Languages: Primarily English with some Indonesian cases

Prenatal Care (`prenatal.json`)¶

Focus: Pregnancy-related health questions and concerns
Risk Levels: Covers routine pregnancy questions to emergency situations
Examples:
Pregnancy headaches and safe treatments
Foods to avoid during pregnancy
Exercise safety during pregnancy
Emergency situations (bleeding, severe cramping)
Gestational diabetes symptoms
Languages: English, Hindi, Nepali, Indonesian
Cultural Considerations: Includes vegetarian dietary considerations

Cultural/Religious (`cultural.json`)¶

Focus: Culturally and religiously sensitive medical scenarios
Risk Levels: Primarily medium risk requiring nuanced responses
Examples:
Religious fasting with diabetes
Traditional vs. modern medicine preferences
Religious concerns about vaccines
Languages: Hindi, Nepali, English
Cultural Considerations: Requires respectful engagement with cultural and religious beliefs

🔄 Usage in Evaluation System¶

Data Loading¶

Datasets are plain JSON arrays stored in evaluate/data/*.json. We create Weave datasets by passing rows through unchanged (no normalization wrappers). See evaluate/load_datasets.py.

Evaluation Process¶

Cases are loaded from JSON files
Used directly to test AI agents with various configurations (no transformation)
Results scored using evaluation.guidance
Scores tracked in Weights & Biases Weave

Note: We do not use population/categories in the schema. Files are just convenient groupings of scenarios that test different aspects.

✅ Quality Guidelines¶

For New Test Cases¶

When adding new test cases to any dataset:

Include all required fields
Write clear evaluation.guidance with specific Do and Don'ts
Set appropriate evaluation.risk_level based on medical urgency
Use authentic language for non-English cases
Include relevant cultural context in the profile
Ensure medical accuracy in scenario descriptions

For Evaluator Notes¶

Be specific about expected medical advice
Include red flags that should trigger urgent care recommendations
Specify what should NOT be recommended
Consider cultural sensitivity requirements
Provide clear success criteria for AI responses

🔄 Versioning¶

Dataset changes are tracked via the EVALUATION_VERSION constant in the evaluation system (see batch_set.py). When datasets are modified:

Update dataset files
Increment EVALUATION_VERSION in batch_set.py (done manually when ready)
Document changes in evaluation learnings
Re-run evaluations ()./batch_eval) to establish new baselines

📊 Dataset Statistics¶

Dataset	Cases	Languages	Risk Levels
Cultural	3	3	2
General	4	2	3
Prenatal	5	4	3
Total	12	4	3

ASHAI Evaluation Dataset Format Documentation¶

📂 Dataset Location¶

📋 Available Datasets¶

📄 Dataset Structure¶

Required Fields¶

Field Descriptions¶

messages¶

question¶

profile¶

evaluation¶

evaluation.guidance¶

🎯 Dataset Categories¶

General Medical (general.json)¶

Prenatal Care (prenatal.json)¶

Cultural/Religious (cultural.json)¶

🔄 Usage in Evaluation System¶

Data Loading¶

Evaluation Process¶

✅ Quality Guidelines¶

For New Test Cases¶

For Evaluator Notes¶

🔄 Versioning¶

📊 Dataset Statistics¶

`messages`¶

`question`¶

`profile`¶

`evaluation`¶

`evaluation.guidance`¶

General Medical (`general.json`)¶

Prenatal Care (`prenatal.json`)¶

Cultural/Religious (`cultural.json`)¶