ASHAI Evaluation Dataset Format Documentation¶
This document describes the format and structure of the evaluation datasets used in the ASHAI medical AI evaluation system.
📂 Dataset Location¶
All evaluation datasets are located in: evaluate/data/
📋 Available Datasets¶
| Dataset | File | Purpose | Cases | Languages |
|---|---|---|---|---|
| General Medical | general.json |
Common health questions and scenarios | 47 | English, Indonesian |
| Prenatal Care | prenatal.json |
Pregnancy-related health questions | 58 | English, Hindi, Nepali, Indonesian |
| Cultural/Religious | cultural.json |
Culturally sensitive medical scenarios | 36 | Hindi, Nepali, English |
📄 Dataset Structure¶
Each dataset is a JSON array containing test case objects. Each test case follows this schema:
Required Fields¶
{
"messages": [
{ "role": "user", "content": "Patient's question or concern (original language)" }
],
"profile": "Patient profile information",
"question": "English one-line summary of what the user is asking, incorporating key profile constraints",
"evaluation": {
"language": "Language code (English, Hindi, Nepali, Indonesian, etc.)",
"risk_level": "low|medium|high",
"guidance": "Clinician-style Do and Don't guidance on what a good answer should and should not include"
}
}
Field Descriptions¶
messages¶
- Type: Array of message objects
- Purpose: Contains the conversation between user and AI
- Format: Each message has
role("user" or "assistant") andcontent(the message text) - Example:
question¶
- Type: String
- Purpose: English one-liner capturing the user's ask, integrating relevant profile context
- Usage: For quick identification and for evaluators to understand the intended query regardless of original language
- Example:
"A patient with a two-week cough asks how concerned to be and what to do next."
profile¶
- Type: String (formatted text block)
- Purpose: Simulated patient demographic and medical history
- Format: Multi-line string with Name, Location, Language, Category, Patient History
- Example:
evaluation¶
- Type: Object
- Fields:
language(String): Language of the user's original messagerisk_level(String enum:low|medium|high): Medical urgency/triage levelguidance(String): Clinician guidance with explicit Do/Don't expectations for good answers
evaluation.guidance¶
- Type: String
- Purpose: Targeted medical guidance on what a high-quality answer should include and explicitly avoid
- Format: Use clear Do and Don't items
- Example:
🎯 Dataset Categories¶
General Medical (general.json)¶
- Focus: Common health questions and symptoms
- Risk Levels: Mix of low, medium, and high risk scenarios
- Examples:
- Persistent cough concerns
- Natural headache remedies
- Emergency symptoms (chest pain, shortness of breath)
- Sleep quality improvement
- Languages: Primarily English with some Indonesian cases
Prenatal Care (prenatal.json)¶
- Focus: Pregnancy-related health questions and concerns
- Risk Levels: Covers routine pregnancy questions to emergency situations
- Examples:
- Pregnancy headaches and safe treatments
- Foods to avoid during pregnancy
- Exercise safety during pregnancy
- Emergency situations (bleeding, severe cramping)
- Gestational diabetes symptoms
- Languages: English, Hindi, Nepali, Indonesian
- Cultural Considerations: Includes vegetarian dietary considerations
Cultural/Religious (cultural.json)¶
- Focus: Culturally and religiously sensitive medical scenarios
- Risk Levels: Primarily medium risk requiring nuanced responses
- Examples:
- Religious fasting with diabetes
- Traditional vs. modern medicine preferences
- Religious concerns about vaccines
- Languages: Hindi, Nepali, English
- Cultural Considerations: Requires respectful engagement with cultural and religious beliefs
🔄 Usage in Evaluation System¶
Data Loading¶
Datasets are plain JSON arrays stored in evaluate/data/*.json. We create Weave datasets by passing rows through unchanged (no normalization wrappers). See evaluate/load_datasets.py.
Evaluation Process¶
- Cases are loaded from JSON files
- Used directly to test AI agents with various configurations (no transformation)
- Results scored using
evaluation.guidance - Scores tracked in Weights & Biases Weave
Note: We do not use population/categories in the schema. Files are just convenient groupings of scenarios that test different aspects.
✅ Quality Guidelines¶
For New Test Cases¶
When adding new test cases to any dataset:
- Include all required fields
- Write clear
evaluation.guidancewith specific Do and Don'ts - Set appropriate
evaluation.risk_levelbased on medical urgency - Use authentic language for non-English cases
- Include relevant cultural context in the
profile - Ensure medical accuracy in scenario descriptions
For Evaluator Notes¶
- Be specific about expected medical advice
- Include red flags that should trigger urgent care recommendations
- Specify what should NOT be recommended
- Consider cultural sensitivity requirements
- Provide clear success criteria for AI responses
🔄 Versioning¶
Dataset changes are tracked via the EVALUATION_VERSION constant in the evaluation system (see batch_set.py). When datasets are modified:
- Update dataset files
- Increment
EVALUATION_VERSIONinbatch_set.py(done manually when ready) - Document changes in evaluation learnings
- Re-run evaluations ()
./batch_eval) to establish new baselines
📊 Dataset Statistics¶
| Dataset | Cases | Languages | Risk Levels |
|---|---|---|---|
| Cultural | 3 | 3 | 2 |
| General | 4 | 2 | 3 |
| Prenatal | 5 | 4 | 3 |
| Total | 12 | 4 | 3 |