ASHAI Evaluation System Documentation¶
🏆 View Live Evaluation Leaderboard → | 🚀 Try the Agents Being Evaluated →
Complete guide for running evaluations of ASHAI agents using the ./eval
and batch_eval_launcher.sh
scripts with Weights & Biases Weave tracking.
📁 Evaluation System Overview¶
The evaluation system consists of:
./eval
- Main CLI tool for single agent evaluations with full parameter controlbatch_eval_launcher.sh
- RECOMMENDED: Complete batch evaluation launcher that handles server startup, readiness checks, and cleanupbatch_eval_async.py
- Batch runner for comprehensive multi-agent/multi-model comparisonsevaluate/evaluators/
- Modular Weave-compatible scorers with medical evaluation criteriaevaluate/weave_datasets.py
- Test datasets (prenatal, general, cultural scenarios)evaluate/agent_realtime_evaluator.py
- Real-time evaluator used by API endpoints- Weave dashboard is the source of truth for leaderboard and run details
evaluate/evaluator_stats_service.py
- Weave data fetching utilities
📚 Evaluation Learnings Log - A critical resource documenting historical insights, improvements, and lessons learned from evaluation experiments. This log captures key findings about model performance, evaluation methodology improvements, and evolving best practices.
🚀 Quick Start¶
Prerequisites¶
- Environment Setup: Make sure you have:
OPENAI_API_KEY
in your environment or.env
fileWEAVE_PROJECT
set (defaults to "ashai-medical-ai")WANDB_ENTITY
set (optional, for dashboard links)- Virtual environment activated (
source venv/bin/activate
)
Configure Evals¶
- Set
EVALUATION_VERSION
andRUNS
inbatch_set.py
to choose which agents/models/tools to evaluate and to manage evaluation versioning. - For details and examples, see the section below: Running a Single Eval
Run Batch Evals (recommended)¶
- For quick mode, concurrency, and other options, see: Comprehensive Testing📊 Evaluation Options¶
Available Agents¶
ashai
- Main medical AI agent with flexible tool supportseptember
- September Health Library specialized agentstrict-referenced
- Evidence-only agent requiring search validationstrict-referenced-after
- Updated strict-referenced agentperplexity
- Perplexity Medical agent for real-time web sourcesone-shot
- One-shot agent for comparisonashai-experiment
- Experimental ASHAI variant
./eval Command Line Options¶
Parameter | Type | Description | Example |
---|---|---|---|
agent |
Required | Agent to evaluate | ./eval ashai |
--model |
String | Override model (gpt-4o, gpt-4o-mini, gpt-5, etc.) | --model gpt-5 |
--tools |
List | Override tools or specify "none" | --tools search_perplexity none |
--cases |
Integer | Limit number of test cases | --cases 5 |
--quick |
Flag | Run only 3 cases for fast feedback | --quick |
--no-reasoning |
Flag | Disable <thinking> reasoning mode |
--no-reasoning |
--eval-retry |
Flag | Enable self-evaluation and retry on low scores | --eval-retry |
--verbose |
Flag | Show full agent responses for debugging | --verbose |
--weave-project |
String | Override Weave project name | --weave-project my-project |
--wandb-entity |
String | Override W&B entity for dashboard links | --wandb-entity my-team |
batch_eval_launcher.sh Options¶
The batch_eval_launcher.sh
script handles the complete evaluation lifecycle:
Parameter | Description | Default |
---|---|---|
--quick |
Run quick (3-case) evaluations for all configurations | false |
-j, --jobs N |
Set concurrency level | 4 |
--weave-project NAME |
Set Weave project name | ashai-medical-ai |
--keep-server |
Keep server running after evals complete | false |
-h, --help |
Show help message | - |
What batch_eval_launcher.sh tests: - All agents (ashai, september, strict-referenced, strict-referenced-after, perplexity, one-shot, ashai-experiment) - Multiple models (gpt-4o-mini, gpt-5) - Reasoning vs no-reasoning comparisons - Tool configuration variations (none, perplexity, default tools) - Self-evaluation retry testing
Server Configuration (Automatic):
- Workers: 4 (configurable via ASHAI_WORKERS
)
- Concurrency limit: 128 (configurable via ASHAI_LIMIT_CONCURRENCY
)
- Backlog: 512 (configurable via ASHAI_BACKLOG
)
- Direct mode: Enabled (EVAL_DIRECT=1
)
- Direct tools: Enabled (EVAL_DIRECT_TOOLS=1
)
📈 Weights & Biases Weave Integration¶
How Tracking Works¶
The evaluation system uses Weights & Biases Weave for comprehensive tracking and analysis:
- Data Storage: All evaluation data is stored in Weave, not locally
- Dashboard Access: View detailed results at https://wandb.ai/
- Leaderboard Generation: Homepage leaderboard is pulled from Weave data
- Versioning:
EVALUATION_VERSION
ensures fair comparisons when datasets/evaluators change
Weave Project Structure¶
- Default Project:
ashai-medical-ai
(configurable viaWEAVE_PROJECT
env var or--weave-project
) - Evaluation Naming:
{agent}_{model}_{version}_{cases}rows{suffix}
- Raw Data: All agent responses, scores, timing data stored in Weave
Note: Local cached leaderboard (evaluate/leaderboard.json
) has been removed.
Accessing Your Data¶
View in Weave Dashboard:
# After running evaluation, check the console output for direct link
./eval ashai --quick
# Output includes: "View results: https://wandb.ai/your-entity/ashai-medical-ai/weave"
Homepage Leaderboard: - Automatically updated with latest evaluation results - Shows aggregated scores across all configurations - Links to full Weave evaluation details
📊 Evaluation Metrics¶
Scoring System (Updated V8)¶
The medical evaluation uses sophisticated AI-powered scoring across 4 dimensions:
Dimension | Weight | Description |
---|---|---|
Medical Accuracy | 45% | Correctness of medical info, evidence-based responses, proper disclaimers |
Precision | 25% | Direct answers to specific questions, relevance to patient concerns |
Language Clarity | 20% | Clear communication, appropriate terminology, cultural sensitivity |
Empathy Score | 10% | Patient-centered care, emotional acknowledgment, professional tone |
📋 Detailed Dataset Documentation - Complete format specification, field descriptions, and usage guidelines
Sample Size:
- Quick mode: 3 cases (fast development feedback)
- Full mode: ~12 cases (comprehensive evaluation)
- Custom: --cases N
for specific testing needs
🎯 Usage Patterns¶
Running a Single Eval¶
Use ./eval
for detailed control when testing a single agent. See examples below.
Development Iteration:
# Quick validation after code changes
./eval ashai --quick
# Test specific model performance
./eval ashai --model gpt-5 --quick
# Debug with full output
./eval ashai --verbose --cases 2
Model Comparison:
# Compare models on same agent
./eval ashai --model gpt-4o-mini
./eval ashai --model gpt-5
./eval ashai --model gpt-4o
# Test reasoning impact
./eval ashai --model gpt-5 --no-reasoning
./eval ashai --model gpt-5 # with reasoning (default)
Tool Configuration Testing:
# Test without tools
./eval ashai --tools none
# Test specific tool combinations
./eval ashai --tools search_september
./eval ashai --tools search_perplexity
./eval ashai --tools search_september search_perplexity
# Test all available tools
./eval ashai # uses default tools
Advanced Features:
# Self-evaluation and retry
./eval ashai --eval-retry --quick
# Custom evaluation run
# (Version management now handled via EVALUATION_VERSION in code)
Comprehensive Testing (batch_eval_launcher.sh)¶
Quick Development Feedback:
# Quick comprehensive check (faster, ~15-30 minutes)
./batch_eval_launcher.sh --quick
# Quick with higher concurrency
./batch_eval_launcher.sh --quick -j 8
Full System Evaluation:
# Complete evaluation suite (slow, ~1-2 hours)
./batch_eval_launcher.sh
# Full with maximum concurrency
./batch_eval_launcher.sh -j 8
Custom Configurations:
# Custom Weave project
./batch_eval_launcher.sh --quick -j 4 --weave-project my-project
# Keep server running for additional testing
./batch_eval_launcher.sh --quick -j 4 --keep-server
What batch_eval_launcher.sh covers: - All 7 agents × multiple models - Tool configuration variations - Reasoning vs no-reasoning comparisons - Self-evaluation retry testing
💡 Best Practices¶
Development Workflow¶
Daily Development:
# Quick validation after code changes
./eval ashai --quick
# Debug specific issues
./eval ashai --verbose --cases 2
Model Experimentation:
# Test new model quickly
./eval ashai --model gpt-5 --quick
# Full comparison when promising
./eval ashai --model gpt-5
Weekly Comprehensive Review:
# Quick comprehensive evaluation
./batch_eval_launcher.sh --quick
# Full system evaluation (when needed)
./batch_eval_launcher.sh -j 8
Performance Optimization¶
Concurrency Guidelines:
- Development: Use -j 2-4
for quick feedback
- Testing: Use -j 4-6
for balanced performance
- Production: Use -j 6-8
for maximum throughput
Server Configuration:
The launcher automatically sets optimal server configuration:
- ASHAI_NO_RELOAD=1
- Stability during long runs
- ASHAI_WORKERS=4
- Multiple worker processes
- ASHAI_LIMIT_CONCURRENCY=128
- High concurrency limit
- ASHAI_BACKLOG=512
- Large connection backlog
- EVAL_DIRECT=1
- Direct agent calls (faster)
- EVAL_DIRECT_TOOLS=1
- Direct tool calls (faster)
Evaluation Version Management¶
When to Update Version: - Changed evaluation criteria or weights - Updated test dataset - Modified scoring logic
📈 Understanding Results¶
Accessing Your Data¶
- Homepage Leaderboard: Quick overview at your app's root URL
- Weave Dashboard: Detailed analysis at wandb.ai
- Console Output: Direct links printed after each evaluation
Key Performance Indicators¶
Look for these patterns: - Model Performance: GPT-5 typically best quality, GPT-4o-mini best speed/cost ratio - Tool Impact: Tools generally improve medical accuracy but increase latency - Reasoning Effect: Reasoning typically improves quality but increases response time - Agent Specialization: September excels at general health, strict-referenced at evidence requirements
Troubleshooting¶
Common Issues:
Issue | Solution |
---|---|
Virtual environment not found |
Run python -m venv venv && source venv/bin/activate && pip install -r requirements.txt |
OPENAI_API_KEY not found |
Set in .env file or environment |
Weave connection failed |
Check WEAVE_PROJECT and internet connection |
Server failed to start |
Check server.log for details |
GPT-5 timeouts |
Use --quick flag or increase timeout in code |
Evaluation takes too long |
Use --quick or adjust concurrency with -j |
Performance Tips:
- Use --quick
for development iteration
- Use ./batch_eval_launcher.sh --quick
for comprehensive but faster evaluation
- Use --verbose
only for debugging (slows down evaluation)
- GPT-5 is slow but highest quality - use sparingly
- Higher concurrency (-j 8
) can speed up batch evaluations
Legacy Scripts¶
Previous Approaches (No Longer Recommended):
- ./batch_eval
- Old batch evaluation script (deprecated)
- ./eval_run.sh
- Legacy server launcher (superseded by batch_eval_launcher.sh
)
- ./batch_eval_server.sh
- Legacy server configuration (functionality now in launcher)
Current Recommended Workflow:
# Single agent testing
./eval ashai --quick
# Comprehensive batch evaluation
./batch_eval_launcher.sh --quick -j 4
📚 Key Resources¶
Essential Documentation¶
- Evaluation Learnings Log - Critical Resource: Historical insights, methodology improvements, and lessons learned from evaluation experiments. Essential reading for understanding evaluation evolution and best practices.
- Dataset Format Documentation - Complete specification of evaluation dataset structure and format
- Versioning Guide - Managing evaluation versions and compatibility
Additional Resources¶
- Agent Documentation - Understanding each agent's capabilities
- Voice Interfaces - Voice interaction options
Start with the Learnings Log
The Evaluation Learnings Log is an invaluable resource that captures years of evaluation experience, including what works, what doesn't, and why. It's highly recommended reading before running large-scale evaluations or making changes to the evaluation system.