ASHAI Evaluation System Documentation¶

🏆 View Live Evaluation Leaderboard → | 🚀 Try the Agents Being Evaluated →

Complete guide for running evaluations of ASHAI agents using the ./eval and batch_eval_launcher.sh scripts with Weights & Biases Weave tracking.

📁 Evaluation System Overview¶

The evaluation system consists of:

./eval - Main CLI tool for single agent evaluations with full parameter control
batch_eval_launcher.sh - RECOMMENDED: Complete batch evaluation launcher that handles server startup, readiness checks, and cleanup
batch_eval_async.py - Batch runner for comprehensive multi-agent/multi-model comparisons
evaluate/evaluators/ - Modular Weave-compatible scorers with medical evaluation criteria
evaluate/weave_datasets.py - Test datasets (prenatal, general, cultural scenarios)
evaluate/agent_realtime_evaluator.py - Real-time evaluator used by API endpoints
Weave dashboard is the source of truth for leaderboard and run details
evaluate/evaluator_stats_service.py - Weave data fetching utilities

📚 Evaluation Learnings Log - A critical resource documenting historical insights, improvements, and lessons learned from evaluation experiments. This log captures key findings about model performance, evaluation methodology improvements, and evolving best practices.

🚀 Quick Start¶

Prerequisites¶

Environment Setup: Make sure you have:
OPENAI_API_KEY in your environment or .env file
WEAVE_PROJECT set (defaults to "ashai-medical-ai")
WANDB_ENTITY set (optional, for dashboard links)
Virtual environment activated (source venv/bin/activate)

Configure Evals¶

Set EVALUATION_VERSION and RUNS in batch_set.py to choose which agents/models/tools to evaluate and to manage evaluation versioning.
For details and examples, see the section below: Running a Single Eval

Run Batch Evals (recommended)¶

./batch_eval_launcher.sh

- For quick mode, concurrency, and other options, see: Comprehensive Testing

📊 Evaluation Options¶

Available Agents¶

ashai - Main medical AI agent with flexible tool support
september - September Health Library specialized agent
strict-referenced - Evidence-only agent requiring search validation
strict-referenced-after - Updated strict-referenced agent
perplexity - Perplexity Medical agent for real-time web sources
one-shot - One-shot agent for comparison
ashai-experiment - Experimental ASHAI variant

./eval Command Line Options¶

Parameter	Type	Description	Example
`agent`	Required	Agent to evaluate	`./eval ashai`
`--model`	String	Override model (gpt-4o, gpt-4o-mini, gpt-5, etc.)	`--model gpt-5`
`--tools`	List	Override tools or specify "none"	`--tools search_perplexity none`
`--cases`	Integer	Limit number of test cases	`--cases 5`
`--quick`	Flag	Run only 3 cases for fast feedback	`--quick`
`--no-reasoning`	Flag	Disable `<thinking>` reasoning mode	`--no-reasoning`
`--eval-retry`	Flag	Enable self-evaluation and retry on low scores	`--eval-retry`
`--verbose`	Flag	Show full agent responses for debugging	`--verbose`
`--weave-project`	String	Override Weave project name	`--weave-project my-project`
`--wandb-entity`	String	Override W&B entity for dashboard links	`--wandb-entity my-team`

batch_eval_launcher.sh Options¶

The batch_eval_launcher.sh script handles the complete evaluation lifecycle:

Parameter	Description	Default
`--quick`	Run quick (3-case) evaluations for all configurations	false
`-j, --jobs N`	Set concurrency level	4
`--weave-project NAME`	Set Weave project name	ashai-medical-ai
`--keep-server`	Keep server running after evals complete	false
`-h, --help`	Show help message	-

What batch_eval_launcher.sh tests: - All agents (ashai, september, strict-referenced, strict-referenced-after, perplexity, one-shot, ashai-experiment) - Multiple models (gpt-4o-mini, gpt-5) - Reasoning vs no-reasoning comparisons - Tool configuration variations (none, perplexity, default tools) - Self-evaluation retry testing

Server Configuration (Automatic): - Workers: 4 (configurable via ASHAI_WORKERS) - Concurrency limit: 128 (configurable via ASHAI_LIMIT_CONCURRENCY) - Backlog: 512 (configurable via ASHAI_BACKLOG) - Direct mode: Enabled (EVAL_DIRECT=1) - Direct tools: Enabled (EVAL_DIRECT_TOOLS=1)

📈 Weights & Biases Weave Integration¶

How Tracking Works¶

The evaluation system uses Weights & Biases Weave for comprehensive tracking and analysis:

Data Storage: All evaluation data is stored in Weave, not locally
Dashboard Access: View detailed results at https://wandb.ai/
Leaderboard Generation: Homepage leaderboard is pulled from Weave data
Versioning: EVALUATION_VERSION ensures fair comparisons when datasets/evaluators change

Weave Project Structure¶

Default Project: ashai-medical-ai (configurable via WEAVE_PROJECT env var or --weave-project)
Evaluation Naming: {agent}_{model}_{version}_{cases}rows{suffix}
Raw Data: All agent responses, scores, timing data stored in Weave

Note: Local cached leaderboard (evaluate/leaderboard.json) has been removed.

Accessing Your Data¶

View in Weave Dashboard:

# After running evaluation, check the console output for direct link
./eval ashai --quick
# Output includes: "View results: https://wandb.ai/your-entity/ashai-medical-ai/weave"

Homepage Leaderboard: - Automatically updated with latest evaluation results - Shows aggregated scores across all configurations - Links to full Weave evaluation details

📊 Evaluation Metrics¶

Scoring System (Updated V8)¶

The medical evaluation uses sophisticated AI-powered scoring across 4 dimensions:

Dimension	Weight	Description
Medical Accuracy	45%	Correctness of medical info, evidence-based responses, proper disclaimers
Precision	25%	Direct answers to specific questions, relevance to patient concerns
Language Clarity	20%	Clear communication, appropriate terminology, cultural sensitivity
Empathy Score	10%	Patient-centered care, emotional acknowledgment, professional tone

📋 Detailed Dataset Documentation - Complete format specification, field descriptions, and usage guidelines

Sample Size: - Quick mode: 3 cases (fast development feedback) - Full mode: ~12 cases (comprehensive evaluation) - Custom: --cases N for specific testing needs

🎯 Usage Patterns¶

Running a Single Eval¶

Use ./eval for detailed control when testing a single agent. See examples below.

Development Iteration:

# Quick validation after code changes  
./eval ashai --quick

# Test specific model performance
./eval ashai --model gpt-5 --quick

# Debug with full output
./eval ashai --verbose --cases 2

Model Comparison:

# Compare models on same agent
./eval ashai --model gpt-4o-mini
./eval ashai --model gpt-5
./eval ashai --model gpt-4o

# Test reasoning impact
./eval ashai --model gpt-5 --no-reasoning
./eval ashai --model gpt-5  # with reasoning (default)

Tool Configuration Testing:

# Test without tools
./eval ashai --tools none

# Test specific tool combinations  
./eval ashai --tools search_september
./eval ashai --tools search_perplexity
./eval ashai --tools search_september search_perplexity

# Test all available tools
./eval ashai  # uses default tools

Advanced Features:

# Self-evaluation and retry
./eval ashai --eval-retry --quick

# Custom evaluation run
# (Version management now handled via EVALUATION_VERSION in code)

Comprehensive Testing (batch_eval_launcher.sh)¶

Quick Development Feedback:

# Quick comprehensive check (faster, ~15-30 minutes)  
./batch_eval_launcher.sh --quick

# Quick with higher concurrency
./batch_eval_launcher.sh --quick -j 8

Full System Evaluation:

# Complete evaluation suite (slow, ~1-2 hours)
./batch_eval_launcher.sh

# Full with maximum concurrency
./batch_eval_launcher.sh -j 8

Custom Configurations:

# Custom Weave project
./batch_eval_launcher.sh --quick -j 4 --weave-project my-project

# Keep server running for additional testing
./batch_eval_launcher.sh --quick -j 4 --keep-server

What batch_eval_launcher.sh covers: - All 7 agents × multiple models - Tool configuration variations - Reasoning vs no-reasoning comparisons - Self-evaluation retry testing

💡 Best Practices¶

Development Workflow¶

Daily Development:

# Quick validation after code changes
./eval ashai --quick

# Debug specific issues  
./eval ashai --verbose --cases 2

Model Experimentation:

# Test new model quickly
./eval ashai --model gpt-5 --quick

# Full comparison when promising
./eval ashai --model gpt-5

Weekly Comprehensive Review:

# Quick comprehensive evaluation
./batch_eval_launcher.sh --quick

# Full system evaluation (when needed)
./batch_eval_launcher.sh -j 8

Performance Optimization¶

Concurrency Guidelines: - Development: Use -j 2-4 for quick feedback - Testing: Use -j 4-6 for balanced performance - Production: Use -j 6-8 for maximum throughput

Server Configuration: The launcher automatically sets optimal server configuration: - ASHAI_NO_RELOAD=1 - Stability during long runs - ASHAI_WORKERS=4 - Multiple worker processes - ASHAI_LIMIT_CONCURRENCY=128 - High concurrency limit - ASHAI_BACKLOG=512 - Large connection backlog - EVAL_DIRECT=1 - Direct agent calls (faster) - EVAL_DIRECT_TOOLS=1 - Direct tool calls (faster)

Evaluation Version Management¶

When to Update Version: - Changed evaluation criteria or weights - Updated test dataset - Modified scoring logic

# In batch_set.py, increment EVALUATION_VERSION:
EVALUATION_VERSION = "v2"  # Was v1

📈 Understanding Results¶

Accessing Your Data¶

Homepage Leaderboard: Quick overview at your app's root URL
Weave Dashboard: Detailed analysis at wandb.ai
Console Output: Direct links printed after each evaluation

Key Performance Indicators¶

Look for these patterns: - Model Performance: GPT-5 typically best quality, GPT-4o-mini best speed/cost ratio - Tool Impact: Tools generally improve medical accuracy but increase latency - Reasoning Effect: Reasoning typically improves quality but increases response time - Agent Specialization: September excels at general health, strict-referenced at evidence requirements

Troubleshooting¶

Common Issues:

Issue	Solution
`Virtual environment not found`	Run `python -m venv venv && source venv/bin/activate && pip install -r requirements.txt`
`OPENAI_API_KEY not found`	Set in `.env` file or environment
`Weave connection failed`	Check `WEAVE_PROJECT` and internet connection
`Server failed to start`	Check `server.log` for details
`GPT-5 timeouts`	Use `--quick` flag or increase timeout in code
`Evaluation takes too long`	Use `--quick` or adjust concurrency with `-j`

Performance Tips: - Use --quick for development iteration - Use ./batch_eval_launcher.sh --quick for comprehensive but faster evaluation - Use --verbose only for debugging (slows down evaluation) - GPT-5 is slow but highest quality - use sparingly - Higher concurrency (-j 8) can speed up batch evaluations

Legacy Scripts¶

Previous Approaches (No Longer Recommended): - ./batch_eval - Old batch evaluation script (deprecated) - ./eval_run.sh - Legacy server launcher (superseded by batch_eval_launcher.sh) - ./batch_eval_server.sh - Legacy server configuration (functionality now in launcher)

Current Recommended Workflow:

# Single agent testing
./eval ashai --quick

# Comprehensive batch evaluation
./batch_eval_launcher.sh --quick -j 4

📚 Key Resources¶

Essential Documentation¶

Evaluation Learnings Log - Critical Resource: Historical insights, methodology improvements, and lessons learned from evaluation experiments. Essential reading for understanding evaluation evolution and best practices.
Dataset Format Documentation - Complete specification of evaluation dataset structure and format
Versioning Guide - Managing evaluation versions and compatibility

Additional Resources¶

Agent Documentation - Understanding each agent's capabilities
Voice Interfaces - Voice interaction options

Start with the Learnings Log

The Evaluation Learnings Log is an invaluable resource that captures years of evaluation experience, including what works, what doesn't, and why. It's highly recommended reading before running large-scale evaluations or making changes to the evaluation system.