Skip to content

ASHAI Evaluation System Documentation

🏆 View Live Evaluation Leaderboard → | 🚀 Try the Agents Being Evaluated →

Complete guide for running evaluations of ASHAI agents using the ./eval and batch_eval_launcher.sh scripts with Weights & Biases Weave tracking.

📁 Evaluation System Overview

The evaluation system consists of:

  • ./eval - Main CLI tool for single agent evaluations with full parameter control
  • batch_eval_launcher.sh - RECOMMENDED: Complete batch evaluation launcher that handles server startup, readiness checks, and cleanup
  • batch_eval_async.py - Batch runner for comprehensive multi-agent/multi-model comparisons
  • evaluate/evaluators/ - Modular Weave-compatible scorers with medical evaluation criteria
  • evaluate/weave_datasets.py - Test datasets (prenatal, general, cultural scenarios)
  • evaluate/agent_realtime_evaluator.py - Real-time evaluator used by API endpoints
  • Weave dashboard is the source of truth for leaderboard and run details
  • evaluate/evaluator_stats_service.py - Weave data fetching utilities

📚 Evaluation Learnings Log - A critical resource documenting historical insights, improvements, and lessons learned from evaluation experiments. This log captures key findings about model performance, evaluation methodology improvements, and evolving best practices.

🚀 Quick Start

Prerequisites

  1. Environment Setup: Make sure you have:
  2. OPENAI_API_KEY in your environment or .env file
  3. WEAVE_PROJECT set (defaults to "ashai-medical-ai")
  4. WANDB_ENTITY set (optional, for dashboard links)
  5. Virtual environment activated (source venv/bin/activate)

Configure Evals

  • Set EVALUATION_VERSION and RUNS in batch_set.py to choose which agents/models/tools to evaluate and to manage evaluation versioning.
  • For details and examples, see the section below: Running a Single Eval

./batch_eval_launcher.sh
- For quick mode, concurrency, and other options, see: Comprehensive Testing

📊 Evaluation Options

Available Agents

  • ashai - Main medical AI agent with flexible tool support
  • september - September Health Library specialized agent
  • strict-referenced - Evidence-only agent requiring search validation
  • strict-referenced-after - Updated strict-referenced agent
  • perplexity - Perplexity Medical agent for real-time web sources
  • one-shot - One-shot agent for comparison
  • ashai-experiment - Experimental ASHAI variant

./eval Command Line Options

Parameter Type Description Example
agent Required Agent to evaluate ./eval ashai
--model String Override model (gpt-4o, gpt-4o-mini, gpt-5, etc.) --model gpt-5
--tools List Override tools or specify "none" --tools search_perplexity none
--cases Integer Limit number of test cases --cases 5
--quick Flag Run only 3 cases for fast feedback --quick
--no-reasoning Flag Disable <thinking> reasoning mode --no-reasoning
--eval-retry Flag Enable self-evaluation and retry on low scores --eval-retry
--verbose Flag Show full agent responses for debugging --verbose
--weave-project String Override Weave project name --weave-project my-project
--wandb-entity String Override W&B entity for dashboard links --wandb-entity my-team

batch_eval_launcher.sh Options

The batch_eval_launcher.sh script handles the complete evaluation lifecycle:

Parameter Description Default
--quick Run quick (3-case) evaluations for all configurations false
-j, --jobs N Set concurrency level 4
--weave-project NAME Set Weave project name ashai-medical-ai
--keep-server Keep server running after evals complete false
-h, --help Show help message -

What batch_eval_launcher.sh tests: - All agents (ashai, september, strict-referenced, strict-referenced-after, perplexity, one-shot, ashai-experiment) - Multiple models (gpt-4o-mini, gpt-5) - Reasoning vs no-reasoning comparisons - Tool configuration variations (none, perplexity, default tools) - Self-evaluation retry testing

Server Configuration (Automatic): - Workers: 4 (configurable via ASHAI_WORKERS) - Concurrency limit: 128 (configurable via ASHAI_LIMIT_CONCURRENCY) - Backlog: 512 (configurable via ASHAI_BACKLOG) - Direct mode: Enabled (EVAL_DIRECT=1) - Direct tools: Enabled (EVAL_DIRECT_TOOLS=1)

📈 Weights & Biases Weave Integration

How Tracking Works

The evaluation system uses Weights & Biases Weave for comprehensive tracking and analysis:

  1. Data Storage: All evaluation data is stored in Weave, not locally
  2. Dashboard Access: View detailed results at https://wandb.ai/
  3. Leaderboard Generation: Homepage leaderboard is pulled from Weave data
  4. Versioning: EVALUATION_VERSION ensures fair comparisons when datasets/evaluators change

Weave Project Structure

  • Default Project: ashai-medical-ai (configurable via WEAVE_PROJECT env var or --weave-project)
  • Evaluation Naming: {agent}_{model}_{version}_{cases}rows{suffix}
  • Raw Data: All agent responses, scores, timing data stored in Weave

Note: Local cached leaderboard (evaluate/leaderboard.json) has been removed.

Accessing Your Data

View in Weave Dashboard:

# After running evaluation, check the console output for direct link
./eval ashai --quick
# Output includes: "View results: https://wandb.ai/your-entity/ashai-medical-ai/weave"

Homepage Leaderboard: - Automatically updated with latest evaluation results - Shows aggregated scores across all configurations - Links to full Weave evaluation details

📊 Evaluation Metrics

Scoring System (Updated V8)

The medical evaluation uses sophisticated AI-powered scoring across 4 dimensions:

Dimension Weight Description
Medical Accuracy 45% Correctness of medical info, evidence-based responses, proper disclaimers
Precision 25% Direct answers to specific questions, relevance to patient concerns
Language Clarity 20% Clear communication, appropriate terminology, cultural sensitivity
Empathy Score 10% Patient-centered care, emotional acknowledgment, professional tone

📋 Detailed Dataset Documentation - Complete format specification, field descriptions, and usage guidelines

Sample Size: - Quick mode: 3 cases (fast development feedback) - Full mode: ~12 cases (comprehensive evaluation) - Custom: --cases N for specific testing needs

🎯 Usage Patterns

Running a Single Eval

Use ./eval for detailed control when testing a single agent. See examples below.

Development Iteration:

# Quick validation after code changes  
./eval ashai --quick

# Test specific model performance
./eval ashai --model gpt-5 --quick

# Debug with full output
./eval ashai --verbose --cases 2

Model Comparison:

# Compare models on same agent
./eval ashai --model gpt-4o-mini
./eval ashai --model gpt-5
./eval ashai --model gpt-4o

# Test reasoning impact
./eval ashai --model gpt-5 --no-reasoning
./eval ashai --model gpt-5  # with reasoning (default)

Tool Configuration Testing:

# Test without tools
./eval ashai --tools none

# Test specific tool combinations  
./eval ashai --tools search_september
./eval ashai --tools search_perplexity
./eval ashai --tools search_september search_perplexity

# Test all available tools
./eval ashai  # uses default tools

Advanced Features:

# Self-evaluation and retry
./eval ashai --eval-retry --quick

# Custom evaluation run
# (Version management now handled via EVALUATION_VERSION in code)

Comprehensive Testing (batch_eval_launcher.sh)

Quick Development Feedback:

# Quick comprehensive check (faster, ~15-30 minutes)  
./batch_eval_launcher.sh --quick

# Quick with higher concurrency
./batch_eval_launcher.sh --quick -j 8

Full System Evaluation:

# Complete evaluation suite (slow, ~1-2 hours)
./batch_eval_launcher.sh

# Full with maximum concurrency
./batch_eval_launcher.sh -j 8

Custom Configurations:

# Custom Weave project
./batch_eval_launcher.sh --quick -j 4 --weave-project my-project

# Keep server running for additional testing
./batch_eval_launcher.sh --quick -j 4 --keep-server

What batch_eval_launcher.sh covers: - All 7 agents × multiple models - Tool configuration variations - Reasoning vs no-reasoning comparisons - Self-evaluation retry testing

💡 Best Practices

Development Workflow

Daily Development:

# Quick validation after code changes
./eval ashai --quick

# Debug specific issues  
./eval ashai --verbose --cases 2

Model Experimentation:

# Test new model quickly
./eval ashai --model gpt-5 --quick

# Full comparison when promising
./eval ashai --model gpt-5

Weekly Comprehensive Review:

# Quick comprehensive evaluation
./batch_eval_launcher.sh --quick

# Full system evaluation (when needed)
./batch_eval_launcher.sh -j 8

Performance Optimization

Concurrency Guidelines: - Development: Use -j 2-4 for quick feedback - Testing: Use -j 4-6 for balanced performance - Production: Use -j 6-8 for maximum throughput

Server Configuration: The launcher automatically sets optimal server configuration: - ASHAI_NO_RELOAD=1 - Stability during long runs - ASHAI_WORKERS=4 - Multiple worker processes - ASHAI_LIMIT_CONCURRENCY=128 - High concurrency limit - ASHAI_BACKLOG=512 - Large connection backlog - EVAL_DIRECT=1 - Direct agent calls (faster) - EVAL_DIRECT_TOOLS=1 - Direct tool calls (faster)

Evaluation Version Management

When to Update Version: - Changed evaluation criteria or weights - Updated test dataset - Modified scoring logic

# In batch_set.py, increment EVALUATION_VERSION:
EVALUATION_VERSION = "v2"  # Was v1

📈 Understanding Results

Accessing Your Data

  1. Homepage Leaderboard: Quick overview at your app's root URL
  2. Weave Dashboard: Detailed analysis at wandb.ai
  3. Console Output: Direct links printed after each evaluation

Key Performance Indicators

Look for these patterns: - Model Performance: GPT-5 typically best quality, GPT-4o-mini best speed/cost ratio - Tool Impact: Tools generally improve medical accuracy but increase latency - Reasoning Effect: Reasoning typically improves quality but increases response time - Agent Specialization: September excels at general health, strict-referenced at evidence requirements

Troubleshooting

Common Issues:

Issue Solution
Virtual environment not found Run python -m venv venv && source venv/bin/activate && pip install -r requirements.txt
OPENAI_API_KEY not found Set in .env file or environment
Weave connection failed Check WEAVE_PROJECT and internet connection
Server failed to start Check server.log for details
GPT-5 timeouts Use --quick flag or increase timeout in code
Evaluation takes too long Use --quick or adjust concurrency with -j

Performance Tips: - Use --quick for development iteration - Use ./batch_eval_launcher.sh --quick for comprehensive but faster evaluation - Use --verbose only for debugging (slows down evaluation) - GPT-5 is slow but highest quality - use sparingly - Higher concurrency (-j 8) can speed up batch evaluations

Legacy Scripts

Previous Approaches (No Longer Recommended): - ./batch_eval - Old batch evaluation script (deprecated) - ./eval_run.sh - Legacy server launcher (superseded by batch_eval_launcher.sh) - ./batch_eval_server.sh - Legacy server configuration (functionality now in launcher)

Current Recommended Workflow:

# Single agent testing
./eval ashai --quick

# Comprehensive batch evaluation
./batch_eval_launcher.sh --quick -j 4

📚 Key Resources

Essential Documentation

  • Evaluation Learnings Log - Critical Resource: Historical insights, methodology improvements, and lessons learned from evaluation experiments. Essential reading for understanding evaluation evolution and best practices.
  • Dataset Format Documentation - Complete specification of evaluation dataset structure and format
  • Versioning Guide - Managing evaluation versions and compatibility

Additional Resources

Start with the Learnings Log

The Evaluation Learnings Log is an invaluable resource that captures years of evaluation experience, including what works, what doesn't, and why. It's highly recommended reading before running large-scale evaluations or making changes to the evaluation system.