Skip to content

ASHAI Evaluation Versioning Guide

How to properly manage dataset and evaluator changes to maintain meaningful leaderboards.

🎯 The Versioning Challenge

When you modify datasets or evaluators, old evaluation results may no longer be comparable to new ones. This guide explains how to handle this properly.

📊 What Weave Tracks Automatically

Evaluation versions - Single version for datasets + evaluators (in evaluate/evaluators/base.py) ✅ Model runs - Each evaluation gets unique timestamp
Parameters - Agent, model, tools, dataset combinations

🚨 When to Update Evaluation Version

Always Update Version For:

  • Modified scoring logic in evaluate/evaluators/ (individual scorer files)
  • Modified test cases in evaluate/data/ JSON files
  • Changed difficulty levels or evaluation criteria
# After any evaluator or dataset changes, update in batch_set.py:
EVALUATION_VERSION = "v2"  # Was v1

Why update? Changed evaluation criteria = unfair comparison to old results.

What Gets Tracked Automatically:

  • Minor additions (<20% new content)
  • Bug fixes that don't change intent
  • Experimental changes
  • Formatting/typo corrections

🛠 Version Management Commands

Standard Evaluation (No Changes)

./eval ashai --model gpt-4o
# Creates: ashai_gpt-4o_evalv1_20241220_143022

After Dataset or Evaluator Changes

# 1. Update EVALUATION_VERSION in batch_set.py:
# EVALUATION_VERSION = "v2"  # Was v1

# 2. Run evaluation with new version
./eval ashai --model gpt-4o
# Creates: ashai_gpt-4o_evalv2_20241220_143022

Experimental Testing

# For experimental changes, use descriptive naming in Weave dashboard
./eval ashai --model gpt-4o --quick
# Creates: ashai_gpt-4o_evalv2_quick_20241220_143022

📋 Step-by-Step: Making Changes

1. Modifying Datasets

Edit test cases:

# In evaluate/data/prenatal.json, general.json, or cultural.json
[
    {
        "query": "Updated question with new criteria",  # Modified
        "profile": "Updated profile",
        # ... rest of case
    }
]

2. Modifying Evaluators

Edit scoring logic:

# In evaluate/evaluators/medical_accuracy.py
@weave.op()
def medical_accuracy_scorer(output: str, **kwargs) -> Dict[str, float]:
    # Modified evaluation criteria
    score = new_evaluation_logic(output)  # Changed
    return {"medical_accuracy": score}

3. After Either Change

Update version:

# In batch_set.py (top of file)
EVALUATION_VERSION = "v2"  # Was v1

Run evaluation:

./eval ashai --model gpt-4o

📈 Leaderboard Strategy

Development Phase

  • Use quick mode for experiments
  • Keep separate branches for major changes
  • Update version frequently as you iterate
# Experiment with different approaches
./eval ashai --quick
./eval ashai --model gpt-5 --quick

Production Phase

  • Formal version releases (v1, v2, v3)
  • Update version for major changes only
  • Maintain historical tracking
# Official evaluation after major changes
./eval ashai --model gpt-4o

🎯 Best Practices

1. Document Changes

# Good: Clear version progression
EVALUATION_VERSION = "v2"  # Improved medical accuracy scoring
EVALUATION_VERSION = "v3"  # Added cultural sensitivity criteria

2. Batch Version Updates

# After evaluator changes, test all agents
./eval ashai
./eval september  
./eval strict-referenced

3. Preserve Important Baselines

# Before major changes, run comprehensive baseline
./eval ashai --model gpt-4o

4. Use Weave Dashboard Filtering

  • Filter by evaluation versions to compare relevant runs
  • Use timestamps to track iteration speed
  • Group by evaluation versions for fair comparison

⚠️ Common Mistakes

❌ Don't do this:

# Comparing across different evaluation criteria
# Old results used v1 scoring, new results use v2 scoring
./eval ashai  # Without updating EVALUATION_VERSION after evaluator changes

✅ Do this instead:

# Update version after evaluator changes
# In batch_set.py: EVALUATION_VERSION = "v2"
./eval ashai

❌ Don't do this:

# Unclear version management
EVALUATION_VERSION = "test"
EVALUATION_VERSION = "test2"  

✅ Do this instead:

# Clear, descriptive version progression
EVALUATION_VERSION = "v1"  # Initial version
EVALUATION_VERSION = "v2"  # Enhanced medical accuracy
EVALUATION_VERSION = "v3"  # Added cultural context

🔍 Troubleshooting Versions

Check Current Version

# Look at EVALUATION_VERSION in batch settings
grep -n "EVALUATION_VERSION" batch_set.py

Compare Evaluation Names

# Recent evaluations show version info
# ashai_gpt-4o_evalv1_timestamp
# ashai_gpt-4o_evalv2_timestamp  <- Evaluation criteria changed

Start Fresh

# Nuclear option: Start completely fresh
# Update EVALUATION_VERSION to new major version
./eval ashai

📚 Summary

✅ Always update version when: - Evaluator/scoring logic changes - Major dataset modifications
- Starting new evaluation phase

✅ Use Weave dashboard for: - Experimental tracking - Feature testing - Historical comparison

✅ Update version numbers when: - Making official releases - Significant methodology changes - Want clear historical separation

Your Weave dashboard will thank you for proper version management! 🎉