ASHAI Evaluation Versioning Guide¶

How to properly manage dataset and evaluator changes to maintain meaningful leaderboards.

🎯 The Versioning Challenge¶

When you modify datasets or evaluators, old evaluation results may no longer be comparable to new ones. This guide explains how to handle this properly.

📊 What Weave Tracks Automatically¶

✅ Evaluation versions - Single version for datasets + evaluators (in evaluate/evaluators/base.py) ✅ Model runs - Each evaluation gets unique timestamp
✅ Parameters - Agent, model, tools, dataset combinations

🚨 When to Update Evaluation Version¶

Always Update Version For:¶

Modified scoring logic in evaluate/evaluators/ (individual scorer files)
Modified test cases in evaluate/data/ JSON files
Changed difficulty levels or evaluation criteria

# After any evaluator or dataset changes, update in batch_set.py:
EVALUATION_VERSION = "v2"  # Was v1

Why update? Changed evaluation criteria = unfair comparison to old results.

What Gets Tracked Automatically:¶

Minor additions (<20% new content)
Bug fixes that don't change intent
Experimental changes
Formatting/typo corrections

🛠 Version Management Commands¶

Standard Evaluation (No Changes)¶

./eval ashai --model gpt-4o
# Creates: ashai_gpt-4o_evalv1_20241220_143022

After Dataset or Evaluator Changes¶

# 1. Update EVALUATION_VERSION in batch_set.py:
# EVALUATION_VERSION = "v2"  # Was v1

# 2. Run evaluation with new version
./eval ashai --model gpt-4o
# Creates: ashai_gpt-4o_evalv2_20241220_143022

Experimental Testing¶

# For experimental changes, use descriptive naming in Weave dashboard
./eval ashai --model gpt-4o --quick
# Creates: ashai_gpt-4o_evalv2_quick_20241220_143022

📋 Step-by-Step: Making Changes¶

1. Modifying Datasets¶

Edit test cases:

# In evaluate/data/prenatal.json, general.json, or cultural.json
[
    {
        "query": "Updated question with new criteria",  # Modified
        "profile": "Updated profile",
        # ... rest of case
    }
]

2. Modifying Evaluators¶

Edit scoring logic:

# In evaluate/evaluators/medical_accuracy.py
@weave.op()
def medical_accuracy_scorer(output: str, **kwargs) -> Dict[str, float]:
    # Modified evaluation criteria
    score = new_evaluation_logic(output)  # Changed
    return {"medical_accuracy": score}

3. After Either Change¶

Update version:

# In batch_set.py (top of file)
EVALUATION_VERSION = "v2"  # Was v1

Run evaluation:

./eval ashai --model gpt-4o

📈 Leaderboard Strategy¶

Development Phase¶

Use quick mode for experiments
Keep separate branches for major changes
Update version frequently as you iterate

# Experiment with different approaches
./eval ashai --quick
./eval ashai --model gpt-5 --quick

Production Phase¶

Formal version releases (v1, v2, v3)
Update version for major changes only
Maintain historical tracking

# Official evaluation after major changes
./eval ashai --model gpt-4o

🎯 Best Practices¶

1. Document Changes¶

# Good: Clear version progression
EVALUATION_VERSION = "v2"  # Improved medical accuracy scoring
EVALUATION_VERSION = "v3"  # Added cultural sensitivity criteria

2. Batch Version Updates¶

# After evaluator changes, test all agents
./eval ashai
./eval september  
./eval strict-referenced

3. Preserve Important Baselines¶

# Before major changes, run comprehensive baseline
./eval ashai --model gpt-4o

4. Use Weave Dashboard Filtering¶

Filter by evaluation versions to compare relevant runs
Use timestamps to track iteration speed
Group by evaluation versions for fair comparison

⚠️ Common Mistakes¶

❌ Don't do this:

# Comparing across different evaluation criteria
# Old results used v1 scoring, new results use v2 scoring
./eval ashai  # Without updating EVALUATION_VERSION after evaluator changes

✅ Do this instead:

# Update version after evaluator changes
# In batch_set.py: EVALUATION_VERSION = "v2"
./eval ashai

❌ Don't do this:

# Unclear version management
EVALUATION_VERSION = "test"
EVALUATION_VERSION = "test2"

✅ Do this instead:

# Clear, descriptive version progression
EVALUATION_VERSION = "v1"  # Initial version
EVALUATION_VERSION = "v2"  # Enhanced medical accuracy
EVALUATION_VERSION = "v3"  # Added cultural context

🔍 Troubleshooting Versions¶

Check Current Version¶

# Look at EVALUATION_VERSION in batch settings
grep -n "EVALUATION_VERSION" batch_set.py

Compare Evaluation Names¶

# Recent evaluations show version info
# ashai_gpt-4o_evalv1_timestamp
# ashai_gpt-4o_evalv2_timestamp  <- Evaluation criteria changed

Start Fresh¶

# Nuclear option: Start completely fresh
# Update EVALUATION_VERSION to new major version
./eval ashai

📚 Summary¶

✅ Always update version when: - Evaluator/scoring logic changes - Major dataset modifications
- Starting new evaluation phase

✅ Use Weave dashboard for: - Experimental tracking - Feature testing - Historical comparison

✅ Update version numbers when: - Making official releases - Significant methodology changes - Want clear historical separation

Your Weave dashboard will thank you for proper version management! 🎉