ASHAI Evaluation Versioning Guide¶
How to properly manage dataset and evaluator changes to maintain meaningful leaderboards.
🎯 The Versioning Challenge¶
When you modify datasets or evaluators, old evaluation results may no longer be comparable to new ones. This guide explains how to handle this properly.
📊 What Weave Tracks Automatically¶
✅ Evaluation versions - Single version for datasets + evaluators (in evaluate/evaluators/base.py
)
✅ Model runs - Each evaluation gets unique timestamp
✅ Parameters - Agent, model, tools, dataset combinations
🚨 When to Update Evaluation Version¶
Always Update Version For:¶
- Modified scoring logic in
evaluate/evaluators/
(individual scorer files) - Modified test cases in
evaluate/data/
JSON files - Changed difficulty levels or evaluation criteria
# After any evaluator or dataset changes, update in batch_set.py:
EVALUATION_VERSION = "v2" # Was v1
Why update? Changed evaluation criteria = unfair comparison to old results.
What Gets Tracked Automatically:¶
- Minor additions (<20% new content)
- Bug fixes that don't change intent
- Experimental changes
- Formatting/typo corrections
🛠 Version Management Commands¶
Standard Evaluation (No Changes)¶
After Dataset or Evaluator Changes¶
# 1. Update EVALUATION_VERSION in batch_set.py:
# EVALUATION_VERSION = "v2" # Was v1
# 2. Run evaluation with new version
./eval ashai --model gpt-4o
# Creates: ashai_gpt-4o_evalv2_20241220_143022
Experimental Testing¶
# For experimental changes, use descriptive naming in Weave dashboard
./eval ashai --model gpt-4o --quick
# Creates: ashai_gpt-4o_evalv2_quick_20241220_143022
📋 Step-by-Step: Making Changes¶
1. Modifying Datasets¶
Edit test cases:
# In evaluate/data/prenatal.json, general.json, or cultural.json
[
{
"query": "Updated question with new criteria", # Modified
"profile": "Updated profile",
# ... rest of case
}
]
2. Modifying Evaluators¶
Edit scoring logic:
# In evaluate/evaluators/medical_accuracy.py
@weave.op()
def medical_accuracy_scorer(output: str, **kwargs) -> Dict[str, float]:
# Modified evaluation criteria
score = new_evaluation_logic(output) # Changed
return {"medical_accuracy": score}
3. After Either Change¶
Update version:
Run evaluation:
📈 Leaderboard Strategy¶
Development Phase¶
- Use quick mode for experiments
- Keep separate branches for major changes
- Update version frequently as you iterate
Production Phase¶
- Formal version releases (v1, v2, v3)
- Update version for major changes only
- Maintain historical tracking
🎯 Best Practices¶
1. Document Changes¶
# Good: Clear version progression
EVALUATION_VERSION = "v2" # Improved medical accuracy scoring
EVALUATION_VERSION = "v3" # Added cultural sensitivity criteria
2. Batch Version Updates¶
3. Preserve Important Baselines¶
4. Use Weave Dashboard Filtering¶
- Filter by evaluation versions to compare relevant runs
- Use timestamps to track iteration speed
- Group by evaluation versions for fair comparison
⚠️ Common Mistakes¶
❌ Don't do this:
# Comparing across different evaluation criteria
# Old results used v1 scoring, new results use v2 scoring
./eval ashai # Without updating EVALUATION_VERSION after evaluator changes
✅ Do this instead:
❌ Don't do this:
✅ Do this instead:
# Clear, descriptive version progression
EVALUATION_VERSION = "v1" # Initial version
EVALUATION_VERSION = "v2" # Enhanced medical accuracy
EVALUATION_VERSION = "v3" # Added cultural context
🔍 Troubleshooting Versions¶
Check Current Version¶
Compare Evaluation Names¶
# Recent evaluations show version info
# ashai_gpt-4o_evalv1_timestamp
# ashai_gpt-4o_evalv2_timestamp <- Evaluation criteria changed
Start Fresh¶
# Nuclear option: Start completely fresh
# Update EVALUATION_VERSION to new major version
./eval ashai
📚 Summary¶
✅ Always update version when:
- Evaluator/scoring logic changes
- Major dataset modifications
- Starting new evaluation phase
✅ Use Weave dashboard for: - Experimental tracking - Feature testing - Historical comparison
✅ Update version numbers when: - Making official releases - Significant methodology changes - Want clear historical separation
Your Weave dashboard will thank you for proper version management! 🎉