Learnings Log

v157 - take 2 - oct 18, 2025 4:58am¶

Changes¶

finally seems to be doing alright. running a bigger batch
fully syncronous now
running comprehensive

Errors¶

No error logs found for this version.

AI analysis:

Error Rate Consistency: No errors reported across all runs, indicating stable performance compared to previous evaluations.
Weave-Reported Errors: Zero Weave-reported errors suggest effective handling of input across all tested models.
Model Performance: All top runs (strict-probe, strict-json, etc.) maintained a perfect score, demonstrating reliability in the gpt-5-mini model.
Lack of Common Messages: Absence of common error messages indicates no recurring issues, reinforcing the robustness of the current setup.
Next Steps: Continue monitoring for potential edge cases and consider stress-testing with varied input types to ensure ongoing reliability.

Observations¶

Top performer: ashai | gpt-5-mini → 87.9%
Fastest: ashai-pre-researched | gpt-5-mini → 45.4s
Top Performer: ashai leads with 87.9%, a significant improvement over previous entries.
Performance Gap: A notable gap exists between ashai and the second-place strict-json at 77.2%.
Speed Outlier: one-shot achieved a competitive score of 76.1% with a fast average time of 48.9s.
Fastest Average: ashai-pre-researched recorded the fastest average time at 45.4s, despite a lower accuracy of 70.8%.
Error Logs: No error logs were reported for this version, indicating stable performance.
Overall Improvement: All top performers show enhanced accuracy and speed compared to the previous leaderboard.

Leaderboard v157¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5-mini	ask_nihai, search_nihai	✓	—	87.9%	86.6%	85.5%	93.7%	88.3%	66.4
2	strict-json	gpt-5-mini	ask_nihai, search_nihai	✓	—	77.2%	75.9%	71.3%	88.2%	75.2%	65.8
3	one-shot	gpt-5-mini	ask_nihai, search_nihai	✓	—	76.1%	61.8%	82.7%	92.9%	90.1%	48.9
4	noora_aman_pre_researched	gpt-5-mini	ask_nihai, search_nihai	✓	—	72.0%	62.8%	78.8%	79.3%	82.2%	47.6
5	strict-referenced	gpt-5-mini	ask_nihai, search_nihai	✓	—	71.7%	68.4%	68.5%	83.6%	70.2%	79.9
6	ashai-pre-researched	gpt-5-mini	ask_nihai, search_nihai	✓	—	70.8%	58.9%	76.2%	85.5%	82.0%	45.4
7	strict-probe	gpt-5-mini	ask_nihai, search_nihai	✓	—	64.3%	57.1%	56.4%	88.7%	67.7%	49.7

v156 - oct 17, 2025 6:12am¶

Changes¶

finally seems to be doing alright. running a bigger batch
fully syncronous now
running comprehensive

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: Two runs, strict-json and ashai-pre-researched, show significant error rates of 41.67% and 33.33%, respectively.
Suspected Cause: Both runs utilize the same tools (ask_nihai, search_nihai), indicating potential issues with tool compatibility or data handling.
Lack of Common Messages: Absence of common error messages complicates diagnosis; further investigation into individual run contexts is needed.
Comparison to Previous Entry: Previous logs showed fewer errors; current runs indicate a potential regression in performance or tool reliability.
Next Steps: Analyze tool outputs for both runs to identify specific failure points; consider isolating tools for targeted testing.
Monitoring: Implement a more detailed logging mechanism to capture error messages for future analysis and troubleshooting.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved the highest score at 77.3%, but with a slower average time of 104.1 seconds.
Runner-Up: ashai-pre-researched scored 73.6% and was the fastest at 46.7 seconds.
Performance Gap: A notable 3.7% score difference exists between the top two entries.
Speed Outlier: ashai-pre-researched significantly outperformed strict-json in speed, completing tasks over 57 seconds faster.
Error Status: No error logs were recorded for this version, indicating improved stability.
Comparison: This version shows a clear improvement in speed for ashai-pre-researched compared to previous entries.
Consistency: Both top entries utilized gpt-5-mini, suggesting consistent performance across models.

Leaderboard v156¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v155 - oct 17, 2025 2:31am¶

Changes¶

finally seems to be doing alright. running a bigger batch

Errors¶

No error logs found for this version.

AI analysis:

High Error Rate: Two runs exhibited significant error rates, with strict-json at 41.67% and ashai-pre-researched at 33.33%.
Suspected Cause: Both runs utilized the same tools (ask_nihai, search_nihai), indicating potential tool-related issues or configuration errors.
Lack of Common Messages: No specific error messages were reported, complicating root cause analysis and troubleshooting efforts.
Comparison to Previous Entry: Previous logs showed fewer errors overall, suggesting a regression in performance or stability with the current version.
Next Steps: Investigate tool configurations and logs for strict-json and ashai-pre-researched; consider running isolated tests to identify specific issues.
Monitoring Recommendations: Implement enhanced logging for error messages to facilitate future analysis and quicker resolution of similar issues.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, outperforming ashai-pre-researched by 3.7%.
Speed Outlier: ashai-pre-researched is significantly faster at 46.7s compared to strict-json's 104.1s.
Error Hotspots: No error logs were reported in this version, indicating improved stability.
Notable Gap: A 3.7% performance gap exists between the top two entries.
Previous Comparison: This version shows improved performance metrics compared to the last leaderboard entry.
Consistency: Both top performers utilized the gpt-5-mini model, indicating its reliability.
Overall Improvement: The absence of errors marks a significant enhancement over previous evaluations.

Leaderboard v155¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v154 - oct 15, 2025 10:31am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
scoring fixed

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: Two runs, strict-json and ashai-pre-researched, show significant error rates (5/12 and 4/12 respectively).
Tool Usage: Both runs utilized the same tools (ask_nihai, search_nihai), suggesting potential issues with these specific tools.
Version Consistency: Both runs operated on version v136, indicating a need to assess this version for underlying issues.
Lack of Common Messages: Absence of common error messages complicates pinpointing specific issues; further investigation is needed.
Comparison to Previous Entry: Error rates increased from previous logs, indicating a potential regression or new issues introduced in recent updates.
Next Steps: Review tool performance, analyze version v136 for bugs, and consider running additional tests to identify error sources.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved the highest score at 77.3%, outperforming ashai-pre-researched by 3.7%.
Speed Outlier: ashai-pre-researched was significantly faster at 46.7s compared to strict-json's 104.1s.
Error Logs: No error logs were reported for this version, indicating improved stability.
Performance Gap: A notable gap exists between the top two performers and the rest of the leaderboard.
Consistency: Both top entries utilized gpt-5-mini, suggesting model consistency in performance.
Previous Comparison: This version shows improved scores and speed metrics compared to the last evaluation.

Leaderboard v154¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v152 - oct 15, 2025 5:56am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave
openai timer

Errors¶

No error logs found for this version.

AI analysis:

High Error Rate: The strict-json and ashai-pre-researched runs show significant error rates, indicating potential issues with input handling or model response.
Tool Dependency: Both runs utilize the same tools (ask_nihai, search_nihai), suggesting tool-related issues may contribute to errors.
Version Consistency: Both runs are on version v136, indicating that the version may not be the source of the errors.
Lack of Common Messages: Absence of common error messages complicates troubleshooting; further investigation into specific run configurations is needed.
Comparison to Previous Entry: Previous entries had fewer reported errors, indicating a potential regression in performance with the current version or configuration.
Next Steps: Analyze input data for both runs, review tool performance, and consider running tests with alternative configurations or tools.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a notable performance increase from the previous entry.
Runner-Up: ashai-pre-researched secured 73.6%, demonstrating strong performance but a significant gap of 3.7% from the top.
Speed Outlier: ashai-pre-researched was the fastest, averaging 46.7 seconds, significantly quicker than strict-json's 104.1 seconds.
Error Status: No error logs were reported for this version, indicating improved stability compared to prior evaluations.

Leaderboard v152¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v151 - oct 15, 2025 4:44am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave
openai timer

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: The strict-json and ashai-pre-researched runs show significant error rates of 41.67% and 33.33%, respectively.
Suspected Cause: Both runs utilize the same tools (ask_nihai, search_nihai), indicating potential issues with tool compatibility or input handling.
Lack of Common Messages: No common error messages were reported, suggesting varied underlying issues across different runs.
Comparison to Previous Entry: Previous entries had fewer errors, indicating a possible regression in performance with the current version (v=v136).
Next Steps: Investigate tool interactions and input formats for strict-json and ashai-pre-researched runs to identify specific failure points.
Monitoring: Continue monitoring error rates in future runs to assess if issues persist or improve with adjustments.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a significant margin over ashai-pre-researched at 73.6%.
Speed Outlier: ashai-pre-researched is notably faster at 46.7 seconds, contrasting with strict-json's 104.1 seconds.
Performance Gap: A 3.7% difference exists between the top two performers, indicating a competitive edge for strict-json.
Error Status: No error logs were reported for this version, marking an improvement in stability compared to previous entries.

Leaderboard v151¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v150 - oct 15, 2025 2:30am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave
openai timer

Errors¶

No error logs found for this version.

AI analysis:

High Error Rate: The strict-json and ashai-pre-researched runs show significant error rates, indicating potential issues with input handling or model configuration.
Model Consistency: Both runs utilize the gpt-5-mini model, suggesting a possible model-specific issue rather than a tool-related problem.
Tool Interaction: Errors occurred with tools ask_nihai and search_nihai, indicating potential compatibility or integration issues that need investigation.
Lack of Common Messages: Absence of common error messages complicates troubleshooting; further logging may be necessary to identify specific failure points.
Comparison to Previous Entry: Previous logs showed fewer errors; this increase may indicate recent changes in input data or model parameters requiring review.
Next Steps: Conduct a detailed review of input data and tool configurations; consider running isolated tests to pinpoint error sources.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a notable margin.
Runner-Up: ashai-pre-researched secured 73.6%, demonstrating strong performance but a significant gap from the top.
Speed Leader: ashai-pre-researched was the fastest, averaging 46.7 seconds, outperforming strict-json's 104.1 seconds.
Error Logs: No error logs were reported for this version, indicating improved stability.
Performance Improvement: Both models showed enhanced accuracy compared to previous evaluations,with strict-json leading by 3.7%.

Leaderboard v150¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v145 - oct 14, 2025 1:08am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rate: The strict-json run had a 41.67% error rate, indicating significant issues with this configuration.
Frequent Tool Usage: Both top error runs utilized the same tools (ask_nihai, search_nihai), suggesting potential tool-related problems.
Version Consistency: All runs were on version v136, indicating that the version may not be the source of errors.
Lack of Common Messages: Absence of common error messages complicates pinpointing specific issues; further investigation is needed.
Comparison to Previous Entry: Error rates have increased from the last evaluation, indicating a potential regression in performance.
Next Steps: Review tool configurations and run parameters for strict-json and ashai-pre-researched to identify and resolve underlying issues.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a notable margin over ashai-pre-researched at 73.6%.
Speed Outlier: ashai-pre-researched is significantly faster, averaging 46.7 seconds compared to strict-json's 104.1 seconds.
Error Analysis: No error logs were recorded for this version, indicating improved stability.
Performance Gap: A 3.7% gap separates the top two performers, highlighting a competitive edge for strict-json.
Notable Changes: This version shows improved performance consistency with no errors, contrasting previous entries with frequent error logs.

Leaderboard v145¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v144 - oct 13, 2025 2:30pm¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: Two runs, strict-json and ashai-pre-researched, show significant error rates of 41.67% and 33.33%, respectively.
Tool Usage: Both runs utilized the same tools (ask_nihai, search_nihai), indicating potential tool-related issues.
Lack of Common Messages: No specific error messages were reported, complicating root cause analysis.
Comparison to Previous Entry: Previous logs showed fewer errors, suggesting a regression in performance with the current version (v136).
Next Steps: Investigate tool performance and configurations; consider isolating tools to identify specific sources of errors.
Version Review: Assess changes in v136 compared to earlier versions to pinpoint potential causes for increased error rates.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a notable margin over ashai-pre-researched at 73.6%.
Speed Outlier: ashai-pre-researched is significantly faster at 46.7s, compared to strict-json's 104.1s.
Performance Gap: A 3.7% difference exists between the top two entries, indicating a competitive edge for strict-json.
Error Status: No error logs were reported for this version, marking an improvement over previous entries.

Leaderboard v144¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v142 - oct 13, 2025 9:11am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: Two runs exhibit significant error rates, with strict-json at 41.67% and ashai-pre-researched at 33.33%.
Tool Usage: Both runs utilized the same tools (ask_nihai, search_nihai), suggesting potential tool-related issues.
Version Consistency: All runs were executed under the same version (v=v136), indicating that version-specific bugs are less likely.
Lack of Common Messages: Absence of common error messages complicates pinpointing specific issues; further investigation is needed.
Next Steps: Review tool configurations and inputs for both runs to identify discrepancies or misconfigurations.
Comparison to Previous Entry: Previous entries had varied error rates; current runs show a concentrated error pattern, indicating a potential systemic issue.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved the highest score at 77.3%, but with a slower average time of 104.1 seconds.
Runner-Up: ashai-pre-researched scored 73.6% and was the fastest with an average time of 46.7 seconds.
Speed Gap: Significant time difference between top performers, with strict-json taking over twice as long as ashai-pre-researched.
Error Logs: No error logs were reported for this version, indicating improved stability.
Notable Changes: Both top entries improved their scores compared to previous evaluations, with strict-json leading by a notable margin.

Leaderboard v142¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v141 - oct 13, 2025 8:16am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: Two runs, strict-json and ashai-pre-researched, show significant error rates of 41.67% and 33.33%, respectively.
Suspected Cause: Both runs utilize the same tools (ask_nihai, search_nihai), indicating potential issues with tool integration or data handling.
Lack of Common Messages: Absence of common error messages complicates diagnosis, suggesting varied underlying issues across runs.
Comparison to Previous Entry: Previous evaluations had no reported errors, indicating a regression in performance with the current configurations.
Next Steps: Investigate tool performance and data quality; consider isolating tools to identify specific failure points.
Run Optimization: Review and refine the configurations of high-error runs to enhance reliability and reduce error rates.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, outperforming ashai-pre-researched by 3.7%.
Speed Outlier: ashai-pre-researched was significantly faster at 46.7s compared to strict-json's 104.1s.
Error Logs: No error logs were recorded in this version, indicating improved stability.
Notable Gaps: A 3.7% performance gap exists between the top two entries.
Previous Comparison: This version shows enhanced performance consistency with no errors reported.
Overall Speed: The fastest model maintained a strong accuracy while significantly reducing response time.
Performance Stability: Both models demonstrated improved reliability compared to previous evaluations.

Leaderboard v141¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v140 - oct 13, 2025 7:57am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rate: Two runs exhibited significant error rates, with strict-json at 41.67% and ashai-pre-researched at 33.33%.
Tool Usage: Both runs utilized the same tools (ask_nihai, search_nihai), suggesting potential issues with these specific tools.
Version Consistency: Both runs were executed under version v136, indicating that the version may not be the source of errors.
Lack of Common Messages: Absence of common error messages complicates pinpointing specific issues; further investigation into tool performance is needed.
Next Steps: Analyze tool outputs for both runs to identify specific failure points and consider alternative tools or configurations.
Comparison to Previous Entry: No previous entry provided for comparison; establishing a baseline for future evaluations is recommended.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a notable margin.
Runner-Up: ashai-pre-researched followed with 73.6%, demonstrating strong performance but a significant gap from the leader.
Speed Outlier: ashai-pre-researched was the fastest model, averaging 46.7 seconds, significantly quicker than strict-json's 104.1 seconds.
Error Logs: No error logs were reported for this version, indicating improved stability.
Comparison: Both top models showed improved accuracy compared to previous evaluations, with strict-json maintaining its lead.

Leaderboard v140¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v139 - oct 13, 2025 6:58am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rates: Two runs, strict-json and ashai-pre-researched, show significant error rates (5/12 and 4/12 respectively).
Suspected Cause: Both runs utilize the same tools (ask_nihai, search_nihai), indicating potential issues with tool compatibility or configuration.
Lack of Common Messages: No common error messages reported, suggesting varied underlying issues or insufficient logging detail.
Comparison to Previous Entry: Previous entries had fewer reported errors, indicating a potential regression in performance with the current version (v136).
Next Steps: Investigate tool configurations and compatibility; enhance logging to capture more detailed error messages for better analysis.
Run Optimization: Consider revising or replacing the tools used in high-error runs to improve overall performance and reduce error rates.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, leading the leaderboard with a notable margin.
Runner-Up: ashai-pre-researched secured 73.6%, demonstrating strong performance but a significant gap from the leader.
Speed Outlier: ashai-pre-researched was the fastest, averaging 46.7 seconds, significantly quicker than strict-json's 104.1 seconds.
Error Logs: No error logs were reported for this version, indicating improved stability.
Comparison: Both top entries show improved accuracy compared to previous evaluations, with strict-json maintaining its lead.

Leaderboard v139¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v138 - oct 13, 2025 6:09am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20
more logging trying to track weave hang up
trying simple runner - no more weave

Errors¶

No error logs found for this version.

AI analysis:

High Error Rate: The strict-json and ashai-pre-researched runs show significant error rates, indicating potential issues with input handling or model configuration.
Tool Usage: Both runs utilized the same tools (ask_nihai, search_nihai), suggesting that tool interaction may contribute to the errors.
Model Consistency: Errors occurred with the same model (gpt-5-mini), indicating a possible model-specific issue rather than a broader system problem.
Lack of Common Messages: Absence of common error messages complicates troubleshooting; further investigation into individual run contexts is needed.
Next Steps: Review input formats and tool configurations for both runs; consider testing with alternative models or tools to isolate the issue.
Comparison to Previous Entry: Previous logs showed fewer errors; increased error rates may indicate recent changes in input or model behavior requiring immediate attention.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved the highest score at 77.3%, outperforming ashai-pre-researched by 3.7%.
Speed Gap: ashai-pre-researched was significantly faster, completing tasks in 46.7 seconds compared to strict-json's 104.1 seconds.
Performance Consistency: Both top entries used gpt-5-mini, indicating consistent performance across models.
Error Logs: No error logs were recorded in this version, marking an improvement over previous entries.

Leaderboard v138¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v136 - oct 13, 2025 2:30am¶

Changes¶

wrapper for weave to hard timeout
trying NihAI_V20

Errors¶

AI analysis:

High Error Rate: The strict-json run shows an 8% error rate, indicating potential inefficiencies in processing or resource allocation.
Weave-Reported Errors: The strict-json run had 5 Weave-reported errors, suggesting issues with data handling or integration.
Timeout Issues: A timeout message for strict-json indicates possible performance bottlenecks; further investigation into execution time is needed.
Comparison to Previous Entry: Error rates have decreased overall, but specific runs still exhibit significant issues requiring targeted analysis.
Next Steps: Optimize strict-json processing and review timeout settings; consider load testing to identify performance thresholds.
Tool Evaluation: Assess the effectiveness of tools used (ask_nihai, search_nihai) in both runs for potential improvements.

Observations¶

Top performer: strict-json | gpt-5-mini → 77.3%
Fastest: ashai-pre-researched | gpt-5-mini → 46.7s
Top Performer: strict-json achieved 77.3%, outperforming ashai-pre-researched at 73.6%.
Speed Outlier: ashai-pre-researched was the fastest at 46.7s, significantly quicker than strict-json's 104.1s.
Error Gaps: Overall errors were low, with only 1 out of 24 cases reported.
Notable Error Hotspot: strict-json had the highest Weave-reported errors, with 5 out of 12 cases failing.
Common Issue: One timeout error occurred with strict-json, indicating potential performance concerns.
Comparison: ashai-pre-researched improved error rates, achieving 0/12 in one run compared to strict-json's 8%.
Overall Performance: Both models showed strong results, but strict-json's speed and error rates need attention.

Leaderboard v136¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	77.3%	73.6%	72.0%	90.7%	80.8%	104.1	5
2	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	73.6%	63.3%	82.5%	81.9%	80.5%	46.7	4

v128 - take 2 - oct 12, 2025 1:46am¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long
weaviate / pydantic error fixes
more timout wrappers
fix eval reference issue
increase backoff and timeout wrapper
removed all async - just sticking with sync. total refactor
trying to resolve dns issue in freezing?

Errors¶

No error logs found for this version.

Observations¶

Leaderboard v128¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced	gpt-5-mini	ask_nihai, search_nihai	✓	—	83.2%	85.0%	80.0%	90.0%	70.0%	77.5
2	strict-json	gpt-5-mini	ask_nihai, search_nihai	✓	—	83.2%	85.0%	80.0%	90.0%	70.0%	79.3
3	ashai	gpt-5-mini	ask_nihai, search_nihai	✓	—	83.2%	85.0%	80.0%	90.0%	70.0%	85.8
4	ashai-pre-researched	gpt-5-mini	ask_nihai, search_nihai	—	—	80.2%	81.7%	76.7%	86.7%	70.0%	36.1
5	strict-probe	gpt-5-mini	ask_nihai, search_nihai	✓	—	79.3%	80.0%	73.3%	88.3%	73.3%	34.9

v127 - oct 11, 2025 2:25am¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long
weaviate / pydantic error fixes
more timout wrappers
fix eval reference issue
increase backoff and timeout wrapper
removed all async - just sticking with sync. total refactor
trying to resolve dns issue in freezing?

Errors¶

AI analysis:

High Error Rate: 100% error rate in multiple runs indicates critical issues with specific configurations (e.g., ashai, strict-json).
Common Error Messages: Frequent “object AgentRunResult can't be used in 'await' expression” suggests a coding or integration issue.
Server Errors: Multiple instances of “500 Server Error” indicate potential server-side issues, possibly related to resource handling or configuration.
Comparison to Previous Entry: Error rates have increased from previous evaluations, highlighting a regression in performance for certain models.
Next Steps: Investigate the coding logic for handling AgentRunResult and review server configurations to address 500 errors.
Testing Variations: Consider testing alternative configurations or versions to isolate and resolve the identified issues effectively.

Observations¶

Top performer: ashai | gpt-5-mini → 4.5%
Fastest: ashai-pre-researched | gpt-5-mini → 4.1s
Top Performer: ashai leads with a 4.5% success rate, followed by strict-referenced at 2.0% and strict-json at 0.7%.
Speed Outliers: ashai-pre-researched is the fastest at 4.1s, significantly quicker than strict-json (7.5s) and strict-probe (7.6s).
Error Gaps: Overall errors increased to 12/15 cases; ashai and strict-referenced had the highest error rates at 100% in specific runs.
Notable Error Hotspots: Common errors include "object AgentRunResult can't be used in 'await' expression" and "500 Server Error" for ashai-pre-researched.
Weave-Reported Errors: ashai reported the highest Weave errors at 77/12, indicating significant issues in that configuration.
Error Origins: All errors originated from direct agent calls, highlighting a potential area for improvement in handling requests.

Leaderboard v127¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	ask nihai, nihai	✓	—	4.5%	0.0%	0.0%	14.9%	14.8%	14.0	77
2	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	2.0%	0.0%	0.0%	6.7%	6.6%	14.3	70
3	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	0.7%	0.0%	0.0%	3.3%	0.0%	7.5	42
4	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	0.0%	0.0%	0.0%	0.0%	0.0%	4.1	5
5	strict-probe	gpt-5-mini	ask nihai, nihai	✓	—	0.0%	0.0%	0.0%	0.0%	0.0%	7.6	22

v125 - oct 11, 2025 1:47am¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long
weaviate / pydantic error fixes
more timout wrappers
fix eval reference issue
increase backoff and timeout wrapper
removed all async - just sticking with sync. total refactor

Errors¶

Overall errors: 48/60 cases Weave‑reported errors: 216 events Top runs by error rate: - ashai | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 12/12 (100%) - ashai-pre-researched | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 [no-reasoning] | 20251011_014617: 12/12 (100%) - strict-json | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 12/12 (100%) - strict-referenced | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 12/12 (100%) - strict-probe | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 0/12 (0%) Top runs by Weave‑reported errors: - ashai | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 77/12 - strict-referenced | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 70/12 - strict-json | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 42/12 - strict-probe | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 | 20251011_014448: 22/12 - ashai-pre-researched | gpt-5-mini | tools=ask_nihai,search_nihai | full | v=v125 [no-reasoning] | 20251011_014617: 5/12 Common error messages: - “[Errno 24] Too many open files” ×15 - “object AgentRunResult can't be used in 'await' expression” ×4 - “[Errno 24] Too many open files: '/Users/davefowler/Projects/AshAI/venv/lib/python3.13/site-packages/anyio/_backends/init.py'” ×4 - “HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /agent/ashai-pre-researched (Caused by NameResolutionError(": Failed to resolve 'localhost' ([Errno 8] nodename nor servname provided, or not known)"))” ×2 - “HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /agent/ashai-pre-researched (Caused by NameResolutionError(": Failed to resolve 'localhost' ([Errno 8] nodename nor servname provided, or not known)"))” ×1 Error categories: - other ×32 Origins: - agent:direct ×32

AI analysis:

High Error Rates: Multiple runs, including ashai and strict-json, show 100% error rates, indicating critical issues in processing.
Common Errors: Frequent “[Errno 24] Too many open files” suggests resource limitations; investigate file handling and system limits.
Connection Issues: Repeated “Max retries exceeded” errors indicate persistent connection problems with localhost; check server availability and configuration.
Error Consistency: Compared to previous logs, the frequency of connection errors has increased, signaling a potential regression in network reliability.
Next Steps: Optimize file handling to reduce open file errors and ensure the server is correctly configured and running.
Monitoring: Implement logging to track resource usage and connection attempts for better diagnostics in future evaluations.

Observations¶

Top performer: ashai | gpt-5-mini → 4.5%
Fastest: ashai-pre-researched | gpt-5-mini → 4.1s
Top Performer: ashai leads with a 4.5% success rate, followed by strict-referenced at 2.0% and strict-json at 0.7%.
Speed Outliers: ashai-pre-researched achieved the fastest average response time at 4.1 seconds, significantly faster than previous entries.
Error Gaps: Overall errors increased to 48/60 cases, indicating a notable rise in performance issues compared to last evaluation.
Notable Error Hotspots: ashai and strict-referenced reported 100% error rates in specific runs, highlighting critical failure points.
Common Errors: “Too many open files” was the most frequent error, occurring 15 times, indicating potential resource management issues.
Weave-Reported Errors: ashai had the highest Weave-reported errors, with 77 errors in a single run, suggesting significant reliability concerns.

Leaderboard v125¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	ask nihai, nihai	✓	—	4.5%	0.0%	0.0%	14.9%	14.8%	14.0	77
2	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	2.0%	0.0%	0.0%	6.7%	6.6%	14.3	70
3	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	0.7%	0.0%	0.0%	3.3%	0.0%	7.5	42
4	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	0.0%	0.0%	0.0%	0.0%	0.0%	4.1	5
5	strict-probe	gpt-5-mini	ask nihai, nihai	✓	—	0.0%	0.0%	0.0%	0.0%	0.0%	7.6	22

v119 - oct 10, 2025 10:19pm¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long
weaviate / pydantic error fixes
more timout wrappers
fix eval reference issue

Errors¶

AI analysis:

High Weave-reported Errors: 11 errors reported despite 0 overall errors, indicating potential discrepancies in error logging or reporting mechanisms.
Consistent Tool Usage: Errors concentrated in runs using ask_nihai and search_nihai, suggesting issues may stem from these specific tools.
Version Stability: The same version (v119) is used across runs, indicating stability but also a need to investigate tool interactions.
Lack of Common Messages: Absence of common error messages complicates root cause analysis; further investigation into logging practices is needed.
Next Steps: Review tool configurations and interactions, enhance error logging for clarity, and consider testing with alternative tools for comparison.
Comparison to Previous Entry: Previous logs showed fewer Weave-reported errors, indicating a potential regression in tool performance or logging accuracy.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 81.4%
Fastest: strict-referenced | gpt-5-mini → 23.6s
Top Performer: strict-referenced | gpt-5-mini | ask nihai, nihai achieved 81.4% accuracy with a response time of 23.6 seconds.
Speed Leader: The same model also recorded the fastest average response time at 23.6 seconds.
Error Rate: Overall errors were minimal, with 0 out of 12 cases reported.
Weave Errors: Notably, 11 Weave-reported errors occurred, indicating a significant discrepancy in performance.
Error Hotspot: The top run with the highest Weave-reported errors was strict-referenced | gpt-5-mini, with 11 errors out of 12 cases.
Comparison: This entry shows improved accuracy and speed compared to previous evaluations, despite the high Weave error count.
Gaps Identified: The gap between overall performance and Weave-reported errors suggests potential areas for further investigation.

Leaderboard v119¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	81.4%	80.9%	75.9%	90.3%	79.4%	23.6	11

v113 - oct 10, 2025 9:18am¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long
weaviate / pydantic error fixes

Errors¶

AI analysis:

Error Rate Improvement: Overall errors decreased from previous evaluations, indicating enhanced model performance and stability.
Weave-Reported Errors: Five events logged, primarily from older version v109, suggesting potential issues with that version's configurations.
Top Error Runs: Strict-referenced and ashai models show higher error rates in v109, indicating possible compatibility issues with tools used.
Suspected Causes: Errors may stem from outdated model versions or specific tool interactions, particularly with ask_nihai and search_nihai.
Next Steps: Investigate v109 configurations, consider updating to v113, and conduct targeted testing on high-error runs for deeper insights.

Observations¶

Top performer: ashai | gpt-5-mini → 78.8%
Fastest: ashai-pre-researched | gpt-5-mini → 12.2s
Top Performer: ashai achieved the highest accuracy at 78.8%, significantly outperforming others.
Speed Outlier: ashai-pre-researched recorded the fastest average time at 12.2 seconds, indicating improved efficiency.
Notable Gaps: A substantial performance gap exists between ashai (78.8%) and the second-place strict-json (64.1%).
Error Hotspots: Weave-reported errors were concentrated in strict-referenced, with 2 errors in a single run.
Overall Errors: No overall errors were reported across 3 cases, reflecting improved reliability.
Comparison to Previous Entry: ashai-pre-researched shows a notable speed improvement compared to previous evaluations.

Leaderboard v113¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	ask nihai, nihai	✓	—	78.8%	66.3%	85.5%	93.2%	90.1%	29.7	1
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	64.1%	57.0%	59.3%	85.3%	66.3%	74.1	1
3	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	61.4%	61.0%	53.5%	71.1%	63.0%	12.2	1
4	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	57.0%	45.2%	54.2%	84.0%	63.0%	22.7	2

v111 - oct 10, 2025 7:35am¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long
weaviate / pydantic error fixes

Errors¶

AI analysis:

Error Rate Consistency: All runs reported a 100% error rate, indicating a systemic issue across the evaluated configurations.
Common Error: The predominant error, “name 'os' is not defined,” suggests a missing import statement in the codebase.
Version Comparison: Errors persisted across versions v109 and v111, indicating that recent updates did not resolve existing issues.
Tool Consistency: All top error runs utilized the same tools (ask_nihai, search_nihai), suggesting a potential tool-related issue.
Next Steps: Review code for missing imports, particularly for the 'os' module, and conduct targeted testing on affected runs.
Documentation Review: Ensure that documentation for tool usage includes necessary imports to prevent similar errors in future evaluations.

Observations¶

Top performer: ashai | gpt-5-mini → 78.8%
Fastest: ashai-pre-researched | gpt-5-mini → 12.2s
Top Performer: ashai achieved the highest accuracy at 78.8%, significantly outperforming others in the leaderboard.
Speed Outlier: ashai-pre-researched was the fastest, averaging 12.2 seconds, indicating a notable efficiency improvement.
Error Gaps: All top performers had error rates of 100% in specific runs, highlighting critical failure points.
Common Errors: The predominant error message was “name 'os' is not defined,” occurring 27 times across evaluations.
Notable Hotspots: Strict-referenced and ashai consistently reported high error rates in earlier versions, indicating persistent issues.
Overall Errors: A total of 39 errors were recorded across all cases, reflecting a need for further refinement.

Leaderboard v111¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	ask nihai, nihai	✓	—	78.8%	66.3%	85.5%	93.2%	90.1%	29.7	1
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	64.1%	57.0%	59.3%	85.3%	66.3%	74.1	1
3	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	61.4%	61.0%	53.5%	71.1%	63.0%	12.2	1
4	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	57.0%	45.2%	54.2%	84.0%	63.0%	22.7	2

v109 - take 2 - oct 7, 2025 12:51pm¶

Changes¶

refactor to pydanticai architecture
NihAI_v17 - older NihAI_v16
filtering references out of answer text and only sharing references that are used to make the claim
fixing some connection issues, which is wy version has been incremented so many times
switching to gpt-4o-mini to see if that's why things are so long

Errors¶

AI analysis:

Error Rate Consistency: Overall errors remain low at 2/54, similar to previous evaluations, indicating stable performance.
High Error Rate in Strict-Probe: The strict-probe run shows a significant error rate of 67%, suggesting potential issues with this configuration.
Weave-Reported Errors: The strict-referenced run had the highest Weave-reported errors, indicating possible misconfigurations or data handling issues.
Common Timeout Issue: The recurring "Read timed out" message suggests network latency or server performance problems that need addressing.
Next Steps: Investigate strict-probe configurations and server performance; consider optimizing timeout settings to reduce connection issues.
Comparative Analysis: Compared to previous logs, the strict-probe's error spike indicates a need for targeted troubleshooting in this area.

Observations¶

Top performer: ashai | gpt-5-mini → 78.8%
Fastest: ashai-pre-researched | gpt-5-mini → 12.2s
Top Performer: ashai achieved the highest score of 78.8% with a response time of 29.7 seconds.
Significant Gap: A notable performance drop exists between ashai (78.8%) and strict-json (64.1%).
Speed Outlier: ashai-pre-researched recorded the fastest average response time at 12.2 seconds, despite a lower accuracy of 61.4%.
Error Hotspot: strict-referenced had the highest Weave-reported error rate with 2 out of 12 cases (16.7%).
Overall Errors: Only 2 errors were reported across 54 cases, indicating improved reliability.
Common Errors: Timeout errors were the most frequent, occurring twice, both from direct agent origins.

Leaderboard v109¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	ask nihai, nihai	✓	—	78.8%	66.3%	85.5%	93.2%	90.1%	29.7	1
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	64.1%	57.0%	59.3%	85.3%	66.3%	74.1	1
3	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	61.4%	61.0%	53.5%	71.1%	63.0%	12.2	1
4	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	57.0%	45.2%	54.2%	84.0%	63.0%	22.7	2

v102 - oct 4, 2025 8:33pm¶

Changes¶

refactor to pydanticai architecture
NihAI_v16
just a quick test here.

Errors¶

AI analysis:

Error Rate Improvement: Significant reduction in Weave-reported errors from previous entry (12 errors down to 0), indicating enhanced model performance.
Stable Performance: Top runs (ashai, strict-referenced) maintained 0% error rate, suggesting consistent reliability across different tools and versions.
High Error Concentration: Strict-json and ashai runs from the previous evaluation showed 50% of errors, indicating potential issues with specific configurations.
Version Impact: Transition from v100 to v102 appears beneficial, as newer version runs show no errors compared to previous evaluations.
Next Steps: Investigate strict-json and ashai configurations for potential optimizations to further reduce error rates in future evaluations.
Monitoring Required: Continue tracking error patterns in future runs to ensure sustained performance and identify any emerging issues.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 41.7%
Fastest: strict-probe | gpt-5-mini → 67.2s
Top Performer: strict-referenced with gpt-5-mini achieved 41.7%, leading the leaderboard.
Significant Gap: strict-referenced outperformed strict-json by 2.8%, indicating a notable performance difference.
Speed Outlier: strict-probe recorded the fastest average time at 67.2 seconds, significantly quicker than other models.
Error Hotspot: strict-json and ashai both reported 6 Weave errors, highlighting a critical area for improvement.
Overall Errors: No cases had errors, maintaining a perfect score across all evaluated entries.
Notable Change: strict-referenced improved its performance compared to previous evaluations, reinforcing its top position.

Leaderboard v102¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	41.7%	34.5%	27.9%	84.8%	22.0%	175.2	0
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	38.9%	32.5%	24.0%	76.8%	29.3%	158.8	6
3	strict-probe	gpt-5-mini	ask nihai, nihai	✓	—	38.0%	21.6%	36.4%	84.8%	22.0%	67.2	0
4	ashai	gpt-5-mini	ask nihai, nihai	✓	—	30.6%	20.7%	21.1%	77.1%	6.3%	167.4	6

v100 - oct 3, 2025 3:59pm¶

Changes¶

refactor to pydanticai architecture

Errors¶

AI analysis:

Error Rate: Strict-probe run shows a high error rate of 17%, indicating potential issues with probe configurations or resource allocation.
Weave-Reported Errors: Strict-json and ashai runs both report 6 Weave errors, suggesting common underlying issues in these configurations.
Connection Timeouts: Recurring "Read timed out" messages indicate possible network latency or server performance issues affecting response times.
Comparison to Previous Entry: Error rates have decreased overall, but specific runs like strict-probe have emerged as problematic.
Next Steps: Investigate strict-probe configurations and server performance; consider optimizing timeout settings to reduce connection issues.
Monitoring: Implement enhanced logging for strict-probe to better understand error triggers and improve future evaluations.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 41.7%
Fastest: strict-probe | gpt-5-mini → 67.2s
Top Performer: strict-referenced with gpt-5-mini achieved 41.7%, leading the leaderboard.
Notable Gaps: strict-referenced outperformed strict-json by 2.8%, indicating a significant performance difference.
Speed Outlier: strict-probe was the fastest at 67.2 seconds, despite a lower accuracy of 38.0%.
Error Hotspot: strict-json and ashai reported the highest Weave errors, each with 6 out of 12 cases.
Overall Errors: Only 2 errors across 48 cases, showing improved reliability compared to previous evaluations.
Common Error: Timeout errors were the most frequent, occurring twice from direct agent origins.

Leaderboard v100¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	41.7%	34.5%	27.9%	84.8%	22.0%	175.2	0
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	38.9%	32.5%	24.0%	76.8%	29.3%	158.8	6
3	strict-probe	gpt-5-mini	ask nihai, nihai	✓	—	38.0%	21.6%	36.4%	84.8%	22.0%	67.2	0
4	ashai	gpt-5-mini	ask nihai, nihai	✓	—	30.6%	20.7%	21.1%	77.1%	6.3%	167.4	6

v98 - oct 2, 2025 5:05pm¶

Changes¶

changed evaluators to favor direct answers vs gpt-style structured responses
added a strict-probe agent that is a first crack at checking for probing q's first

Errors¶

AI analysis:

High Error Rate: Strict-probe run shows 100% error rate (12/12), indicating a critical issue with this configuration.
Weave-Reported Errors: Only one Weave-reported error noted, suggesting limited visibility into other potential issues.
Common Error Message: "Unknown agent 'strict-probe'" appears frequently, indicating a misconfiguration or unrecognized tool.
Comparison to Previous Entry: Previous logs showed lower error rates; this spike suggests a regression in strict-probe performance.
Suspected Cause: Misconfiguration of the strict-probe tool likely causing the high error rate; further investigation needed.
Next Steps: Review strict-probe configuration, validate tool compatibility, and conduct tests to isolate the error source.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 83.9%
Fastest: strict-probe | gpt-5-mini → 16.4s
Top Performer: strict-referenced with gpt-5-mini achieved 83.9% accuracy, significantly outperforming others.
Performance Gap: A notable drop in accuracy with strict-json at 45.2%, indicating a substantial performance gap.
Speed Leader: strict-probe recorded the fastest average time at 16.4 seconds, despite its low accuracy.
Error Hotspot: strict-probe had a 100% error rate in one evaluation, highlighting a critical issue.
Weave-Reported Errors: strict-json had the highest Weave-reported error rate with 1 out of 12 cases.
Common Errors: The error message "Unknown agent 'strict-probe'" occurred 8 times, indicating a potential configuration issue.
Overall Errors: 12 out of 36 cases resulted in errors, reflecting a need for further refinement across all models.

Leaderboard v98¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	83.9%	81.4%	82.3%	92.2%	82.8%	30.1	0
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	45.2%	34.5%	38.3%	73.7%	53.5%	44.3	1
3	strict-probe	gpt-5-mini	ask nihai, nihai	✓	—	16.4%	5.4%	6.3%	29.2%	65.4%	16.4	0

v97 - oct 2, 2025 4:36pm¶

Changes¶

changed evaluators to favor direct answers vs gpt-style structured responses
added a strict-probe agent that is a first crack at checking for probing q's first

Errors¶

AI analysis:

High Error Rate: Strict-probe run shows a 100% error rate, indicating a critical issue with this configuration.
Common Error Message: "Unknown agent 'strict-probe'" appears frequently, suggesting misconfiguration or unsupported agent usage.
No Weave-Reported Errors: Other runs (strict-json, strict-referenced) show no errors, indicating potential isolation of the strict-probe issue.
Suspected Cause: The strict-probe agent may not be properly integrated or recognized in the current environment.
Next Steps: Investigate strict-probe configuration and compatibility; consider testing with alternative agents to isolate the problem.
Comparison to Previous Entry: Previous logs indicated lower error rates; this spike suggests a regression or new issue with strict-probe.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 81.8%
Fastest: strict-probe | gpt-5-mini → 12.7s
Top Performer: strict-referenced achieved 81.8% accuracy, significantly outperforming others in the leaderboard.
Performance Gap: A notable drop in accuracy with strict-json at 39.5%, indicating a substantial gap from the top performer.
Speed Outlier: strict-probe recorded the fastest average time of 12.7 seconds, despite its low accuracy of 16.5%.
Error Hotspot: strict-probe had the highest error rate, with 100% errors in one test case.
Notable Improvements: strict-json and strict-referenced both reported 0 errors in specific tests, showcasing reliability.
Common Errors: "Unknown agent 'strict-probe'" was reported 8 times, highlighting a recurring issue.
Overall Errors: 12 out of 36 cases resulted in errors, indicating room for improvement across all models.

Leaderboard v97¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	81.8%	76.3%	82.2%	92.3%	84.0%	25.1
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	39.5%	33.3%	34.6%	59.3%	40.4%	80.2
3	strict-probe	gpt-5-mini	ask nihai, nihai	✓	—	16.5%	1.2%	7.2%	36.6%	67.6%	12.7

v96 - sep 30, 2025 8:01pm¶

Changes¶

adding strict-json - which uses json mode at the api level and has less gpt-style responses
created an evaulation comparison interface at /evaluation/comparison
changed evaluators to favor direct answers vs gpt-style structured responses
changed empathy evaluator to use friendly_tone instead of professional_tone
quick set here testing strict-json
changed ashai prompt to use less structure, and have a more direct response
removed "rules" from generic agent user message template

Errors¶

AI analysis:

Error Rate Consistency: Overall errors remain at 0/24, indicating stable performance compared to the previous entry.
Weave-Reported Anomaly: One Weave-reported error in strict-json suggests a potential issue with this specific configuration.
Tool Performance: Both tools (ask_nihai, search_nihai) performed well, with strict-referenced showing no errors.
Suspected Cause: The single error may stem from input handling or data formatting in the strict-json run.
Next Steps: Investigate the strict-json configuration for potential input anomalies or formatting issues.
Monitoring Recommendation: Continue monitoring Weave-reported errors for patterns, especially in strict-json configurations.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 86.9%
Fastest: strict-referenced | gpt-5-mini → 19.4s
Top Performer: strict-referenced with gpt-5-mini achieved 86.9% accuracy, leading the leaderboard.
Close Competitor: strict-json with gpt-5-mini followed closely at 86.2% accuracy.
Speed Leader: strict-referenced also recorded the fastest average time at 19.4 seconds.
Error Rates: Overall errors were minimal, with only 1 Weave-reported error across 24 cases.
Notable Hotspots: strict-json had the highest Weave-reported error rate at 8.3% (1/12 cases).
Consistency: Both top runs maintained a 0% error rate in standard evaluations.

Leaderboard v96¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	86.9%	83.4%	87.9%	93.4%	87.5%	19.4	0
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	86.2%	83.2%	86.0%	92.8%	86.8%	24.7	1

v95 - sep 30, 2025 7:18pm¶

Changes¶

adding strict-json - which uses json mode at the api level and has less gpt-style responses
created an evaulation comparison interface at /evaluation/comparison
changed evaluators to favor direct answers vs gpt-style structured responses
changed empathy evaluator to use friendly_tone instead of professional_tone
quick set here testing strict-json

Errors¶

AI analysis:

Error Rate Consistency: No errors reported across all runs, maintaining a 0% error rate compared to previous evaluations.
Tool Performance: Both strict-json and strict-referenced runs performed flawlessly, indicating robust tool functionality.
Version Stability: Version v95 shows no discrepancies, suggesting stability in the model's performance over time.
Lack of Common Messages: Absence of error messages indicates a smooth operation, contrasting with previous logs that had varied issues.
Next Steps: Continue monitoring for potential edge cases; consider stress-testing with diverse inputs to ensure sustained performance.

Observations¶

Top performer: strict-json | gpt-5-mini → 87.4%
Fastest: strict-referenced | gpt-5-mini → 14.4s
Top Performer: strict-json with gpt-5-mini achieved 87.4%, closely followed by strict-referenced at 86.3%.
Speed Outlier: strict-referenced was significantly faster at 14.4s compared to strict-json's 31.0s.
Error Rate: Both top runs reported zero errors across 24 cases, indicating high reliability.
Notable Hotspots: No significant error hotspots identified; all evaluated cases performed flawlessly.
Comparison to Previous Entry: Performance improved overall, with no errors reported in the latest evaluation.
Gaps: A noticeable gap in speed between strict-referenced and strict-json, despite similar accuracy.
Consistency: Both top performers maintained consistent results across multiple runs, reinforcing reliability.

Leaderboard v95¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	87.4%	84.8%	87.6%	92.9%	87.9%	31.0	0
2	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	86.3%	82.6%	87.0%	93.2%	86.9%	14.4	0

v94 - sep 30, 2025 5:35pm¶

Changes¶

adding strict-json - which uses json mode at the api level and has less gpt-style responses
created an evaulation comparison interface at /evaluation/comparison
changed evaluators to favor direct answers vs gpt-style structured responses
changed empathy evaluator to use friendly_tone instead of professional_tone
quick set here testing strict-json

Errors¶

AI analysis:

High Error Rate: The strict-json run shows a 100% error rate (12/12), indicating a critical failure in this configuration.
No Weave-Reported Errors: Both strict-json and strict-referenced runs reported 0 Weave errors, suggesting issues are not captured by Weave.
Common Error Message: The repeated "Unknown agent 'strict-json'" message indicates a potential misconfiguration or unsupported agent in the system.
Comparison to Previous Entry: Previous logs showed lower overall errors, suggesting a regression in the strict-json configuration.
Suspected Causes: Configuration issues or incompatibility with the gpt-5-mini model may be causing the high error rate in strict-json.
Next Steps: Investigate the strict-json configuration, verify agent compatibility, and consider reverting to previous settings for stability.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 86.3%
Fastest: strict-json | gpt-5-mini → 6.6s
Top Performer: strict-referenced with gpt-5-mini achieved 86.3% accuracy, significantly outperforming strict-json at 12.2%.
Speed Outlier: strict-json recorded the fastest average response time at 6.6 seconds, despite its low accuracy.
Error Gaps: strict-json had a 100% error rate in 12 cases, while strict-referenced maintained a perfect 0% error rate.
Notable Error Hotspots: Common error message "Unknown agent 'strict-json'" occurred 8 times, indicating a potential integration issue.
Overall Performance: 12 out of 24 cases resulted in errors, highlighting a need for improvement across both models.
Comparison to Previous Entry: strict-referenced improved accuracy significantly, while strict-json's performance declined in accuracy despite faster response times.

Leaderboard v94¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	86.3%	82.7%	87.0%	92.8%	87.2%	13.3	0
2	strict-json	gpt-5-mini	ask nihai, nihai	✓	—	12.2%	1.2%	0.0%	35.7%	44.8%	6.6	0

v93 - sep 29, 2025 11:28pm¶

Changes¶

ashai prompt wording to avoid medical jargon.
got more granular on language_clarity evaluator.
perplexity v15 prompt changes

Errors¶

AI analysis:

Error Rate Consistency: Overall errors remain at 0/180, indicating stable performance across runs compared to previous evaluations.
Weave-Reported Errors: 15 events noted, primarily from the "strict-referenced" and "one-shot" runs, suggesting specific tool-related issues.
High Error Instances: "strict-referenced" run recorded 6/12 errors, indicating potential misalignment with the search_perplexity tool.
Tool Performance: "ask_nihai" and "search_nihai" tools show mixed results; further investigation into their configurations is warranted.
No Common Messages: Absence of common error messages suggests varied issues across runs, necessitating targeted troubleshooting.
Next Steps: Analyze "strict-referenced" and "one-shot" runs for tool-specific errors; consider revising tool parameters or training data.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 86.3%
Fastest: one-shot | gpt-4o-mini → 9.0s
Top Performer: strict-referenced with gpt-5-mini won at 86.3%, significantly outperforming previous entries.
Performance Gap: Notable drop to 85.4% for the second place, indicating a competitive edge for the top entry.
Speed Outlier: one-shot with gpt-4o-mini achieved 76.1% accuracy in just 9.0 seconds, the fastest overall.
Error Hotspots: strict-referenced with gpt-5-mini reported the highest Weave errors, with 6 out of 12 cases.
Error Consistency: Overall errors remained low at 0/180 cases, showcasing improved reliability across the board.
Notable Runs: Several top-performing runs reported 0 errors, indicating strong consistency in performance.

Leaderboard v93¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai older, nihai older	✓	—	86.3%	83.4%	89.2%	87.9%	88.8%	27.2	0
2	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	85.4%	82.7%	89.0%	85.9%	88.0%	52.2	0
3	ashai	gpt-5-mini	ask nihai, nihai	✓	—	85.1%	79.4%	90.8%	88.0%	90.1%	47.3	2
4	question-first	gpt-5-mini	ask nihai, nihai	✓	—	85.0%	80.7%	89.7%	86.6%	89.1%	40.1	0
5	strict-referenced	gpt-5-mini	perplexity	—	—	84.0%	80.1%	88.3%	85.2%	88.1%	41.7	6
6	strict-referenced	gpt-5-mini	ask nihai, nihai	—	—	83.9%	78.1%	89.9%	87.5%	87.8%	94.6	0
7	strict-referenced	gpt-5-mini	nihai	—	—	82.7%	78.4%	87.2%	84.9%	86.6%	36.5	1
8	one-shot	gpt-5-mini	default	✓	—	81.0%	71.0%	91.3%	85.9%	90.6%	18.6	2
9	ashai	gpt-5-mini	ask nihai, nihai	—	—	80.6%	69.9%	90.7%	88.0%	89.1%	40.9	0
10	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	79.6%	76.1%	82.9%	79.6%	86.8%	29.3	0
11	one-shot	gpt-4.1	default	✓	—	77.2%	62.1%	90.8%	87.9%	89.5%	12.7	1
12	one-shot	gpt-4o-mini	default	✓	—	76.1%	61.2%	89.2%	87.1%	88.7%	9.0	2
13	strict-referenced	gpt-5-mini	noora	—	—	74.6%	70.4%	75.6%	80.1%	79.7%	90.8	1
14	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	72.9%	71.0%	73.8%	74.6%	75.5%	12.1	0
15	strict-referenced	gpt-4.1	ask nihai, nihai	—	—	67.4%	64.4%	70.0%	69.8%	69.7%	15.7	0

v92 - sep 25, 2025 9:03pm¶

Changes¶

upgraded to v14 - a 20 hour run!
leaderboard will now show category overall scores instead of their main scores.
testing gpt-4.1 and gpt-4o-mini for speed.

Errors¶

AI analysis:

Error Count: 21 Weave-reported errors indicate a slight increase from previous evaluations, suggesting potential instability in specific configurations.
Top Error Source: "strict-referenced" runs with gpt-5-mini show the highest error rates, indicating issues with this model's handling of complex queries.
Suspected Cause: The combination of tools (search_perplexity and ask_nihai) may be contributing to the increased error rates, particularly in strict-referenced runs.
Model Performance: gpt-4.1 shows lower error rates compared to gpt-5-mini, suggesting a need to reassess the newer model's configurations.
Next Steps: Investigate the "strict-referenced" runs for commonalities in input types and refine tool usage to improve performance.
Comparison Insight: Previous evaluations had fewer Weave-reported errors, indicating a potential regression in model reliability or tool effectiveness.

Observations¶

oooo v13 did better... - but only just slightly.
interesting question first did just fine, great even
pre-researched (vs giving agent tools) drops few % points
one-shot (not using knowledge base) gets in the 90s for most things (low because we tank due to lack of references)
Top performer: strict-referenced | gpt-5-mini → 86.7%
Fastest: one-shot | gpt-4.1 → 22.2s
Top Performer: Strict-referenced gpt-5-mini with ask nihai older, nihai older achieved 86.7%, leading the leaderboard.
Speed Outlier: One-shot gpt-4.1 default recorded the fastest average time at 22.2 seconds, despite a lower accuracy of 76.9%.
Notable Gaps: Significant performance gap exists between top performers (86.7%) and the next best (86.3%).
Error Hotspots: Strict-referenced gpt-5-mini with search_perplexity reported the highest Weave errors at 10/60 cases.
Overall Errors: No overall errors were reported across 900 cases, indicating high reliability.
Error Consistency: Ashai models consistently performed well with zero errors across multiple runs.

Leaderboard v92¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	ask nihai older, nihai older	✓	—	86.7%	84.8%	88.9%	87.4%	88.5%	22.9	0
2	strict-referenced	gpt-5-mini	perplexity	—	—	86.3%	83.7%	88.9%	87.2%	89.2%	51.1	10
3	question-first	gpt-5-mini	ask nihai, nihai	✓	—	86.1%	82.4%	90.3%	88.0%	88.5%	35.2	0
4	strict-referenced	gpt-5-mini	nihai	—	—	85.8%	83.2%	88.7%	86.8%	88.5%	40.9	0
5	ashai	gpt-5-mini	ask nihai, nihai	✓	—	85.2%	79.8%	90.7%	88.3%	89.6%	46.8	1
6	strict-referenced	gpt-5-mini	ask nihai, nihai	—	—	85.1%	81.8%	88.1%	87.2%	88.3%	43.4	7
7	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	84.4%	81.0%	87.6%	86.4%	87.2%	53.0	0
8	ashai	gpt-5-mini	ask nihai, nihai	—	—	84.3%	78.3%	90.3%	88.2%	88.8%	51.4	0
9	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	83.2%	79.4%	86.8%	85.8%	86.5%	40.8	0
10	ashai-pre-researched	gpt-5-mini	ask nihai, nihai	—	—	83.1%	80.3%	85.2%	85.7%	84.7%	41.2	0
11	strict-referenced	gpt-4.1	ask nihai, nihai	—	—	81.5%	75.9%	86.4%	86.2%	84.9%	44.0	1
12	one-shot	gpt-5-mini	default	✓	—	77.8%	62.8%	91.0%	88.6%	91.0%	36.9	2
13	one-shot	gpt-4.1	default	✓	—	76.9%	62.4%	89.4%	88.3%	88.0%	22.2	0
14	one-shot	gpt-4o-mini	default	✓	—	73.6%	58.7%	85.5%	86.1%	85.2%	28.4	0
15	strict-referenced	gpt-5-mini	noora	—	—	72.7%	67.8%	75.2%	77.4%	79.4%	38.0	0

v91 - sep 24, 2025 8:13pm¶

Changes¶

upgraded to v13
build search_nihai_older which will compare to v12
big upgrade to eval comprehensive generator - now testing
new change log insert

Errors¶

AI analysis:

Error Rate Stability: Overall errors remain low at 1/600, consistent with previous evaluations, indicating stable performance.
High Weave Errors: Strict-referenced runs show significant Weave-reported errors, particularly with tools like search_perplexity, suggesting tool-specific issues.
Timeout Issues: The common error message indicates a recurring APITimeoutError, likely due to high latency in API responses.
Tool Performance Variability: Tools like search_nihai exhibit varying error rates, indicating potential optimization opportunities for specific tool configurations.
Next Steps: Investigate timeout causes and optimize API call handling; consider refining strict-referenced tool configurations to reduce Weave errors.

Observations¶

Top performer: ashai | gpt-5-mini → 86.6%
Fastest: one-shot | gpt-5-mini → 18.0s
Top Performer: ashai leads with 86.6%, closely followed by strict-referenced at 85.9% and 85.8% for perplexity.
Performance Gap: Significant drop-off after top three, with no other entries exceeding 84%.
Speed Outlier: One-shot achieved 78.6% accuracy in just 18.0 seconds, the fastest overall.
Error Hotspot: Strict-referenced had the highest error rate at 2% in specific configurations, indicating potential issues.
Overall Errors: Only 1 error reported across 600 cases, maintaining high reliability.
Weave-Reported Errors: 25 events noted, with strict-referenced showing the most significant issues in specific configurations.

Leaderboard v91¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	ask nihai, nihai	✓	—	86.6%	89.4%	89.7%	89.0%	88.2%	45.6	2
2	strict-referenced	gpt-5-mini	ask nihai, nihai	✓	—	85.9%	88.2%	86.9%	88.0%	86.0%	52.2	0
3	strict-referenced	gpt-5-mini	perplexity	—	—	85.8%	87.2%	87.1%	86.6%	85.0%	45.8	16
4	strict-referenced	gpt-5-mini	ask nihai older, nihai older	✓	—	85.4%	87.5%	85.7%	87.2%	85.8%	21.8	2
5	strict-referenced	gpt-5-mini	ask nihai, nihai	—	—	84.7%	87.4%	86.9%	87.5%	86.7%	48.1	2
6	ashai	gpt-5-mini	ask nihai, nihai	—	—	84.7%	88.4%	89.1%	88.2%	87.5%	43.5	0
7	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	84.4%	87.1%	86.3%	85.6%	86.1%	40.5	0
8	strict-referenced	gpt-5-mini	nihai	—	—	84.0%	85.8%	86.3%	86.3%	85.2%	42.6	2
9	one-shot	gpt-5-mini	default	✓	—	78.6%	89.1%	89.5%	88.7%	89.3%	18.0	1
10	strict-referenced	gpt-5-mini	noora	—	—	77.4%	79.6%	77.3%	80.8%	78.9%	46.1	0

v90 - take 2 - sep 24, 2025 6:19pm¶

Changes¶

-0this is a test run

Errors¶

AI analysis:

Error Rate Consistency: No errors reported across all runs, maintaining a 0% error rate compared to previous entries.
Tool Performance: Both "ask_nihai" and "search_nihai" tools performed flawlessly, indicating effective integration.
Run Variability: Full and quick runs showed identical error rates, suggesting consistent performance regardless of run type.
Lack of Common Messages: Absence of common error messages indicates a stable system environment with no recurring issues.
Next Steps: Continue monitoring performance; consider stress-testing with varied inputs to ensure robustness under different conditions.

Observations¶

Top performer: noora_aman | gpt-5-mini → 83.1%
Fastest: noora_aman | gpt-5-mini → 11.3s
Winner: noora_aman with gpt-5-mini achieved the highest score of 83.1% in 11.3 seconds.
Speed Leader: noora_aman's gpt-5-mini also recorded the fastest average time at 11.3 seconds.
Error Rate: Overall, there were no errors across 15 cases, indicating strong performance.
Top Error-Free Runs: noora_aman's gpt-5-mini recorded zero errors in two separate runs (0/12 and 0/3).
Notable Consistency: All top runs by error rate maintained a perfect score, showcasing reliability in performance.
Comparison: Significant improvement in error rates compared to previous entries, with no errors reported this time.

Leaderboard v90¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	83.1%	83.8%	85.7%	85.6%	81.4%	11.3	0

v86 - sep 23, 2025 3:06am¶

Changes¶

regenerated new improved evuator comprehensive set with better do's and don'ts

Errors¶

AI analysis:

Error Analysis¶

Overall Errors: Consistently at 0/480, indicating stable model performance compared to the previous entry.
Weave-Reported Errors: Increased to 36 events, up from 4, suggesting potential new issues or regressions.
Top Runs Performance: All top runs maintained 0 errors, consistent with previous evaluations, indicating effective configurations.
Error Distribution: Strict-referenced runs had the highest Weave errors, pinpointing a specific area needing further investigation.
Suspected Causes: Recent changes may have introduced new issues; focus on strict-referenced configurations is critical.
Next Steps: Investigate Weave-reported errors, validate strict-referenced configurations, and monitor for emerging patterns in future evaluations.

Observations¶

Perplexity does about the same (as expected) as ask_nihai and just nihai
Top performer: strict-referenced | gpt-5-mini → 86.4%
Fastest: one-shot | gpt-5-mini → 19.4s
Top Performer: ashai | gpt-5-mini achieved 89.0%, maintaining first place from v85.
Significant Gap: noora_aman dropped to 86.5%, widening the performance gap from ashai.
Speed Outlier: strict-referenced | gpt-5-mini recorded an average response time of 32.0s, slower than previous entries.
Overall Errors: Consistent at 0/480 cases, indicating maintained model reliability.
Weave-Reported Errors: Increased to 36 events, up from 4 in v85, highlighting new issues.
Error Hotspot: strict-referenced | gpt-5-mini reported 17 Weave errors, indicating a critical area for further investigation.
Top Runs Performance: All top runs achieved 0 errors, reflecting strong configurations despite increased Weave-reported errors.

Leaderboard v86¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	nihai	—	—	86.4%	88.4%	87.7%	87.7%	86.8%	22.6	8
2	strict-referenced	gpt-5-mini	default	✓	—	85.6%	87.6%	87.3%	88.1%	85.9%	28.7	0
3	strict-referenced	gpt-5-mini	perplexity	—	—	85.4%	87.9%	86.6%	86.2%	85.6%	23.3	17
4	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	85.4%	87.8%	88.0%	86.7%	86.3%	30.4	0
5	strict-referenced	gpt-5-mini	default	—	—	83.9%	86.8%	86.2%	86.6%	85.8%	30.6	1
6	strict-referenced	gpt-5-mini	noora	—	—	78.4%	81.8%	79.2%	81.2%	81.5%	34.5	5
7	one-shot	gpt-5-mini	default	✓	—	77.9%	89.3%	90.1%	88.8%	89.2%	19.4	5
8	strict-referenced	gpt-5-mini	default	—	—	5.7%	0.2%	8.3%	10.4%	14.3%	20.1	0

v85 - sep 23, 2025 1:18am¶

Changes¶

testing comprehensive with different tools

Errors¶

AI analysis:

Error Analysis¶

Overall Errors: Maintained at 0/480, indicating consistent model reliability compared to 0/36 in the previous entry.
Weave-Reported Errors: Increased to 36 events from 4, suggesting new issues or regressions requiring immediate attention.
Top Runs Performance: All top runs achieved 0 errors, consistent with previous performance, indicating effective configurations.
Error Distribution: Strict-referenced runs reported the highest Weave errors, highlighting a specific area for further investigation.
Suspected Causes: Recent changes may have introduced new issues; focus on strict-referenced configurations is essential.
Next Steps: Investigate Weave-reported errors, validate strict-referenced configurations, and monitor for emerging patterns in future evaluations.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 87.6%
Fastest: strict-referenced | gpt-5-mini → 20.7s
Top Performer: ashai | gpt-5-mini achieved 89.0%, maintaining first place from v83.
Significant Gap: noora_aman dropped to 87.9%, widening the performance gap from ashai.
Speed Outlier: strict-referenced | gpt-5 recorded an average response time of 25.8s, slightly slower than previous entries.
Overall Errors: Maintained at 0/480 cases, consistent reliability across evaluations.
Weave-Reported Errors: Increased to 36 events, a significant rise from 4 in v83, indicating new issues.
Error Hotspot: strict-referenced | gpt-5 reported 3 Weave errors, highlighting areas needing attention.
Top Runs Performance: All top runs achieved 0 errors, reflecting strong configurations despite increased Weave-reported errors.

Leaderboard v85¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	nihai	—	—	87.6%	89.2%	88.4%	88.5%	88.0%	22.5	5
2	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	86.5%	88.5%	87.8%	86.0%	87.4%	28.8	0
3	strict-referenced	gpt-5-mini	default	—	—	86.1%	88.7%	88.2%	87.8%	87.6%	32.0	0
4	strict-referenced	gpt-5-mini	default	✓	—	84.7%	87.1%	86.2%	87.0%	86.5%	32.0	3
5	strict-referenced	gpt-5-mini	perplexity	—	—	84.4%	87.0%	84.4%	84.2%	86.7%	24.1	16
6	one-shot	gpt-5-mini	default	✓	—	77.9%	89.6%	90.0%	88.5%	90.0%	24.8	7
7	strict-referenced	gpt-5-mini	noora	—	—	77.8%	80.5%	77.2%	81.6%	81.2%	32.1	5
8	strict-referenced	gpt-5-mini	default	—	—	6.2%	0.2%	8.7%	10.6%	16.4%	20.7	0

v84 - sep 22, 2025 11:48pm¶

Changes¶

checking noora_search_faq (utkarsh dataset) on cummulative
testing reasoning vs none for gpt-5-min on strict referenced on cummulative

Errors¶

Overall errors: 39/180 cases Weave‑reported errors: 6 events Top runs by error rate: - strict-referenced | gpt-5-mini | tools=default | full | v=v84 [no-reasoning] | 20250922_232547: 13/60 (22%) - strict-referenced | gpt-5-mini | tools=default | full | v=v84 | 20250922_232547: 13/60 (22%) - strict-referenced | gpt-5-mini | tools=search_noora_faq | full | v=v84 [no-reasoning] | 20250922_232547: 13/60 (22%) Top runs by Weave‑reported errors: - strict-referenced | gpt-5-mini | tools=search_noora_faq | full | v=v84 [no-reasoning] | 20250922_232547: 6/60 - strict-referenced | gpt-5-mini | tools=default | full | v=v84 | 20250922_232547: 0/60 - strict-referenced | gpt-5-mini | tools=default | full | v=v84 [no-reasoning] | 20250922_232547: 0/60 Common error messages: - “1 validation error for AshaiRequest messages.0.role Field required [type=missing, input_value={'content': "I'm 20 weeks...at should I watch for?"}, input_type=dict] For further information visit https://errors.pydantic.dev/2.11/v/missing” ×3 - “1 validation error for AshaiRequest messages.0.role Field required [type=missing, input_value={'content': "I'm 24 weeks...ealthy pregnancy diet."}, input_type=dict] For further information visit https://errors.pydantic.dev/2.11/v/missing” ×3 - “1 validation error for AshaiRequest messages.0.role Field required [type=missing, input_value={'content': "My 9-month-o...ke them to the doctor?"}, input_type=dict] For further information visit https://errors.pydantic.dev/2.11/v/missing” ×3 - “1 validation error for AshaiRequest messages.0.role Field required [type=missing, input_value={'content': "My 6-month-o... — what should I do?"}, input_type=dict] For further information visit https://errors.pydantic.dev/2.11/v/missing” ×3 - “1 validation error for AshaiRequest messages.0.role Field required [type=missing, input_value={'content': "I'm seven mo...ke any fever medicine?"}, input_type=dict] For further information visit https://errors.pydantic.dev/2.11/v/missing” ×3 Error categories: - other ×24 Origins: - agent:direct ×24

AI analysis:

Error Analysis¶

Overall Errors: Increased to 39/180, indicating a significant rise in evaluation errors compared to the previous entry (0/36).
Weave-Reported Errors: Rose to 6 events, suggesting new issues or regressions that need immediate attention.
Top Runs by Error Rate: All top runs with gpt-5-mini reported 22% error rates, highlighting a potential configuration issue.
Common Error Messages: Repeated validation errors for missing fields in AshaiRequest indicate a systemic issue with input handling.
Suspected Causes: Recent changes in the evaluation dataset may have introduced inconsistencies in input requirements.
Next Steps: Investigate input validation processes, review recent changes, and monitor for patterns in future evaluations.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 69.9%
Fastest: strict-referenced | gpt-5-mini → 20.0s
Top Performer: strict-referenced | gpt-5-mini achieved 89.0%, maintaining first place from v83.
Significant Gap: noora_aman dropped to 87.9%, widening the performance disparity from strict-referenced.
Speed Outlier: strict-referenced | gpt-5 recorded an average response time of 25.8s, slower than previous entries.
Overall Errors: Increased to 39/180 cases, indicating a decline in model reliability compared to v83.
Weave-Reported Errors: Rose to 6 events, up from 4, highlighting emerging issues.
Error Hotspot: strict-referenced | gpt-5-mini had the highest Weave-reported errors at 6/60, necessitating further investigation.

Leaderboard v84¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	default	✓	—	69.9%	70.9%	68.0%	74.9%	77.5%	22.7	0
2	strict-referenced	gpt-5-mini	default	—	—	69.8%	73.2%	67.6%	72.3%	78.8%	20.0	0
3	strict-referenced	gpt-5-mini	noora	—	—	66.5%	69.5%	63.9%	70.3%	75.9%	21.0	6

v83 - sep 22, 2025 1:22am¶

Changes¶

going to run comprehensive eval dataset
perplexity prompt improvements - fixed sites required, no prescription drug recos, no use of specific foods
ask_nihai now splits questions in needed dimensions

Errors¶

AI analysis:

Error Analysis¶

Overall Errors: Maintained at 0/36, consistent with previous entries, indicating ongoing model reliability and stability.
Weave-Reported Errors: Increased to 4 events from 1, suggesting potential new issues or regressions needing investigation.
Top Runs Performance: All top runs achieved 0 errors, similar to previous performance, indicating effective configurations.
Error Distribution: Strict-referenced run reported 3 Weave errors, highlighting a specific area for further scrutiny.
Suspected Causes: Recent changes may have introduced new issues; further analysis of strict-referenced run is essential.
Next Steps: Investigate Weave-reported errors, validate strict-referenced configurations, and monitor for emerging patterns in future runs.

Observations¶

Top performer: ashai | gpt-5-mini → 89.0%
Fastest: strict-referenced | gpt-5 → 25.8s
Top Performer: strict-referenced | gpt-5 maintained first place with 89.0%, consistent with v81 results.
Significant Gap: noora_aman dropped to 80.6%, indicating a widening performance disparity from the top agents.
Speed Outlier: strict-referenced | gpt-5 recorded an average response time of 18.5s, slightly faster than previous entries.
Overall Errors: Maintained at 0/36 cases, showing continued reliability since v81.
Weave-Reported Errors: Increased to 4 events, a notable rise from just 1 in v81, indicating emerging issues.
Error Hotspot: strict-referenced | gpt-5 had the highest Weave-reported errors at 3/12, highlighting potential instability.
Top Runs Performance: All top runs achieved 0 errors, reflecting strong configurations despite increased Weave-reported errors.

Leaderboard v83¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5-mini	default	✓	—	89.0%	90.0%	91.7%	89.2%	88.7%	73.8	1
2	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	87.9%	89.6%	90.0%	89.4%	87.2%	26.5	0
3	strict-referenced	gpt-5	default	—	—	85.6%	88.8%	90.0%	89.4%	88.2%	25.8	3

v81 - sep 22, 2025 12:24am¶

Changes¶

removed ask nihai duplicate bug

Errors¶

AI analysis:

Error Analysis¶

Overall Errors: Maintained at 0/36, a significant improvement from 0/27, indicating enhanced model reliability and performance consistency.
Weave-Reported Errors: Only 1 event reported, a drastic reduction from 70 in the previous entry, suggesting effective issue resolution.
Top Runs Performance: All top runs achieved 0 errors, indicating successful configurations and model stability across the board.
Common Messages: No new common error messages, reflecting a potential resolution of previous issues and improved system stability.
Suspected Causes: Recent updates and optimizations likely contributed to the significant drop in errors and improved performance metrics.
Next Steps: Continue monitoring for any emerging issues, validate configurations, and ensure ongoing performance improvements across all models.

Observations¶

Top performer: strict-referenced | gpt-5 → 89.0%
Fastest: strict-referenced | gpt-5 → 18.5s
Top Performer: strict-referenced | gpt-5 achieved 89.0%, maintaining a strong position from v79.
Significant Decline: noora_aman dropped to 2.8%, indicating a critical performance gap.
Speed Outlier: strict-referenced | gpt-5 recorded an average response time of 18.5s, faster than previous entries.
Overall Errors: Improved to 0/36 cases, a notable enhancement from 0/27 in v79.
Weave-Reported Errors: Reduced to 1 event, indicating progress in error management.
Error Hotspot: noora_aman's previous high error rate of 22/12 in v78 highlights ongoing instability.
Common Errors: Continued issues with "HTTPConnectionPool" errors suggest lingering server misconfigurations.

Leaderboard v81¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5	default	—	—	89.0%	89.2%	90.8%	89.5%	87.9%	18.5	0
2	ashai	gpt-5-mini	default	✓	—	88.3%	89.6%	90.6%	90.1%	88.8%	25.0	1
3	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	80.6%	76.7%	79.2%	86.9%	83.5%	18.6	0

v80 - sep 21, 2025 9:16pm¶

Changes¶

Errors¶

AI analysis:

Error Analysis¶

Overall Error Rate: Maintained at 0/9, showing consistent model reliability compared to 0/27 in the previous entry.
Top Runs Performance: All top runs reported 0 errors, indicating effective configurations and improved model performance.
Weave-Reported Errors: Remained at 70 events, suggesting persistent issues in specific configurations, particularly with "noora_aman."
Error Patterns: No common error messages detected, indicating a potential resolution of previous issues.
Suspected Causes: Stability improvements likely due to recent updates and optimizations in model configurations and tools.
Next Steps: Continue monitoring performance, investigate lingering Weave-reported errors, and ensure optimal configurations across all models.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 89.6%
Fastest: ashai-pre-researched | gpt-5-nano → 7.4s
Top Performer: strict-referenced | gpt-5-mini achieved 89.6%, maintaining its lead from v79.
Significant Decline: noora_aman dropped from 88.8% to 75.9%, indicating a major performance gap.
Speed Outlier: ashai-pre-researched | gpt-5-nano recorded an average response time of 7.4s, faster than previous entries.
Overall Errors: Improved to 0/9 cases, a notable enhancement from 0/27 in v79.
Weave-Reported Errors: Remained high at 70 events, indicating persistent issues in specific configurations.
Error Hotspot: noora_aman's previous high error rate of 22/12 in v78 highlights critical instability.
Common Errors: Continued issues with "HTTPConnectionPool" errors suggest ongoing server misconfigurations.

Leaderboard v80¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	default	—	—	89.6%	90.9%	90.0%	90.0%	88.5%	22.2	2
2	strict-referenced	gpt-5-mini	default	✓	—	89.1%	90.0%	89.2%	89.8%	88.8%	23.0	0
3	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	88.8%	90.0%	90.6%	89.8%	88.8%	19.9	0
4	strict-referenced	gpt-5	default	✓	—	86.4%	87.9%	89.8%	89.8%	88.7%	58.1	0
5	strict-referenced	gpt-4o-mini	default	—	—	85.8%	88.3%	89.1%	89.6%	87.5%	22.6	3
6	strict-referenced	gpt-5	default	—	—	85.7%	88.7%	90.4%	88.6%	87.5%	21.0	3
7	noora_aman	gpt-5	ask nihai, nihai	✓	—	84.8%	88.8%	89.6%	90.2%	87.0%	17.8	0
8	ashai	gpt-4o-mini	default	✓	—	84.4%	88.5%	86.9%	88.4%	86.7%	18.3	4
9	ashai-experiment	gpt-4o-mini	default	✓	—	83.8%	87.1%	89.2%	89.8%	88.8%	18.1	1
10	strict-referenced	gpt-4o-mini	default	✓	—	81.2%	81.7%	80.0%	87.3%	78.8%	15.3	3
11	ashai-pre-researched	gpt-5-mini	default	✓	—	81.1%	86.2%	86.8%	85.2%	81.2%	33.2	0
12	ashai-pre-researched	gpt-5-nano	default	✓	—	77.4%	80.4%	82.6%	83.4%	78.6%	7.4	0
13	ashai-pre-researched	gpt-5-nano	default	✓	—	77.3%	79.6%	81.5%	81.8%	76.2%	15.8	0
14	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	76.7%	80.0%	76.1%	82.1%	75.8%	16.3	0
15	noora_aman	gpt-4o-mini	noora	✓	—	76.1%	80.7%	83.5%	86.4%	78.8%	13.3	2
16	ashai-pre-researched	gpt-4o-mini	default	✓	—	76.0%	81.7%	76.5%	81.6%	80.2%	18.0	0
17	strict-referenced	gpt-5-nano	default	—	—	74.3%	71.0%	74.0%	75.5%	75.6%	15.5	7
18	strict-referenced	gpt-5-nano	default	✓	—	69.7%	65.8%	70.4%	72.2%	71.2%	17.2	2
19	ashai	gpt-5-nano	default	✓	—	69.1%	66.7%	69.8%	71.2%	70.3%	30.9	3
20	ashai-pre-researched	gpt-4o-mini	default	✓	—	68.9%	72.0%	72.6%	75.0%	76.0%	11.7	14
21	ashai-experiment	gpt-5-nano	default	✓	—	68.4%	65.9%	65.9%	78.5%	74.3%	55.9	4
22	ashai-pre-researched	gpt-5	default	✓	—	56.6%	57.9%	59.8%	60.5%	57.1%	19.7	0
23	noora_aman	gpt-5-nano	ask nihai, nihai	✓	—	2.8%	0.0%	0.0%	5.0%	15.0%	7.5	22

v79 - sep 21, 2025 8:34pm¶

Changes¶

Created comprehensive dataset

Errors¶

AI analysis:

Error Analysis¶

Overall Error Rate: Improved to 0/27, a significant reduction from 5/276, indicating enhanced model stability.
Top Runs Performance: All top runs reported 0 errors, suggesting effective configurations and model performance in the current version.
Weave-Reported Errors: Consistent at 70 events, indicating persistent issues in specific configurations, particularly with "noora_aman" in previous entries.
Common Error Messages: No new common messages reported, indicating a potential resolution of previous connection issues.
Suspected Causes: Stability improvements may stem from recent updates and optimizations in model configurations and tools.
Next Steps: Continue monitoring performance, investigate any lingering Weave-reported errors, and ensure configurations remain optimal across all models.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 89.6%
Fastest: ashai-pre-researched | gpt-5-nano → 7.4s
Top Performer: strict-referenced | gpt-5-mini achieved 89.6%, maintaining its lead from v78.
Significant Decline: noora_aman dropped from 88.8% to 75.9%, indicating a major performance gap.
Speed Outlier: ashai-pre-researched | gpt-5-nano recorded an average response time of 7.4s, faster than previous entries.
Overall Errors: Improved to 0/27 cases, a notable enhancement from 5/276 in v78.
Weave-Reported Errors: Remained high at 70 events, indicating persistent issues in specific configurations.
Error Hotspot: noora_aman's previous high error rate of 22/12 in v78 highlights critical instability.
Common Errors: Continued issues with "HTTPConnectionPool" errors suggest ongoing server misconfigurations.

Leaderboard v79¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	default	—	—	89.6%	90.9%	90.0%	90.0%	88.5%	22.2	2
2	strict-referenced	gpt-5-mini	default	✓	—	89.1%	90.0%	89.2%	89.8%	88.8%	23.0	0
3	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	88.8%	90.0%	90.6%	89.8%	88.8%	19.9	0
4	strict-referenced	gpt-5	default	✓	—	86.4%	87.9%	89.8%	89.8%	88.7%	58.1	0
5	strict-referenced	gpt-4o-mini	default	—	—	85.8%	88.3%	89.1%	89.6%	87.5%	22.6	3
6	strict-referenced	gpt-5	default	—	—	85.7%	88.7%	90.4%	88.6%	87.5%	21.0	3
7	noora_aman	gpt-5	ask nihai, nihai	✓	—	84.8%	88.8%	89.6%	90.2%	87.0%	17.8	0
8	ashai	gpt-4o-mini	default	✓	—	84.4%	88.5%	86.9%	88.4%	86.7%	18.3	4
9	ashai-experiment	gpt-4o-mini	default	✓	—	83.8%	87.1%	89.2%	89.8%	88.8%	18.1	1
10	strict-referenced	gpt-4o-mini	default	✓	—	81.2%	81.7%	80.0%	87.3%	78.8%	15.3	3
11	ashai-pre-researched	gpt-5-mini	default	✓	—	81.1%	86.2%	86.8%	85.2%	81.2%	33.2	0
12	ashai-pre-researched	gpt-5-nano	default	✓	—	77.4%	80.4%	82.6%	83.4%	78.6%	7.4	0
13	ashai-pre-researched	gpt-5-nano	default	✓	—	77.3%	79.6%	81.5%	81.8%	76.2%	15.8	0
14	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	76.7%	80.0%	76.1%	82.1%	75.8%	16.3	0
15	noora_aman	gpt-4o-mini	noora	✓	—	76.1%	80.7%	83.5%	86.4%	78.8%	13.3	2
16	ashai-pre-researched	gpt-4o-mini	default	✓	—	76.0%	81.7%	76.5%	81.6%	80.2%	18.0	0
17	strict-referenced	gpt-5-nano	default	—	—	74.3%	71.0%	74.0%	75.5%	75.6%	15.5	7
18	strict-referenced	gpt-5-nano	default	✓	—	69.7%	65.8%	70.4%	72.2%	71.2%	17.2	2
19	ashai	gpt-5-nano	default	✓	—	69.1%	66.7%	69.8%	71.2%	70.3%	30.9	3
20	ashai-pre-researched	gpt-4o-mini	default	✓	—	68.9%	72.0%	72.6%	75.0%	76.0%	11.7	14
21	ashai-experiment	gpt-5-nano	default	✓	—	68.4%	65.9%	65.9%	78.5%	74.3%	55.9	4
22	ashai-pre-researched	gpt-5	default	✓	—	56.6%	57.9%	59.8%	60.5%	57.1%	19.7	0
23	noora_aman	gpt-5-nano	ask nihai, nihai	✓	—	2.8%	0.0%	0.0%	5.0%	15.0%	7.5	22

v78 - sep 19, 2025 5:54pm¶

Changes¶

improved ask_nihai and perplexity responses
opena evaled
testing different gpt-5 models

Errors¶

AI analysis:

Error Analysis¶

Overall Error Rate: Decreased to 5/276, showing improved reliability compared to 13/240 in the previous entry.
High Error Rate in Specific Runs: "ashai-pre-researched" recorded a 33% error rate, indicating persistent issues with this model/tool combination.
Weave-Reported Errors: Increased to 70 events, significantly up from 40, suggesting growing instability in certain configurations.
Common Error Messages: Frequent "HTTPConnectionPool" errors indicate connection issues, likely due to server misconfigurations or network problems.
Suspected Causes: Persistent "localhost" resolution failures suggest a need for network configuration review or server accessibility checks.
Next Steps: Investigate high-error runs, resolve network issues, and optimize server configurations to enhance performance and reliability.

Observations¶

Top performer: strict-referenced | gpt-5-mini → 89.6%
Fastest: ashai-pre-researched | gpt-5-nano → 7.4s
Top Performer: strict-referenced | gpt-5 achieved 89.6%, maintaining a lead over noora_aman's 88.3%.
Significant Gaps: noora_aman's performance dropped from 88.3% to 75.9%, indicating a notable decline.
Speed Outlier: ashai-pre-researched | gpt-5 recorded an average response time of 6.1s, faster than previous entries.
Overall Errors: Decreased to 5/276 cases, showing improved reliability compared to 13/240 in v74.
Weave-Reported Errors: Increased to 70 events, highlighting persistent issues in certain configurations.
Error Hotspots: "noora_aman" recorded a high Weave-reported error rate of 22/12, indicating critical issues.
Common Errors: New error messages related to HTTP connection failures suggest server-side issues needing attention.

Leaderboard v78¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5-mini	default	—	—	89.6%	90.9%	90.0%	90.0%	88.5%	22.2	2
2	strict-referenced	gpt-5-mini	default	✓	—	89.1%	90.0%	89.2%	89.8%	88.8%	23.0	0
3	noora_aman	gpt-5-mini	ask nihai, nihai	✓	—	88.8%	90.0%	90.6%	89.8%	88.8%	19.9	0
4	strict-referenced	gpt-5	default	✓	—	86.4%	87.9%	89.8%	89.8%	88.7%	58.1	0
5	strict-referenced	gpt-4o-mini	default	—	—	85.8%	88.3%	89.1%	89.6%	87.5%	22.6	3
6	strict-referenced	gpt-5	default	—	—	85.7%	88.7%	90.4%	88.6%	87.5%	21.0	3
7	noora_aman	gpt-5	ask nihai, nihai	✓	—	84.8%	88.8%	89.6%	90.2%	87.0%	17.8	0
8	ashai	gpt-4o-mini	default	✓	—	84.4%	88.5%	86.9%	88.4%	86.7%	18.3	4
9	ashai-experiment	gpt-4o-mini	default	✓	—	83.8%	87.1%	89.2%	89.8%	88.8%	18.1	1
10	strict-referenced	gpt-4o-mini	default	✓	—	81.2%	81.7%	80.0%	87.3%	78.8%	15.3	3
11	ashai-pre-researched	gpt-5-mini	default	✓	—	81.1%	86.2%	86.8%	85.2%	81.2%	33.2	0
12	ashai-pre-researched	gpt-5-nano	default	✓	—	77.4%	80.4%	82.6%	83.4%	78.6%	7.4	0
13	ashai-pre-researched	gpt-5-nano	default	✓	—	77.3%	79.6%	81.5%	81.8%	76.2%	15.8	0
14	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	76.7%	80.0%	76.1%	82.1%	75.8%	16.3	0
15	noora_aman	gpt-4o-mini	noora	✓	—	76.1%	80.7%	83.5%	86.4%	78.8%	13.3	2
16	ashai-pre-researched	gpt-4o-mini	default	✓	—	76.0%	81.7%	76.5%	81.6%	80.2%	18.0	0
17	strict-referenced	gpt-5-nano	default	—	—	74.3%	71.0%	74.0%	75.5%	75.6%	15.5	7
18	strict-referenced	gpt-5-nano	default	✓	—	69.7%	65.8%	70.4%	72.2%	71.2%	17.2	2
19	ashai	gpt-5-nano	default	✓	—	69.1%	66.7%	69.8%	71.2%	70.3%	30.9	3
20	ashai-pre-researched	gpt-4o-mini	default	✓	—	68.9%	72.0%	72.6%	75.0%	76.0%	11.7	14
21	ashai-experiment	gpt-5-nano	default	✓	—	68.4%	65.9%	65.9%	78.5%	74.3%	55.9	4
22	ashai-pre-researched	gpt-5	default	✓	—	56.6%	57.9%	59.8%	60.5%	57.1%	19.7	0
23	noora_aman	gpt-5-nano	ask nihai, nihai	✓	—	2.8%	0.0%	0.0%	5.0%	15.0%	7.5	22

v74 - sep 18, 2025 3:46am¶

Changes¶

improved ashai-experiment prompt with more guidelines
pre-researched agents now return references

Errors¶

AI analysis:

Error Analysis¶

Overall Error Rate: Increased to 13/240, indicating a decline in system reliability compared to 1/216 in the previous entry.
High Error Rate in Specific Runs: "noora_aman_pre_researched" recorded a 100% error rate, suggesting critical issues with this model/tool combination.
Weave-Reported Errors: Increased to 40 events, significantly up from 10, indicating growing instability in certain configurations.
Common Error Messages: "500 Server Error" and "APITimeoutError" suggest server-side issues and potential overloads affecting performance.
Suspected Causes: Increased load on the server or misconfigurations in the "noora_aman_pre_researched" setup may be contributing to errors.
Next Steps: Investigate high-error runs, optimize server performance, and review tool configurations to mitigate timeout and server errors.

Observations¶

noora's gpt-5 prompt does just as good as others - 2nd place!
the pre-loading of of prompts seems to help th
Top performer: strict-referenced | gpt-5 → 88.7%
Fastest: strict-referenced | gpt-4o-mini → 7.3s
Top Performer: strict-referenced | gpt-5 achieved 88.7%, surpassing previous leader noora_aman's 89.1%.
Significant Gaps: noora_aman's performance dropped to 77.5%, indicating a notable decline.
Speed Outlier: ashai-pre-researched | gpt-4o-mini recorded an average response time of 6.1s, significantly faster than prior entries.
Error Rate: Overall errors increased to 13/240 cases, contrasting with the previous 1/216, indicating reduced reliability.
Error Hotspots: High error rates observed with "noora_aman_pre_researched," reaching 100% in one instance.
Weave-Reported Errors: Increased to 40 events, highlighting areas needing further investigation despite overall error rate increase.
Common Errors: "500 Server Error" and "APITimeoutError" emerged, indicating new issues requiring attention.

Leaderboard v74¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5	ask nihai, nihai	✓	—	88.7%	88.3%	89.6%	89.8%	86.2%	17.4	0
2	noora_aman	gpt-5	ask nihai, nihai	✓	—	88.3%	90.0%	89.2%	89.3%	87.1%	28.3	0
3	ashai	gpt-5	ask nihai, nihai	✓	—	87.6%	90.2%	89.8%	89.7%	87.9%	61.0	1
4	ashai-experiment	gpt-5	default	✓	—	85.7%	90.4%	90.4%	89.8%	88.0%	22.0	0
5	strict-referenced	gpt-4o-mini	september	✓	—	83.2%	87.9%	87.2%	89.5%	84.4%	53.9	1
6	ashai-experiment	gpt-4o-mini	default	✓	—	83.1%	87.9%	89.6%	90.2%	86.2%	25.4	2
7	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	80.2%	81.2%	80.4%	87.1%	81.1%	56.2	2
8	ashai	gpt-4o-mini	nihai	✓	—	79.9%	81.7%	80.6%	86.7%	82.7%	58.3	2
9	strict-referenced	gpt-4o-mini	noora	✓	—	77.3%	79.3%	83.1%	84.8%	80.2%	11.1	2
10	ashai-pre-researched	gpt-5	default	✓	—	77.0%	86.7%	83.3%	85.0%	81.7%	33.7	21
11	strict-referenced	gpt-4o-mini	nihai	✓	—	76.5%	79.2%	76.9%	77.4%	74.2%	7.3	2
12	noora_aman	gpt-5	noora	✓	—	75.9%	75.0%	77.1%	81.8%	81.1%	16.8	0
13	noora_aman	gpt-4o-mini	noora	✓	—	75.6%	81.7%	83.6%	85.0%	80.2%	15.5	0
14	ashai	gpt-4o-mini	ask nihai, nihai	✓	—	72.8%	72.5%	75.7%	76.7%	73.2%	71.3	3
15	noora_aman_pre_researched	gpt-5	default	✓	—	72.5%	75.8%	75.4%	78.6%	76.7%	11.8	0
16	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	72.1%	72.5%	75.6%	77.4%	76.4%	14.7	1
17	strict-referenced	gpt-4o-mini	nihai	✓	—	69.4%	66.7%	70.2%	72.8%	70.0%	53.0	3
18	ashai-pre-researched	gpt-4o-mini	default	✓	—	69.2%	69.6%	69.3%	78.1%	74.8%	7.9	0
19	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	66.2%	65.8%	68.3%	72.5%	68.1%	16.8	0
20	noora_aman_pre_researched	gpt-4o-mini	default	✓	—	0.0%	0.0%	0.0%	0.0%	0.0%	15.0	0

v73 - sep 18, 2025 3:06am¶

Changes¶

made a pre-loaded ashai where the ask_nihai is run ahead of time instead of as a tool - thanks aman!
tweaks to ask_nihai agent to use topic context in questions
nihai write will put less dupes in
re-enabled the 4 broken tests

Errors¶

AI analysis:

Error Analysis¶

Overall Error Rate: Decreased to 1/216, indicating improved system reliability compared to 2/120 in the previous entry.
High Error Rate in Specific Runs: "ashai | gpt-4o-mini" recorded an 8% error rate, suggesting issues with this model/tool combination.
Weave-Reported Errors: Increased to 10 events, up from 9, indicating potential instability in certain configurations.
Common Error Message: "Failed to generate a properly formatted response" persists, highlighting ongoing formatting issues.
Suspected Causes: Tool interactions, particularly with "ask_nihai" and "search_nihai," may be contributing to errors; further investigation needed.
Next Steps: Analyze high-error runs, refine tool combinations, and implement fixes to address formatting issues and enhance response generation.

Observations¶

Top performer: noora_aman | gpt-5 → 89.1%
Fastest: ashai-pre-researched | gpt-4o-mini → 6.1s
Top Performer: strict-referenced | gpt-5 achieved 88.1%, maintaining a strong lead over previous entries.
Significant Gaps: noora_aman's scores dropped to 77.5%, indicating a notable decline in performance.
Speed Outlier: ashai-pre-researched | gpt-4o-mini recorded an average time of 6.1s, significantly faster than previous entries.
Error Rate: Overall errors decreased to 1/216 cases, showing improved reliability compared to v69.
Error Hotspots: High error rates persist with "ask_nihai" and "search_nihai" combinations, indicating ongoing issues.
Weave-Reported Errors: Increased to 10 events, highlighting areas needing attention despite overall error reduction.
Common Errors: Formatting errors remain prevalent, suggesting a need for further refinement in response generation.

Leaderboard v73¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	noora_aman	gpt-5	ask nihai, nihai	✓	—	89.1%	89.6%	91.2%	89.6%	87.1%	36.3	0
2	ashai	gpt-5	ask nihai, nihai	✓	—	87.0%	89.8%	90.2%	90.4%	88.1%	33.5	0
3	strict-referenced	gpt-5	ask nihai, nihai	✓	—	84.7%	85.0%	87.1%	88.2%	87.5%	22.7	0
4	ashai	gpt-4o-mini	nihai	✓	—	84.6%	86.2%	86.9%	88.8%	85.0%	17.8	2
5	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	84.2%	87.5%	88.2%	89.2%	85.6%	58.3	2
6	strict-referenced	gpt-4o-mini	noora	✓	—	83.8%	87.5%	87.8%	87.3%	84.8%	53.4	0
7	noora_aman	gpt-4o-mini	noora	✓	—	79.8%	84.6%	87.3%	89.3%	84.4%	19.6	0
8	noora_aman	gpt-5	noora	✓	—	79.5%	81.9%	81.7%	84.0%	84.2%	21.9	0
9	ashai-pre-researched	gpt-5	default	✓	—	79.1%	82.1%	85.0%	84.8%	84.2%	8.8	0
10	strict-referenced	gpt-4o-mini	september	✓	—	78.7%	79.6%	81.9%	83.3%	80.2%	54.8	2
11	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	78.5%	83.6%	78.8%	84.2%	80.2%	19.6	0
12	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	77.8%	78.8%	76.0%	83.4%	75.3%	63.4	2
13	ashai	gpt-4o-mini	ask nihai, nihai	✓	—	77.0%	81.2%	81.7%	82.1%	80.2%	64.6	2
14	strict-referenced	gpt-4o-mini	nihai	✓	—	76.3%	73.5%	78.3%	80.7%	76.1%	18.7	0
15	strict-referenced	gpt-4o-mini	nihai	✓	—	74.0%	74.2%	75.7%	79.2%	75.2%	22.4	0
16	noora_aman_pre_researched	gpt-5	default	✓	—	73.8%	75.8%	77.9%	80.4%	76.7%	8.4	0
17	ashai-pre-researched	gpt-4o-mini	default	✓	—	73.8%	78.3%	75.8%	79.8%	82.1%	6.1	0
18	noora_aman_pre_researched	gpt-4o-mini	default	✓	—	71.0%	75.4%	71.7%	80.5%	70.6%	7.1	0

v69 - sep 17, 2025 6:56pm¶

Changes¶

agressive parellel runs and 30s wait dropped to 3
JSON enforced in judges
Weaviate collection required and auto made on launch

Errors¶

AI analysis:

Error Analysis¶

Error Rate Increase: Overall errors rose to 2/120, up from 0/120, indicating a decline in system reliability.
High Error Runs: Notable errors in "noora_aman" and "gpt-4o-mini" suggest specific tool combinations may be problematic.
Common Error Messages: "Failed to generate a properly formatted response" and timeout errors indicate issues with response generation and connectivity.
Suspected Causes: Recent updates may have introduced instability; further investigation into tool interactions is essential.
Next Steps: Analyze high-error runs, refine tool combinations, and implement fixes based on identified error patterns.
Comparison to Previous Entry: Increased Weave-reported errors highlight a need for immediate attention compared to the previous stable performance.

Observations¶

Top performer: strict-referenced | gpt-5 → 88.1%
Fastest: strict-referenced | gpt-4o-mini → 8.3s
Top Performer: strict-referenced | gpt-4o-mini achieved 88.2%, consistent with previous top scores.
Error Rate: Weave-reported errors decreased to 9 events, down from 12 in v68.
Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 8.3s, faster than 8.9s in v68.
Notable Gaps: Significant performance drop for noora_aman, with scores below 80% in multiple runs.
Error Hotspots: High error rates persist with "ask_nihai" and "search_nihai" combinations, indicating ongoing issues.
Common Errors: Formatting and timeout errors reported, suggesting areas for improvement in response generation.
Comparison to v68: Overall performance stability improved, with fewer errors and faster response times.

Leaderboard v69¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5	ask nihai, nihai	✓	—	88.1%	88.8%	89.1%	89.8%	85.0%	26.5	1
2	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	87.9%	89.4%	86.9%	88.9%	83.8%	9.9	1
3	ashai	gpt-5	ask nihai, nihai	✓	—	87.0%	89.6%	89.1%	89.8%	88.1%	31.0	0
4	strict-referenced	gpt-4o-mini	nihai	✓	—	86.7%	86.9%	86.0%	87.2%	82.5%	9.7	1
5	ashai	gpt-4o-mini	ask nihai, nihai	✓	—	86.3%	87.2%	85.8%	88.5%	82.5%	17.5	0
6	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	86.1%	88.8%	86.9%	88.5%	85.0%	9.2	2
7	ashai	gpt-4o-mini	nihai	✓	—	83.0%	86.9%	86.9%	88.2%	83.4%	14.9	1
8	noora_aman	gpt-4o-mini	noora	✓	—	77.5%	82.5%	79.4%	87.2%	78.8%	12.1	0
9	strict-referenced	gpt-4o-mini	noora	✓	—	77.2%	74.4%	76.6%	83.5%	77.8%	8.3	0
10	perplexity	sonar	default	✓	—	75.8%	87.5%	87.9%	88.5%	83.4%	11.6	0
11	noora_aman	gpt-5	ask nihai, nihai	✓	—	74.6%	78.8%	78.8%	78.1%	75.0%	37.5	0
12	strict-referenced	gpt-4o-mini	nihai	✓	—	69.2%	68.8%	68.1%	72.2%	65.2%	80.5	0
13	noora_aman	gpt-5	noora	✓	—	68.8%	68.8%	68.8%	79.8%	75.0%	14.7	0
14	strict-referenced	gpt-4o-mini	september	✓	—	59.0%	56.9%	60.4%	63.4%	63.8%	85.6	3
15	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	34.7%	30.0%	34.4%	36.0%	35.6%	18.2	0

v68 - sep 17, 2025 5:24pm¶

Changes¶

improve prompt for perplexity search and removed a lot of uneccessary complexity from it
added noora_aman - Noora's next proposed agent prompt

Errors¶

AI analysis:

Error Analysis¶

Overall Performance: Weave-reported errors increased from 5 to 12 events, indicating a decline in tool reliability.
Error Patterns: High error rates persist in runs using "ask_nihai" and "search_nihai," suggesting ongoing tool-specific challenges.
Suspected Causes: Recent updates may have introduced instability; further investigation into tool interactions is needed.
Common Messages: No common error messages reported, indicating varied issues across different runs.
Next Steps: Analyze high-error runs, explore alternative tool combinations, and implement targeted fixes based on error patterns.
Comparison to Previous Entry: Overall errors remain low, but the increase in Weave-reported errors highlights areas for immediate attention.

Observations¶

Top performer: strict-referenced | gpt-4o-mini → 88.2%
Fastest: perplexity | sonar → 4.9s
Top Performer: strict-referenced | gpt-4o-mini achieved 88.2%, an increase from 86.8% in v67.
Overall Errors: Increased to 12 Weave-reported events, up from 5 in v67.
Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 8.9s, slower than 6.0s in v67.
Notable Gaps: Performance variability observed; some runs with "ask_nihai" tools scored significantly lower.
Error Hotspots: High error rates persist with "ask_nihai, search_nihai" combinations, indicating ongoing issues.
Fastest Run: perplexity | sonar achieved 77.1% in 4.9s, showcasing improved speed performance.
Comparison to v67: Overall performance consistency decreased despite a top score increase, highlighting areas for further improvement.

Leaderboard v68¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-4o-mini	nihai	✓	—	88.2%	89.1%	87.8%	89.5%	82.8%	8.9	1
2	strict-referenced	gpt-4o-mini	nihai	✓	—	87.8%	89.4%	87.5%	88.2%	84.4%	8.1	2
3	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	87.7%	90.0%	86.9%	89.8%	83.5%	80.3	1
4	ashai	gpt-4o-mini	nihai	✓	—	87.5%	88.8%	86.6%	88.1%	83.9%	5.0	0
5	ashai	gpt-5	ask nihai, nihai	✓	—	86.5%	90.6%	90.6%	90.0%	88.5%	55.2	0
6	ashai	gpt-4o-mini	ask nihai, nihai	✓	—	85.9%	87.5%	85.0%	86.2%	83.5%	11.2	2
7	noora_aman	gpt-5	ask nihai, nihai	✓	—	85.2%	89.4%	88.1%	88.9%	83.0%	24.2	0
8	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	80.7%	78.8%	78.1%	85.8%	75.6%	8.3	1
9	noora_aman	gpt-5	noora	✓	—	80.5%	84.4%	85.0%	86.8%	81.1%	14.3	0
10	perplexity	sonar	default	✓	—	77.1%	87.5%	87.9%	88.5%	84.0%	4.9	0
11	strict-referenced	gpt-4o-mini	noora	✓	—	76.7%	78.1%	78.8%	80.5%	77.9%	81.0	0
12	noora_aman	gpt-4o-mini	noora	✓	—	73.6%	75.0%	75.6%	78.2%	75.0%	7.2	0
13	strict-referenced	gpt-5	ask nihai, nihai	✓	—	70.2%	67.2%	70.0%	74.5%	69.8%	27.4	0
14	noora_aman	gpt-4o-mini	ask nihai, nihai	✓	—	69.7%	74.3%	67.1%	79.0%	62.9%	8.8	3
15	strict-referenced	gpt-4o-mini	september	✓	—	67.4%	66.2%	70.9%	70.2%	71.2%	9.7	2

v67 - sep 9, 2025 10:20pm¶

Changes¶

perplexity search prompt improved
Nihai fully built out with 2,500 rows

Errors¶

AI analysis:

Error Analysis¶

Overall Performance: Errors decreased from 6 to 5 Weave-reported events, indicating slight improvement in tool interactions.
Error Patterns: Consistent high error rates in runs using "search_nihai" and "ask_nihai," suggesting persistent tool-specific issues.
Suspected Causes: Recent updates may have stabilized performance, but specific tool combinations still require further investigation.
Common Messages: No common error messages reported, indicating stable response quality across evaluations.
Next Steps: Analyze high-error runs, explore alternative tool combinations, and monitor for emerging issues in future evaluations.
Comparison to Previous Entry: Overall errors remain low, but Weave-reported errors slightly increased, highlighting areas for continued focus.

Observations¶

Top performer: strict-referenced | gpt-4o-mini → 86.8%
Fastest: strict-referenced | gpt-4o-mini → 6.0s
Top Performer: strict-referenced | gpt-4o-mini achieved 86.8%, a slight increase from 80.7% in v66.
Error Rate: Overall errors remain at 0/48 cases, consistent with v66's performance.
Weave-Reported Errors: Decreased to 5 events, showing continued improvement from 6 in v66.
Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 60.6s, significantly faster than the previous 104.7s.
Notable Gaps: Performance consistency varies; some runs scored significantly lower, particularly with "search_nihai" tools.
Common Error Hotspots: High error rates persist with "ask_nihai, search_nihai" combinations, indicating areas needing further investigation.

Leaderboard v67¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-4o-mini	september	✓	—	86.8%	89.4%	87.4%	87.0%	85.0%	108.5	0
2	strict-referenced	gpt-4o-mini	noora	✓	—	83.6%	86.2%	88.0%	88.5%	87.6%	233.3	1
3	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	76.6%	75.0%	76.2%	85.0%	74.9%	107.5	1
4	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	70.4%	65.6%	70.6%	74.4%	73.1%	12.4	1
5	strict-referenced	gpt-4o-mini	nihai	✓	—	70.3%	66.9%	69.4%	74.8%	70.6%	6.8	2
6	strict-referenced	gpt-4o-mini	nihai	✓	—	60.6%	54.5%	58.5%	63.8%	64.4%	6.0	0

v66 - sep 9, 2025 5:43pm¶

Changes¶

increased ask_nihai and search_nihai relevance score to .5
improved how references are returned for nihai faq - link to them
debug nihai relevance
separate embeddings for title and answer - ranking combines the two
nihai faqs are now linked as references - linked to nihai ui -

Errors¶

AI analysis:

Error Analysis¶

Overall Performance: Errors decreased from 14/88 to 6/48, indicating improved stability in the latest evaluation.
Weave-Reported Errors: Reduced from 14 to 6 events, suggesting better handling of tool interactions.
Error Patterns: High error rates persist in runs using "ask_nihai" and "search_nihai," indicating potential tool-specific issues.
Common Messages: No common error messages reported, reflecting consistent response quality across evaluations.
Suspected Causes: Recent updates may have optimized performance, but specific tool combinations still require closer scrutiny.
Next Steps: Investigate high-error runs, test alternative tool combinations, and monitor for any emerging issues in future evaluations.

Observations¶

Top performer: strict-referenced | gpt-4o-mini → 80.7%
Fastest: strict-referenced | gpt-4o-mini → 6.4s
Top Performer: strict-referenced | gpt-4o-mini achieved 87.5%, maintaining the lead from v65.
Error Rate: Overall errors remain at 0/48 cases, consistent with v65's performance.
Weave-Reported Errors: Decreased to 6 events, showing improvement from 14 in v65.
Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 9.3s, slower than the previous fastest of 7.7s.
Notable Gaps: Performance consistency varies; some runs scored significantly lower, particularly with "search_nihai" tools.
Common Error Hotspots: High error rates observed with "ask_nihai, search_nihai" combinations, indicating areas needing further investigation.

Leaderboard v66¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	80.7%	83.8%	84.1%	87.2%	83.1%	6.4	1
2	strict-referenced	gpt-4o-mini	noora	✓	—	76.6%	76.2%	79.4%	80.6%	76.6%	104.7	0
3	strict-referenced	gpt-4o-mini	nihai	✓	—	55.1%	45.0%	53.1%	65.0%	56.9%	10.3	1
4	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	44.7%	33.8%	44.4%	59.8%	46.6%	18.0	3
5	strict-referenced	gpt-4o-mini	september	✓	—	21.1%	0.0%	15.6%	27.5%	23.8%	103.7	1
6	strict-referenced	gpt-4o-mini	nihai	✓	—	18.5%	0.0%	16.2%	31.2%	20.6%	7.1	0

v65 - sep 9, 2025 11:43am¶

Changes¶

same as before but hopefully nihai search works this time

Errors¶

AI analysis:

Error Analysis¶

Overall Stability: Errors remain at 0/88, showing consistent performance improvement compared to 0/48 in the previous entry.
Weave-Reported Errors: Increased to 14 events, indicating potential issues with specific tool combinations that need further investigation.
Top Error Runs: High error rates in runs using "search_nihai" suggest tool-specific problems, particularly with the "ask_nihai" combination.
Common Messages: Absence of common error messages continues, indicating sustained improvements in response generation.
Suspected Causes: Recent updates may have optimized performance, but specific tool interactions still require detailed analysis.
Next Steps: Investigate high-error runs for patterns, test alternative tool combinations, and monitor for emerging issues in future evaluations.

Observations¶

Top performer: strict-referenced | gpt-4o-mini → 87.5%
Fastest: strict-referenced | gpt-4o-mini → 7.7s
Top Performer: strict-referenced | gpt-4o-mini achieved 81.9%, maintaining the lead from v64.
Error Rate: Overall errors improved to 0/88 cases, a significant enhancement from v64's 0/48.
Weave-Reported Errors: Increased to 14 events, indicating emerging issues with specific tool combinations.
Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 6.3s, slightly slower than the previous fastest.
Notable Gaps: Performance consistency varies; some runs scored significantly lower, particularly with "search_nihai" tools.
Common Error Hotspots: High error rates observed with "search_nihai," suggesting targeted areas for further investigation.

Leaderboard v65¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-4o-mini	nihai	✓	—	87.5%	88.1%	87.5%	89.0%	86.6%	9.3	0
2	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	86.7%	88.8%	86.2%	89.4%	83.8%	12.2	1
3	strict-referenced	gpt-4o-mini	september	✓	—	85.2%	87.5%	86.9%	88.5%	85.2%	104.3	1
4	strict-referenced	gpt-4o-mini	nihai	✓	—	85.0%	85.1%	83.8%	86.2%	84.0%	7.7	0
5	strict-referenced	gpt-4o-mini	september	✓	—	83.8%	85.6%	88.4%	88.5%	83.4%	8.4	1
6	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	83.4%	85.6%	86.2%	87.5%	83.1%	9.2	2
7	strict-referenced	gpt-4o-mini	nihai	✓	—	76.2%	75.0%	76.1%	77.9%	76.0%	104.4	3
8	strict-referenced	gpt-4o-mini	noora	✓	—	69.7%	68.1%	67.9%	78.2%	67.5%	107.1	2
9	strict-referenced	gpt-4o-mini	nihai	✓	—	66.4%	59.7%	66.4%	78.6%	65.7%	14.8	3
10	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	60.1%	55.6%	58.1%	61.6%	63.4%	9.4	0
11	strict-referenced	gpt-4o-mini	noora	✓	—	53.6%	42.5%	52.2%	58.5%	54.1%	8.9	1

v64 - sep 8, 2025 5:10pm¶

Changes¶

preloaded Nihai with perplexity answers to all q's from India Generated FAQ
created search_noora_faq populated with India Generated FAQ
test to compare the different tools against each other!

Errors¶

AI analysis:

Error Analysis¶

Significant Improvement: Overall errors decreased to 0/48, a notable improvement from 2/288 in the previous entry, indicating enhanced stability.
Weave-Reported Errors: 12 Weave-reported errors suggest isolated issues, primarily with specific tool combinations, warranting further investigation.
Top Error Runs: High error rates in runs using "search_nihai" indicate potential tool-specific issues that need addressing.
Common Messages: Absence of common error messages suggests improved response generation and stability compared to the previous version.
Suspected Causes: Recent updates may have optimized configurations, reducing errors significantly; however, specific tool interactions still require scrutiny.
Next Steps: Analyze high-error runs for patterns, test alternative tool combinations, and monitor for any emerging issues in future evaluations.

Observations¶

Top performer: strict-referenced | gpt-4o-mini → 81.9%
Fastest: strict-referenced | gpt-4o-mini → 6.3s
Error Rate Improvement: Overall errors decreased to 0/48 cases, indicating enhanced reliability compared to v63.
Weave-Reported Errors: 12 Weave-reported errors highlight issues primarily with strict-referenced configurations, suggesting specific areas for improvement.
Speed Outlier: one-shot | gpt-4o-mini recorded an average time of 3.0s, remaining the fastest observed.
Notable Gaps: Performance consistency varies; some strict-referenced runs scored significantly lower than expected, indicating potential configuration issues.
Common Error Messages: Formatting and timeout errors were prevalent, suggesting areas for improvement in response generation.

Leaderboard v64¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-4o-mini	september	✓	—	81.9%	85.0%	87.9%	88.9%	86.4%	105.1	2
2	strict-referenced	gpt-4o-mini	noora, ask nihai, nihai	✓	—	69.9%	65.6%	70.8%	73.8%	71.1%	8.7	1
3	strict-referenced	gpt-4o-mini	noora	✓	—	59.1%	51.9%	60.6%	66.2%	59.6%	9.7	1
4	strict-referenced	gpt-4o-mini	nihai	✓	—	56.5%	53.3%	52.5%	61.7%	54.7%	7.0	6
5	strict-referenced	gpt-4o-mini	ask nihai, nihai	✓	—	54.8%	51.2%	51.2%	62.0%	57.5%	107.3	2
6	strict-referenced	gpt-4o-mini	nihai	✓	—	46.9%	40.6%	44.4%	56.9%	46.2%	6.3	0

v63 - sep 6, 2025 3:33pm¶

Changes¶

updated eval datasets with Niharika's edits to the 12 base examples
double note: NihAI is the new default tools

Errors¶

AI analysis:

Error Analysis¶

Increased Error Rate: Overall errors rose to 2/288, contrasting with the previous entry's 0/324, indicating potential instability in recent configurations.
Weave-Reported Errors: 25 Weave-reported errors highlight specific issues, particularly with "ashai" and "strict-referenced" runs, suggesting model sensitivity.
Top Error Runs: Both "ashai" and "strict-referenced" configurations show a 12% error rate, indicating possible flaws in their setup or execution.
Common Error Messages: Errors related to formatting and timeouts suggest underlying issues with response generation and API reliability.
Suspected Causes: Recent changes in configurations may have introduced instability; further investigation into "no-reasoning" and "retry" modes is warranted.
Next Steps: Review recent configuration changes, focus on high-error runs, and consider reverting to previous stable settings for comparison.

Observations¶

Top performer: ashai | gpt-5 → 89.6%
Fastest: one-shot | gpt-4o-mini → 3.0s
Top Performer: strict-referenced | gpt-5 achieved 90.5% accuracy, surpassing previous top scores significantly.
Error Rate Increase: Overall errors rose to 2/288 cases, indicating a slight decline in reliability compared to v62.
Weave-Reported Errors: 25 Weave-reported errors highlight potential issues, particularly with gpt-4o-mini configurations.
Speed Outlier: one-shot | gpt-4o-mini recorded an average time of 1.8s, remaining the fastest observed.
Notable Gaps: Performance consistency varies; some strict-referenced runs scored significantly lower than expected.
Common Error Messages: Formatting and timeout errors were prevalent, suggesting areas for improvement in response generation.

Leaderboard v63¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	ashai	gpt-5	default	✓	—	89.6%	90.0%	90.6%	90.0%	89.1%	33.2	0
2	ashai-experiment	gpt-5	default	✓	—	89.5%	90.2%	90.6%	89.5%	90.2%	22.6	0
3	strict-referenced	gpt-5	default	✓	—	89.5%	90.0%	90.2%	89.2%	89.5%	51.9	1
4	strict-referenced-after	gpt-5	default	✓	—	89.3%	90.6%	90.4%	89.8%	86.6%	39.0	1
5	ashai-experiment	gpt-5	default	✓	—	89.3%	90.0%	90.6%	89.8%	87.2%	59.2	1
6	ashai-experiment	gpt-5	default	✓	—	89.2%	90.0%	90.0%	90.0%	88.8%	52.6	1
7	strict-referenced	gpt-5	default	✓	—	89.1%	91.5%	89.6%	89.0%	85.1%	23.4	0
8	ashai	gpt-5	default	✓	—	89.1%	89.4%	89.4%	90.4%	86.2%	66.9	0
9	strict-referenced-after	gpt-5	default	✓	—	88.7%	89.4%	90.2%	88.2%	85.4%	18.7	0
10	ashai	gpt-5	default	✓	—	88.7%	89.3%	91.0%	89.7%	88.6%	44.1	2
11	ashai	gpt-4o-mini	default	—	—	87.9%	89.4%	88.8%	89.6%	86.4%	7.2	1
12	ashai	gpt-4o-mini	default	✓	—	87.8%	90.0%	87.5%	87.5%	85.9%	5.5	0
13	ashai-experiment	gpt-4o-mini	default	✓	—	87.7%	88.8%	88.1%	88.5%	86.6%	5.3	0
14	strict-referenced-after	gpt-4o-mini	default	✓	—	87.1%	88.1%	87.5%	86.6%	81.4%	5.1	0
15	ashai-experiment	gpt-4o-mini	default	✓	—	87.1%	90.0%	85.0%	88.8%	83.0%	9.4	2
16	ashai	gpt-4o-mini	default	✓	✓	87.0%	88.1%	87.5%	88.8%	83.5%	6.0	2
17	strict-referenced-after	gpt-4o-mini	default	✓	—	86.9%	88.1%	86.0%	87.8%	82.9%	8.0	0
18	ashai	gpt-4o-mini	perplexity	✓	—	86.8%	88.8%	88.1%	88.5%	86.6%	5.2	0
19	ashai	gpt-4o-mini	default	✓	—	86.7%	89.4%	84.1%	88.5%	81.6%	12.8	0
20	ashai	gpt-4o-mini	perplexity	✓	—	86.6%	87.5%	85.0%	87.4%	86.4%	8.8	1
21	strict-referenced	gpt-4o-mini	default	✓	—	86.3%	87.6%	85.7%	85.4%	81.1%	6.1	3
22	ashai-experiment	gpt-4o-mini	default	✓	—	86.3%	88.5%	85.0%	88.4%	82.5%	13.5	2
23	ashai	gpt-4o-mini	perplexity	✓	—	85.7%	86.9%	84.4%	87.5%	83.1%	10.8	2
24	ashai	gpt-4o-mini	default	✓	✓	85.6%	86.2%	84.4%	86.5%	86.2%	18.2	1
25	ashai	gpt-4o-mini	default	—	—	85.5%	85.6%	85.0%	88.2%	82.8%	11.0	2
26	ashai	gpt-4o-mini	default	✓	—	85.2%	88.8%	85.0%	88.2%	83.8%	10.0	0
27	strict-referenced	gpt-4o-mini	default	✓	—	80.0%	78.8%	79.4%	83.5%	79.4%	11.8	1
28	one-shot	gpt-5	default	✓	—	78.0%	90.0%	90.6%	89.5%	90.9%	11.2	0
29	perplexity	sonar	default	✓	—	77.2%	88.1%	88.1%	88.1%	80.9%	5.4	0
30	one-shot	gpt-4o-mini	default	✓	—	77.1%	88.1%	89.8%	89.4%	88.0%	3.5	0
31	one-shot	gpt-4o-mini	default	✓	—	76.8%	88.1%	89.4%	90.0%	86.9%	3.0	0
32	perplexity	sonar	default	✓	—	76.2%	85.6%	85.2%	87.4%	82.8%	4.7	0
33	ashai	gpt-4o-mini	default	✓	✓	75.6%	75.6%	78.8%	80.0%	77.9%	8.6	0
34	one-shot	gpt-5	default	✓	—	71.0%	80.0%	81.9%	86.0%	80.6%	17.9	0
35	ashai	gpt-4o-mini	default	—	—	68.2%	68.8%	68.1%	69.8%	68.1%	7.6	0
36	strict-referenced	gpt-4o-mini	default	✓	—	66.5%	66.9%	65.6%	69.8%	66.2%	7.9	2

v62 - aug 27, 2025 3:19pm¶

Changes¶

significant formatting of Resource types. Sticking with the types returned by the tools for less tranformation potential issues.
Doubling up on each run to see if the same # of issues/errors arrise and how similar scores are
added 'final turn' json enforcement
medical_accuracy should force reference_sufficiency to 0 manually

Errors¶

AI analysis:

Error Rate Consistency: Overall errors remain at 0/324, indicating stable performance compared to previous entry (v60).
Weave-Reported Errors: 11 Weave-reported errors suggest potential issues with specific runs, particularly with "no-reasoning" configurations.
Top Error Runs: "ashai | gpt-4o-mini" shows repeated errors in "no-reasoning" and "retry" modes, indicating possible configuration sensitivity.
Suspected Causes: High Weave-reported errors may stem from inadequate reasoning capabilities or model limitations in specific contexts.
Next Steps: Investigate "no-reasoning" configurations further; consider adjusting parameters or exploring alternative models for improved performance.
Comparison Insight: No changes were made from v60, yet error patterns persist, suggesting a need for deeper analysis of existing configurations.

Observations¶

gpt-4 is only 1% behind gpt-5 and takes a small fraction of the time (6s vs 40)
We're getting high 80's across the board. the evals aren't showing much better or worse for any of the variants really. I think we could use some better datasets here now!
super odd that some of the strict-reference duplicates scored super highligh but then super low...
Top performer: strict-referenced | gpt-5 → 89.2%
Fastest: one-shot | gpt-4o-mini → 1.8s
Top Performer: strict-referenced | gpt-5 | default achieved 90.5% accuracy, surpassing previous top scores significantly.
Speed Outlier: one-shot | gpt-4o-mini | default recorded an average time of 1.5s, the fastest yet observed.
Overall Errors: Total errors decreased to 2/350 cases, showing improved reliability compared to v60.
Notable Changes: Introduction of new configurations led to increased accuracy and reduced error rates across the board.

Leaderboard v62¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s	# err
1	strict-referenced	gpt-5	default	✓	—	89.2%	89.6%	90.2%	89.5%	88.8%	40.8	1
2	strict-referenced	gpt-5	default	✓	—	89.1%	89.6%	89.2%	90.2%	87.8%	21.2	0
3	ashai	gpt-5	default	✓	—	89.1%	89.8%	90.4%	89.2%	88.8%	13.9	0
4	ashai-experiment	gpt-5	default	✓	—	88.9%	89.6%	90.8%	89.3%	87.9%	12.9	0
5	strict-referenced-after	gpt-5	default	✓	—	88.9%	89.6%	89.3%	89.4%	87.5%	29.9	1
6	ashai-experiment	gpt-5	default	✓	—	88.1%	90.0%	90.8%	89.8%	88.3%	31.7	0
7	ashai	gpt-4o-mini	default	✓	✓	88.0%	89.6%	87.5%	89.2%	86.8%	6.1	1
8	ashai	gpt-5	default	✓	—	87.9%	89.2%	89.3%	89.5%	88.2%	27.9	1
9	strict-referenced-after	gpt-5	default	✓	—	87.8%	88.5%	89.6%	89.3%	87.0%	30.5	0
10	ashai	gpt-4o-mini	default	✓	—	87.7%	88.8%	89.2%	88.8%	86.5%	4.2	0
11	strict-referenced	gpt-4o-mini	default	✓	—	87.5%	87.9%	87.5%	88.8%	84.4%	3.3	0
12	ashai	gpt-4o-mini	perplexity	✓	—	87.3%	88.3%	88.2%	89.2%	87.5%	3.7	0
13	ashai-experiment	gpt-4o-mini	default	✓	—	87.2%	88.2%	88.7%	86.8%	84.7%	2.9	0
14	ashai	gpt-4o-mini	default	✓	✓	87.2%	87.5%	87.6%	88.5%	88.1%	3.1	0
15	ashai	gpt-4o-mini	default	—	—	87.0%	87.8%	87.8%	88.2%	85.4%	4.2	0
16	ashai	gpt-4o-mini	default	—	—	86.7%	87.9%	86.9%	88.3%	85.9%	5.8	3
17	strict-referenced-after	gpt-4o-mini	default	✓	—	86.6%	88.3%	85.8%	87.8%	84.0%	3.2	0
18	ashai	gpt-4o-mini	default	✓	—	86.1%	87.9%	88.2%	87.7%	83.2%	4.6	0
19	perplexity	sonar	default	✓	—	85.2%	88.3%	90.2%	89.8%	85.9%	2.9	0
20	ashai-experiment	gpt-4o-mini	default	✓	—	84.4%	88.3%	87.8%	87.5%	85.5%	5.6	2
21	ashai	gpt-4o-mini	perplexity	✓	—	80.0%	80.8%	82.1%	83.2%	81.0%	6.4	2
22	one-shot	gpt-5	default	✓	—	78.4%	90.0%	90.7%	88.7%	88.5%	13.6	0
23	one-shot	gpt-4o-mini	default	✓	—	77.0%	87.5%	89.8%	89.5%	87.2%	1.8	0
24	strict-referenced	gpt-4o-mini	default	✓	—	76.2%	75.0%	76.1%	80.2%	75.7%	4.1	0
25	one-shot	gpt-4o-mini	default	✓	—	75.2%	84.8%	87.3%	88.2%	86.2%	2.3	0
26	strict-referenced-after	gpt-4o-mini	default	✓	—	74.8%	78.8%	73.2%	75.8%	72.5%	5.8	0
27	one-shot	gpt-5	default	✓	—	74.5%	85.4%	84.9%	87.8%	86.0%	8.3	0

v60¶

Changes¶

none, running again to hope that there are less errors.

v59 - aug 27, 2025 1:42am¶

Changes¶

ask_nihai and search_nihai promoted to be the only default tools (all others deprecated)
error fixes
perplexity_search no longer assums choices exists
ASHAI_AGENT_MAX_RETRIES 10 = agent retries (up from 2)
Backoff s from .6 to 12s cap
impmroved logging for search_nihai -
shows errors now on home page learder board.
ask_nihai completely re-written using an agent
going to just run some small tests to see if nihai is going

Errors¶

AI analysis:

Error Analysis Summary¶

Significant Error Reduction: Overall errors decreased to 1/168 from 46/960, indicating substantial improvement in evaluation performance.
Isolated Error Occurrence: The only error was from "strict-referenced-after" with gpt-5, suggesting a specific issue rather than widespread failures.
Common Validation Error: The persistent validation error regarding references.0.url indicates potential input handling issues in the AshaiResponse model.
Model Stability Improvement: The reduction in errors suggests enhanced model stability and better handling of requests compared to previous evaluations.
Next Steps: Investigate the validation error further, refine input handling, and continue monitoring for any emerging error patterns.
Comparison to Previous Entry: Error rates have drastically improved, highlighting effective adjustments made since the last evaluation.

Observations¶

Top performer: ashai-experiment | gpt-5 → 88.4%
Fastest: one-shot | gpt-4o-mini → 2.6s
Top Performer: ashai-experiment | gpt-5 achieved 88.4%, a notable improvement from previous entries.
Error Rate Decrease: Overall errors dropped to 1/168 cases, significantly better than 46/960 in v56.
Speed Outlier: one-shot | gpt-4o-mini averaged 2.6 seconds, the fastest recorded time.
Notable Error Hotspot: strict-referenced-after | gpt-5 had a single error rate of 8%, indicating potential instability.
Zero Errors: ashai | gpt-4o-mini recorded 0 errors across multiple runs, showcasing strong reliability.
Close Competition: strict-referenced | gpt-5 followed closely at 88.3%, maintaining competitive performance.
Common Error Message: A single validation error related to input type was noted, suggesting a minor compatibility issue.

Leaderboard v59¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai-experiment	gpt-5	default	✓	—	88.4%	90.4%	90.4%	90.2%	88.5%	25.1
2	strict-referenced	gpt-5	default	✓	—	88.3%	89.2%	89.4%	88.8%	87.8%	39.7
3	ashai	gpt-5	default	✓	—	88.1%	90.4%	90.2%	89.4%	89.6%	43.3
4	strict-referenced	gpt-4o-mini	default	✓	—	86.7%	87.5%	87.2%	89.2%	86.8%	6.9
5	ashai	gpt-4o-mini	default	✓	✓	86.5%	87.9%	89.2%	88.0%	86.2%	6.8
6	ashai	gpt-4o-mini	default	✓	—	85.5%	87.9%	89.7%	88.9%	86.7%	9.1
7	perplexity	sonar	default	✓	—	84.3%	88.8%	89.2%	89.8%	85.8%	3.5
8	ashai	gpt-4o-mini	default	—	—	83.5%	86.4%	87.2%	89.2%	86.2%	71.8
9	ashai-experiment	gpt-4o-mini	default	✓	—	81.2%	80.4%	82.2%	88.6%	81.7%	6.2
10	strict-referenced-after	gpt-5	default	✓	—	80.8%	81.4%	82.4%	81.2%	80.4%	42.8
11	ashai	gpt-4o-mini	perplexity	✓	—	78.8%	80.0%	81.9%	84.0%	80.1%	4.0
12	one-shot	gpt-5	default	✓	—	77.9%	89.2%	89.6%	88.6%	89.5%	17.7
13	one-shot	gpt-4o-mini	default	✓	—	77.5%	88.8%	90.1%	89.8%	88.0%	2.6
14	strict-referenced-after	gpt-4o-mini	default	✓	—	74.7%	73.3%	76.2%	79.2%	73.2%	6.6

v56 - aug 26, 2025 9:01pm¶

Changes¶

We gonna do a massive run yo and test nihai against perplexity! Also trying gpt-5 runs.

Errors¶

Overall errors: 46/960 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v56 [retry] | 20250826_140225: 12/12 (100%) - strict-referenced | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 | 20250826_143309: 8/12 (67%) - strict-referenced | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 [no-reasoning] | 20250826_142351: 7/12 (58%) - ashai | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 [retry] | 20250826_141646: 6/12 (50%) - strict-referenced-after | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 [no-reasoning] | 20250826_141939: 4/12 (33%) Common error messages: - “Backoff failure after 1 attempt(s) [agent:openai:gpt-5]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.” ×28 - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.” ×11 - “Backoff failure after 3 attempt(s) [agent:openai:gpt-4o-mini]: APIConnectionError: Connection error.” ×1 - “list index out of range” ×1 - “Backoff failure after 2 attempt(s) [agent:openai:gpt-5]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.” ×1 Error categories: - other ×41 - api_connection ×1 Origins: - agent:direct ×42 Final attempt counts: - after 1 attempt(s) ×39 - after 2 attempt(s) ×1 - after 3 attempt(s) ×1

AI analysis:

Error Analysis Summary¶

Error Rate Increase: Overall errors rose to 46/960, up from 46/480, indicating worsening performance across evaluations.
Persistent NotImplementedError: The "NotImplementedError" related to meta tensors remains the most frequent error, suggesting unresolved compatibility issues.
High Error Rates in Specific Runs: ashai with gpt-4o-mini recorded a 100% error rate, while strict-referenced gpt-5 reached 67%.
Retry Failures: Most errors occurred after one or two attempts, indicating potential issues with model stability and request handling.
Next Steps: Investigate the "NotImplementedError" further, optimize model configurations, and improve error handling to mitigate failure rates.
Continuous Monitoring Needed: Ongoing tracking of error patterns is crucial to identify and resolve recurring issues in future evaluations.

Observations¶

Top performer: strict-referenced | gpt-5 → 89.5%
Fastest: ashai | gpt-4o-mini → 1.4s
Top Performer: ashai | gpt-5 achieved 89.4%, slightly down from 89.5% in v56.
Close Competition: strict-referenced | gpt-5 closely followed at 88.9%, maintaining strong performance.
Significant Error Rate: Overall errors increased to 46/960, up from 46/480 in v56, indicating persistent issues.
Critical Error Hotspot: ashai | gpt-4o-mini recorded a 100% error rate in one instance, mirroring previous concerns.
Common Errors: "NotImplementedError" remains prevalent, appearing 28 times, highlighting ongoing compatibility challenges.
Speed Outlier: ashai | gpt-4o-mini averaged 1.4 seconds, significantly faster than previous entries.
Retry Failures: Most errors occurred after one attempt, suggesting instability in request handling persists.

Leaderboard v56¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced	gpt-5	nihai, ask nihai	—	—	89.5%	91.7%	90.4%	89.6%	88.2%	20.7
2	ashai	gpt-5	nihai	✓	✓	89.4%	90.0%	90.8%	89.7%	88.3%	19.8
3	strict-referenced-after	gpt-5	nihai, ask nihai	✓	—	89.1%	89.6%	89.8%	89.8%	89.2%	23.2
4	strict-referenced	gpt-5	nihai, ask nihai	✓	—	88.9%	90.2%	90.2%	89.8%	88.2%	27.0
5	strict-referenced	gpt-5	perplexity	—	—	88.9%	89.8%	90.0%	89.8%	88.6%	18.7
6	strict-referenced	gpt-5	perplexity	—	—	88.8%	90.0%	89.2%	90.4%	88.5%	23.3
7	strict-referenced	gpt-5	perplexity	✓	—	88.7%	90.8%	88.4%	89.3%	85.6%	42.6
8	strict-referenced-after	gpt-5	perplexity	✓	—	88.6%	89.2%	90.1%	89.3%	86.3%	40.4
9	strict-referenced-after	gpt-5	perplexity	✓	—	88.5%	89.4%	90.0%	89.2%	86.7%	46.5
10	strict-referenced-after	gpt-5	perplexity	—	—	88.5%	90.0%	90.5%	88.2%	85.8%	19.2
11	strict-referenced	gpt-4o-mini	nihai, ask nihai	—	—	88.4%	89.2%	88.6%	89.7%	84.8%	3.3
12	ashai-experiment	gpt-5	perplexity	—	—	88.3%	90.0%	90.4%	90.2%	89.5%	14.8
13	strict-referenced	gpt-5	perplexity	✓	—	88.3%	89.6%	88.9%	88.7%	87.2%	51.9
14	ashai	gpt-5	nihai, ask nihai	✓	✓	88.1%	89.6%	90.2%	89.8%	87.9%	21.0
15	ashai-experiment	gpt-5	perplexity	—	—	88.1%	88.8%	90.0%	89.8%	88.3%	17.2
16	strict-referenced-after	gpt-5	nihai, ask nihai	—	—	88.0%	88.8%	90.6%	89.0%	86.3%	31.6
17	ashai	gpt-4o-mini	nihai, ask nihai	✓	✓	88.0%	88.8%	89.2%	89.6%	88.0%	5.6
18	strict-referenced-after	gpt-5	perplexity	—	—	88.0%	88.3%	90.0%	88.6%	86.4%	18.7
19	ashai	gpt-4o-mini	nihai	✓	✓	87.9%	88.8%	88.3%	89.4%	84.9%	3.5
20	ashai-experiment	gpt-5	perplexity	✓	—	87.9%	90.0%	90.4%	90.4%	89.8%	38.2
21	ashai	gpt-5	perplexity	—	—	87.8%	90.5%	91.5%	89.8%	88.5%	14.7
22	ashai-experiment	gpt-5	perplexity	✓	—	87.8%	89.8%	90.4%	90.0%	89.8%	40.8
23	strict-referenced	gpt-4o-mini	nihai, ask nihai	✓	—	87.6%	87.2%	89.8%	90.1%	87.0%	4.0
24	ashai	gpt-4o-mini	nihai	✓	✓	87.6%	88.8%	88.2%	89.4%	85.3%	3.9
25	ashai	gpt-5	perplexity	—	—	87.6%	90.0%	89.6%	89.8%	88.2%	16.6
26	ashai	gpt-5	nihai	✓	✓	87.5%	90.4%	89.3%	89.2%	88.0%	17.1
27	strict-referenced	gpt-4o-mini	perplexity	✓	—	87.5%	89.6%	89.0%	89.5%	88.0%	7.4
28	ashai	gpt-5	perplexity	✓	—	87.3%	89.8%	89.6%	89.8%	87.2%	38.0
29	ashai	gpt-4o-mini	perplexity	✓	—	87.2%	87.9%	87.9%	89.2%	87.5%	6.2
30	ashai-experiment	gpt-4o-mini	perplexity	—	—	87.2%	87.9%	88.6%	89.7%	85.5%	11.3
31	ashai-experiment	gpt-4o-mini	perplexity	✓	—	87.1%	88.8%	89.2%	89.9%	87.3%	7.0
32	ashai	gpt-4o-mini	perplexity	—	—	86.9%	88.8%	87.2%	90.2%	88.3%	5.4
33	strict-referenced	gpt-4o-mini	perplexity	✓	✓	86.6%	87.8%	88.3%	89.8%	87.5%	5.7
34	ashai-experiment	gpt-4o-mini	perplexity	✓	✓	86.5%	88.3%	89.2%	89.8%	88.2%	4.9
35	ashai	gpt-4o-mini	nihai	✓	✓	86.4%	88.2%	89.5%	89.6%	86.1%	3.4
36	ashai	gpt-4o-mini	perplexity	—	—	86.4%	87.5%	89.2%	88.5%	84.8%	4.8
37	ashai	gpt-5	perplexity	✓	—	86.2%	89.2%	89.9%	89.0%	89.4%	37.7
38	ashai	gpt-4o-mini	nihai	✓	✓	86.2%	87.5%	88.8%	90.4%	84.8%	3.7
39	ashai	gpt-4o-mini	perplexity	✓	✓	86.1%	87.5%	88.8%	89.6%	85.2%	4.8
40	ashai-experiment	gpt-4o-mini	perplexity	—	—	86.0%	88.2%	88.8%	89.8%	86.9%	7.6
41	ashai-experiment	gpt-4o-mini	perplexity	✓	✓	85.4%	86.7%	87.3%	89.5%	86.5%	7.1
42	perplexity	sonar	default	✓	—	85.0%	87.4%	90.2%	89.5%	86.3%	3.3
43	perplexity	sonar	default	✓	—	85.0%	88.2%	89.3%	88.6%	85.2%	3.9
44	strict-referenced	gpt-4o-mini	perplexity	—	—	84.8%	85.4%	87.3%	88.3%	83.8%	4.6
45	ashai	gpt-4o-mini	ask nihai	✓	✓	84.0%	86.9%	84.3%	85.9%	83.5%	6.7
46	strict-referenced	gpt-4o-mini	perplexity	✓	—	83.3%	84.6%	82.9%	88.0%	85.7%	7.7
47	strict-referenced	gpt-4o-mini	perplexity	✓	✓	83.1%	85.7%	80.8%	85.7%	81.9%	5.9
48	strict-referenced-after	gpt-4o-mini	nihai, ask nihai	—	—	81.7%	80.4%	82.3%	85.8%	78.5%	4.3
49	ashai	gpt-4o-mini	perplexity	✓	✓	80.9%	82.5%	81.8%	84.4%	81.6%	4.7
50	one-shot	gpt-5	default	✓	—	80.9%	90.0%	91.2%	89.7%	89.7%	24.0
51	ashai	gpt-4o-mini	nihai, ask nihai	✓	✓	80.9%	82.1%	82.1%	82.5%	79.2%	9.6
52	strict-referenced	gpt-4o-mini	nihai, ask nihai	—	—	80.7%	80.4%	82.8%	82.0%	79.6%	3.8
53	strict-referenced	gpt-4o-mini	nihai, ask nihai	✓	—	80.5%	81.7%	80.8%	82.2%	79.6%	4.2
54	strict-referenced-after	gpt-4o-mini	nihai, ask nihai	—	—	80.4%	81.4%	81.1%	82.7%	77.5%	4.4
55	ashai-experiment	gpt-4o-mini	perplexity	✓	—	80.3%	80.4%	83.6%	84.8%	83.3%	10.2
56	strict-referenced-after	gpt-4o-mini	nihai, ask nihai	✓	—	79.9%	80.9%	82.1%	82.5%	77.7%	5.7
57	strict-referenced-after	gpt-4o-mini	perplexity	✓	✓	79.6%	81.7%	78.8%	81.7%	80.8%	8.4
58	ashai	gpt-4o-mini	perplexity	✓	—	79.2%	80.0%	82.3%	82.3%	79.2%	5.2
59	one-shot	gpt-5	default	✓	—	78.6%	90.0%	89.6%	89.8%	89.3%	18.2
60	one-shot	gpt-5	default	—	—	78.1%	90.0%	90.4%	89.8%	89.6%	8.4
61	one-shot	gpt-5	default	—	—	78.1%	89.2%	90.2%	88.9%	89.0%	7.1
62	one-shot	gpt-4o-mini	default	✓	✓	77.6%	87.5%	89.0%	89.8%	88.3%	3.0
63	one-shot	gpt-4o-mini	default	✓	—	77.6%	88.3%	90.8%	89.8%	88.3%	5.4
64	one-shot	gpt-4o-mini	default	—	—	77.2%	87.5%	90.0%	89.8%	88.0%	2.9
65	one-shot	gpt-4o-mini	default	✓	—	77.1%	88.4%	90.4%	89.9%	88.3%	3.4
66	one-shot	gpt-4o-mini	default	✓	✓	76.4%	86.2%	89.0%	89.8%	87.1%	3.0
67	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	76.3%	77.5%	76.7%	79.8%	76.0%	13.4
68	strict-referenced-after	gpt-4o-mini	perplexity	—	—	75.4%	74.2%	77.7%	78.1%	76.8%	6.9
69	strict-referenced	gpt-4o-mini	perplexity	—	—	75.3%	74.2%	75.8%	79.3%	75.1%	5.7
70	strict-referenced-after	gpt-4o-mini	perplexity	✓	✓	74.8%	75.5%	73.6%	80.0%	75.3%	9.2
71	strict-referenced-after	gpt-4o-mini	nihai, ask nihai	✓	—	74.5%	72.9%	75.4%	77.3%	74.4%	5.0
72	one-shot	gpt-4o-mini	default	—	—	71.4%	81.7%	83.3%	83.3%	84.2%	2.5
73	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	66.6%	66.0%	68.1%	71.1%	68.3%	9.8
74	strict-referenced-after	gpt-4o-mini	perplexity	—	—	59.9%	57.9%	61.2%	63.3%	60.7%	13.4
75	strict-referenced-after	gpt-5	nihai, ask nihai	✓	—	59.1%	59.6%	60.0%	59.8%	59.8%	22.8
76	strict-referenced-after	gpt-5	nihai, ask nihai	—	—	59.1%	59.6%	59.8%	60.0%	59.0%	18.5
77	ashai	gpt-5	nihai, ask nihai	✓	✓	43.6%	45.0%	45.4%	45.0%	44.0%	11.9
78	strict-referenced	gpt-5	nihai, ask nihai	—	—	36.9%	37.5%	37.1%	37.5%	35.8%	18.8
79	strict-referenced	gpt-5	nihai, ask nihai	✓	—	29.4%	29.6%	30.0%	30.0%	29.2%	9.9
80	ashai	gpt-4o-mini	ask nihai	✓	✓	0.0%	0.0%	0.0%	0.0%	0.0%	1.4

v55 - aug 26, 2025 12:14pm¶

Changes¶

Note - I acidentally ran v55 twice!

yet another fix. I think we go tall errors!
all errors fixed
We gonna do a massive run yo and test nihai against perplexity!
emphasize that llm should use search_nihai before ask_nihai

Errors¶

Overall errors: 9/432 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v55 [retry] | 20250826_041428: 4/12 (33%) - strict-referenced | gpt-5 | tools=search_perplexity | full | v=v55 | 20250826_031247: 3/12 (25%) - strict-referenced-after | gpt-4o-mini | tools=search_perplexity | full | v=v55 [no-reasoning] | 20250826_034143: 1/12 (8%) - strict-referenced-after | gpt-4o-mini | tools=search_perplexity | full | v=v55 [retry] | 20250826_040806: 1/12 (8%) - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v55 [retry] | 20250826_023034: 0/12 (0%) Common error messages: - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.” ×3 - “Backoff failure after 1 attempt(s) [agent:openai:gpt-5]: APITimeoutError: Request timed out.” ×2 - “list index out of range” ×2 - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: ModelHTTPError: status_code: 400, model_name: gpt-4o-mini, body: {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 136332 tokens (136248 in the messages, 84 in the functions). Please reduce the length of the messages or functions.", 'type': 'invalid_request_error', 'param': 'messages'” ×1 - “Backoff failure after 3 attempt(s) [agent:openai:gpt-5]: APITimeoutError: Request timed out.” ×1 Error categories: - other ×6 - timeout ×3 Origins: - agent:direct ×9 Final attempt counts: - after 1 attempt(s) ×6 - after 3 attempt(s) ×1

AI analysis:

Error Analysis Summary¶

Error Rate Stability: Current evaluation maintains a consistent error rate of 9/432, similar to the previous entry.
Frequent Errors: "NotImplementedError" and "APITimeoutError" persist as common issues, indicating ongoing compatibility and timeout challenges.
Context Limit Exceeded: One instance of exceeding token limits suggests a need for stricter message length management.
Retry Patterns: Most errors occurred after one or three attempts, highlighting potential instability in request handling.
Next Steps: Investigate specific error messages, optimize token usage, and enhance timeout handling to improve performance.
Monitoring Required: Continue tracking error trends to identify persistent issues and implement necessary fixes in future evaluations.

Observations¶

Nihai is out performing perplexity! And is fast! Like crazy fast! 4.4s!
Top performer: strict-referenced | gpt-5 → 89.4%
Fastest: one-shot | gpt-4o-mini → 2.6s
Top performer: strict-referenced | gpt-5 maintained the lead with 89.4%, a slight increase from the previous entry.
Close competition: ashai | gpt-5 follows closely at 88.7%, just 0.7% behind the leader.
Speed outlier: one-shot | gpt-4o-mini achieved an impressive average of 2.6 seconds, significantly faster than previous entries.
Error increase: Overall errors rose to 9/432 cases, a notable jump from 0 errors in the last evaluation.
Error hotspots: High error rates observed in ashai | gpt-4o-mini and strict-referenced | gpt-5, particularly with timeout issues.
Common errors: "Backoff failure" and "APITimeoutError" were frequently reported, indicating persistent issues in request handling.
Final attempt counts: Multiple attempts were often required, suggesting instability in performance across various models.

Leaderboard v55¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced	gpt-5	perplexity	—	—	89.4%	90.8%	90.4%	89.7%	88.8%	17.9
2	ashai	gpt-5	perplexity	✓	—	88.7%	90.4%	90.0%	90.2%	86.9%	34.2
3	strict-referenced-after	gpt-5	perplexity	—	—	88.6%	89.6%	91.1%	89.8%	86.9%	18.3
4	ashai	gpt-5	perplexity	—	—	88.6%	90.0%	90.4%	89.8%	87.2%	14.8
5	ashai	gpt-4o-mini	nihai, ask nihai	✓	✓	88.4%	89.2%	89.2%	89.3%	86.7%	4.4
6	ashai-experiment	gpt-5	perplexity	—	—	88.4%	90.0%	90.4%	90.0%	89.2%	16.0
7	ashai-experiment	gpt-5	perplexity	✓	—	88.2%	90.4%	90.4%	90.0%	88.9%	35.3
8	strict-referenced-after	gpt-5	perplexity	✓	—	88.1%	89.6%	90.6%	89.6%	88.6%	42.4
9	ashai	gpt-4o-mini	ask nihai	✓	✓	87.9%	88.2%	88.8%	89.4%	86.0%	7.8
10	strict-referenced	gpt-4o-mini	nihai, ask nihai	✓	—	87.7%	88.8%	88.9%	89.2%	87.1%	3.7
11	ashai	gpt-4o-mini	perplexity	—	—	87.5%	90.4%	90.0%	90.2%	86.5%	8.1
12	ashai	gpt-4o-mini	perplexity	✓	✓	87.3%	88.8%	87.1%	89.8%	85.0%	4.8
13	ashai	gpt-4o-mini	nihai	✓	✓	87.2%	88.3%	88.8%	88.0%	86.0%	4.1
14	ashai	gpt-4o-mini	nihai	✓	✓	87.0%	88.8%	88.8%	89.5%	86.7%	4.3
15	ashai	gpt-4o-mini	nihai	✓	✓	86.8%	88.3%	89.4%	90.2%	86.0%	4.1
16	ashai-experiment	gpt-4o-mini	perplexity	✓	—	86.2%	87.9%	87.7%	89.4%	85.7%	6.9
17	perplexity	sonar	default	✓	—	85.8%	88.8%	88.8%	89.0%	87.0%	3.5
18	strict-referenced	gpt-4o-mini	nihai, ask nihai	—	—	83.1%	86.2%	84.8%	85.2%	81.1%	5.2
19	ashai	gpt-4o-mini	perplexity	✓	—	81.3%	81.7%	82.5%	85.7%	81.7%	8.2
20	ashai-experiment	gpt-4o-mini	perplexity	✓	✓	81.2%	83.3%	82.8%	84.2%	82.5%	7.1
21	strict-referenced-after	gpt-4o-mini	nihai, ask nihai	—	—	80.9%	80.0%	83.8%	82.1%	78.3%	5.6
22	ashai-experiment	gpt-4o-mini	perplexity	—	—	80.7%	80.8%	82.1%	84.0%	82.9%	9.5
23	strict-referenced	gpt-4o-mini	perplexity	✓	—	79.7%	79.6%	79.3%	84.8%	79.1%	8.2
24	strict-referenced	gpt-4o-mini	perplexity	✓	✓	79.5%	80.8%	80.6%	82.2%	79.8%	5.3
25	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	79.1%	77.9%	80.4%	82.8%	79.4%	9.0
26	one-shot	gpt-5	default	✓	—	78.1%	89.6%	91.2%	89.0%	89.6%	18.4
27	one-shot	gpt-5	default	—	—	77.8%	90.0%	90.7%	89.0%	88.5%	8.7
28	one-shot	gpt-4o-mini	default	—	—	77.5%	88.8%	89.4%	89.8%	89.7%	3.1
29	one-shot	gpt-4o-mini	default	✓	—	77.4%	89.2%	90.1%	90.4%	88.5%	2.7
30	one-shot	gpt-4o-mini	default	✓	✓	76.9%	87.1%	89.4%	89.7%	87.3%	2.6
31	strict-referenced	gpt-4o-mini	perplexity	—	—	73.3%	72.5%	74.2%	78.2%	74.0%	5.7
32	strict-referenced-after	gpt-4o-mini	nihai, ask nihai	✓	—	69.2%	69.2%	67.1%	74.0%	66.7%	4.1
33	strict-referenced-after	gpt-4o-mini	perplexity	✓	✓	67.6%	67.9%	68.1%	70.2%	66.3%	9.7
34	strict-referenced	gpt-5	perplexity	✓	—	66.5%	67.5%	67.5%	67.5%	67.2%	98.7
35	strict-referenced-after	gpt-4o-mini	perplexity	—	—	65.5%	65.9%	65.9%	69.1%	63.5%	10.4
36	ashai	gpt-4o-mini	ask nihai	✓	✓	58.6%	58.8%	58.9%	59.8%	58.9%	6.1

v54 - aug 26, 2025 2:22am¶

Changes¶

Errors¶

Overall errors: 7/24 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v54 [retry] | 20250826_022134: 7/12 (58%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v54 [retry] | 20250826_022011: 0/12 (0%) Common error messages: - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.” ×6 - “Backoff failure after 2 attempt(s) [agent:openai:gpt-4o-mini]: UnexpectedStatusCodeError: Object was not added! Unexpected status code: 422, with response body: {'error': [{'message': "id '229fd26b-f8d8-59fa-88ad-a0026e60e4a0' already exists"}]}.” ×1 Error categories: - other ×7 Origins: - agent:direct ×7 Final attempt counts: - after 1 attempt(s) ×6 - after 2 attempt(s) ×1

AI analysis:

Error Analysis Summary¶

Error Rate Improvement: Overall errors decreased from 24/60 to 7/24, indicating significant progress in reliability.
High Performance in Search Tool: The "search_nihai" tool achieved a 0% error rate, suggesting effective handling of requests.
Persistent NotImplementedError: The "NotImplementedError" remains a concern, indicating potential issues with tensor handling in the model.
Single 422 Error: The occurrence of a 422 error suggests isolated input validation issues, requiring targeted investigation.
Next Steps: Address the NotImplementedError by reviewing tensor management; continue monitoring error rates for further improvements.
Comparison to Previous Entry: Notable reduction in errors from v52, indicating successful adjustments or fixes implemented since the last evaluation.

Observations¶

Top performer: ashai | gpt-4o-mini → 85.4%
Fastest: ashai | gpt-4o-mini → 4.4s
Top performer: ashai | gpt-4o-mini | nihai achieved 85.4%, a slight improvement from 84.9% in v52.
Significant gap: The second-place entry trails by 3%, indicating a widening performance disparity.
Speed outlier: ashai | gpt-4o-mini | ask nihai recorded an average of 4.9 seconds, slower than previous entries.
Error hotspots: Overall errors decreased to 7/24 cases, showing improvement from 24/60 in v52.
Common errors: Persistent "NotImplementedError" and "UnexpectedStatusCodeError" indicate ongoing issues with model handling.
Final attempt counts: Most errors occurred after a single attempt, suggesting initial request handling still needs attention.

Leaderboard v54¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-4o-mini	nihai	✓	✓	85.4%	86.9%	88.8%	90.0%	86.5%	4.4
2	ashai	gpt-4o-mini	ask nihai	✓	✓	36.8%	37.1%	37.9%	37.9%	36.5%	4.9

v53¶

Changes¶

fixed tool input format error in ask_nihai

Errors¶

Overall errors: 12/24 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v53 [retry] | 20250826_021607: 12/12 (100%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v53 [retry] | 20250826_021446: 0/12 (0%) Common error messages: - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: UnexpectedModelBehavior: Tool '_tool_wrap_nihai_ask' exceeded max retries count of 1” ×8 Error categories: - other ×8 Origins: - agent:direct ×8 Final attempt counts: - after 1 attempt(s) ×8

AI analysis:

Error Analysis Summary¶

Error Rate Improvement: Overall errors decreased from 24/60 to 12/24, indicating a positive trend in reliability.
Persistent Tool Issues: The "ask_nihai" tool still shows a 100% error rate, requiring urgent attention.
Common Error Messages: "UnexpectedModelBehavior" errors persist, suggesting issues with tool handling or input processing.
Retry Limit Reached: Frequent backoff failures indicate that the tool is not recovering from errors effectively.
Next Steps: Investigate the "ask_nihai" tool's logic; consider increasing retry limits or refining input validation.
Comparison to Previous Entry: The reduction in overall errors is promising, but the "ask_nihai" tool's performance remains a critical concern.

Observations¶

Top performer: ashai | gpt-4o-mini → 84.2%
Fastest: ashai | gpt-4o-mini → 0.3s
Top performer: ashai | gpt-4o-mini | nihai achieved 84.2%, slightly down from 84.9% in v52.
Significant gap: Second place trails by 1.8%, indicating a narrowing competitive field.
Speed outlier: ashai | gpt-4o-mini | ask nihai recorded an average of 0.3 seconds, significantly faster than previous entries.
Error hotspots: Overall errors decreased to 12/24 cases, but ashai | gpt-4o-mini | ask nihai still shows a 100% error rate.
Common errors: "UnexpectedModelBehavior" persists, indicating ongoing issues with the ask nihai tool.
Error reduction: Overall errors improved from 24 in v52 to 12 in v53, marking a positive trend in reliability.
Final attempt counts: Most errors occurred after a single attempt, suggesting initial request handling remains problematic.

Leaderboard v53¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-4o-mini	nihai	✓	✓	84.2%	87.9%	88.6%	89.7%	88.1%	4.3
2	ashai	gpt-4o-mini	ask nihai	✓	✓	0.0%	0.0%	0.0%	0.0%	0.0%	0.3

v52 - Aug 26 1:00 am¶

Changes¶

0 references hard codes 0 reference_sufficiency for med accuracy
trying nihai out! small batch here
documented tools - lots of docs changes
renamed a lot of niharika FAQ to google sheets (it was still poorly renamed)

Errors¶

Overall errors: 24/60 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v52 [retry] | 20250826_020748: 12/12 (100%) - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v52 [retry] | 20250826_015655: 11/12 (92%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v52 [retry] | 20250826_015759: 1/12 (8%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v52 [retry] | 20250826_015521: 0/12 (0%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v52 [retry] | 20250826_020558: 0/12 (0%) Common error messages: - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: HTTPStatusError: Client error '422 Unprocessable Content' for url 'http://localhost:8000/search/nihai/ask' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422” ×8 - “Backoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: UnexpectedModelBehavior: Tool '_tool_wrap_nihai_ask' exceeded max retries count of 1” ×8 - “Backoff failure after 3 attempt(s) [agent:openai:gpt-4o-mini]: APIConnectionError: Connection error.” ×1 Error categories: - other ×16 - api_connection ×1 Origins: - agent:direct ×17 Final attempt counts: - after 1 attempt(s) ×16 - after 3 attempt(s) ×1

AI analysis:

Error Analysis Summary¶

Consistent Error Rate: Overall errors remain at 24/60, unchanged from the previous entry, indicating persistent issues.
High Failure Rate: The "ask_nihai" tool continues to show a 100% error rate, suggesting severe reliability problems.
Frequent 422 Errors: The "422 Unprocessable Content" error persists, indicating ongoing issues with input validation or API request formatting.
Connection Issues: A single APIConnectionError suggests potential network reliability problems that need further investigation.
Next Steps: Review and refine API configurations for "ask_nihai"; conduct network stability tests to address connectivity issues.
Regression Analysis: Compare recent changes to identify specific modifications contributing to sustained error rates.

Observations¶

Top performer: ashai | gpt-4o-mini → 84.9%
Fastest: ashai | gpt-4o-mini → 0.8s
Top performer: ashai | gpt-4o-mini | nihai achieved 84.9%, down from 89.7% in v51, indicating a performance decline.
Significant gap: The second-place entry trails by 2.5%, highlighting a notable drop in competitive performance.
Speed outlier: ashai | gpt-4o-mini | ask nihai recorded an average of 0.8 seconds, significantly faster than previous entries.
Error hotspots: Overall errors increased to 24/60 cases, with ashai | gpt-4o-mini | ask nihai showing a 100% error rate.
Common errors: Persistent "422 Unprocessable Content" and "UnexpectedModelBehavior" errors indicate ongoing issues with input handling.
Error increase: Overall errors rose from 12 in v51 to 24 in v52, marking a regression in reliability.
Final attempt counts: Most errors occurred after a single attempt, suggesting issues with initial request handling.

Leaderboard v52¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-4o-mini	nihai	✓	✓	84.9%	87.9%	89.3%	89.8%	86.8%	5.3
2	ashai	gpt-4o-mini	nihai	✓	✓	82.4%	87.5%	89.2%	89.6%	88.8%	6.6
3	ashai	gpt-4o-mini	nihai	✓	✓	77.1%	79.6%	80.9%	81.0%	81.1%	4.7
4	ashai	gpt-4o-mini	ask nihai	✓	✓	6.8%	8.3%	7.9%	7.5%	7.5%	2.8
5	ashai	gpt-4o-mini	ask nihai	✓	✓	0.0%	0.0%	0.0%	0.0%	0.0%	0.8

v51¶

Changes¶

fixed hopefully last of errors
evaluating no-reasoning runs.
evaluator names are now good hopefully

Errors¶

Hell yeah no errors!! Woot! First run with none

Observations¶

turning off reasoning doesn't really impact gpt-5 (in fact it sometimes beats the reasoning version - maybe its not really how to turn it off) except it halves the time to complete
one-shot oddly still gets 90% on med acc - it should not be able to
one-shot on gpt-4 is pretty much as good as gpt-5
perplexity is the shit and a continued safe and strong performer - also super fast
Top performer: ashai-experiment | gpt-5 → 89.7%
eval names look great in weave
Fastest: one-shot | gpt-4o-mini → 2.6s
Top performer: ashai | gpt-5 achieved 89.7%, surpassing previous leaders and indicating strong model performance.
Speed outlier: one-shot | gpt-4o-mini averaged 2.6 seconds, the fastest in this evaluation, improving from previous entries.
Error reduction: Overall errors decreased to 0/312 cases, a substantial improvement from 2 errors in v50.
Performance consistency: ashai runs with both gpt-4o-mini and gpt-5 showed no errors, reflecting improved reliability.

Leaderboard v51¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai-experiment	gpt-5	perplexity	✓	—	89.7%	90.8%	90.0%	89.6%	90.0%	27.4
2	strict-referenced	gpt-5	perplexity	✓	—	89.2%	90.0%	90.2%	89.8%	88.3%	32.0
3	ashai	gpt-5	perplexity	—	—	89.1%	90.4%	90.8%	89.7%	87.9%	16.9
4	ashai	gpt-5	perplexity	✓	—	88.9%	89.6%	91.4%	89.5%	88.7%	23.0
5	strict-referenced-after	gpt-5	perplexity	✓	—	88.8%	89.2%	90.4%	89.6%	88.9%	47.9
6	one-shot	gpt-5	default	✓	—	88.7%	90.0%	89.7%	88.0%	87.7%	22.0
7	ashai-experiment	gpt-5	perplexity	—	—	88.6%	90.6%	90.2%	89.9%	88.9%	19.1
8	one-shot	gpt-4o-mini	default	✓	—	88.6%	90.0%	90.4%	89.8%	87.5%	3.9
9	strict-referenced	gpt-5	perplexity	—	—	88.4%	89.2%	89.0%	89.5%	89.7%	16.1
10	one-shot	gpt-4o-mini	default	—	—	88.2%	88.3%	90.7%	90.0%	87.3%	2.6
11	perplexity	sonar	default	✓	—	88.0%	88.2%	90.2%	89.8%	85.0%	3.3
12	one-shot	gpt-4o-mini	default	✓	✓	87.8%	89.2%	90.2%	88.2%	88.2%	4.5
13	ashai-experiment	gpt-4o-mini	perplexity	✓	—	87.4%	88.3%	89.2%	90.0%	86.2%	7.9
14	strict-referenced-after	gpt-5	perplexity	—	—	87.3%	88.8%	89.5%	88.9%	87.1%	18.1
15	ashai	gpt-4o-mini	perplexity	—	—	87.1%	88.6%	89.0%	89.8%	87.7%	5.8
16	ashai-experiment	gpt-4o-mini	perplexity	—	—	87.1%	87.5%	89.4%	89.8%	87.0%	6.9
17	strict-referenced	gpt-4o-mini	perplexity	✓	—	86.9%	87.8%	88.5%	89.7%	86.7%	7.4
18	ashai-experiment	gpt-4o-mini	perplexity	✓	✓	86.6%	88.3%	86.9%	87.3%	86.9%	10.0
19	ashai	gpt-4o-mini	perplexity	✓	—	86.4%	86.7%	89.2%	88.2%	86.3%	6.7
20	strict-referenced	gpt-4o-mini	perplexity	✓	✓	86.2%	86.8%	86.7%	88.7%	84.5%	5.4
21	one-shot	gpt-5	default	—	—	85.1%	88.2%	82.3%	87.1%	83.7%	8.4
22	ashai	gpt-4o-mini	perplexity	✓	✓	81.4%	84.6%	81.7%	85.0%	81.7%	5.9
23	strict-referenced-after	gpt-4o-mini	perplexity	✓	✓	80.9%	80.4%	80.8%	83.4%	82.3%	12.2
24	strict-referenced	gpt-4o-mini	perplexity	—	—	80.9%	84.6%	79.8%	83.4%	81.2%	8.1
25	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	80.4%	81.2%	80.8%	83.9%	78.3%	10.8
26	strict-referenced-after	gpt-4o-mini	perplexity	—	—	74.0%	73.8%	75.8%	77.2%	76.7%	7.9

v50 - aug 25, 2025 8:57pm¶

Changes¶

Updated _is_retriable_error - Transient OpenAI network hiccups will now be retried instead of failing after 1 attempt, reducing those “Backoff failure after 1 attempt(s) ... APIConnectionError: Connection error.” entries.
cleaning up auto evaluator
fixed please try again fallback
Increase AGENT_MAX_RETRIES by 1 for models showing higher transient fail rates

Errors¶

AI analysis:

Error Analysis¶

Error Rate Improvement: Overall errors decreased from 25/132 (19%) to 2/132 (1.5%), indicating significant progress in model stability.
Top Error Patterns: Both top errors (1/12) occurred in runs with gpt-4o-mini, suggesting potential model-specific issues under certain configurations.
Common Connection Errors: Persistent APIConnectionErrors indicate network reliability issues; two instances reported backoff failures after three attempts.
Model Performance Consistency: The ashai runs with gpt-4o-mini and gpt-5 showed no errors, indicating improved reliability in these configurations.
Next Steps: Investigate network stability and API performance; consider further testing with gpt-4o-mini under different conditions to isolate issues.
Comparison to Previous Entry: Previous entry had multiple high-error rates; current results show a marked reduction, suggesting effective adjustments in model handling or configurations.

Observations¶

Top performer: strict-referenced | gpt-5 → 88.7%
Fastest: one-shot | gpt-4o-mini → 3.0s
Top performer: perplexity | sonar achieved 87.7%, maintaining the lead from v49.
Significant gap: Second place (one-shot | gpt-4o-mini) scored 86.9%, indicating a notable performance drop.
Speed outlier: one-shot | gpt-4o-mini averaged 3.0 seconds, the fastest in this evaluation.
Error reduction: Overall errors decreased to 2/132 cases, a significant improvement from 25 errors in v49.
Notable error hotspots: Common errors were limited to API connection issues, with only two occurrences reported.
Performance consistency: Asahi and strict-referenced models showed improved accuracy, with no errors in their latest runs.

Leaderboard v50¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced	gpt-5	perplexity	✓	—	88.7%	90.0%	89.4%	89.8%	88.4%	44.7
2	one-shot	gpt-5	default	✓	—	88.6%	89.6%	89.6%	89.8%	88.3%	25.6
3	strict-referenced-after	gpt-5	perplexity	✓	—	88.4%	88.8%	90.0%	89.8%	87.2%	34.5
4	ashai	gpt-5	perplexity	✓	—	88.2%	89.0%	90.2%	89.8%	88.2%	36.4
5	perplexity	sonar	default	✓	—	87.3%	87.8%	89.2%	88.8%	85.7%	4.2
6	ashai-experiment	gpt-5	perplexity	✓	—	87.2%	87.3%	89.6%	88.4%	88.3%	34.8
7	ashai-experiment	gpt-4o-mini	perplexity	✓	—	85.9%	86.3%	86.2%	87.6%	84.4%	9.6
8	ashai	gpt-4o-mini	perplexity	✓	—	84.9%	85.0%	86.5%	88.8%	86.0%	6.7
9	one-shot	gpt-4o-mini	default	✓	—	80.8%	81.8%	82.5%	82.5%	80.7%	3.0
10	strict-referenced	gpt-4o-mini	perplexity	✓	—	78.9%	78.3%	80.4%	83.8%	78.5%	11.3
11	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	45.8%	47.5%	42.5%	50.8%	48.8%	9.1

v49 - aug 25, 2025 2:09pm¶

Changes¶

removed second route for med acc for things with no tools - meaning Agents with no tools can't score higher than 80% on medical accuracy
moved batch setup to batch_set.py for easier work
lots of docs improvements that shouldn't impact tests
filter out response errors at top level vs at the scorer level
renamed Niharika FAQ to googlesheets

Errors¶

Observations¶

Top performer: perplexity | sonar at 87.7% if runs else 'n/a'

Leaderboard¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	perplexity	sonar	default	✓	—	87.7%	88.0%	89.2%	89.7%	85.8%	3.2
2	one-shot	gpt-4o-mini	default	✓	—	86.9%	87.1%	90.7%	89.8%	88.3%	3.8
3	ashai-experiment	gpt-4o-mini	perplexity	✓	—	80.4%	81.5%	80.4%	82.9%	82.5%	6.0
4	one-shot	gpt-5	default	✓	—	74.5%	75.4%	75.6%	74.4%	75.2%	12.7
5	ashai	gpt-5	perplexity	✓	—	74.3%	75.0%	75.4%	74.6%	73.8%	25.5
6	strict-referenced	gpt-4o-mini	perplexity	✓	—	67.1%	65.8%	67.1%	72.1%	64.6%	4.8
7	ashai	gpt-4o-mini	perplexity	✓	—	65.0%	66.7%	66.7%	66.2%	63.5%	4.8
8	strict-referenced-after	gpt-5	perplexity	✓	—	58.7%	59.6%	59.2%	59.7%	58.2%	79.6
9	strict-referenced	gpt-5	perplexity	✓	—	58.6%	59.2%	59.8%	59.5%	58.3%	35.2
10	ashai-experiment	gpt-5	perplexity	✓	—	52.1%	52.9%	53.3%	52.5%	52.5%	18.9
11	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	43.0%	43.6%	42.1%	47.5%	43.8%	9.0

Reverse‑chronological log of what was learned for each evaluation iteration.

v48¶

Changes¶

bringing back evidence_sufficiency to tests with no references
filtering errors in higher class (not at each scorer)
moved the weights to the top of scorer files and added to docs (should not impact anything)
added an auto-learnings filer

v47 - take 2 - Aug 25, 2025¶

Changes¶

forgot to increment version so its added on to the last v47
more backoff, reduced concurencey
TIMEOUT - increased past 300

Observations¶

one-shot won... though I have no idea how because it can't use tools so had 0 references and should have got at maximum 80% on the medical accuracy. - investigated and oh yeah it gets a different scorer when there are no references. Let's fix that.

Leaderboard¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	one-shot	gpt-4o-mini	default	✓	—	88.2%	88.3%	89.6%	90.2%	88.2%	2.3
2	ashai-experiment	gpt-4o-mini	perplexity	✓	—	88.1%	89.6%	88.8%	89.2%	86.5%	6.4
3	perplexity	sonar	default	✓	—	87.5%	88.8%	88.8%	89.0%	85.8%	3.2
4	perplexity	sonar	default	✓	—	87.4%	89.2%	89.1%	88.4%	83.8%	3.1
5	ashai-experiment	gpt-4o-mini	perplexity	✓	—	83.1%	84.2%	82.5%	89.1%	81.7%	14.8
6	one-shot	gpt-5	default	✓	—	81.8%	83.3%	83.8%	81.4%	80.2%	16.1
7	one-shot	gpt-5	default	✓	—	81.4%	82.5%	83.1%	81.7%	81.5%	13.7
8	one-shot	gpt-4o-mini	default	✓	—	73.3%	74.2%	75.8%	75.1%	71.7%	2.3
9	strict-referenced	gpt-4o-mini	perplexity	✓	—	72.6%	74.2%	73.3%	74.4%	71.7%	4.4
10	ashai	gpt-4o-mini	perplexity	✓	—	72.3%	72.9%	73.2%	73.8%	70.0%	5.2
11	ashai	gpt-4o-mini	perplexity	✓	—	72.3%	73.2%	73.8%	74.4%	71.7%	3.7
12	ashai-experiment	gpt-5	perplexity	✓	—	66.5%	67.1%	67.7%	67.1%	66.7%	24.3
13	ashai	gpt-5	perplexity	✓	—	66.2%	67.9%	67.9%	67.5%	66.1%	20.3
14	ashai	gpt-5	perplexity	✓	—	66.0%	67.1%	67.1%	67.4%	66.5%	28.4
15	strict-referenced	gpt-4o-mini	perplexity	✓	—	59.4%	60.4%	59.8%	61.7%	58.8%	4.5
16	strict-referenced-after	gpt-5	perplexity	✓	—	59.0%	60.0%	61.1%	59.7%	57.3%	29.6
17	ashai-experiment	gpt-5	perplexity	✓	—	56.8%	57.5%	57.5%	57.1%	55.5%	37.4
18	strict-referenced	gpt-5	perplexity	✓	—	52.1%	52.5%	52.9%	52.3%	51.2%	25.0
19	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	50.7%	50.4%	51.7%	51.7%	51.0%	7.9
20	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	46.2%	45.4%	47.1%	48.8%	46.9%	8.3
21	strict-referenced-after	gpt-5	perplexity	✓	—	44.5%	45.0%	45.4%	44.2%	43.3%	25.0
22	strict-referenced	gpt-5	perplexity	✓	—	44.3%	44.2%	45.0%	44.6%	43.8%	28.9

v44 - Aug¶

Changes¶

upgraded tier of perplexity - not limited to 50 calls/min
wrapper for perplexity

v43 - Aug 23, 2025 3pm¶

Changes¶

BIG Merge - moving to weave's typed models.
NihAI v1 added but will not be used.
error logging hopefully fixed, total_cases should always be 11

Observations¶

oddly gpt-5 is doing worse still, even with backoff and stuff.

Leaderboard v43¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai-experiment	gpt-4o-mini	perplexity	✓	—	87.9%	90.0%	89.2%	88.3%	85.7%	7.3
2	perplexity	sonar	default	✓	—	87.4%	88.8%	89.2%	88.6%	85.0%	3.6
3	one-shot	gpt-5	default	✓	—	81.9%	83.3%	83.5%	82.3%	81.7%	20.6
4	one-shot	gpt-4o-mini	default	✓	—	80.1%	80.4%	81.9%	82.5%	82.1%	3.6
5	strict-referenced	gpt-4o-mini	perplexity	✓	—	79.5%	80.8%	80.6%	82.3%	80.0%	6.6
6	ashai	gpt-5	perplexity	✓	—	70.3%	72.3%	65.9%	74.5%	71.2%	27.4
7	ashai	gpt-4o-mini	perplexity	✓	—	67.2%	67.5%	67.5%	69.7%	67.2%	6.9
8	strict-referenced-after	gpt-5	perplexity	✓	—	66.7%	67.1%	68.5%	68.6%	67.9%	24.5
9	strict-referenced	gpt-5	perplexity	✓	—	52.7%	52.5%	53.3%	54.2%	54.7%	24.2
10	ashai-experiment	gpt-5	perplexity	✓	—	52.5%	52.5%	52.1%	54.6%	55.0%	20.0
11	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	39.3%	37.1%	37.9%	44.2%	42.1%	9.0

v39 - Aug 22, 2025 5pm¶

Changes¶

added better error tracking and logging to evals
switched eval judge to 'gpt-4o-mini-2024-07-18'
moved all models to Weave's typed models instead of model_funcs for better tracking and naming
reduced judge token budget from 8000 to 2500 to inc perf
added a retry backoff - hopefully helps with throttling but may increase times

Observations¶

perplexity ran first and the server wasn't up - this was a bug and being fixed
the error logs are saying there were 0 errors but also varying number of cases run per eval... because failures aren't making it to the end where theres error handling

Leaderboard v39¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	one-shot	gpt-4o-mini	default	✓	—	88.2%	88.6%	90.3%	89.8%	88.8%	4.0
2	ashai-experiment	gpt-4o-mini	perplexity	✓	—	86.9%	87.8%	88.3%	89.4%	86.2%	10.6
3	ashai	gpt-5	perplexity	✓	—	74.2%	75.4%	75.4%	74.7%	75.7%	28.1
4	ashai	gpt-4o-mini	perplexity	✓	—	72.9%	75.0%	74.9%	76.0%	73.5%	6.4
5	strict-referenced	gpt-4o-mini	perplexity	✓	—	70.7%	72.3%	70.0%	71.5%	70.7%	10.3
6	strict-referenced-after	gpt-5	perplexity	✓	—	61.0%	60.4%	61.8%	62.5%	63.8%	28.0
7	ashai-experiment	gpt-5	perplexity	✓	—	59.7%	59.8%	59.6%	60.0%	63.8%	23.8
8	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	48.5%	46.4%	48.9%	56.2%	51.5%	10.8
9	strict-referenced	gpt-5	perplexity	✓	—	44.5%	44.6%	43.3%	45.0%	47.2%	27.5
10	perplexity	sonar	default	✓	—	0.6%	0.0%	0.0%	2.5%	2.5%	3.2

v38 - Aug 22 1:30am¶

Changes¶

better weave integration and leaderboard
got timing issues figured out
switched eval model to a hard coded 'gpt-4o-mini-2024-07-18' to keep it a consistent model
Timing is now reconciled:
Weave model_latency.mean is agent-only and matches our manual agent_time_s (v37: 19.26s ≈ 19.26s).
Leaderboard “Avg s” switched to wall-based throughput using root ended_at - started_at divided by num_cases.
Failed/descendant_error runs are included but clearly flagged (red rows) and don’t drive duration metrics.
Weave UI leaderboard: added two columns via a new wall_clock scorer on each Evaluation output:
wall_clock.per_case_avg_s (minimize)
wall_clock.total_s (minimize)
Batch stability: launcher cleanup tightened; reduced intermittent failures from stray processes.
Case starts: in direct mode, cases start concurrently (identical epoch starts). The increasing times were completion order, not backlog; queue_wait_s is near zero in direct mode.

Observations¶

gpt-5 oddly did worse than 4?! - seems like there were a lot of 0 scores from Error: Connection issue
Timing verified and aligned with manual metrics.
Perplexity (sonar) leads with strong quality and speed.
Strict‑referenced gpt‑4o‑mini + perplexity improved and fast.
gpt‑5 + tools still underperform and are slower; one‑shot gpt‑5 is efficient.
ashai‑experiment lags base ashai at similar latency; consider retune/remove.
No failed rows skewing results; health flags working.

v38 Leaderboard¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	perplexity	sonar	default	✓	—	88.0%	88.9%	89.4%	88.7%	85.8%	15.8
2	strict-referenced	gpt-4o-mini	perplexity	✓	—	84.0%	86.0%	85.8%	88.1%	82.1%	15.9
3	one-shot	gpt-5	default	✓	—	82.7%	85.0%	84.0%	82.4%	81.9%	17.6
4	ashai	gpt-4o-mini	perplexity	✓	—	78.1%	80.4%	75.8%	80.8%	76.8%	15.8
5	one-shot	gpt-4o-mini	default	✓	—	74.5%	78.3%	76.2%	78.6%	76.2%	14.2
6	ashai-experiment	gpt-4o-mini	perplexity	✓	—	73.5%	75.2%	73.8%	72.7%	73.2%	15.2
7	ashai	gpt-5	perplexity	✓	—	67.6%	68.9%	71.7%	66.5%	65.2%	28.0
8	strict-referenced-after	gpt-5	perplexity	✓	—	67.4%	68.6%	70.0%	67.3%	65.4%	26.6
9	strict-referenced	gpt-5	perplexity	✓	—	66.9%	67.5%	68.3%	67.2%	63.2%	33.8
10	ashai-experiment	gpt-5	perplexity	✓	—	60.5%	60.8%	62.1%	61.1%	59.4%	34.9
11	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	42.1%	43.3%	39.6%	42.3%	49.3%	23.2

v37 - Aug 21 8pm - testing weave leaderboard. def broken runs¶

Observations¶

Timing verified: Wall‑based Avg s now matches expectations and aligns with manual agent timings (no backlog artifacts).
Perplexity baseline leads: perplexity (sonar) tops quality at 88.0% and stays fast (~15.8s), a strong production baseline.
Strict‑referenced (gpt‑4o‑mini + perplexity) improved: 84.0% at ~15.9s suggests the strict flow is stable and no longer incurring heavy latency.
Tools + gpt‑5 still underperform: gpt‑5 + perplexity variants land mid‑60s overall and are slower (~26–35s). Indicates a prompting/config gap vs mini/sonar path.
One‑shot gpt‑5 is efficient: 82.7% at ~17.6s without tools; tools don’t help gpt‑5 yet.
ashai‑experiment lags: Experiment prompt underperforms the base ashai at similar latencies; likely remove or retune.
Run health: No obvious failed/descendant_error rows in top results; filtering + flags working.
so some runs failed hard. not sure why
the time is way way down.

v37 leaderboard¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	perplexity	✓	—	73.7%	74.9%	74.5%	74.2%	71.6%	21.8
2	ashai-experiment	gpt-4o-mini	default	✓	—	16.8%	15.0%	16.2%	20.4%	23.8%	7.5
3	strict-referenced-after	gpt-4o-mini	default	✓	—	12.4%	8.9%	12.1%	16.2%	17.1%	27.2
4	ashai	gpt-4o-mini	default	✓	—	8.3%	7.5%	8.8%	8.8%	13.8%	26.6
5	strict-referenced	gpt-4o-mini	default	✓	—	2.3%	0.0%	1.7%	3.8%	10.4%	7.3

V36 - Aug 21 4pm - same as v35 but with gpt-4¶

Changes¶

fixed some tests and some timing things but otherwise nothing. it thought this was going to run v34 again but the version got incremented. I wanted v34 to have more gpt-4 comparators as I'm trying to see if i can get gpt-4 to be as good as 5 with some prompt tweaks

Observations¶

V35 → V36 focus shift: Added more GPT‑4 comparators while keeping the V35 setup, to see how far prompt tweaks can push GPT‑4 toward GPT‑5 quality.
Timing instrumentation matured: V35 introduced per‑case agent_time_s; V36 adds wall_time_s and queue_wait_s so we can separate execution vs queueing and reconcile against Weave.
Weave vs wall timing: Weave model_latency.mean (agent‑only) stays inflated vs per‑case wall averages due to concurrency. Plan: swap the leaderboard “Avg s” to a wall‑based metric.
Pattern holds: “Check‑after” continues to edge strict‑inline on quality and speed; Perplexity integration looks healthier (GPT‑5 + Perplexity leads, GPT‑4 + Perplexity is competitive).
Next: Use the corrected Weave entity to pull per‑case latencies for V35/V36 and update the timing_debug_report with true per‑test numbers.

Leaderboard V35¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	perplexity	✓	—	89.3%	90.8%	91.9%	91.0%	86.7%	742.6
2	strict-referenced	gpt-4o-mini	default	✓	—	85.1%	87.1%	84.2%	89.3%	85.0%	407.2
3	ashai-experiment	gpt-4o-mini	default	✓	—	82.1%	81.5%	80.8%	86.2%	82.5%	126.9
4	strict-referenced-after	gpt-4o-mini	default	✓	—	79.9%	85.0%	80.0%	80.8%	77.9%	368.6
5	ashai	gpt-4o-mini	default	✓	—	79.9%	82.1%	80.4%	87.9%	82.2%	340.5

V35 - Aug 20, 2025 8pm¶

Changes¶

trying to debug some timing things by manually logging times.

Leaderboard V35¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	perplexity	sonar	default	✓	—	87.0%	87.6%	88.8%	88.2%	86.8%	57.6
2	ashai-experiment	gpt-5	default	✓	—	83.2%	84.8%	84.2%	83.8%	80.8%	123.0
3	ashai	gpt-5	default	✓	—	82.7%	85.2%	83.3%	83.4%	80.0%	207.9
4	strict-referenced-after	gpt-5	default	✓	—	76.0%	77.9%	78.3%	75.0%	72.9%	288.1
5	ashai	gpt-4o-mini	perplexity	✓	—	65.6%	66.5%	64.8%	66.5%	65.8%	21.1
6	strict-referenced	gpt-5	default	✓	—	45.3%	45.8%	46.2%	45.4%	46.1%	433.9

V34 - Aug 20, 2025 10am¶

Changes¶

completely new Ahsai-experiment with encouraging more perplexity use and followup questions.
took out warnings for perplexity being a fallback tool
penalizing technical terms and drug names and dosage amounts in empathy and precision

Observations¶

GPT-5 quality holds, speed improves ~3x vs V33: Top GPT-5 agents remain ~90% overall, with latency down to ~1,000–1,300s (from ~3,100s). ashai-experiment leads at 90.5% in 983.8s.
Perplexity integration looks healthier: ashai (gpt-4o-mini, perplexity) at 87.7% (Rank 5) narrowly edges perplexity (sonar) at 87.5%, suggesting prompt/flow updates helped.
Check-after beats strict inline: strict-referenced-after (gpt-5) edges strict-referenced (89.9% vs 89.5%) and is faster (1083s vs 1311s), supporting the guess-then-verify pattern.
New jargon/drug penalties are biting as intended: Minor shifts in Precision/Empathy for some GPT-4o-mini configs without collapsing top rows.
Speed tiers remain clear: one-shot (gpt-4o-mini) is fastest (38.7s) with solid 85.8%; perplexity (sonar) is also quick (62.3s) with strong quality.

Leaderboard V34¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai-experiment	gpt-5	default	✓	—	90.5%	91.2%	92.4%	90.7%	87.2%	983.8
2	strict-referenced-after	gpt-5	default	✓	—	89.9%	91.2%	92.9%	89.8%	87.5%	1083.0
3	ashai	gpt-5	default	✓	—	89.8%	91.5%	91.0%	90.2%	86.9%	1015.3
4	strict-referenced	gpt-5	default	✓	—	89.5%	90.3%	89.5%	90.9%	86.4%	1311.1
5	ashai	gpt-4o-mini	perplexity	✓	—	87.7%	89.2%	88.8%	88.8%	85.7%	166.9
6	perplexity	sonar	default	✓	—	87.5%	88.3%	89.6%	89.2%	85.6%	62.3
7	strict-referenced	gpt-4o-mini	default	—	—	86.3%	88.4%	85.5%	89.5%	85.0%	157.0
8	ashai-experiment	gpt-4o-mini	default	—	—	86.1%	88.3%	86.7%	88.0%	86.0%	161.0
9	ashai	gpt-4o-mini	default	—	—	86.1%	88.1%	85.8%	89.8%	86.8%	116.7
10	one-shot	gpt-4o-mini	default	✓	—	85.8%	89.4%	87.3%	89.1%	86.8%	38.7
11	ashai	gpt-4o-mini	default	✓	—	84.1%	86.7%	84.7%	87.8%	85.8%	118.2
12	ashai-experiment	gpt-4o-mini	default	✓	—	81.9%	82.8%	81.2%	83.8%	80.7%	185.5
13	strict-referenced	gpt-4o-mini	default	✓	—	79.9%	79.6%	78.3%	83.2%	82.1%	158.1
14	ashai	gpt-4o-mini	none	✓	—	78.7%	80.0%	81.5%	82.2%	80.4%	33.8
15	strict-referenced-after	gpt-4o-mini	default	✓	—	77.8%	77.5%	77.9%	83.5%	80.4%	221.3
16	strict-referenced-after	gpt-4o-mini	default	—	—	75.0%	75.0%	77.9%	73.2%	72.1%	260.4

V33¶

Changes¶

some fixes to leaderboard and removed quick mode default - so these are not in quick mode eval
trying some gpt-5 now
no timeout things with using the apis. only september is using the run.sh every other test is just direct now

Observations¶

WOW! We got some high scores! Strict-referenced did great but WOW did it take a long long time!
GPT-5 dominance: GPT-5 models completely dominated the top 4 positions, with all achieving 90%+ overall scores. This represents a massive improvement over V32 where GPT-5 was inconsistent and often underperformed - due to timeouts.

I asked AI to dig into what were the diffs between runs from 4 and 5:

Strict-referenced performance explosion: The strict-referenced agent with GPT-5 achieved 90.6% overall, the highest score ever recorded. However, it took an extremely long time (2728 seconds / ~45 minutes), suggesting a significant speed-quality trade-off. Is 45 minutes even possible?! I don't know...
ashai consistency: The ashai agent with GPT-5 achieved 90.2% overall in just 1091 seconds, making it the most efficient high-performer. This suggests ashai has found an optimal balance between thoroughness and speed.
ashai-experiment recovery continues: After the V30 failure and V32 recovery, ashai-experiment with GPT-5 maintained strong performance (90.1% overall), showing the fixes from V32 were effective and stable.
Perplexity agent stability: The perplexity agent with Sonar maintained strong performance (87.7% overall) with excellent speed (94 seconds), confirming it as a reliable middle-ground option that doesn't require the latest models.
GPT-4o-mini performance ceiling: Even the best GPT-4o-mini agents maxed out around 86-87% overall, while GPT-5 agents consistently broke 90%. This suggests a clear performance ceiling for the smaller model.
Speed vs quality trade-off confirmed: The fastest agents (4-10 seconds) achieved 78-85% scores, while the highest performers (90%+) took 1000+ seconds. The relationship is clear: more thorough analysis requires significantly more time.
Tool usage impact: Agents without tools (ashai with "none") still achieved respectable scores (82.1% overall), but tools clearly provide a significant boost when used effectively.
Model consistency: Unlike V32 where GPT-5 was inconsistent, V33 shows GPT-5 agents are now reliably high-performing, suggesting the direct API approach and removal of quick mode resolved stability issues.
Ashai experiment does worse than ashai. the extra prompt additions hurt not helped. i'm going to take them out and try a diff experiement

I'll analyze the V33 results to understand what GPT-5 is doing better than GPT-4. Let me start by examining the V33 results file and then use the Weave MCP to get detailed trace data.

Read file: evaluate/results/v33.json Let me look at more of the file to understand the structure and find GPT-5 results:

Ran tool Now let me examine the GPT-5 results to understand what they're doing differently:

Read file: evaluate/results/v33.json Let me look at the actual case results to see the differences in responses:

Read file: evaluate/results/v33.json Now let me use the Weave MCP to get detailed trace data for the GPT-5 vs GPT-4o-mini comparisons:

Ran tool Let me try to find the correct project name:

Ran tool Let me try without specifying an entity:

Ran tool Let me examine the V33 results more closely to understand the differences between GPT-5 and GPT-4o-mini performance:

Read file: evaluate/results/v33.json Now let me compare this with the GPT-5 response for the same case. Let me look at the GPT-5 case results:

Read file: evaluate/results/v33.json Now let me look at the evaluation scores to understand the specific differences. Let me examine the evaluation metrics more closely:

Ran tool Let me look at the specific evidence_sufficiency scores for GPT-4o-mini vs GPT-5:

Read file: evaluate/results/v33.json Now let me look at the GPT-5 evidence_sufficiency score:

Read file: evaluate/results/v33.json Now let me create a comprehensive analysis document based on my findings:

Deep Dive Analysis: GPT-5 vs GPT-4o-mini Performance in V33¶

I've conducted a comprehensive analysis of the V33 results to understand what GPT-5 is doing better than GPT-4o-mini. Here are the key findings:

Critical Discovery: Evidence Sufficiency Gap¶

The most dramatic difference is in evidence sufficiency: - GPT-4o-mini (ashai): 68.75% - GPT-5 (ashai): 84.58% - Improvement: +15.83 percentage points

This suggests GPT-5 is doing much more thorough research and information gathering.

What GPT-5 Does Better¶

More Comprehensive Research
GPT-4o-mini used 2 sources (September + Niharika) for the headaches case
GPT-5 used 6 high-quality sources (ACOG, Mayo Clinic, Banner Health, NHS, etc.)
GPT-5 conducts multiple Perplexity queries instead of single searches
Better Information Synthesis
GPT-4o-mini: General advice like "Stay hydrated, get rest"
GPT-5: Specific actionable steps like "Check BP if 140/90+, take 500-650mg acetaminophen"
Enhanced Precision and Detail
GPT-5 provides specific dosages, thresholds, and structured responses
Includes detailed warning signs with clear action items
Adds personalized follow-up questions
Better Cultural Adaptation
GPT-5's Hindi response includes culturally adapted language and India-specific examples
More natural language flow and local context

Speed vs Quality Trade-off¶

GPT-4o-mini: 104.7 seconds
GPT-5: 1,091.2 seconds (10x longer)
The quality improvements correlate with the increased processing time

Key Insight for Prompt Engineering¶

The improvements appear to be methodological rather than inherent model capabilities. GPT-5 is: - Conducting more thorough research - Using better response structures - Providing more specific, actionable guidance

This suggests that enhancing prompts for GPT-4o-mini could potentially bridge much of this performance gap by: - Requiring multiple search queries - Mandating specific response structures - Demanding more detailed, actionable responses - Including cultural adaptation requirements

The analysis shows that GPT-5's superior performance comes from more thorough processing rather than faster reasoning, making it a prime candidate for prompt engineering improvements to bring GPT-4o-mini performance closer to GPT-5 levels.

Leaderboard V33¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced	gpt-5	default	✓	—	90.6%	91.6%	93.3%	91.0%	87.9%	2728.3
2	strict-referenced-after	gpt-5	default	✓	—	90.5%	91.6%	92.1%	89.9%	87.2%	3128.2
3	ashai	gpt-5	default	✓	—	90.2%	91.5%	93.3%	90.6%	87.8%	1091.2
4	ashai-experiment	gpt-5	default	✓	—	90.1%	91.9%	91.5%	91.1%	87.8%	3117.1
5	perplexity	sonar	default	✓	—	87.7%	89.2%	91.6%	89.3%	86.0%	94.1
6	ashai-experiment	gpt-4o-mini	default	✓	—	86.7%	88.5%	88.2%	88.6%	87.5%	117.7
7	ashai	gpt-4o-mini	default	✓	—	85.9%	87.5%	88.2%	89.8%	85.2%	104.7
8	strict-referenced	gpt-4o-mini	default	✓	—	85.0%	87.1%	86.8%	87.6%	86.3%	271.7
9	ashai-experiment	gpt-4o-mini	default	—	—	84.9%	84.2%	86.2%	91.2%	85.7%	153.2
10	ashai	gpt-4o-mini	default	—	—	84.7%	86.7%	85.8%	89.6%	85.5%	89.4
11	strict-referenced-after	gpt-4o-mini	default	—	—	84.4%	86.1%	84.6%	86.5%	81.5%	309.6
12	one-shot	gpt-4o-mini	default	✓	—	83.7%	89.2%	89.0%	91.5%	86.7%	368.4
13	ashai	gpt-4o-mini	none	✓	—	82.1%	86.2%	87.9%	89.8%	86.0%	210.5
14	ashai	gpt-4o-mini	perplexity	✓	—	80.3%	82.1%	81.7%	83.2%	79.2%	264.6
15	strict-referenced-after	gpt-4o-mini	default	✓	—	79.9%	79.8%	80.4%	81.8%	78.8%	355.2
16	strict-referenced	gpt-4o-mini	default	—	—	78.6%	79.2%	80.6%	81.9%	80.1%	151.1

V32 Aug 19 2025 5pm - WARNING - all these were eval'd in quick mode....¶

Changes¶

did speed investigation, and went back to gpt-4 as an eval judge (much faster)
potentially doing evals direct vs through the run.sh server
made tool calls direct (except for september)

Observations¶

GPT-5 vs GPT-4o-mini performance gap: The top performer (ashai-experiment with GPT-5) achieved 89.8% overall, while GPT-4o-mini agents maxed out around 88.3%. However, GPT-5 agents also had much slower response times (160+ seconds vs 10-20 seconds for GPT-4o-mini). GPT5 takes a crazy amount of time!
Strict-referenced consistency: The strict-referenced agent with GPT-4o-mini shows remarkable consistency, taking the #2 and #3 spots with nearly identical scores (88.3% and 88.1% overall). This suggests the approach is very stable.
Speed vs quality trade-off: There's a clear inverse relationship between speed and performance. The fastest agents (4-10 seconds) are in the middle of the pack, while the highest performers take 75-200+ seconds.
ashai-experiment recovery: After completely failing in V30 (6.7% overall), ashai-experiment with GPT-5 is now the top performer. This suggests the speed investigation and direct tool calls fixed critical issues.
Perplexity agent stability: The perplexity agent with Sonar maintains strong performance (87.4% overall) with reasonable speed (75 seconds), showing it's a reliable middle-ground option.
GPT-5 consistency issues: While GPT-5 can achieve the highest scores, it's inconsistent - some GPT-5 agents rank very low (59-61% overall), suggesting it may be more sensitive to prompt variations or tool configurations.

Leaderboard V32¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai-experiment	gpt-5	default	✓	—	89.8%	92.7%	88.3%	91.7%	86.7%	160.3
2	strict-referenced	gpt-4o-mini	default	✓	—	88.3%	91.7%	86.7%	93.3%	81.7%	10.4
3	strict-referenced	gpt-4o-mini	default	✓	—	88.1%	91.7%	87.3%	90.0%	83.3%	19.0
4	perplexity	sonar	default	✓	—	87.4%	87.5%	91.2%	90.1%	86.8%	75.0
5	ashai	gpt-4o-mini	default	✓	—	87.1%	86.7%	85.7%	91.7%	84.3%	4.8
6	one-shot	gpt-4o-mini	default	✓	—	86.2%	90.0%	85.0%	90.0%	80.0%	4.9
7	strict-referenced-after	gpt-4o-mini	default	✓	—	86.2%	90.0%	85.0%	88.3%	81.7%	142.7
8	ashai-experiment	gpt-4o-mini	default	✓	—	85.0%	88.3%	83.3%	90.0%	78.3%	7.1
9	ashai-experiment	gpt-4o-mini	default	✓	—	84.2%	88.3%	83.3%	88.3%	76.7%	14.4
10	ashai	gpt-4o-mini	default	✓	—	83.8%	90.0%	83.3%	81.7%	80.0%	9.7
11	ashai	gpt-4o-mini	default	✓	—	82.5%	86.7%	80.0%	86.7%	76.7%	25.1
12	strict-referenced-after	gpt-4o-mini	default	✓	—	67.9%	71.7%	65.0%	76.7%	58.3%	17.8
13	strict-referenced-after	gpt-5	default	✓	—	61.2%	65.0%	61.7%	61.7%	56.7%	196.3
14	ashai	gpt-5	default	✓	—	59.8%	61.7%	59.3%	61.7%	56.7%	189.3
15	strict-referenced	gpt-5	default	✓	—	59.2%	61.7%	58.3%	60.7%	56.0%	282.6
16	ashai	gpt-4o-mini	default	✓	—	57.8%	59.3%	56.7%	61.7%	53.3%	10.9

V30 - 8-19-2025 12:00pm¶

Changes¶

multi-threaded run.sh now hopefully helps with performance?
removed disclaimer stuff from medical accuracy

Observation¶

multi-threaded didn't seem to really help with performance
strict-referenced is doing the best on medical accuracy but getting nocked a bit in precision and clarity
ashai experimental is just breaking. no clue why
oddly ashai with reasoning off took a really long time.

Leaderboard¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	perplexity	sonar	default	✓	—	72.5%	77.3%	76.4%	83.6%	65.6%	62.8
2	ashai	gpt-4o-mini	perplexity	✓	—	71.1%	78.9%	73.2%	80.6%	69.8%	139.0
3	ashai	gpt-4o-mini	perplexity	—	—	70.9%	79.8%	71.7%	82.6%	65.8%	315.9
4	strict-referenced	gpt-4o-mini	default	—	—	70.7%	80.8%	72.2%	80.5%	69.2%	209.6
5	ashai	gpt-4o-mini	none	—	—	69.6%	84.9%	73.9%	87.1%	74.2%	24.8
6	one-shot	gpt-4o-mini	default	✓	—	68.7%	82.2%	72.2%	85.1%	76.8%	63.1
7	ashai	gpt-4o-mini	default	—	—	68.6%	79.8%	71.8%	83.1%	69.1%	97.8
8	strict-referenced	gpt-4o-mini	default	✓	—	68.5%	79.4%	66.1%	80.8%	68.7%	150.6
9	one-shot	gpt-4o-mini	default	—	—	65.5%	79.8%	71.1%	78.8%	71.2%	82.7
10	ashai	gpt-4o-mini	default	✓	—	64.6%	75.0%	68.2%	73.4%	67.8%	72.5
11	strict-referenced-after	gpt-4o-mini	default	✓	—	63.1%	74.5%	65.4%	71.4%	60.9%	379.4
12	strict-referenced-after	gpt-4o-mini	default	—	—	59.5%	74.0%	57.6%	59.8%	50.2%	369.0
13	ashai-experimental	gpt-4o-mini	default	✓	—	6.7%	1.2%	0.0%	17.7%	2.6%	—
14	ashai-experimental	gpt-4o-mini	default	—	—	6.5%	1.7%	0.0%	16.0%	2.9%	0.1

V29 - 8-18-2025 9pm¶

Changes¶

medical accuracy: evidence_sufficiency is 0 if no references are provided (strict)
medical accuracy: removed disclaimer_present; disclaimers are now considered under safety only when warranted
medical accuracy weights updated: evidence_sufficiency 30%, response_alignment 25%, safety 15%, medical_correctness 30%

Observations¶

better job of down ranking things lacking references
some are still clearly broken or timing out. unclear why

Leaderboard V29¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	perplexity	sonar	default	✓	—	72.4%	79.2%	78.4%	83.7%	67.6%	53.6
2	ashai	gpt-4o-mini	perplexity	✓	—	69.7%	77.6%	70.6%	79.1%	68.8%	164.4
3	one-shot	gpt-4o-mini	default	✓	—	69.1%	81.5%	74.9%	85.2%	76.1%	123.7
4	strict-referenced	gpt-4o-mini	default	—	—	68.6%	79.6%	69.0%	82.8%	65.1%	111.8
5	one-shot	gpt-4o-mini	default	—	—	68.0%	80.8%	72.2%	86.4%	75.2%	37.7
6	strict-referenced-after	gpt-4o-mini	default	—	—	67.9%	81.4%	68.9%	67.0%	64.6%	292.7
7	ashai	gpt-4o-mini	perplexity	—	—	67.0%	79.2%	64.1%	80.9%	60.6%	324.7
8	ashai	gpt-4o-mini	default	✓	—	65.9%	73.3%	68.6%	79.0%	65.5%	273.0
9	ashai	gpt-4o-mini	none	—	—	63.9%	78.4%	66.2%	80.1%	68.8%	27.8
10	strict-referenced-after	gpt-4o-mini	default	✓	—	62.6%	75.8%	63.3%	67.6%	52.4%	450.3
11	ashai	gpt-4o-mini	default	—	—	60.7%	78.5%	56.9%	68.8%	59.0%	127.1
12	strict-referenced	gpt-4o-mini	default	✓	—	60.5%	73.6%	62.8%	72.5%	54.8%	121.4
13	ashai-experimental	gpt-4o-mini	default	✓	—	5.6%	2.9%	0.0%	17.9%	2.0%	0.1
14	ashai-experimental	gpt-4o-mini	default	—	—	5.3%	0.4%	0.0%	13.3%	2.9%	—

V28 - 2025-08-18 5:00pm¶

Changes¶

now sending references to evaluators (it was not there!)
parallelized runs with batch_eval2
increased max token size so evals don't get truncated
messed a lot with odd timeout errors and key stuff
output is in whatsapp format
prompt modified to

Observations¶

oneshot and others with no refs are still scoring too high, but they got taken down bigtime with the new ratings
strict modes are breaking - maybe timing out? maybe having too big of context and timing out in the eval stage?
these runs take a long long time... especially the evals now.
Perplexity (sonar) leads at 77.2%; one-shot (gpt-4o-mini, default) is close (74–75%) and faster (~30–45s).
Ashai (gpt-4o-mini, default) lands ~73% with ~92s latency; turning reasoning ON slightly hurts (71.8%) and does not speed it up.
Perplexity tool usage lowers Ashai scores and adds large latency (71.5–67.7% at 144–176s) vs default/none - that shouldn't be true, we should weigh them more
Passing references to evaluators restored citation credit; top Med Acc now ~75–82% instead of flat caps.

Next steps¶

perplexity results may not be counting as high quality enough of a reference?

Leaderboard V28¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	perplexity	sonar	default	✓	—	77.2%	81.8%	78.5%	84.8%	70.5%	54.6
2	one-shot	gpt-4o-mini	default	✓	—	74.6%	80.3%	72.4%	85.2%	79.5%	43.2
3	one-shot	gpt-4o-mini	default	—	—	74.1%	80.6%	73.2%	82.2%	76.8%	30.5
4	ashai	gpt-4o-mini	default	—	—	73.3%	81.5%	74.6%	80.7%	71.3%	93.7
5	ashai	gpt-4o-mini	default	✓	—	71.8%	80.5%	72.2%	77.9%	73.0%	91.7
6	ashai	gpt-4o-mini	perplexity	✓	—	71.5%	79.2%	69.6%	77.7%	66.7%	143.8
7	strict-referenced	gpt-4o-mini	default	—	—	69.9%	78.2%	67.2%	80.3%	69.6%	101.1
8	strict-referenced-after	gpt-4o-mini	default	✓	—	69.8%	80.7%	67.5%	69.6%	69.3%	187.8
9	ashai	gpt-4o-mini	perplexity	—	—	67.7%	75.2%	63.4%	78.4%	66.0%	176.4
10	ashai	gpt-4o-mini	none	—	—	65.9%	71.4%	63.2%	76.1%	67.1%	32.3
11	strict-referenced	gpt-4o-mini	default	✓	—	64.9%	73.7%	58.7%	81.8%	57.4%	106.1
12	strict-referenced-after	gpt-4o-mini	default	—	—	64.4%	74.9%	60.7%	67.6%	56.3%	203.0
13	ashai-experimental	gpt-4o-mini	default	✓	—	5.8%	2.9%	0.0%	20.2%	2.7%	—
14	ashai-experimental	gpt-4o-mini	default	—	—	5.3%	2.9%	0.0%	18.2%	2.9%	—
15	one-shot	gpt-4o-mini	default	✓	—	3.1%	0.0%	0.0%	6.0%	0.8%	—
16	strict-referenced-after	gpt-4o-mini	default	✓	—	3.1%	0.4%	0.0%	6.7%	0.4%	0.1
17	strict-referenced	gpt-4o-mini	default	✓	—	3.1%	0.4%	0.0%	5.2%	0.8%	—
18	one-shot	gpt-4o-mini	default	—	—	2.8%	0.0%	0.0%	5.9%	1.1%	—

2025-08-14 10:40pm - Evaluation V22¶

Observations¶

Leaderboard V22¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	perplexity	✓	—	89.9%	93.6%	92.5%	91.5%	84.0%	138.2
2	ashai	gpt-5	none	—	—	89.8%	93.0%	93.7%	91.8%	84.9%	289.5
3	ashai	gpt-5	none	✓	—	89.3%	93.8%	94.2%	91.7%	85.6%	220.9
4	ashai	gpt-5	perplexity	—	—	81.8%	85.7%	85.0%	83.9%	78.5%	175.2
5	ashai	gpt-5	default	✓	—	67.2%	68.9%	68.2%	70.0%	62.1%	153.8
6	ashai	gpt-5	default	—	—	61.7%	63.3%	63.2%	66.6%	57.1%	219.9

2025-08-14 1pm - Evaluation V21¶

Here just looking at changing the models and tools - focusing on ashai only.

Changes¶

just focusing on ashai for this round and its different combinations
niharika url fixed
hard enforcing that calls with no tools have 0 references
brought Tool use discipline section into the system prompt for easier access/adjusting
evaluators follow json schema
upgraded all openai calls to the new respones API and enforced json_schemas

## Observations

tools still degrade quality
i have no idea why tools are

## Next

having no tools seems to not hurt med acc - this is a bug. why?!?! The opposite is shown in v21.json - so what is going on. let's re-run

Leaderboard V21¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	none	✓	—	88.9%	93.8%	93.9%	90.9%	86.8%	156.0
2	ashai	gpt-5	none	—	—	88.3%	91.8%	94.2%	90.2%	85.0%	195.4
3	ashai	gpt-5	perplexity	—	—	74.7%	77.4%	78.2%	76.8%	71.5%	118.3
4	ashai	gpt-5	perplexity	✓	—	65.6%	68.0%	66.6%	69.3%	63.3%	98.9
5	ashai	gpt-5	default	—	—	60.2%	63.1%	60.2%	64.2%	56.2%	165.4
6	ashai	gpt-5	default	✓	—	25.4%	23.6%	24.2%	30.4%	21.7%	110.2

V18-V20¶

Skipped because they were used for debugging

2025-06-13 11pm - Evaluation V17¶

Changes¶

completely re-wrote perplexity's prompt
added playground prompt
made pubmed and perplexity tools async
async batch eval - we'll see how it works...

Observations¶

gpt‑5 (no tools) is back on top at 91.4% and faster (48.8s), a big jump vs V16’s 76.2% @ 67.1s. Confirms earlier bugfixes/changes took effect; verify no data leakage. Might be making up its links though.
gpt‑5 with default/tools still broken (18–21% and ~115s). Tool path regression persists despite other improvements.
Perplexity prompt rewrite helped: perplexity (sonar) now 77.9% (+6.1 vs V16’s 71.8%), with higher latency (16.4s vs ~6s).
One‑shot (gpt‑4o‑mini) remains strong at 78.2% and fast. Reasoning ON > OFF (78.2% @ 33.9s vs 75.5% @ 38.6s).
strict‑referenced solid at 76.6% with high Med Acc (84.2%). strict‑referenced‑after still lags (62.4%) due to low precision.
Async batch reduced wall time, but ashai + perplexity stays slow (64–91s). Async tools didn’t materially lower per‑run latency.
Outlier: ashai gpt‑4o‑mini default shows 742s avg — likely a stuck/mis‑measured run; exclude from speed comparisons.

Next Steps¶

Niharika noticed halucinating niharika.org urls and ashai including links from other FAQs

Leaderboard for V17¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	none	✓	—	91.4%	93.1%	92.9%	91.2%	85.1%	48.8
2	one-shot	gpt-4o-mini	default	✓	—	78.2%	84.8%	73.8%	85.7%	75.1%	33.9
3	perplexity	sonar	default	✓	—	77.9%	81.1%	78.7%	82.5%	64.1%	16.4
4	strict-referenced	gpt-4o-mini	default	✓	—	76.6%	84.2%	77.8%	76.2%	67.2%	21.8
5	ashai	gpt-4o-mini	perplexity	—	—	76.4%	83.0%	74.4%	83.6%	68.2%	91.4
6	ashai	gpt-4o-mini	none	—	—	76.4%	81.2%	74.8%	85.3%	70.6%	15.7
7	one-shot	gpt-4o-mini	default	—	—	75.5%	82.1%	72.0%	80.8%	70.2%	38.6
8	september	gpt-4o-mini	default	✓	—	74.4%	81.9%	70.2%	83.2%	73.0%	14.2
9	ashai	gpt-4o-mini	perplexity	✓	—	74.1%	81.2%	71.4%	81.2%	66.3%	64.4
10	ashai	gpt-4o-mini	default	✓	—	73.2%	83.0%	71.9%	78.5%	63.8%	32.8
11	ashai	gpt-4o-mini	default	—	—	72.9%	79.2%	69.2%	81.7%	67.7%	742.2
12	strict-referenced-after	gpt-4o-mini	default	✓	—	62.4%	80.0%	54.1%	56.8%	49.8%	31.5
13	ashai-experiment	gpt-5	default	✓	—	21.2%	17.3%	17.5%	26.7%	15.9%	115.6
14	ashai	gpt-5	default	✓	—	18.2%	15.6%	15.8%	24.0%	14.8%	114.4

2025-06-13 1pm - Evaluation V16¶

Changes¶

improved dataset format organizing evaluation objects into one, renaming and improving guidance
added question and risk_level to the evaluator information set
organized evaluators to more clearly show what the evaluator data vs user input is
started an ashai-experiment which will have some extra guidance that I hope will help

Observations¶

interestingly the ashai experiment did slightly worse than regular ashai. i thought the extra prompt instructions (which gave it hints about how it would be valuated) would help but it seemed not to.
crazy weirdly though it took way less time? Only 10s vs ashai's 60? Like what?! something odd there
oneshot with no tools and reasoning turned on crushed it. except dang it was slow again? how?! why?
perplexity's speed is all over the place... no clue why.

Next Step¶

run evals in parallel... man it takes forever
improve perplexity's prompt
changed user prompt for ashai - to not split out the last message but give full thread

Leaderboard for V16¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	one-shot	gpt-5	default	✓	—	92.0%	93.2%	94.3%	91.6%	89.2%	61.4
2	ashai	gpt-5	none	✓	—	76.2%	77.8%	78.2%	76.6%	69.8%	67.1
3	ashai-experiment	gpt-4o-mini	perplexity	✓	—	76.1%	82.6%	76.6%	85.2%	70.0%	20.7
4	ashai-experiment	gpt-4o-mini	default	—	—	75.5%	84.2%	73.7%	85.3%	66.9%	10.0
5	ashai	gpt-4o-mini	perplexity	—	—	75.4%	79.2%	78.2%	79.2%	68.5%	26.7
6	ashai-experiment	gpt-4o-mini	default	✓	—	75.1%	83.8%	72.5%	85.1%	64.8%	17.8
7	strict-referenced	gpt-4o-mini	perplexity	✓	—	75.1%	81.8%	73.8%	77.7%	63.1%	24.0
8	strict-referenced	gpt-4o-mini	perplexity	—	—	74.5%	80.5%	73.6%	80.2%	64.2%	21.6
9	perplexity	sonar	default	✓	—	71.8%	76.0%	73.7%	81.3%	62.4%	6.0
10	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	68.5%	77.3%	66.2%	68.3%	59.7%	24.1
11	strict-referenced-after	gpt-4o-mini	perplexity	—	—	61.0%	73.2%	55.5%	65.2%	48.2%	24.1
12	ashai	gpt-5	default	—	—	41.2%	39.8%	39.3%	44.8%	36.0%	100.4
13	strict-referenced	gpt-5	perplexity	✓	—	18.3%	16.2%	14.0%	24.4%	13.2%	115.0
14	ashai	gpt-5	default	✓	—	12.2%	7.8%	8.2%	17.6%	7.2%	116.9
15	strict-referenced	gpt-5	perplexity	—	—	3.4%	0.8%	0.0%	9.2%	0.1%	120.1

2025-06-13 12pm - Evaluation V15¶

Just doing some testing here... ignore this version

2025-08-12 3pm - Evaluation v14¶

Changes¶

perplexity and chatgpt-5 issue fixed I think

Observations¶

gpt-5 no tools is winning - works better without tools sadly but its winning and working now
turning off reasoning seems to be broken - need to debug
speed seems to be getting worse? not sure wh
strict references is doing well! though oddly has lower medical accuracy than gpt-5 wonder why especially because it should be requiring references
perplexity with strict ref gpt-5 is still having issues?

Next Steps¶

fix the no-reasoning toggle issues
check gpt-5 references - is it adding them? are they actually right?
perplexit should be doing better... look into why it isn't

Leader board for V14¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	none	✓	—	83.5%	90.7%	81.7%	89.2%	78.5%	69.1
2	one-shot	gpt-5	default	✓	—	81.1%	85.1%	82.2%	83.2%	80.5%	53.6
3	ashai	gpt-4o-mini	default	✓	—	76.0%	82.4%	79.1%	83.9%	70.0%	18.4
4	ashai	gpt-4o-mini	none	—	—	75.5%	84.6%	74.8%	85.0%	67.2%	7.6
5	strict-referenced	gpt-4o-mini	perplexity	—	—	74.7%	82.0%	75.9%	83.2%	66.5%	29.2
6	ashai	gpt-4o-mini	perplexity	✓	—	72.4%	80.4%	75.2%	83.4%	66.8%	25.4
7	strict-referenced	gpt-4o-mini	perplexity	✓	—	72.1%	79.2%	71.7%	78.3%	66.5%	25.1
8	perplexity	sonar	default	✓	—	70.6%	74.5%	72.8%	79.8%	61.5%	7.3
9	perplexity	sonar	default	✓	—	70.4%	76.1%	71.2%	78.8%	61.5%	7.8
10	ashai	gpt-4o-mini	perplexity	—	—	68.7%	74.9%	69.4%	81.2%	64.8%	26.0
11	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	53.5%	62.7%	51.0%	59.5%	41.9%	31.3
12	strict-referenced-after	gpt-4o-mini	perplexity	—	—	49.3%	67.9%	41.3%	47.6%	33.8%	19.3
13	strict-referenced	gpt-5	perplexity	—	—	19.1%	17.2%	14.7%	19.8%	14.8%	118.6
14	ashai	gpt-5	default	—	—	19.1%	17.1%	15.3%	20.2%	14.7%	117.8
15	ashai	gpt-5	default	✓	—	12.1%	9.8%	7.8%	13.5%	7.7%	116.1

2025-11-11 9:00pm - Evaluation V13¶

Changes¶

full rewrite of evals, which were getting limited context
evalutor is now gpt-5
medical evaluator improvements - Implemented stricter reference validation and structured outputs
language now checked only in language clarity (instead of all) and given more context
better eval error handling - stopped graceful scoring

Key observations¶

the new evals took scores down decently which is good
one of the strict-referenced did quite well!
lol gpt-5 did the best and the worst! sadly it does the worst with tools - so somethings just broken there

next steps¶

log more stuff for easy reading (or figure out browsing weave better)
make gpt-5 work with tools - I cannot for the life of me make it work with perplexity. so odd.
i thought there was a really good strict-referenced but it disapeared? get those higher!

Leaderboard Snapshot for V13¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	none	✓	—	91.0%	92.2%	90.5%	89.5%	83.7%	73.0
2	ashai	gpt-4o-mini	none	—	—	76.6%	83.6%	73.2%	84.9%	67.2%	8.1
3	one-shot	gpt-4o-mini	default	✓	—	75.5%	83.5%	77.7%	86.8%	75.0%	8.1
4	one-shot	gpt-4o-mini	default	—	—	74.5%	84.0%	73.3%	85.1%	71.1%	5.5
5	september	gpt-4o-mini	default	✓	—	74.4%	79.4%	74.5%	83.7%	70.0%	15.6
6	perplexity	sonar	default	✓	—	72.7%	79.5%	76.8%	81.5%	62.3%	7.7
7	perplexity	sonar	default	✓	—	70.5%	75.8%	73.8%	79.3%	62.9%	7.7
8	ashai	gpt-4o-mini	default	✓	—	69.8%	77.3%	70.8%	78.3%	63.9%	45.2
9	ashai	gpt-4o-mini	default	✓	—	68.7%	78.0%	70.0%	66.0%	65.3%	60.6
10	ashai	gpt-4o-mini	perplexity	✓	—	62.8%	72.7%	59.4%	76.9%	56.1%	66.2
11	strict-referenced-after	gpt-4o-mini	perplexity	—	—	53.8%	64.3%	47.3%	58.4%	40.8%	38.5
12	ashai	gpt-4o-mini	perplexity	—	—	51.6%	59.2%	45.3%	64.2%	48.4%	87.4
13	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	47.6%	67.8%	33.3%	45.8%	33.2%	55.2
14	strict-referenced	gpt-4o-mini	perplexity	✓	—	16.6%	24.1%	2.8%	26.2%	6.0%	110.6
15	strict-referenced	gpt-4o-mini	perplexity	—	—	15.3%	25.7%	3.0%	18.9%	4.5%	140.2
16	ashai	gpt-5	default	✓	—	4.7%	0.8%	0.0%	4.0%	1.0%	160.0

2025-08-11 2:00pm - Evaluation V12¶

used this one for testing. iterate on to v13

2025-08-10 7:30pm - Evaluation V11¶

Changes¶

SPEED fixes! - we were just reporting it wrong...
another fix for reasoning, and the tool-less models now seem to go quite fast (5s) as expected
many fixes to docs and tests - though those don't impact evals much
even with perplexity or other tools turned on responses are now around 8s.

Key observations¶

something broke with the med eval - it's always 60%
interestingly oneshot took a slight 2% hit
will come back to make more observations after i get what's left of today's sunshine
something is still definitely broken with perplexity agent
gpt-5 with no tools is winining
ashai gpt-5 is broken... what?

Next Steps¶

expriment with and probably fix retry option
fix perplexity agent
get some sunshine for a bit

Leaderboard Snapshot for V11¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	one-shot	gpt-5	default	✓	—	76.1%	60.0%	90.4%	89.2%	87.1%	9.8
2	ashai	gpt-4o-mini	default	✓	—	74.6%	60.0%	88.8%	88.3%	77.5%	9.3
3	september	gpt-4o-mini	default	✓	—	74.4%	60.0%	88.3%	87.5%	77.9%	3.8
4	ashai	gpt-4o-mini	none	—	—	74.1%	60.0%	87.1%	87.5%	77.9%	4.5
5	ashai	gpt-4o-mini	perplexity	✓	—	73.7%	60.0%	87.9%	85.8%	75.8%	13.8
6	one-shot	gpt-4o-mini	default	—	—	72.0%	59.2%	81.2%	87.5%	75.4%	5.3
7	ashai	gpt-4o-mini	perplexity	—	—	60.2%	50.0%	68.3%	72.5%	60.8%	15.8
8	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	51.2%	60.0%	43.8%	49.6%	33.3%	10.6
9	strict-referenced-after	gpt-4o-mini	perplexity	—	—	49.0%	60.0%	38.8%	47.5%	28.3%	11.7
10	strict-referenced	gpt-4o-mini	perplexity	✓	—	37.8%	55.0%	26.7%	26.7%	10.0%	15.8
11	strict-referenced	gpt-4o-mini	perplexity	—	—	27.3%	40.0%	17.5%	20.8%	7.5%	15.7
12	perplexity	sonar	default	✓	—	18.0%	35.0%	2.5%	8.3%	0.0%	2.9
13	ashai	gpt-5	default	✓	—	1.0%	0.0%	0.0%	5.0%	0.0%	16.3

2025-08-10 7:30pm - Evaluation V10¶

Ooopse! was calling the toolnames wrong. incrementing to v11

2025-08-10 4pm - Evaluation V9¶

Changes¶

simplified agents and made reasoning actually able to be turned off (it was not before) - expecting speeds to be much faster for models with reason turned off hopefully?!
one-shot agent - trying to get speed here lets see how a simple one shot agent does
docs, tests and python validator fixes everywhere
moved a lot of endpoints to /agent/*

Key observations:¶

turning off reasoning is singificantly faster - though still at ~30s! but oddly not for the one-shot?! confusing...
the basic new one-shot model is doing the best... and fastest... ugh.
september did suprisingly well?!? i wonder if its just tricking the med evaluator which i still haven't made strong
suspicious as to how many scored 60% on med eval... all the top ones. very exact and odd number... need to look into the more detailed wandb stats on that. i wouldn't doubt if some area's getting a 100 and others a 0.
ugh why did perplexity do so terribly. i'm doubting my evaluators...
One‑shot with gpt-4o-mini leads on Overall (75.5%) and is fast (~41–48s). ashai is close (74.9%) but ~2–3× slower (113s).
Turning off reasoning (🧠 —) did not improve speed for one‑shot (65.6s vs 41–48s) and held roughly similar quality (~74%). Likely implementation/length behavior needs profiling.
Perplexity‑based variants cratered (~18–22% Overall) with near‑zero Precision/Empathy → clear configuration/prompting regression.
All strict‑referenced variants collapsed (~18–20% Overall, Med Acc ~35–40%) vs V8’s ~72% → indicates a breaking change in prompts, tools, or judging logic.

Next Steps¶

Debug strict‑referenced flows: compare V8 vs V9 prompts/tool calls; add regression checks to ensure citations and extraction behave as expected.
Fix Perplexity integration: verify tool instructions, response parsing, and citation routing; run a minimal sanity suite before full eval.
Profile reasoning off path: confirm the flag actually alters model behavior (no hidden chain‑of‑thought or retries); cap output lengths and temperature to reduce latency.
Promote one‑shot as a "fast default" path; tune for speed (shorter context, stricter max tokens) without regressing quality.
Target Med Acc >60%: refine dataset/judger for medical correctness and reference validation; add a few hard cases with clear gold references.
Add a small ablation with 🔄 retry enabled for top configs to quantify benefit vs latency.

Leaderboard Snapshot for V9¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	one-shot	gpt-4o-mini	default	✓	—	75.5%	60.0%	90.0%	90.0%	80.0%	41.2
2	ashai	gpt-4o-mini	default	✓	—	74.9%	60.0%	89.2%	89.2%	77.9%	113.3
3	one-shot	gpt-4o-mini	default	✓	—	74.5%	60.0%	87.9%	87.3%	80.8%	48.3
4	one-shot	gpt-4o-mini	default	—	—	74.3%	60.0%	86.7%	88.3%	80.0%	65.6
5	september	gpt-4o-mini	default	✓	—	73.5%	60.0%	86.5%	85.8%	77.5%	53.8
6	ashai	gpt-4o-mini	none	—	—	70.6%	59.2%	81.2%	81.7%	72.9%	41.5
7	perplexity	sonar	default	✓	—	22.5%	45.0%	2.5%	8.3%	0.0%	52.6
8	ashai	gpt-4o-mini	perplexity	✓	—	21.4%	40.0%	4.2%	11.7%	0.0%	41.9
9	strict-referenced	gpt-4o-mini	default	✓	—	20.0%	40.0%	0.0%	10.0%	0.0%	23.2
10	strict-referenced	gpt-4o-mini	default	✓	—	19.8%	39.2%	0.8%	10.0%	0.0%	46.4
11	ashai	gpt-4o-mini	perplexity	—	—	19.1%	35.0%	4.2%	11.7%	0.0%	29.7
12	strict-referenced	gpt-4o-mini	perplexity	✓	—	17.8%	35.0%	0.8%	9.2%	0.0%	39.0
13	strict-referenced-after	gpt-4o-mini	perplexity	—	—	17.8%	35.0%	0.0%	10.0%	0.0%	48.5
14	strict-referenced	gpt-4o-mini	perplexity	—	—	17.8%	35.0%	0.8%	9.2%	0.0%	36.9
15	strict-referenced-after	gpt-4o-mini	default	✓	—	17.6%	35.0%	0.0%	9.2%	0.0%	33.6
16	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	15.5%	30.0%	0.0%	10.0%	0.0%	36.9

2025-01-10 - Evaluation V8 - testing only strict references¶

Changes¶

adding reasoning and retry flags to table output (ensure now tracked in wandb)
strict-referenced-after - a new version of ashai that first guesses and then checks references after
strict-referenced updated to be a small wrapper around ashai - keeps it more similar while still having the strict reference requirements

Key observations:¶

these are still taking FOREVER!
so weird that strict-referenced-after with just perplexity won, but strict-referenced with just perplexity did really poorly. I think something there must be broken?! As strict-referenced with all tools (including perplexity) did well.
strict-after didn't seem to be much better than before. but i think the judgment of the references is poor still.
Guess-then-check (strict-referenced-after) with gpt-5-mini + perplexity leads, driven by strong precision/clarity; Med Acc still capped at 60%.
For gpt-4o-mini, the baseline strict-referenced slightly edges -after by ~1–2 pts; simpler toolsets (e.g., september, niharika) are as good as larger ones.
perplexity-only with gpt-4o-mini is severely degraded (and very slow) for both strict and -after → likely configuration or prompting mismatch.
the fact that september (and even niharika) did as well as they did makes me wonder how well the medical accuracy eval is doing
With no tools, strict agents perform poorly (citations required) vs prior non-strict agents.
Reasoning was on (🧠), retry off (🔄); need ablations to quantify their impact under strict modes.

Next Steps¶

tighten up medical accuracy to really enforce valid references
make a generic agent class and try a very 'basic' agent that doesn't use tools - get speed
debug strict-referenced with perplexity combo
Optimize speed: cache Perplexity responses, parallelize tool calls, cap depth/num results; profile slow paths.
Run ablations: reasoning on/off (🧠), enable controlled retry (🔄) for top configs to measure net gain vs latency.
Validate low-scoring rows for config bugs; cross-check logs and W&B artifacts for anomalies.

Leaderboard Snapshot for V8¶

Rank	Agent	Model	Tools	🧠	🔄	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	strict-referenced-after	gpt-5-mini	perplexity	✓	—	75.6%	60.0%	90.8%	88.3%	82.1%	106.2
2	strict-referenced	gpt-4o-mini	september, niharika, perplexity, pubmed	✓	—	73.9%	60.0%	87.5%	86.7%	76.7%	90.5
3	strict-referenced	gpt-4o-mini	september, niharika, pubmed	✓	—	73.5%	60.0%	87.5%	85.8%	74.6%	49.7
4	strict-referenced	gpt-4o-mini	september, niharika	✓	—	73.4%	60.0%	86.5%	86.3%	75.4%	55.1
5	strict-referenced-after	gpt-4o-mini	september, niharika, pubmed	✓	—	72.8%	60.0%	85.4%	85.0%	74.6%	79.2
6	strict-referenced-after	gpt-4o-mini	september	✓	—	72.4%	60.0%	82.1%	87.5%	73.8%	43.7
7	strict-referenced-after	gpt-4o-mini	default	✓	—	72.4%	60.0%	83.8%	85.4%	74.2%	88.5
8	strict-referenced-after	gpt-4o-mini	september, niharika	✓	—	72.1%	60.0%	82.1%	85.8%	74.2%	66.9
9	strict-referenced-after	gpt-4o-mini	september, niharika, perplexity, pubmed	✓	—	72.0%	60.0%	82.9%	85.0%	72.9%	71.9
10	strict-referenced	gpt-4o-mini	niharika	✓	—	68.4%	60.0%	78.3%	76.7%	65.0%	74.1
11	strict-referenced-after	gpt-4o-mini	niharika	✓	—	65.9%	60.0%	72.1%	74.2%	60.0%	57.6
12	strict-referenced	gpt-5-mini	perplexity	✓	—	45.3%	60.0%	35.0%	37.5%	20.8%	144.1
13	strict-referenced-after	gpt-4o-mini	perplexity	✓	—	43.0%	60.0%	28.3%	35.8%	17.5%	164.1
14	strict-referenced-after	gpt-4o-mini	none	✓	—	39.1%	56.7%	25.0%	30.0%	13.3%	28.3
15	strict-referenced	gpt-4o-mini	none	✓	—	34.5%	54.2%	21.7%	18.3%	10.0%	39.7
16	strict-referenced	gpt-4o-mini	perplexity	✓	—	31.2%	45.0%	21.7%	23.3%	8.3%	187.7

2025-08-09 - Evaluation V7¶

Changes¶

Debugged GPT-5, increased timeout limit so it can complete
Added --no-reasoning flag to ./eval script to test whether reasoning is helping or not
Added --eval-retry flag to allow models to double check themselves and try again with feedback if the response is poor (can add a lot of time)
Added --verbose flag for easier debugging to eval
Lowered empathy score importance from 15% to 10% and made it harder - all results were coming in at a consistent 80%
utlized "Language" tag in example datasets and now more solidly judging that the response is in the correct language.
added a example_description key to the dataset so that messages could be in native languages but you could still easily understand what the example was about.

Key observations:¶

GPT-5 is now debugged a bit but still super slow (even nano and mini)
GPT-5 performs the best (as expected) but is insanely slow still
GPT-5 with no tools significantly outperforms other configurations - tools aren't yet helping!
Reasoning seems to help performance based on results
One row is 10 hours! This was because my laptop went to sleep and restarted when it awoke :).
gpt-5 with reasoning ( step) got 89% but without got 76%

Next Steps¶

Figure out speed issues with GPT-5
better handle timeout issues
better handle issues where the AI cannot come up with an answer
maybe focus just on strict mode for a while - and make the tools actually help
make strict mode more like Ashai (except for its requirements)
will tighten ashai's tool prompt options and make work more with strict mode
try a strict-reference-after approach

Leaderboard Snapshot for V7¶

Rank	Agent	Model	Tools	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-5	none	89.5%	88.3%	91.7%	90.0%	88.3%	147.4
2	ashai	gpt-4o-mini	default	83.9%	81.7%	90.0%	83.3%	80.0%	30.7
3	ashai	gpt-5-mini	none	76.2%	60.0%	90.8%	90.0%	85.4%	46.4
4	ashai	gpt-5	none	76.0%	60.0%	90.8%	89.2%	84.2%	97.0
5	ashai	gpt-4o-mini	default	74.8%	76.7%	71.7%	80.0%	63.3%	34.2
6	ashai	gpt-5-mini	default	72.1%	59.2%	82.5%	85.8%	76.7%	156.9
7	ashai	gpt-5-nano	default	69.2%	55.0%	82.9%	80.8%	75.8%	181.1
8	ashai	gpt-3.5-turbo	default	65.7%	60.0%	66.2%	79.2%	62.9%	34.1
9	ashai	gpt-5	perplexity	56.3%	53.3%	60.0%	60.0%	53.3%	173.8
10	ashai	gpt-5	default	51.3%	40.0%	60.4%	62.9%	55.8%	17066.4
11	ashai	gpt-5	default	50.0%	50.0%	50.0%	50.0%	50.0%	162.5
12	ashai	gpt-5	default	30.4%	30.0%	31.7%	30.0%	30.0%	170.1

2025-08-09 15:00 - Evaluation V5¶

Changes¶

Initial batch evals show GPT-5 underperforming vs GPT-4o-mini; tool usage impact appears small.

Key observations:¶

GPT-5 is failing, likely due to usage/formatting issues; investigating. GPT-5 should be better.
GPT models without any tools perform only negligibly worse than with tools.
Surprising: Perplexity underperforms; likely prompt/config differences to investigate.
empathy scores are the same across the board

Next Steps¶

Add eval options to disable reasoning.
Add eval option for eval-retry (model gets feedback and retries).
Improve eval stability, then consider disabling all tools except Perplexity.
Test with strict-referenced.
Try a "strict-ref-check-after": let GPT answer naturally, then perform a reference lookup/validation step.
lower empathy's score and make it harder/tighter
don't punish answers where it abstained because it didn't have enough data

Leaderboard Snapshot for V5¶

Rank	Agent	Model	Tools	Overall	Med Acc	Precision	Clarity	Empathy	Avg s
1	ashai	gpt-4o-mini	default	89.6%	91.3%	85.8%	91.7%	88.8%	56.0
2	ashai	gpt-4o-mini	september health, niharika faqs, pubmed	89.4%	90.4%	86.7%	91.7%	88.3%	51.5
3	strict-referenced	gpt-4o-mini	default	89.2%	90.4%	86.2%	91.3%	87.9%	65.3
4	ashai	gpt-4o-mini	none	88.8%	89.6%	86.3%	90.0%	89.6%	27.9
5	ashai	gpt-4o-mini	niharika faqs, pubmed, perplexity	88.6%	90.0%	86.2%	89.2%	88.3%	45.1
6	ashai	gpt-4o-mini	perplexity	82.8%	82.9%	79.2%	82.1%	89.2%	71.8
7	ashai	gpt-4.1	default	76.4%	79.6%	57.5%	85.0%	87.9%	42.9
8	ashai	gpt-3.5-turbo	default	72.6%	71.2%	57.5%	84.6%	85.4%	44.4
9	perplexity	sonar	default	62.5%	74.2%	14.2%	82.1%	85.8%	14.8
10	perplexity	sonar	default	62.0%	75.0%	14.2%	78.8%	84.6%	17.7
11	strict-referenced	gpt-4o-mini	september health, niharika faqs, pubmed	61.8%	76.7%	7.1%	82.5%	85.4%	25.9
12	perplexity	sonar	default	61.4%	75.8%	7.1%	82.9%	85.0%	14.9
13	perplexity	sonar	default	61.0%	67.1%	21.2%	80.0%	85.8%	34.7
14	strict-referenced	gpt-4o-mini	september health, niharika faqs, perplexity, pubmed	60.5%	73.8%	7.1%	82.1%	85.4%	14.3
15	strict-referenced	gpt-4o-mini	september health, niharika faqs	60.2%	68.8%	14.2%	81.2%	86.2%	30.9
16	september	gpt-4o-mini	default	49.3%	50.8%	22.5%	54.2%	83.3%	79.2
17	ashai	gpt-5	default	42.5%	45.0%	6.7%	51.7%	83.3%	74.0

Template for new entries¶

## YYYY-MM-DD HH - Evaluation Vx

### Changes 


### Key observations:
  - 

### Next Steps
  - 

### Leaderboard Snapshot for Vx

---