Learnings Log
v68¶
Changes¶
- improve prompt for perplexity search and removed a lot of uneccessary complexity from it
v67 - sep 9, 2025 10:20pm¶
Changes¶
- perplexity search prompt improved
- Nihai fully built out with 2,500 rows
Errors¶
Overall errors: 0/48 cases Weaveβreported errors: 5 events Top runs by error rate: - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v67 | 20250909_211511: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v67 | 20250909_211353: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v67 | 20250909_211720: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v67 | 20250909_211844: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v67 | 20250909_220518: 0/8 (0%) Top runs by Weaveβreported errors: - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v67 | 20250909_211720: 2/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v67 | 20250909_211844: 1/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v67 | 20250909_220518: 1/8 - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v67 | 20250909_211511: 1/8 - strict-referenced | gpt-4o-mini | tools=search_september | full | v=v67 | 20250909_215020: 0/8
AI analysis:
Error Analysis¶
-
Overall Performance: Errors decreased from 6 to 5 Weave-reported events, indicating slight improvement in tool interactions.
-
Error Patterns: Consistent high error rates in runs using "search_nihai" and "ask_nihai," suggesting persistent tool-specific issues.
-
Suspected Causes: Recent updates may have stabilized performance, but specific tool combinations still require further investigation.
-
Common Messages: No common error messages reported, indicating stable response quality across evaluations.
-
Next Steps: Analyze high-error runs, explore alternative tool combinations, and monitor for emerging issues in future evaluations.
-
Comparison to Previous Entry: Overall errors remain low, but Weave-reported errors slightly increased, highlighting areas for continued focus.
Observations¶
- Top performer: strict-referenced | gpt-4o-mini β 86.8%
- Fastest: strict-referenced | gpt-4o-mini β 6.0s
- Top Performer: strict-referenced | gpt-4o-mini achieved 86.8%, a slight increase from 80.7% in v66.
- Error Rate: Overall errors remain at 0/48 cases, consistent with v66's performance.
- Weave-Reported Errors: Decreased to 5 events, showing continued improvement from 6 in v66.
- Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 60.6s, significantly faster than the previous 104.7s.
- Notable Gaps: Performance consistency varies; some runs scored significantly lower, particularly with "search_nihai" tools.
- Common Error Hotspots: High error rates persist with "ask_nihai, search_nihai" combinations, indicating areas needing further investigation.
Leaderboard v67¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s | # err |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-4o-mini | september | β | β | 86.8% | 89.4% | 87.4% | 87.0% | 85.0% | 108.5 | 0 |
2 | strict-referenced | gpt-4o-mini | noora | β | β | 83.6% | 86.2% | 88.0% | 88.5% | 87.6% | 233.3 | 1 |
3 | strict-referenced | gpt-4o-mini | noora, ask nihai, nihai | β | β | 76.6% | 75.0% | 76.2% | 85.0% | 74.9% | 107.5 | 1 |
4 | strict-referenced | gpt-4o-mini | ask nihai, nihai | β | β | 70.4% | 65.6% | 70.6% | 74.4% | 73.1% | 12.4 | 1 |
5 | strict-referenced | gpt-4o-mini | nihai | β | β | 70.3% | 66.9% | 69.4% | 74.8% | 70.6% | 6.8 | 2 |
6 | strict-referenced | gpt-4o-mini | nihai | β | β | 60.6% | 54.5% | 58.5% | 63.8% | 64.4% | 6.0 | 0 |
v66 - sep 9, 2025 5:43pm¶
Changes¶
- increased ask_nihai and search_nihai relevance score to .5
- improved how references are returned for nihai faq - link to them
- debug nihai relevance
- separate embeddings for title and answer - ranking combines the two
- nihai faqs are now linked as references - linked to nihai ui -
Errors¶
Overall errors: 0/48 cases Weaveβreported errors: 6 events Top runs by error rate: - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v66 | 20250909_170831: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v66 | 20250909_170705: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v66 | 20250909_171125: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v66 | 20250909_171318: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v66 | 20250909_174205: 0/8 (0%) Top runs by Weaveβreported errors: - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v66 | 20250909_170831: 3/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v66 | 20250909_174205: 1/8 - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v66 | 20250909_171125: 1/8 - strict-referenced | gpt-4o-mini | tools=search_september | full | v=v66 | 20250909_172745: 1/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v66 | 20250909_171318: 0/8
AI analysis:
Error Analysis¶
-
Overall Performance: Errors decreased from 14/88 to 6/48, indicating improved stability in the latest evaluation.
-
Weave-Reported Errors: Reduced from 14 to 6 events, suggesting better handling of tool interactions.
-
Error Patterns: High error rates persist in runs using "ask_nihai" and "search_nihai," indicating potential tool-specific issues.
-
Common Messages: No common error messages reported, reflecting consistent response quality across evaluations.
-
Suspected Causes: Recent updates may have optimized performance, but specific tool combinations still require closer scrutiny.
-
Next Steps: Investigate high-error runs, test alternative tool combinations, and monitor for any emerging issues in future evaluations.
Observations¶
- Top performer: strict-referenced | gpt-4o-mini β 80.7%
- Fastest: strict-referenced | gpt-4o-mini β 6.4s
- Top Performer: strict-referenced | gpt-4o-mini achieved 87.5%, maintaining the lead from v65.
- Error Rate: Overall errors remain at 0/48 cases, consistent with v65's performance.
- Weave-Reported Errors: Decreased to 6 events, showing improvement from 14 in v65.
- Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 9.3s, slower than the previous fastest of 7.7s.
- Notable Gaps: Performance consistency varies; some runs scored significantly lower, particularly with "search_nihai" tools.
- Common Error Hotspots: High error rates observed with "ask_nihai, search_nihai" combinations, indicating areas needing further investigation.
Leaderboard v66¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s | # err |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-4o-mini | noora, ask nihai, nihai | β | β | 80.7% | 83.8% | 84.1% | 87.2% | 83.1% | 6.4 | 1 |
2 | strict-referenced | gpt-4o-mini | noora | β | β | 76.6% | 76.2% | 79.4% | 80.6% | 76.6% | 104.7 | 0 |
3 | strict-referenced | gpt-4o-mini | nihai | β | β | 55.1% | 45.0% | 53.1% | 65.0% | 56.9% | 10.3 | 1 |
4 | strict-referenced | gpt-4o-mini | ask nihai, nihai | β | β | 44.7% | 33.8% | 44.4% | 59.8% | 46.6% | 18.0 | 3 |
5 | strict-referenced | gpt-4o-mini | september | β | β | 21.1% | 0.0% | 15.6% | 27.5% | 23.8% | 103.7 | 1 |
6 | strict-referenced | gpt-4o-mini | nihai | β | β | 18.5% | 0.0% | 16.2% | 31.2% | 20.6% | 7.1 | 0 |
v65 - sep 9, 2025 11:43am¶
Changes¶
- same as before but hopefully nihai search works this time
Errors¶
Overall errors: 0/88 cases Weaveβreported errors: 14 events Top runs by error rate: - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v65 | 20250908_180515: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v65 | 20250909_110908: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v65 | 20250908_180331: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v65 | 20250908_180723: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v65 | 20250909_110736: 0/8 (0%) Top runs by Weaveβreported errors: - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v65 | 20250909_111053: 3/8 - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v65 | 20250908_180723: 3/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v65 | 20250909_114142: 2/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v65 | 20250909_112518: 2/8 - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v65 | 20250908_180515: 1/8
AI analysis:
Error Analysis¶
-
Overall Stability: Errors remain at 0/88, showing consistent performance improvement compared to 0/48 in the previous entry.
-
Weave-Reported Errors: Increased to 14 events, indicating potential issues with specific tool combinations that need further investigation.
-
Top Error Runs: High error rates in runs using "search_nihai" suggest tool-specific problems, particularly with the "ask_nihai" combination.
-
Common Messages: Absence of common error messages continues, indicating sustained improvements in response generation.
-
Suspected Causes: Recent updates may have optimized performance, but specific tool interactions still require detailed analysis.
-
Next Steps: Investigate high-error runs for patterns, test alternative tool combinations, and monitor for emerging issues in future evaluations.
Observations¶
- Top performer: strict-referenced | gpt-4o-mini β 87.5%
- Fastest: strict-referenced | gpt-4o-mini β 7.7s
- Top Performer: strict-referenced | gpt-4o-mini achieved 81.9%, maintaining the lead from v64.
- Error Rate: Overall errors improved to 0/88 cases, a significant enhancement from v64's 0/48.
- Weave-Reported Errors: Increased to 14 events, indicating emerging issues with specific tool combinations.
- Speed Outlier: strict-referenced | gpt-4o-mini recorded an average time of 6.3s, slightly slower than the previous fastest.
- Notable Gaps: Performance consistency varies; some runs scored significantly lower, particularly with "search_nihai" tools.
- Common Error Hotspots: High error rates observed with "search_nihai," suggesting targeted areas for further investigation.
Leaderboard v65¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s | # err |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-4o-mini | nihai | β | β | 87.5% | 88.1% | 87.5% | 89.0% | 86.6% | 9.3 | 0 |
2 | strict-referenced | gpt-4o-mini | ask nihai, nihai | β | β | 86.7% | 88.8% | 86.2% | 89.4% | 83.8% | 12.2 | 1 |
3 | strict-referenced | gpt-4o-mini | september | β | β | 85.2% | 87.5% | 86.9% | 88.5% | 85.2% | 104.3 | 1 |
4 | strict-referenced | gpt-4o-mini | nihai | β | β | 85.0% | 85.1% | 83.8% | 86.2% | 84.0% | 7.7 | 0 |
5 | strict-referenced | gpt-4o-mini | september | β | β | 83.8% | 85.6% | 88.4% | 88.5% | 83.4% | 8.4 | 1 |
6 | strict-referenced | gpt-4o-mini | noora, ask nihai, nihai | β | β | 83.4% | 85.6% | 86.2% | 87.5% | 83.1% | 9.2 | 2 |
7 | strict-referenced | gpt-4o-mini | nihai | β | β | 76.2% | 75.0% | 76.1% | 77.9% | 76.0% | 104.4 | 3 |
8 | strict-referenced | gpt-4o-mini | noora | β | β | 69.7% | 68.1% | 67.9% | 78.2% | 67.5% | 107.1 | 2 |
9 | strict-referenced | gpt-4o-mini | nihai | β | β | 66.4% | 59.7% | 66.4% | 78.6% | 65.7% | 14.8 | 3 |
10 | strict-referenced | gpt-4o-mini | ask nihai, nihai | β | β | 60.1% | 55.6% | 58.1% | 61.6% | 63.4% | 9.4 | 0 |
11 | strict-referenced | gpt-4o-mini | noora | β | β | 53.6% | 42.5% | 52.2% | 58.5% | 54.1% | 8.9 | 1 |
v64 - sep 8, 2025 5:10pm¶
Changes¶
- preloaded Nihai with perplexity answers to all q's from India Generated FAQ
- created search_noora_faq populated with India Generated FAQ
- test to compare the different tools against each other!
Errors¶
Overall errors: 0/48 cases Weaveβreported errors: 12 events Top runs by error rate: - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v64 | 20250908_163528: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v64 | 20250908_163408: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v64 | 20250908_165017: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v64 | 20250908_165143: 0/8 (0%) - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v64 | 20250908_170801: 0/8 (0%) Top runs by Weaveβreported errors: - strict-referenced | gpt-4o-mini | tools=search_nihai | full | v=v64 | 20250908_165017: 6/8 - strict-referenced | gpt-4o-mini | tools=search_september | full | v=v64 | 20250908_165330: 2/8 - strict-referenced | gpt-4o-mini | tools=ask_nihai,search_nihai | full | v=v64 | 20250908_163528: 2/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq,ask_nihai,search_nihai | full | v=v64 | 20250908_170801: 1/8 - strict-referenced | gpt-4o-mini | tools=search_noora_faq | full | v=v64 | 20250908_165143: 1/8
AI analysis:
Error Analysis¶
-
Significant Improvement: Overall errors decreased to 0/48, a notable improvement from 2/288 in the previous entry, indicating enhanced stability.
-
Weave-Reported Errors: 12 Weave-reported errors suggest isolated issues, primarily with specific tool combinations, warranting further investigation.
-
Top Error Runs: High error rates in runs using "search_nihai" indicate potential tool-specific issues that need addressing.
-
Common Messages: Absence of common error messages suggests improved response generation and stability compared to the previous version.
-
Suspected Causes: Recent updates may have optimized configurations, reducing errors significantly; however, specific tool interactions still require scrutiny.
-
Next Steps: Analyze high-error runs for patterns, test alternative tool combinations, and monitor for any emerging issues in future evaluations.
Observations¶
- Top performer: strict-referenced | gpt-4o-mini β 81.9%
- Fastest: strict-referenced | gpt-4o-mini β 6.3s
- Error Rate Improvement: Overall errors decreased to 0/48 cases, indicating enhanced reliability compared to v63.
- Weave-Reported Errors: 12 Weave-reported errors highlight issues primarily with strict-referenced configurations, suggesting specific areas for improvement.
- Speed Outlier: one-shot | gpt-4o-mini recorded an average time of 3.0s, remaining the fastest observed.
- Notable Gaps: Performance consistency varies; some strict-referenced runs scored significantly lower than expected, indicating potential configuration issues.
- Common Error Messages: Formatting and timeout errors were prevalent, suggesting areas for improvement in response generation.
Leaderboard v64¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s | # err |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-4o-mini | september | β | β | 81.9% | 85.0% | 87.9% | 88.9% | 86.4% | 105.1 | 2 |
2 | strict-referenced | gpt-4o-mini | noora, ask nihai, nihai | β | β | 69.9% | 65.6% | 70.8% | 73.8% | 71.1% | 8.7 | 1 |
3 | strict-referenced | gpt-4o-mini | noora | β | β | 59.1% | 51.9% | 60.6% | 66.2% | 59.6% | 9.7 | 1 |
4 | strict-referenced | gpt-4o-mini | nihai | β | β | 56.5% | 53.3% | 52.5% | 61.7% | 54.7% | 7.0 | 6 |
5 | strict-referenced | gpt-4o-mini | ask nihai, nihai | β | β | 54.8% | 51.2% | 51.2% | 62.0% | 57.5% | 107.3 | 2 |
6 | strict-referenced | gpt-4o-mini | nihai | β | β | 46.9% | 40.6% | 44.4% | 56.9% | 46.2% | 6.3 | 0 |
v63 - sep 6, 2025 3:33pm¶
Changes¶
- updated eval datasets with Niharika's edits to the 12 base examples
- double note: NihAI is the new default tools
Errors¶
Overall errors: 2/288 cases Weaveβreported errors: 25 events Top runs by error rate: - ashai | gpt-4o-mini | tools=default | full | v=v63 [no-reasoning] | 20250906_150610: 1/8 (12%) - strict-referenced | gpt-4o-mini | tools=default | full | v=v63 | 20250906_152010: 1/8 (12%) - ashai | gpt-4o-mini | tools=default | full | v=v63 [no-reasoning] | 20250903_114606: 0/8 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v63 [no-reasoning] | 20250906_142109: 0/8 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v63 [retry] | 20250903_114803: 0/8 (0%) Top runs by Weaveβreported errors: - strict-referenced | gpt-4o-mini | tools=default | full | v=v63 | 20250906_144509: 3/8 - ashai | gpt-5 | tools=default | full | v=v63 | 20250903_115255: 2/8 - ashai-experiment | gpt-4o-mini | tools=default | full | v=v63 | 20250906_143500: 2/8 - ashai | gpt-4o-mini | tools=default | full | v=v63 [retry] | 20250906_142236: 2/8 - ashai-experiment | gpt-4o-mini | tools=default | full | v=v63 | 20250903_115918: 2/8 Common error messages: - βError: Failed to generate a properly formatted response. Please try again.β Γ1 - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: APITimeoutError: Request timed out.β Γ1 Error categories: - formatting_error Γ1 - timeout Γ1 Origins: - agent:direct Γ2 Final attempt counts: - after 1 attempt(s) Γ1
AI analysis:
Error Analysis¶
-
Increased Error Rate: Overall errors rose to 2/288, contrasting with the previous entry's 0/324, indicating potential instability in recent configurations.
-
Weave-Reported Errors: 25 Weave-reported errors highlight specific issues, particularly with "ashai" and "strict-referenced" runs, suggesting model sensitivity.
-
Top Error Runs: Both "ashai" and "strict-referenced" configurations show a 12% error rate, indicating possible flaws in their setup or execution.
-
Common Error Messages: Errors related to formatting and timeouts suggest underlying issues with response generation and API reliability.
-
Suspected Causes: Recent changes in configurations may have introduced instability; further investigation into "no-reasoning" and "retry" modes is warranted.
-
Next Steps: Review recent configuration changes, focus on high-error runs, and consider reverting to previous stable settings for comparison.
Observations¶
- Top performer: ashai | gpt-5 β 89.6%
- Fastest: one-shot | gpt-4o-mini β 3.0s
- Top Performer: strict-referenced | gpt-5 achieved 90.5% accuracy, surpassing previous top scores significantly.
- Error Rate Increase: Overall errors rose to 2/288 cases, indicating a slight decline in reliability compared to v62.
- Weave-Reported Errors: 25 Weave-reported errors highlight potential issues, particularly with gpt-4o-mini configurations.
- Speed Outlier: one-shot | gpt-4o-mini recorded an average time of 1.8s, remaining the fastest observed.
- Notable Gaps: Performance consistency varies; some strict-referenced runs scored significantly lower than expected.
- Common Error Messages: Formatting and timeout errors were prevalent, suggesting areas for improvement in response generation.
Leaderboard v63¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s | # err |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | default | β | β | 89.6% | 90.0% | 90.6% | 90.0% | 89.1% | 33.2 | 0 |
2 | ashai-experiment | gpt-5 | default | β | β | 89.5% | 90.2% | 90.6% | 89.5% | 90.2% | 22.6 | 0 |
3 | strict-referenced | gpt-5 | default | β | β | 89.5% | 90.0% | 90.2% | 89.2% | 89.5% | 51.9 | 1 |
4 | strict-referenced-after | gpt-5 | default | β | β | 89.3% | 90.6% | 90.4% | 89.8% | 86.6% | 39.0 | 1 |
5 | ashai-experiment | gpt-5 | default | β | β | 89.3% | 90.0% | 90.6% | 89.8% | 87.2% | 59.2 | 1 |
6 | ashai-experiment | gpt-5 | default | β | β | 89.2% | 90.0% | 90.0% | 90.0% | 88.8% | 52.6 | 1 |
7 | strict-referenced | gpt-5 | default | β | β | 89.1% | 91.5% | 89.6% | 89.0% | 85.1% | 23.4 | 0 |
8 | ashai | gpt-5 | default | β | β | 89.1% | 89.4% | 89.4% | 90.4% | 86.2% | 66.9 | 0 |
9 | strict-referenced-after | gpt-5 | default | β | β | 88.7% | 89.4% | 90.2% | 88.2% | 85.4% | 18.7 | 0 |
10 | ashai | gpt-5 | default | β | β | 88.7% | 89.3% | 91.0% | 89.7% | 88.6% | 44.1 | 2 |
11 | ashai | gpt-4o-mini | default | β | β | 87.9% | 89.4% | 88.8% | 89.6% | 86.4% | 7.2 | 1 |
12 | ashai | gpt-4o-mini | default | β | β | 87.8% | 90.0% | 87.5% | 87.5% | 85.9% | 5.5 | 0 |
13 | ashai-experiment | gpt-4o-mini | default | β | β | 87.7% | 88.8% | 88.1% | 88.5% | 86.6% | 5.3 | 0 |
14 | strict-referenced-after | gpt-4o-mini | default | β | β | 87.1% | 88.1% | 87.5% | 86.6% | 81.4% | 5.1 | 0 |
15 | ashai-experiment | gpt-4o-mini | default | β | β | 87.1% | 90.0% | 85.0% | 88.8% | 83.0% | 9.4 | 2 |
16 | ashai | gpt-4o-mini | default | β | β | 87.0% | 88.1% | 87.5% | 88.8% | 83.5% | 6.0 | 2 |
17 | strict-referenced-after | gpt-4o-mini | default | β | β | 86.9% | 88.1% | 86.0% | 87.8% | 82.9% | 8.0 | 0 |
18 | ashai | gpt-4o-mini | perplexity | β | β | 86.8% | 88.8% | 88.1% | 88.5% | 86.6% | 5.2 | 0 |
19 | ashai | gpt-4o-mini | default | β | β | 86.7% | 89.4% | 84.1% | 88.5% | 81.6% | 12.8 | 0 |
20 | ashai | gpt-4o-mini | perplexity | β | β | 86.6% | 87.5% | 85.0% | 87.4% | 86.4% | 8.8 | 1 |
21 | strict-referenced | gpt-4o-mini | default | β | β | 86.3% | 87.6% | 85.7% | 85.4% | 81.1% | 6.1 | 3 |
22 | ashai-experiment | gpt-4o-mini | default | β | β | 86.3% | 88.5% | 85.0% | 88.4% | 82.5% | 13.5 | 2 |
23 | ashai | gpt-4o-mini | perplexity | β | β | 85.7% | 86.9% | 84.4% | 87.5% | 83.1% | 10.8 | 2 |
24 | ashai | gpt-4o-mini | default | β | β | 85.6% | 86.2% | 84.4% | 86.5% | 86.2% | 18.2 | 1 |
25 | ashai | gpt-4o-mini | default | β | β | 85.5% | 85.6% | 85.0% | 88.2% | 82.8% | 11.0 | 2 |
26 | ashai | gpt-4o-mini | default | β | β | 85.2% | 88.8% | 85.0% | 88.2% | 83.8% | 10.0 | 0 |
27 | strict-referenced | gpt-4o-mini | default | β | β | 80.0% | 78.8% | 79.4% | 83.5% | 79.4% | 11.8 | 1 |
28 | one-shot | gpt-5 | default | β | β | 78.0% | 90.0% | 90.6% | 89.5% | 90.9% | 11.2 | 0 |
29 | perplexity | sonar | default | β | β | 77.2% | 88.1% | 88.1% | 88.1% | 80.9% | 5.4 | 0 |
30 | one-shot | gpt-4o-mini | default | β | β | 77.1% | 88.1% | 89.8% | 89.4% | 88.0% | 3.5 | 0 |
31 | one-shot | gpt-4o-mini | default | β | β | 76.8% | 88.1% | 89.4% | 90.0% | 86.9% | 3.0 | 0 |
32 | perplexity | sonar | default | β | β | 76.2% | 85.6% | 85.2% | 87.4% | 82.8% | 4.7 | 0 |
33 | ashai | gpt-4o-mini | default | β | β | 75.6% | 75.6% | 78.8% | 80.0% | 77.9% | 8.6 | 0 |
34 | one-shot | gpt-5 | default | β | β | 71.0% | 80.0% | 81.9% | 86.0% | 80.6% | 17.9 | 0 |
35 | ashai | gpt-4o-mini | default | β | β | 68.2% | 68.8% | 68.1% | 69.8% | 68.1% | 7.6 | 0 |
36 | strict-referenced | gpt-4o-mini | default | β | β | 66.5% | 66.9% | 65.6% | 69.8% | 66.2% | 7.9 | 2 |
v62 - aug 27, 2025 3:19pm¶
Changes¶
- significant formatting of Resource types. Sticking with the types returned by the tools for less tranformation potential issues.
- Doubling up on each run to see if the same # of issues/errors arrise and how similar scores are
- added 'final turn' json enforcement
- medical_accuracy should force reference_sufficiency to 0 manually
Errors¶
Overall errors: 0/324 cases Weaveβreported errors: 11 events Top runs by error rate: - ashai | gpt-4o-mini | tools=default | full | v=v62 [no-reasoning] | 20250827_140831: 0/12 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v62 [no-reasoning] | 20250827_145146: 0/12 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v62 [retry] | 20250827_141010: 0/12 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v62 [retry] | 20250827_145306: 0/12 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v62 | 20250827_140710: 0/12 (0%) Top runs by Weaveβreported errors: - ashai | gpt-4o-mini | tools=default | full | v=v62 [no-reasoning] | 20250827_140831: 3/12 - ashai-experiment | gpt-4o-mini | tools=default | full | v=v62 | 20250827_141945: 2/12 - ashai | gpt-4o-mini | tools=search_perplexity | full | v=v62 | 20250827_141153: 2/12 - strict-referenced | gpt-5 | tools=default | full | v=v62 | 20250827_142923: 1/12 - strict-referenced-after | gpt-5 | tools=default | full | v=v62 | 20250827_143942: 1/12
AI analysis:
-
Error Rate Consistency: Overall errors remain at 0/324, indicating stable performance compared to previous entry (v60).
-
Weave-Reported Errors: 11 Weave-reported errors suggest potential issues with specific runs, particularly with "no-reasoning" configurations.
-
Top Error Runs: "ashai | gpt-4o-mini" shows repeated errors in "no-reasoning" and "retry" modes, indicating possible configuration sensitivity.
-
Suspected Causes: High Weave-reported errors may stem from inadequate reasoning capabilities or model limitations in specific contexts.
-
Next Steps: Investigate "no-reasoning" configurations further; consider adjusting parameters or exploring alternative models for improved performance.
-
Comparison Insight: No changes were made from v60, yet error patterns persist, suggesting a need for deeper analysis of existing configurations.
Observations¶
- gpt-4 is only 1% behind gpt-5 and takes a small fraction of the time (6s vs 40)
- We're getting high 80's across the board. the evals aren't showing much better or worse for any of the variants really. I think we could use some better datasets here now!
- super odd that some of the strict-reference duplicates scored super highligh but then super low...
- Top performer: strict-referenced | gpt-5 β 89.2%
- Fastest: one-shot | gpt-4o-mini β 1.8s
- Top Performer: strict-referenced | gpt-5 | default achieved 90.5% accuracy, surpassing previous top scores significantly.
- Speed Outlier: one-shot | gpt-4o-mini | default recorded an average time of 1.5s, the fastest yet observed.
- Overall Errors: Total errors decreased to 2/350 cases, showing improved reliability compared to v60.
- Notable Changes: Introduction of new configurations led to increased accuracy and reduced error rates across the board.
Leaderboard v62¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s | # err |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-5 | default | β | β | 89.2% | 89.6% | 90.2% | 89.5% | 88.8% | 40.8 | 1 |
2 | strict-referenced | gpt-5 | default | β | β | 89.1% | 89.6% | 89.2% | 90.2% | 87.8% | 21.2 | 0 |
3 | ashai | gpt-5 | default | β | β | 89.1% | 89.8% | 90.4% | 89.2% | 88.8% | 13.9 | 0 |
4 | ashai-experiment | gpt-5 | default | β | β | 88.9% | 89.6% | 90.8% | 89.3% | 87.9% | 12.9 | 0 |
5 | strict-referenced-after | gpt-5 | default | β | β | 88.9% | 89.6% | 89.3% | 89.4% | 87.5% | 29.9 | 1 |
6 | ashai-experiment | gpt-5 | default | β | β | 88.1% | 90.0% | 90.8% | 89.8% | 88.3% | 31.7 | 0 |
7 | ashai | gpt-4o-mini | default | β | β | 88.0% | 89.6% | 87.5% | 89.2% | 86.8% | 6.1 | 1 |
8 | ashai | gpt-5 | default | β | β | 87.9% | 89.2% | 89.3% | 89.5% | 88.2% | 27.9 | 1 |
9 | strict-referenced-after | gpt-5 | default | β | β | 87.8% | 88.5% | 89.6% | 89.3% | 87.0% | 30.5 | 0 |
10 | ashai | gpt-4o-mini | default | β | β | 87.7% | 88.8% | 89.2% | 88.8% | 86.5% | 4.2 | 0 |
11 | strict-referenced | gpt-4o-mini | default | β | β | 87.5% | 87.9% | 87.5% | 88.8% | 84.4% | 3.3 | 0 |
12 | ashai | gpt-4o-mini | perplexity | β | β | 87.3% | 88.3% | 88.2% | 89.2% | 87.5% | 3.7 | 0 |
13 | ashai-experiment | gpt-4o-mini | default | β | β | 87.2% | 88.2% | 88.7% | 86.8% | 84.7% | 2.9 | 0 |
14 | ashai | gpt-4o-mini | default | β | β | 87.2% | 87.5% | 87.6% | 88.5% | 88.1% | 3.1 | 0 |
15 | ashai | gpt-4o-mini | default | β | β | 87.0% | 87.8% | 87.8% | 88.2% | 85.4% | 4.2 | 0 |
16 | ashai | gpt-4o-mini | default | β | β | 86.7% | 87.9% | 86.9% | 88.3% | 85.9% | 5.8 | 3 |
17 | strict-referenced-after | gpt-4o-mini | default | β | β | 86.6% | 88.3% | 85.8% | 87.8% | 84.0% | 3.2 | 0 |
18 | ashai | gpt-4o-mini | default | β | β | 86.1% | 87.9% | 88.2% | 87.7% | 83.2% | 4.6 | 0 |
19 | perplexity | sonar | default | β | β | 85.2% | 88.3% | 90.2% | 89.8% | 85.9% | 2.9 | 0 |
20 | ashai-experiment | gpt-4o-mini | default | β | β | 84.4% | 88.3% | 87.8% | 87.5% | 85.5% | 5.6 | 2 |
21 | ashai | gpt-4o-mini | perplexity | β | β | 80.0% | 80.8% | 82.1% | 83.2% | 81.0% | 6.4 | 2 |
22 | one-shot | gpt-5 | default | β | β | 78.4% | 90.0% | 90.7% | 88.7% | 88.5% | 13.6 | 0 |
23 | one-shot | gpt-4o-mini | default | β | β | 77.0% | 87.5% | 89.8% | 89.5% | 87.2% | 1.8 | 0 |
24 | strict-referenced | gpt-4o-mini | default | β | β | 76.2% | 75.0% | 76.1% | 80.2% | 75.7% | 4.1 | 0 |
25 | one-shot | gpt-4o-mini | default | β | β | 75.2% | 84.8% | 87.3% | 88.2% | 86.2% | 2.3 | 0 |
26 | strict-referenced-after | gpt-4o-mini | default | β | β | 74.8% | 78.8% | 73.2% | 75.8% | 72.5% | 5.8 | 0 |
27 | one-shot | gpt-5 | default | β | β | 74.5% | 85.4% | 84.9% | 87.8% | 86.0% | 8.3 | 0 |
v60¶
Changes¶
- none, running again to hope that there are less errors.
v59 - aug 27, 2025 1:42am¶
Changes¶
- ask_nihai and search_nihai promoted to be the only default tools (all others deprecated)
- error fixes
- perplexity_search no longer assums choices exists
- ASHAI_AGENT_MAX_RETRIES 10 = agent retries (up from 2)
- Backoff s from .6 to 12s cap
- impmroved logging for search_nihai -
- shows errors now on home page learder board.
- ask_nihai completely re-written using an agent
- going to just run some small tests to see if nihai is going
Errors¶
Overall errors: 1/168 cases Top runs by error rate: - strict-referenced-after | gpt-5 | tools=default | full | v=v59 | 20250827_012837: 1/12 (8%) - ashai | gpt-4o-mini | tools=default | full | v=v59 [no-reasoning] | 20250827_004203: 0/12 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v59 [retry] | 20250827_005655: 0/12 (0%) - ashai | gpt-4o-mini | tools=default | full | v=v59 | 20250827_003944: 0/12 (0%) - ashai | gpt-4o-mini | tools=search_perplexity | full | v=v59 | 20250827_005846: 0/12 (0%) Common error messages: - β1 validation error for AshaiResponse references.0.url Input should be a valid string [type=string_type, input_value=None, input_type=NoneType] For further information visit https://errors.pydantic.dev/2.11/v/string_typeβ Γ1 Error categories: - other Γ1 Origins: - agent:direct Γ1
AI analysis:
Error Analysis Summary¶
-
Significant Error Reduction: Overall errors decreased to 1/168 from 46/960, indicating substantial improvement in evaluation performance.
-
Isolated Error Occurrence: The only error was from "strict-referenced-after" with gpt-5, suggesting a specific issue rather than widespread failures.
-
Common Validation Error: The persistent validation error regarding
references.0.url
indicates potential input handling issues in the AshaiResponse model. -
Model Stability Improvement: The reduction in errors suggests enhanced model stability and better handling of requests compared to previous evaluations.
-
Next Steps: Investigate the validation error further, refine input handling, and continue monitoring for any emerging error patterns.
-
Comparison to Previous Entry: Error rates have drastically improved, highlighting effective adjustments made since the last evaluation.
Observations¶
- Top performer: ashai-experiment | gpt-5 β 88.4%
- Fastest: one-shot | gpt-4o-mini β 2.6s
- Top Performer: ashai-experiment | gpt-5 achieved 88.4%, a notable improvement from previous entries.
- Error Rate Decrease: Overall errors dropped to 1/168 cases, significantly better than 46/960 in v56.
- Speed Outlier: one-shot | gpt-4o-mini averaged 2.6 seconds, the fastest recorded time.
- Notable Error Hotspot: strict-referenced-after | gpt-5 had a single error rate of 8%, indicating potential instability.
- Zero Errors: ashai | gpt-4o-mini recorded 0 errors across multiple runs, showcasing strong reliability.
- Close Competition: strict-referenced | gpt-5 followed closely at 88.3%, maintaining competitive performance.
- Common Error Message: A single validation error related to input type was noted, suggesting a minor compatibility issue.
Leaderboard v59¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai-experiment | gpt-5 | default | β | β | 88.4% | 90.4% | 90.4% | 90.2% | 88.5% | 25.1 |
2 | strict-referenced | gpt-5 | default | β | β | 88.3% | 89.2% | 89.4% | 88.8% | 87.8% | 39.7 |
3 | ashai | gpt-5 | default | β | β | 88.1% | 90.4% | 90.2% | 89.4% | 89.6% | 43.3 |
4 | strict-referenced | gpt-4o-mini | default | β | β | 86.7% | 87.5% | 87.2% | 89.2% | 86.8% | 6.9 |
5 | ashai | gpt-4o-mini | default | β | β | 86.5% | 87.9% | 89.2% | 88.0% | 86.2% | 6.8 |
6 | ashai | gpt-4o-mini | default | β | β | 85.5% | 87.9% | 89.7% | 88.9% | 86.7% | 9.1 |
7 | perplexity | sonar | default | β | β | 84.3% | 88.8% | 89.2% | 89.8% | 85.8% | 3.5 |
8 | ashai | gpt-4o-mini | default | β | β | 83.5% | 86.4% | 87.2% | 89.2% | 86.2% | 71.8 |
9 | ashai-experiment | gpt-4o-mini | default | β | β | 81.2% | 80.4% | 82.2% | 88.6% | 81.7% | 6.2 |
10 | strict-referenced-after | gpt-5 | default | β | β | 80.8% | 81.4% | 82.4% | 81.2% | 80.4% | 42.8 |
11 | ashai | gpt-4o-mini | perplexity | β | β | 78.8% | 80.0% | 81.9% | 84.0% | 80.1% | 4.0 |
12 | one-shot | gpt-5 | default | β | β | 77.9% | 89.2% | 89.6% | 88.6% | 89.5% | 17.7 |
13 | one-shot | gpt-4o-mini | default | β | β | 77.5% | 88.8% | 90.1% | 89.8% | 88.0% | 2.6 |
14 | strict-referenced-after | gpt-4o-mini | default | β | β | 74.7% | 73.3% | 76.2% | 79.2% | 73.2% | 6.6 |
v56 - aug 26, 2025 9:01pm¶
Changes¶
- We gonna do a massive run yo and test nihai against perplexity! Also trying gpt-5 runs.
Errors¶
Overall errors: 46/960 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v56 [retry] | 20250826_140225: 12/12 (100%) - strict-referenced | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 | 20250826_143309: 8/12 (67%) - strict-referenced | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 [no-reasoning] | 20250826_142351: 7/12 (58%) - ashai | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 [retry] | 20250826_141646: 6/12 (50%) - strict-referenced-after | gpt-5 | tools=search_nihai,ask_nihai | full | v=v56 [no-reasoning] | 20250826_141939: 4/12 (33%) Common error messages: - βBackoff failure after 1 attempt(s) [agent:openai:gpt-5]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.β Γ28 - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.β Γ11 - βBackoff failure after 3 attempt(s) [agent:openai:gpt-4o-mini]: APIConnectionError: Connection error.β Γ1 - βlist index out of rangeβ Γ1 - βBackoff failure after 2 attempt(s) [agent:openai:gpt-5]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.β Γ1 Error categories: - other Γ41 - api_connection Γ1 Origins: - agent:direct Γ42 Final attempt counts: - after 1 attempt(s) Γ39 - after 2 attempt(s) Γ1 - after 3 attempt(s) Γ1
AI analysis:
Error Analysis Summary¶
-
Error Rate Increase: Overall errors rose to 46/960, up from 46/480, indicating worsening performance across evaluations.
-
Persistent NotImplementedError: The "NotImplementedError" related to meta tensors remains the most frequent error, suggesting unresolved compatibility issues.
-
High Error Rates in Specific Runs: ashai with gpt-4o-mini recorded a 100% error rate, while strict-referenced gpt-5 reached 67%.
-
Retry Failures: Most errors occurred after one or two attempts, indicating potential issues with model stability and request handling.
-
Next Steps: Investigate the "NotImplementedError" further, optimize model configurations, and improve error handling to mitigate failure rates.
-
Continuous Monitoring Needed: Ongoing tracking of error patterns is crucial to identify and resolve recurring issues in future evaluations.
Observations¶
- Top performer: strict-referenced | gpt-5 β 89.5%
- Fastest: ashai | gpt-4o-mini β 1.4s
- Top Performer: ashai | gpt-5 achieved 89.4%, slightly down from 89.5% in v56.
- Close Competition: strict-referenced | gpt-5 closely followed at 88.9%, maintaining strong performance.
- Significant Error Rate: Overall errors increased to 46/960, up from 46/480 in v56, indicating persistent issues.
- Critical Error Hotspot: ashai | gpt-4o-mini recorded a 100% error rate in one instance, mirroring previous concerns.
- Common Errors: "NotImplementedError" remains prevalent, appearing 28 times, highlighting ongoing compatibility challenges.
- Speed Outlier: ashai | gpt-4o-mini averaged 1.4 seconds, significantly faster than previous entries.
- Retry Failures: Most errors occurred after one attempt, suggesting instability in request handling persists.
Leaderboard v56¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-5 | nihai, ask nihai | β | β | 89.5% | 91.7% | 90.4% | 89.6% | 88.2% | 20.7 |
2 | ashai | gpt-5 | nihai | β | β | 89.4% | 90.0% | 90.8% | 89.7% | 88.3% | 19.8 |
3 | strict-referenced-after | gpt-5 | nihai, ask nihai | β | β | 89.1% | 89.6% | 89.8% | 89.8% | 89.2% | 23.2 |
4 | strict-referenced | gpt-5 | nihai, ask nihai | β | β | 88.9% | 90.2% | 90.2% | 89.8% | 88.2% | 27.0 |
5 | strict-referenced | gpt-5 | perplexity | β | β | 88.9% | 89.8% | 90.0% | 89.8% | 88.6% | 18.7 |
6 | strict-referenced | gpt-5 | perplexity | β | β | 88.8% | 90.0% | 89.2% | 90.4% | 88.5% | 23.3 |
7 | strict-referenced | gpt-5 | perplexity | β | β | 88.7% | 90.8% | 88.4% | 89.3% | 85.6% | 42.6 |
8 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.6% | 89.2% | 90.1% | 89.3% | 86.3% | 40.4 |
9 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.5% | 89.4% | 90.0% | 89.2% | 86.7% | 46.5 |
10 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.5% | 90.0% | 90.5% | 88.2% | 85.8% | 19.2 |
11 | strict-referenced | gpt-4o-mini | nihai, ask nihai | β | β | 88.4% | 89.2% | 88.6% | 89.7% | 84.8% | 3.3 |
12 | ashai-experiment | gpt-5 | perplexity | β | β | 88.3% | 90.0% | 90.4% | 90.2% | 89.5% | 14.8 |
13 | strict-referenced | gpt-5 | perplexity | β | β | 88.3% | 89.6% | 88.9% | 88.7% | 87.2% | 51.9 |
14 | ashai | gpt-5 | nihai, ask nihai | β | β | 88.1% | 89.6% | 90.2% | 89.8% | 87.9% | 21.0 |
15 | ashai-experiment | gpt-5 | perplexity | β | β | 88.1% | 88.8% | 90.0% | 89.8% | 88.3% | 17.2 |
16 | strict-referenced-after | gpt-5 | nihai, ask nihai | β | β | 88.0% | 88.8% | 90.6% | 89.0% | 86.3% | 31.6 |
17 | ashai | gpt-4o-mini | nihai, ask nihai | β | β | 88.0% | 88.8% | 89.2% | 89.6% | 88.0% | 5.6 |
18 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.0% | 88.3% | 90.0% | 88.6% | 86.4% | 18.7 |
19 | ashai | gpt-4o-mini | nihai | β | β | 87.9% | 88.8% | 88.3% | 89.4% | 84.9% | 3.5 |
20 | ashai-experiment | gpt-5 | perplexity | β | β | 87.9% | 90.0% | 90.4% | 90.4% | 89.8% | 38.2 |
21 | ashai | gpt-5 | perplexity | β | β | 87.8% | 90.5% | 91.5% | 89.8% | 88.5% | 14.7 |
22 | ashai-experiment | gpt-5 | perplexity | β | β | 87.8% | 89.8% | 90.4% | 90.0% | 89.8% | 40.8 |
23 | strict-referenced | gpt-4o-mini | nihai, ask nihai | β | β | 87.6% | 87.2% | 89.8% | 90.1% | 87.0% | 4.0 |
24 | ashai | gpt-4o-mini | nihai | β | β | 87.6% | 88.8% | 88.2% | 89.4% | 85.3% | 3.9 |
25 | ashai | gpt-5 | perplexity | β | β | 87.6% | 90.0% | 89.6% | 89.8% | 88.2% | 16.6 |
26 | ashai | gpt-5 | nihai | β | β | 87.5% | 90.4% | 89.3% | 89.2% | 88.0% | 17.1 |
27 | strict-referenced | gpt-4o-mini | perplexity | β | β | 87.5% | 89.6% | 89.0% | 89.5% | 88.0% | 7.4 |
28 | ashai | gpt-5 | perplexity | β | β | 87.3% | 89.8% | 89.6% | 89.8% | 87.2% | 38.0 |
29 | ashai | gpt-4o-mini | perplexity | β | β | 87.2% | 87.9% | 87.9% | 89.2% | 87.5% | 6.2 |
30 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 87.2% | 87.9% | 88.6% | 89.7% | 85.5% | 11.3 |
31 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 87.1% | 88.8% | 89.2% | 89.9% | 87.3% | 7.0 |
32 | ashai | gpt-4o-mini | perplexity | β | β | 86.9% | 88.8% | 87.2% | 90.2% | 88.3% | 5.4 |
33 | strict-referenced | gpt-4o-mini | perplexity | β | β | 86.6% | 87.8% | 88.3% | 89.8% | 87.5% | 5.7 |
34 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 86.5% | 88.3% | 89.2% | 89.8% | 88.2% | 4.9 |
35 | ashai | gpt-4o-mini | nihai | β | β | 86.4% | 88.2% | 89.5% | 89.6% | 86.1% | 3.4 |
36 | ashai | gpt-4o-mini | perplexity | β | β | 86.4% | 87.5% | 89.2% | 88.5% | 84.8% | 4.8 |
37 | ashai | gpt-5 | perplexity | β | β | 86.2% | 89.2% | 89.9% | 89.0% | 89.4% | 37.7 |
38 | ashai | gpt-4o-mini | nihai | β | β | 86.2% | 87.5% | 88.8% | 90.4% | 84.8% | 3.7 |
39 | ashai | gpt-4o-mini | perplexity | β | β | 86.1% | 87.5% | 88.8% | 89.6% | 85.2% | 4.8 |
40 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 86.0% | 88.2% | 88.8% | 89.8% | 86.9% | 7.6 |
41 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 85.4% | 86.7% | 87.3% | 89.5% | 86.5% | 7.1 |
42 | perplexity | sonar | default | β | β | 85.0% | 87.4% | 90.2% | 89.5% | 86.3% | 3.3 |
43 | perplexity | sonar | default | β | β | 85.0% | 88.2% | 89.3% | 88.6% | 85.2% | 3.9 |
44 | strict-referenced | gpt-4o-mini | perplexity | β | β | 84.8% | 85.4% | 87.3% | 88.3% | 83.8% | 4.6 |
45 | ashai | gpt-4o-mini | ask nihai | β | β | 84.0% | 86.9% | 84.3% | 85.9% | 83.5% | 6.7 |
46 | strict-referenced | gpt-4o-mini | perplexity | β | β | 83.3% | 84.6% | 82.9% | 88.0% | 85.7% | 7.7 |
47 | strict-referenced | gpt-4o-mini | perplexity | β | β | 83.1% | 85.7% | 80.8% | 85.7% | 81.9% | 5.9 |
48 | strict-referenced-after | gpt-4o-mini | nihai, ask nihai | β | β | 81.7% | 80.4% | 82.3% | 85.8% | 78.5% | 4.3 |
49 | ashai | gpt-4o-mini | perplexity | β | β | 80.9% | 82.5% | 81.8% | 84.4% | 81.6% | 4.7 |
50 | one-shot | gpt-5 | default | β | β | 80.9% | 90.0% | 91.2% | 89.7% | 89.7% | 24.0 |
51 | ashai | gpt-4o-mini | nihai, ask nihai | β | β | 80.9% | 82.1% | 82.1% | 82.5% | 79.2% | 9.6 |
52 | strict-referenced | gpt-4o-mini | nihai, ask nihai | β | β | 80.7% | 80.4% | 82.8% | 82.0% | 79.6% | 3.8 |
53 | strict-referenced | gpt-4o-mini | nihai, ask nihai | β | β | 80.5% | 81.7% | 80.8% | 82.2% | 79.6% | 4.2 |
54 | strict-referenced-after | gpt-4o-mini | nihai, ask nihai | β | β | 80.4% | 81.4% | 81.1% | 82.7% | 77.5% | 4.4 |
55 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 80.3% | 80.4% | 83.6% | 84.8% | 83.3% | 10.2 |
56 | strict-referenced-after | gpt-4o-mini | nihai, ask nihai | β | β | 79.9% | 80.9% | 82.1% | 82.5% | 77.7% | 5.7 |
57 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 79.6% | 81.7% | 78.8% | 81.7% | 80.8% | 8.4 |
58 | ashai | gpt-4o-mini | perplexity | β | β | 79.2% | 80.0% | 82.3% | 82.3% | 79.2% | 5.2 |
59 | one-shot | gpt-5 | default | β | β | 78.6% | 90.0% | 89.6% | 89.8% | 89.3% | 18.2 |
60 | one-shot | gpt-5 | default | β | β | 78.1% | 90.0% | 90.4% | 89.8% | 89.6% | 8.4 |
61 | one-shot | gpt-5 | default | β | β | 78.1% | 89.2% | 90.2% | 88.9% | 89.0% | 7.1 |
62 | one-shot | gpt-4o-mini | default | β | β | 77.6% | 87.5% | 89.0% | 89.8% | 88.3% | 3.0 |
63 | one-shot | gpt-4o-mini | default | β | β | 77.6% | 88.3% | 90.8% | 89.8% | 88.3% | 5.4 |
64 | one-shot | gpt-4o-mini | default | β | β | 77.2% | 87.5% | 90.0% | 89.8% | 88.0% | 2.9 |
65 | one-shot | gpt-4o-mini | default | β | β | 77.1% | 88.4% | 90.4% | 89.9% | 88.3% | 3.4 |
66 | one-shot | gpt-4o-mini | default | β | β | 76.4% | 86.2% | 89.0% | 89.8% | 87.1% | 3.0 |
67 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 76.3% | 77.5% | 76.7% | 79.8% | 76.0% | 13.4 |
68 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 75.4% | 74.2% | 77.7% | 78.1% | 76.8% | 6.9 |
69 | strict-referenced | gpt-4o-mini | perplexity | β | β | 75.3% | 74.2% | 75.8% | 79.3% | 75.1% | 5.7 |
70 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 74.8% | 75.5% | 73.6% | 80.0% | 75.3% | 9.2 |
71 | strict-referenced-after | gpt-4o-mini | nihai, ask nihai | β | β | 74.5% | 72.9% | 75.4% | 77.3% | 74.4% | 5.0 |
72 | one-shot | gpt-4o-mini | default | β | β | 71.4% | 81.7% | 83.3% | 83.3% | 84.2% | 2.5 |
73 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 66.6% | 66.0% | 68.1% | 71.1% | 68.3% | 9.8 |
74 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 59.9% | 57.9% | 61.2% | 63.3% | 60.7% | 13.4 |
75 | strict-referenced-after | gpt-5 | nihai, ask nihai | β | β | 59.1% | 59.6% | 60.0% | 59.8% | 59.8% | 22.8 |
76 | strict-referenced-after | gpt-5 | nihai, ask nihai | β | β | 59.1% | 59.6% | 59.8% | 60.0% | 59.0% | 18.5 |
77 | ashai | gpt-5 | nihai, ask nihai | β | β | 43.6% | 45.0% | 45.4% | 45.0% | 44.0% | 11.9 |
78 | strict-referenced | gpt-5 | nihai, ask nihai | β | β | 36.9% | 37.5% | 37.1% | 37.5% | 35.8% | 18.8 |
79 | strict-referenced | gpt-5 | nihai, ask nihai | β | β | 29.4% | 29.6% | 30.0% | 30.0% | 29.2% | 9.9 |
80 | ashai | gpt-4o-mini | ask nihai | β | β | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 1.4 |
v55 - aug 26, 2025 12:14pm¶
Changes¶
Note - I acidentally ran v55 twice!
- yet another fix. I think we go tall errors!
- all errors fixed
- We gonna do a massive run yo and test nihai against perplexity!
- emphasize that llm should use search_nihai before ask_nihai
Errors¶
Overall errors: 9/432 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v55 [retry] | 20250826_041428: 4/12 (33%) - strict-referenced | gpt-5 | tools=search_perplexity | full | v=v55 | 20250826_031247: 3/12 (25%) - strict-referenced-after | gpt-4o-mini | tools=search_perplexity | full | v=v55 [no-reasoning] | 20250826_034143: 1/12 (8%) - strict-referenced-after | gpt-4o-mini | tools=search_perplexity | full | v=v55 [retry] | 20250826_040806: 1/12 (8%) - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v55 [retry] | 20250826_023034: 0/12 (0%) Common error messages: - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.β Γ3 - βBackoff failure after 1 attempt(s) [agent:openai:gpt-5]: APITimeoutError: Request timed out.β Γ2 - βlist index out of rangeβ Γ2 - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: ModelHTTPError: status_code: 400, model_name: gpt-4o-mini, body: {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 136332 tokens (136248 in the messages, 84 in the functions). Please reduce the length of the messages or functions.", 'type': 'invalid_request_error', 'param': 'messages'β Γ1 - βBackoff failure after 3 attempt(s) [agent:openai:gpt-5]: APITimeoutError: Request timed out.β Γ1 Error categories: - other Γ6 - timeout Γ3 Origins: - agent:direct Γ9 Final attempt counts: - after 1 attempt(s) Γ6 - after 3 attempt(s) Γ1
AI analysis:
Error Analysis Summary¶
-
Error Rate Stability: Current evaluation maintains a consistent error rate of 9/432, similar to the previous entry.
-
Frequent Errors: "NotImplementedError" and "APITimeoutError" persist as common issues, indicating ongoing compatibility and timeout challenges.
-
Context Limit Exceeded: One instance of exceeding token limits suggests a need for stricter message length management.
-
Retry Patterns: Most errors occurred after one or three attempts, highlighting potential instability in request handling.
-
Next Steps: Investigate specific error messages, optimize token usage, and enhance timeout handling to improve performance.
-
Monitoring Required: Continue tracking error trends to identify persistent issues and implement necessary fixes in future evaluations.
Observations¶
- Nihai is out performing perplexity! And is fast! Like crazy fast! 4.4s!
- Top performer: strict-referenced | gpt-5 β 89.4%
- Fastest: one-shot | gpt-4o-mini β 2.6s
- Top performer: strict-referenced | gpt-5 maintained the lead with 89.4%, a slight increase from the previous entry.
- Close competition: ashai | gpt-5 follows closely at 88.7%, just 0.7% behind the leader.
- Speed outlier: one-shot | gpt-4o-mini achieved an impressive average of 2.6 seconds, significantly faster than previous entries.
- Error increase: Overall errors rose to 9/432 cases, a notable jump from 0 errors in the last evaluation.
- Error hotspots: High error rates observed in ashai | gpt-4o-mini and strict-referenced | gpt-5, particularly with timeout issues.
- Common errors: "Backoff failure" and "APITimeoutError" were frequently reported, indicating persistent issues in request handling.
- Final attempt counts: Multiple attempts were often required, suggesting instability in performance across various models.
Leaderboard v55¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-5 | perplexity | β | β | 89.4% | 90.8% | 90.4% | 89.7% | 88.8% | 17.9 |
2 | ashai | gpt-5 | perplexity | β | β | 88.7% | 90.4% | 90.0% | 90.2% | 86.9% | 34.2 |
3 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.6% | 89.6% | 91.1% | 89.8% | 86.9% | 18.3 |
4 | ashai | gpt-5 | perplexity | β | β | 88.6% | 90.0% | 90.4% | 89.8% | 87.2% | 14.8 |
5 | ashai | gpt-4o-mini | nihai, ask nihai | β | β | 88.4% | 89.2% | 89.2% | 89.3% | 86.7% | 4.4 |
6 | ashai-experiment | gpt-5 | perplexity | β | β | 88.4% | 90.0% | 90.4% | 90.0% | 89.2% | 16.0 |
7 | ashai-experiment | gpt-5 | perplexity | β | β | 88.2% | 90.4% | 90.4% | 90.0% | 88.9% | 35.3 |
8 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.1% | 89.6% | 90.6% | 89.6% | 88.6% | 42.4 |
9 | ashai | gpt-4o-mini | ask nihai | β | β | 87.9% | 88.2% | 88.8% | 89.4% | 86.0% | 7.8 |
10 | strict-referenced | gpt-4o-mini | nihai, ask nihai | β | β | 87.7% | 88.8% | 88.9% | 89.2% | 87.1% | 3.7 |
11 | ashai | gpt-4o-mini | perplexity | β | β | 87.5% | 90.4% | 90.0% | 90.2% | 86.5% | 8.1 |
12 | ashai | gpt-4o-mini | perplexity | β | β | 87.3% | 88.8% | 87.1% | 89.8% | 85.0% | 4.8 |
13 | ashai | gpt-4o-mini | nihai | β | β | 87.2% | 88.3% | 88.8% | 88.0% | 86.0% | 4.1 |
14 | ashai | gpt-4o-mini | nihai | β | β | 87.0% | 88.8% | 88.8% | 89.5% | 86.7% | 4.3 |
15 | ashai | gpt-4o-mini | nihai | β | β | 86.8% | 88.3% | 89.4% | 90.2% | 86.0% | 4.1 |
16 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 86.2% | 87.9% | 87.7% | 89.4% | 85.7% | 6.9 |
17 | perplexity | sonar | default | β | β | 85.8% | 88.8% | 88.8% | 89.0% | 87.0% | 3.5 |
18 | strict-referenced | gpt-4o-mini | nihai, ask nihai | β | β | 83.1% | 86.2% | 84.8% | 85.2% | 81.1% | 5.2 |
19 | ashai | gpt-4o-mini | perplexity | β | β | 81.3% | 81.7% | 82.5% | 85.7% | 81.7% | 8.2 |
20 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 81.2% | 83.3% | 82.8% | 84.2% | 82.5% | 7.1 |
21 | strict-referenced-after | gpt-4o-mini | nihai, ask nihai | β | β | 80.9% | 80.0% | 83.8% | 82.1% | 78.3% | 5.6 |
22 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 80.7% | 80.8% | 82.1% | 84.0% | 82.9% | 9.5 |
23 | strict-referenced | gpt-4o-mini | perplexity | β | β | 79.7% | 79.6% | 79.3% | 84.8% | 79.1% | 8.2 |
24 | strict-referenced | gpt-4o-mini | perplexity | β | β | 79.5% | 80.8% | 80.6% | 82.2% | 79.8% | 5.3 |
25 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 79.1% | 77.9% | 80.4% | 82.8% | 79.4% | 9.0 |
26 | one-shot | gpt-5 | default | β | β | 78.1% | 89.6% | 91.2% | 89.0% | 89.6% | 18.4 |
27 | one-shot | gpt-5 | default | β | β | 77.8% | 90.0% | 90.7% | 89.0% | 88.5% | 8.7 |
28 | one-shot | gpt-4o-mini | default | β | β | 77.5% | 88.8% | 89.4% | 89.8% | 89.7% | 3.1 |
29 | one-shot | gpt-4o-mini | default | β | β | 77.4% | 89.2% | 90.1% | 90.4% | 88.5% | 2.7 |
30 | one-shot | gpt-4o-mini | default | β | β | 76.9% | 87.1% | 89.4% | 89.7% | 87.3% | 2.6 |
31 | strict-referenced | gpt-4o-mini | perplexity | β | β | 73.3% | 72.5% | 74.2% | 78.2% | 74.0% | 5.7 |
32 | strict-referenced-after | gpt-4o-mini | nihai, ask nihai | β | β | 69.2% | 69.2% | 67.1% | 74.0% | 66.7% | 4.1 |
33 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 67.6% | 67.9% | 68.1% | 70.2% | 66.3% | 9.7 |
34 | strict-referenced | gpt-5 | perplexity | β | β | 66.5% | 67.5% | 67.5% | 67.5% | 67.2% | 98.7 |
35 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 65.5% | 65.9% | 65.9% | 69.1% | 63.5% | 10.4 |
36 | ashai | gpt-4o-mini | ask nihai | β | β | 58.6% | 58.8% | 58.9% | 59.8% | 58.9% | 6.1 |
v54 - aug 26, 2025 2:22am¶
Changes¶
Errors¶
Overall errors: 7/24 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v54 [retry] | 20250826_022134: 7/12 (58%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v54 [retry] | 20250826_022011: 0/12 (0%) Common error messages: - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.β Γ6 - βBackoff failure after 2 attempt(s) [agent:openai:gpt-4o-mini]: UnexpectedStatusCodeError: Object was not added! Unexpected status code: 422, with response body: {'error': [{'message': "id '229fd26b-f8d8-59fa-88ad-a0026e60e4a0' already exists"}]}.β Γ1 Error categories: - other Γ7 Origins: - agent:direct Γ7 Final attempt counts: - after 1 attempt(s) Γ6 - after 2 attempt(s) Γ1
AI analysis:
Error Analysis Summary¶
-
Error Rate Improvement: Overall errors decreased from 24/60 to 7/24, indicating significant progress in reliability.
-
High Performance in Search Tool: The "search_nihai" tool achieved a 0% error rate, suggesting effective handling of requests.
-
Persistent NotImplementedError: The "NotImplementedError" remains a concern, indicating potential issues with tensor handling in the model.
-
Single 422 Error: The occurrence of a 422 error suggests isolated input validation issues, requiring targeted investigation.
-
Next Steps: Address the NotImplementedError by reviewing tensor management; continue monitoring error rates for further improvements.
-
Comparison to Previous Entry: Notable reduction in errors from v52, indicating successful adjustments or fixes implemented since the last evaluation.
Observations¶
- Top performer: ashai | gpt-4o-mini β 85.4%
- Fastest: ashai | gpt-4o-mini β 4.4s
- Top performer: ashai | gpt-4o-mini | nihai achieved 85.4%, a slight improvement from 84.9% in v52.
- Significant gap: The second-place entry trails by 3%, indicating a widening performance disparity.
- Speed outlier: ashai | gpt-4o-mini | ask nihai recorded an average of 4.9 seconds, slower than previous entries.
- Error hotspots: Overall errors decreased to 7/24 cases, showing improvement from 24/60 in v52.
- Common errors: Persistent "NotImplementedError" and "UnexpectedStatusCodeError" indicate ongoing issues with model handling.
- Final attempt counts: Most errors occurred after a single attempt, suggesting initial request handling still needs attention.
Leaderboard v54¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-4o-mini | nihai | β | β | 85.4% | 86.9% | 88.8% | 90.0% | 86.5% | 4.4 |
2 | ashai | gpt-4o-mini | ask nihai | β | β | 36.8% | 37.1% | 37.9% | 37.9% | 36.5% | 4.9 |
v53¶
Changes¶
- fixed tool input format error in ask_nihai
Errors¶
Overall errors: 12/24 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v53 [retry] | 20250826_021607: 12/12 (100%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v53 [retry] | 20250826_021446: 0/12 (0%) Common error messages: - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: UnexpectedModelBehavior: Tool '_tool_wrap_nihai_ask' exceeded max retries count of 1β Γ8 Error categories: - other Γ8 Origins: - agent:direct Γ8 Final attempt counts: - after 1 attempt(s) Γ8
AI analysis:
Error Analysis Summary¶
-
Error Rate Improvement: Overall errors decreased from 24/60 to 12/24, indicating a positive trend in reliability.
-
Persistent Tool Issues: The "ask_nihai" tool still shows a 100% error rate, requiring urgent attention.
-
Common Error Messages: "UnexpectedModelBehavior" errors persist, suggesting issues with tool handling or input processing.
-
Retry Limit Reached: Frequent backoff failures indicate that the tool is not recovering from errors effectively.
-
Next Steps: Investigate the "ask_nihai" tool's logic; consider increasing retry limits or refining input validation.
-
Comparison to Previous Entry: The reduction in overall errors is promising, but the "ask_nihai" tool's performance remains a critical concern.
Observations¶
- Top performer: ashai | gpt-4o-mini β 84.2%
- Fastest: ashai | gpt-4o-mini β 0.3s
- Top performer: ashai | gpt-4o-mini | nihai achieved 84.2%, slightly down from 84.9% in v52.
- Significant gap: Second place trails by 1.8%, indicating a narrowing competitive field.
- Speed outlier: ashai | gpt-4o-mini | ask nihai recorded an average of 0.3 seconds, significantly faster than previous entries.
- Error hotspots: Overall errors decreased to 12/24 cases, but ashai | gpt-4o-mini | ask nihai still shows a 100% error rate.
- Common errors: "UnexpectedModelBehavior" persists, indicating ongoing issues with the ask nihai tool.
- Error reduction: Overall errors improved from 24 in v52 to 12 in v53, marking a positive trend in reliability.
- Final attempt counts: Most errors occurred after a single attempt, suggesting initial request handling remains problematic.
Leaderboard v53¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-4o-mini | nihai | β | β | 84.2% | 87.9% | 88.6% | 89.7% | 88.1% | 4.3 |
2 | ashai | gpt-4o-mini | ask nihai | β | β | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.3 |
v52 - Aug 26 1:00 am¶
Changes¶
- 0 references hard codes 0 reference_sufficiency for med accuracy
- trying nihai out! small batch here
- documented tools - lots of docs changes
- renamed a lot of niharika FAQ to google sheets (it was still poorly renamed)
Errors¶
Overall errors: 24/60 cases Top runs by error rate: - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v52 [retry] | 20250826_020748: 12/12 (100%) - ashai | gpt-4o-mini | tools=ask_nihai | full | v=v52 [retry] | 20250826_015655: 11/12 (92%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v52 [retry] | 20250826_015759: 1/12 (8%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v52 [retry] | 20250826_015521: 0/12 (0%) - ashai | gpt-4o-mini | tools=search_nihai | full | v=v52 [retry] | 20250826_020558: 0/12 (0%) Common error messages: - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: HTTPStatusError: Client error '422 Unprocessable Content' for url 'http://localhost:8000/search/nihai/ask' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422β Γ8 - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: UnexpectedModelBehavior: Tool '_tool_wrap_nihai_ask' exceeded max retries count of 1β Γ8 - βBackoff failure after 3 attempt(s) [agent:openai:gpt-4o-mini]: APIConnectionError: Connection error.β Γ1 Error categories: - other Γ16 - api_connection Γ1 Origins: - agent:direct Γ17 Final attempt counts: - after 1 attempt(s) Γ16 - after 3 attempt(s) Γ1
AI analysis:
Error Analysis Summary¶
-
Consistent Error Rate: Overall errors remain at 24/60, unchanged from the previous entry, indicating persistent issues.
-
High Failure Rate: The "ask_nihai" tool continues to show a 100% error rate, suggesting severe reliability problems.
-
Frequent 422 Errors: The "422 Unprocessable Content" error persists, indicating ongoing issues with input validation or API request formatting.
-
Connection Issues: A single APIConnectionError suggests potential network reliability problems that need further investigation.
-
Next Steps: Review and refine API configurations for "ask_nihai"; conduct network stability tests to address connectivity issues.
-
Regression Analysis: Compare recent changes to identify specific modifications contributing to sustained error rates.
Observations¶
- Top performer: ashai | gpt-4o-mini β 84.9%
- Fastest: ashai | gpt-4o-mini β 0.8s
- Top performer: ashai | gpt-4o-mini | nihai achieved 84.9%, down from 89.7% in v51, indicating a performance decline.
- Significant gap: The second-place entry trails by 2.5%, highlighting a notable drop in competitive performance.
- Speed outlier: ashai | gpt-4o-mini | ask nihai recorded an average of 0.8 seconds, significantly faster than previous entries.
- Error hotspots: Overall errors increased to 24/60 cases, with ashai | gpt-4o-mini | ask nihai showing a 100% error rate.
- Common errors: Persistent "422 Unprocessable Content" and "UnexpectedModelBehavior" errors indicate ongoing issues with input handling.
- Error increase: Overall errors rose from 12 in v51 to 24 in v52, marking a regression in reliability.
- Final attempt counts: Most errors occurred after a single attempt, suggesting issues with initial request handling.
Leaderboard v52¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-4o-mini | nihai | β | β | 84.9% | 87.9% | 89.3% | 89.8% | 86.8% | 5.3 |
2 | ashai | gpt-4o-mini | nihai | β | β | 82.4% | 87.5% | 89.2% | 89.6% | 88.8% | 6.6 |
3 | ashai | gpt-4o-mini | nihai | β | β | 77.1% | 79.6% | 80.9% | 81.0% | 81.1% | 4.7 |
4 | ashai | gpt-4o-mini | ask nihai | β | β | 6.8% | 8.3% | 7.9% | 7.5% | 7.5% | 2.8 |
5 | ashai | gpt-4o-mini | ask nihai | β | β | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.8 |
v51¶
Changes¶
- fixed hopefully last of errors
- evaluating no-reasoning runs.
- evaluator names are now good hopefully
Errors¶
- Hell yeah no errors!! Woot! First run with none
Observations¶
- turning off reasoning doesn't really impact gpt-5 (in fact it sometimes beats the reasoning version - maybe its not really how to turn it off) except it halves the time to complete
- one-shot oddly still gets 90% on med acc - it should not be able to
- one-shot on gpt-4 is pretty much as good as gpt-5
- perplexity is the shit and a continued safe and strong performer - also super fast
- Top performer: ashai-experiment | gpt-5 β 89.7%
- eval names look great in weave
- Fastest: one-shot | gpt-4o-mini β 2.6s
- Top performer: ashai | gpt-5 achieved 89.7%, surpassing previous leaders and indicating strong model performance.
- Speed outlier: one-shot | gpt-4o-mini averaged 2.6 seconds, the fastest in this evaluation, improving from previous entries.
- Error reduction: Overall errors decreased to 0/312 cases, a substantial improvement from 2 errors in v50.
- Performance consistency: ashai runs with both gpt-4o-mini and gpt-5 showed no errors, reflecting improved reliability.
Leaderboard v51¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai-experiment | gpt-5 | perplexity | β | β | 89.7% | 90.8% | 90.0% | 89.6% | 90.0% | 27.4 |
2 | strict-referenced | gpt-5 | perplexity | β | β | 89.2% | 90.0% | 90.2% | 89.8% | 88.3% | 32.0 |
3 | ashai | gpt-5 | perplexity | β | β | 89.1% | 90.4% | 90.8% | 89.7% | 87.9% | 16.9 |
4 | ashai | gpt-5 | perplexity | β | β | 88.9% | 89.6% | 91.4% | 89.5% | 88.7% | 23.0 |
5 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.8% | 89.2% | 90.4% | 89.6% | 88.9% | 47.9 |
6 | one-shot | gpt-5 | default | β | β | 88.7% | 90.0% | 89.7% | 88.0% | 87.7% | 22.0 |
7 | ashai-experiment | gpt-5 | perplexity | β | β | 88.6% | 90.6% | 90.2% | 89.9% | 88.9% | 19.1 |
8 | one-shot | gpt-4o-mini | default | β | β | 88.6% | 90.0% | 90.4% | 89.8% | 87.5% | 3.9 |
9 | strict-referenced | gpt-5 | perplexity | β | β | 88.4% | 89.2% | 89.0% | 89.5% | 89.7% | 16.1 |
10 | one-shot | gpt-4o-mini | default | β | β | 88.2% | 88.3% | 90.7% | 90.0% | 87.3% | 2.6 |
11 | perplexity | sonar | default | β | β | 88.0% | 88.2% | 90.2% | 89.8% | 85.0% | 3.3 |
12 | one-shot | gpt-4o-mini | default | β | β | 87.8% | 89.2% | 90.2% | 88.2% | 88.2% | 4.5 |
13 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 87.4% | 88.3% | 89.2% | 90.0% | 86.2% | 7.9 |
14 | strict-referenced-after | gpt-5 | perplexity | β | β | 87.3% | 88.8% | 89.5% | 88.9% | 87.1% | 18.1 |
15 | ashai | gpt-4o-mini | perplexity | β | β | 87.1% | 88.6% | 89.0% | 89.8% | 87.7% | 5.8 |
16 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 87.1% | 87.5% | 89.4% | 89.8% | 87.0% | 6.9 |
17 | strict-referenced | gpt-4o-mini | perplexity | β | β | 86.9% | 87.8% | 88.5% | 89.7% | 86.7% | 7.4 |
18 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 86.6% | 88.3% | 86.9% | 87.3% | 86.9% | 10.0 |
19 | ashai | gpt-4o-mini | perplexity | β | β | 86.4% | 86.7% | 89.2% | 88.2% | 86.3% | 6.7 |
20 | strict-referenced | gpt-4o-mini | perplexity | β | β | 86.2% | 86.8% | 86.7% | 88.7% | 84.5% | 5.4 |
21 | one-shot | gpt-5 | default | β | β | 85.1% | 88.2% | 82.3% | 87.1% | 83.7% | 8.4 |
22 | ashai | gpt-4o-mini | perplexity | β | β | 81.4% | 84.6% | 81.7% | 85.0% | 81.7% | 5.9 |
23 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 80.9% | 80.4% | 80.8% | 83.4% | 82.3% | 12.2 |
24 | strict-referenced | gpt-4o-mini | perplexity | β | β | 80.9% | 84.6% | 79.8% | 83.4% | 81.2% | 8.1 |
25 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 80.4% | 81.2% | 80.8% | 83.9% | 78.3% | 10.8 |
26 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 74.0% | 73.8% | 75.8% | 77.2% | 76.7% | 7.9 |
v50 - aug 25, 2025 8:57pm¶
Changes¶
- Updated _is_retriable_error - Transient OpenAI network hiccups will now be retried instead of failing after 1 attempt, reducing those βBackoff failure after 1 attempt(s) ... APIConnectionError: Connection error.β entries.
- cleaning up auto evaluator
- fixed please try again fallback
- Increase AGENT_MAX_RETRIES by 1 for models showing higher transient fail rates
Errors¶
Overall errors: 2/132 cases Top runs by error rate: - one-shot | gpt-4o-mini | tools=default | full | v=v50 | 20250825_184632: 1/12 (8%) - strict-referenced-after | gpt-4o-mini | tools=search_perplexity | full | v=v50 | 20250825_184127: 1/12 (8%) - ashai | gpt-4o-mini | tools=search_perplexity | full | v=v50 | 20250825_183937: 0/12 (0%) - ashai | gpt-5 | tools=search_perplexity | full | v=v50 | 20250825_185505: 0/12 (0%) - ashai-experiment | gpt-4o-mini | tools=search_perplexity | full | v=v50 | 20250825_183711: 0/12 (0%) Common error messages: - βBackoff failure after 3 attempt(s) [agent:openai:gpt-4o-mini]: APIConnectionError: Connection error.β Γ2 Error categories: - api_connection Γ2 Origins: - agent:direct Γ2 Final attempt counts: - after 3 attempt(s) Γ2
AI analysis:
Error Analysis¶
-
Error Rate Improvement: Overall errors decreased from 25/132 (19%) to 2/132 (1.5%), indicating significant progress in model stability.
-
Top Error Patterns: Both top errors (1/12) occurred in runs with
gpt-4o-mini
, suggesting potential model-specific issues under certain configurations. -
Common Connection Errors: Persistent APIConnectionErrors indicate network reliability issues; two instances reported backoff failures after three attempts.
-
Model Performance Consistency: The
ashai
runs withgpt-4o-mini
andgpt-5
showed no errors, indicating improved reliability in these configurations. -
Next Steps: Investigate network stability and API performance; consider further testing with
gpt-4o-mini
under different conditions to isolate issues. -
Comparison to Previous Entry: Previous entry had multiple high-error rates; current results show a marked reduction, suggesting effective adjustments in model handling or configurations.
Observations¶
- Top performer: strict-referenced | gpt-5 β 88.7%
- Fastest: one-shot | gpt-4o-mini β 3.0s
- Top performer: perplexity | sonar achieved 87.7%, maintaining the lead from v49.
- Significant gap: Second place (one-shot | gpt-4o-mini) scored 86.9%, indicating a notable performance drop.
- Speed outlier: one-shot | gpt-4o-mini averaged 3.0 seconds, the fastest in this evaluation.
- Error reduction: Overall errors decreased to 2/132 cases, a significant improvement from 25 errors in v49.
- Notable error hotspots: Common errors were limited to API connection issues, with only two occurrences reported.
- Performance consistency: Asahi and strict-referenced models showed improved accuracy, with no errors in their latest runs.
Leaderboard v50¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-5 | perplexity | β | β | 88.7% | 90.0% | 89.4% | 89.8% | 88.4% | 44.7 |
2 | one-shot | gpt-5 | default | β | β | 88.6% | 89.6% | 89.6% | 89.8% | 88.3% | 25.6 |
3 | strict-referenced-after | gpt-5 | perplexity | β | β | 88.4% | 88.8% | 90.0% | 89.8% | 87.2% | 34.5 |
4 | ashai | gpt-5 | perplexity | β | β | 88.2% | 89.0% | 90.2% | 89.8% | 88.2% | 36.4 |
5 | perplexity | sonar | default | β | β | 87.3% | 87.8% | 89.2% | 88.8% | 85.7% | 4.2 |
6 | ashai-experiment | gpt-5 | perplexity | β | β | 87.2% | 87.3% | 89.6% | 88.4% | 88.3% | 34.8 |
7 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 85.9% | 86.3% | 86.2% | 87.6% | 84.4% | 9.6 |
8 | ashai | gpt-4o-mini | perplexity | β | β | 84.9% | 85.0% | 86.5% | 88.8% | 86.0% | 6.7 |
9 | one-shot | gpt-4o-mini | default | β | β | 80.8% | 81.8% | 82.5% | 82.5% | 80.7% | 3.0 |
10 | strict-referenced | gpt-4o-mini | perplexity | β | β | 78.9% | 78.3% | 80.4% | 83.8% | 78.5% | 11.3 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 45.8% | 47.5% | 42.5% | 50.8% | 48.8% | 9.1 |
v49 - aug 25, 2025 2:09pm¶
Changes¶
- removed second route for med acc for things with no tools - meaning Agents with no tools can't score higher than 80% on medical accuracy
- moved batch setup to batch_set.py for easier work
- lots of docs improvements that shouldn't impact tests
- filter out response errors at top level vs at the scorer level
- renamed Niharika FAQ to googlesheets
Errors¶
Overall errors: 25/132 cases Top runs by error rate: - ashai-experiment | gpt-5 | tools=search_perplexity | full | v=v49 | 20250825_133303: 5/12 (42%) - strict-referenced | gpt-5 | tools=search_perplexity | full | v=v49 | 20250825_135920: 4/12 (33%) - strict-referenced-after | gpt-5 | tools=search_perplexity | full | v=v49 | 20250825_134255: 4/12 (33%) - ashai | gpt-4o-mini | tools=search_perplexity | full | v=v49 | 20250825_132634: 3/12 (25%) - strict-referenced-after | gpt-4o-mini | tools=search_perplexity | full | v=v49 | 20250825_132801: 3/12 (25%) Common error messages: - βBackoff failure after 1 attempt(s) [agent:openai:gpt-5]: APIConnectionError: Connection error.β Γ17 - βBackoff failure after 1 attempt(s) [agent:openai:gpt-4o-mini]: APIConnectionError: Connection error.β Γ7 - βError: Failed to generate a properly formatted response. Please try again.β Γ1
Observations¶
- Top performer: perplexity | sonar at 87.7% if runs else 'n/a'
Leaderboard¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | perplexity | sonar | default | β | β | 87.7% | 88.0% | 89.2% | 89.7% | 85.8% | 3.2 |
2 | one-shot | gpt-4o-mini | default | β | β | 86.9% | 87.1% | 90.7% | 89.8% | 88.3% | 3.8 |
3 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 80.4% | 81.5% | 80.4% | 82.9% | 82.5% | 6.0 |
4 | one-shot | gpt-5 | default | β | β | 74.5% | 75.4% | 75.6% | 74.4% | 75.2% | 12.7 |
5 | ashai | gpt-5 | perplexity | β | β | 74.3% | 75.0% | 75.4% | 74.6% | 73.8% | 25.5 |
6 | strict-referenced | gpt-4o-mini | perplexity | β | β | 67.1% | 65.8% | 67.1% | 72.1% | 64.6% | 4.8 |
7 | ashai | gpt-4o-mini | perplexity | β | β | 65.0% | 66.7% | 66.7% | 66.2% | 63.5% | 4.8 |
8 | strict-referenced-after | gpt-5 | perplexity | β | β | 58.7% | 59.6% | 59.2% | 59.7% | 58.2% | 79.6 |
9 | strict-referenced | gpt-5 | perplexity | β | β | 58.6% | 59.2% | 59.8% | 59.5% | 58.3% | 35.2 |
10 | ashai-experiment | gpt-5 | perplexity | β | β | 52.1% | 52.9% | 53.3% | 52.5% | 52.5% | 18.9 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 43.0% | 43.6% | 42.1% | 47.5% | 43.8% | 9.0 |
Reverseβchronological log of what was learned for each evaluation iteration.
v48¶
Changes¶
- bringing back evidence_sufficiency to tests with no references
- filtering errors in higher class (not at each scorer)
- moved the weights to the top of scorer files and added to docs (should not impact anything)
- added an auto-learnings filer
v47 - take 2 - Aug 25, 2025¶
Changes¶
- forgot to increment version so its added on to the last v47
- more backoff, reduced concurencey
- TIMEOUT - increased past 300
Observations¶
- one-shot won... though I have no idea how because it can't use tools so had 0 references and should have got at maximum 80% on the medical accuracy. - investigated and oh yeah it gets a different scorer when there are no references. Let's fix that.
Leaderboard¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | one-shot | gpt-4o-mini | default | β | β | 88.2% | 88.3% | 89.6% | 90.2% | 88.2% | 2.3 |
2 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 88.1% | 89.6% | 88.8% | 89.2% | 86.5% | 6.4 |
3 | perplexity | sonar | default | β | β | 87.5% | 88.8% | 88.8% | 89.0% | 85.8% | 3.2 |
4 | perplexity | sonar | default | β | β | 87.4% | 89.2% | 89.1% | 88.4% | 83.8% | 3.1 |
5 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 83.1% | 84.2% | 82.5% | 89.1% | 81.7% | 14.8 |
6 | one-shot | gpt-5 | default | β | β | 81.8% | 83.3% | 83.8% | 81.4% | 80.2% | 16.1 |
7 | one-shot | gpt-5 | default | β | β | 81.4% | 82.5% | 83.1% | 81.7% | 81.5% | 13.7 |
8 | one-shot | gpt-4o-mini | default | β | β | 73.3% | 74.2% | 75.8% | 75.1% | 71.7% | 2.3 |
9 | strict-referenced | gpt-4o-mini | perplexity | β | β | 72.6% | 74.2% | 73.3% | 74.4% | 71.7% | 4.4 |
10 | ashai | gpt-4o-mini | perplexity | β | β | 72.3% | 72.9% | 73.2% | 73.8% | 70.0% | 5.2 |
11 | ashai | gpt-4o-mini | perplexity | β | β | 72.3% | 73.2% | 73.8% | 74.4% | 71.7% | 3.7 |
12 | ashai-experiment | gpt-5 | perplexity | β | β | 66.5% | 67.1% | 67.7% | 67.1% | 66.7% | 24.3 |
13 | ashai | gpt-5 | perplexity | β | β | 66.2% | 67.9% | 67.9% | 67.5% | 66.1% | 20.3 |
14 | ashai | gpt-5 | perplexity | β | β | 66.0% | 67.1% | 67.1% | 67.4% | 66.5% | 28.4 |
15 | strict-referenced | gpt-4o-mini | perplexity | β | β | 59.4% | 60.4% | 59.8% | 61.7% | 58.8% | 4.5 |
16 | strict-referenced-after | gpt-5 | perplexity | β | β | 59.0% | 60.0% | 61.1% | 59.7% | 57.3% | 29.6 |
17 | ashai-experiment | gpt-5 | perplexity | β | β | 56.8% | 57.5% | 57.5% | 57.1% | 55.5% | 37.4 |
18 | strict-referenced | gpt-5 | perplexity | β | β | 52.1% | 52.5% | 52.9% | 52.3% | 51.2% | 25.0 |
19 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 50.7% | 50.4% | 51.7% | 51.7% | 51.0% | 7.9 |
20 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 46.2% | 45.4% | 47.1% | 48.8% | 46.9% | 8.3 |
21 | strict-referenced-after | gpt-5 | perplexity | β | β | 44.5% | 45.0% | 45.4% | 44.2% | 43.3% | 25.0 |
22 | strict-referenced | gpt-5 | perplexity | β | β | 44.3% | 44.2% | 45.0% | 44.6% | 43.8% | 28.9 |
v44 - Aug¶
Changes¶
- upgraded tier of perplexity - not limited to 50 calls/min
- wrapper for perplexity
v43 - Aug 23, 2025 3pm¶
Changes¶
- BIG Merge - moving to weave's typed models.
- NihAI v1 added but will not be used.
- error logging hopefully fixed, total_cases should always be 11
Observations¶
- oddly gpt-5 is doing worse still, even with backoff and stuff.
Leaderboard v43¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 87.9% | 90.0% | 89.2% | 88.3% | 85.7% | 7.3 |
2 | perplexity | sonar | default | β | β | 87.4% | 88.8% | 89.2% | 88.6% | 85.0% | 3.6 |
3 | one-shot | gpt-5 | default | β | β | 81.9% | 83.3% | 83.5% | 82.3% | 81.7% | 20.6 |
4 | one-shot | gpt-4o-mini | default | β | β | 80.1% | 80.4% | 81.9% | 82.5% | 82.1% | 3.6 |
5 | strict-referenced | gpt-4o-mini | perplexity | β | β | 79.5% | 80.8% | 80.6% | 82.3% | 80.0% | 6.6 |
6 | ashai | gpt-5 | perplexity | β | β | 70.3% | 72.3% | 65.9% | 74.5% | 71.2% | 27.4 |
7 | ashai | gpt-4o-mini | perplexity | β | β | 67.2% | 67.5% | 67.5% | 69.7% | 67.2% | 6.9 |
8 | strict-referenced-after | gpt-5 | perplexity | β | β | 66.7% | 67.1% | 68.5% | 68.6% | 67.9% | 24.5 |
9 | strict-referenced | gpt-5 | perplexity | β | β | 52.7% | 52.5% | 53.3% | 54.2% | 54.7% | 24.2 |
10 | ashai-experiment | gpt-5 | perplexity | β | β | 52.5% | 52.5% | 52.1% | 54.6% | 55.0% | 20.0 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 39.3% | 37.1% | 37.9% | 44.2% | 42.1% | 9.0 |
v39 - Aug 22, 2025 5pm¶
Changes¶
- added better error tracking and logging to evals
- switched eval judge to 'gpt-4o-mini-2024-07-18'
- moved all models to Weave's typed models instead of model_funcs for better tracking and naming
- reduced judge token budget from 8000 to 2500 to inc perf
- added a retry backoff - hopefully helps with throttling but may increase times
Observations¶
- perplexity ran first and the server wasn't up - this was a bug and being fixed
- the error logs are saying there were 0 errors but also varying number of cases run per eval... because failures aren't making it to the end where theres error handling
Leaderboard v39¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | one-shot | gpt-4o-mini | default | β | β | 88.2% | 88.6% | 90.3% | 89.8% | 88.8% | 4.0 |
2 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 86.9% | 87.8% | 88.3% | 89.4% | 86.2% | 10.6 |
3 | ashai | gpt-5 | perplexity | β | β | 74.2% | 75.4% | 75.4% | 74.7% | 75.7% | 28.1 |
4 | ashai | gpt-4o-mini | perplexity | β | β | 72.9% | 75.0% | 74.9% | 76.0% | 73.5% | 6.4 |
5 | strict-referenced | gpt-4o-mini | perplexity | β | β | 70.7% | 72.3% | 70.0% | 71.5% | 70.7% | 10.3 |
6 | strict-referenced-after | gpt-5 | perplexity | β | β | 61.0% | 60.4% | 61.8% | 62.5% | 63.8% | 28.0 |
7 | ashai-experiment | gpt-5 | perplexity | β | β | 59.7% | 59.8% | 59.6% | 60.0% | 63.8% | 23.8 |
8 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 48.5% | 46.4% | 48.9% | 56.2% | 51.5% | 10.8 |
9 | strict-referenced | gpt-5 | perplexity | β | β | 44.5% | 44.6% | 43.3% | 45.0% | 47.2% | 27.5 |
10 | perplexity | sonar | default | β | β | 0.6% | 0.0% | 0.0% | 2.5% | 2.5% | 3.2 |
v38 - Aug 22 1:30am¶
Changes¶
- better weave integration and leaderboard
- got timing issues figured out
-
switched eval model to a hard coded 'gpt-4o-mini-2024-07-18' to keep it a consistent model
-
Timing is now reconciled:
- Weave
model_latency.mean
is agent-only and matches our manualagent_time_s
(v37: 19.26s β 19.26s). - Leaderboard βAvg sβ switched to wall-based throughput using root
ended_at - started_at
divided by num_cases. - Failed/descendant_error runs are included but clearly flagged (red rows) and donβt drive duration metrics.
- Weave UI leaderboard: added two columns via a new
wall_clock
scorer on each Evaluation output: wall_clock.per_case_avg_s
(minimize)wall_clock.total_s
(minimize)- Batch stability: launcher cleanup tightened; reduced intermittent failures from stray processes.
- Case starts: in direct mode, cases start concurrently (identical epoch starts). The increasing times were completion order, not backlog;
queue_wait_s
is near zero in direct mode.
Observations¶
- gpt-5 oddly did worse than 4?! - seems like there were a lot of 0 scores from Error: Connection issue
- Timing verified and aligned with manual metrics.
- Perplexity (sonar) leads with strong quality and speed.
- Strictβreferenced gptβ4oβmini + perplexity improved and fast.
- gptβ5 + tools still underperform and are slower; oneβshot gptβ5 is efficient.
- ashaiβexperiment lags base ashai at similar latency; consider retune/remove.
- No failed rows skewing results; health flags working.
v38 Leaderboard¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | perplexity | sonar | default | β | β | 88.0% | 88.9% | 89.4% | 88.7% | 85.8% | 15.8 |
2 | strict-referenced | gpt-4o-mini | perplexity | β | β | 84.0% | 86.0% | 85.8% | 88.1% | 82.1% | 15.9 |
3 | one-shot | gpt-5 | default | β | β | 82.7% | 85.0% | 84.0% | 82.4% | 81.9% | 17.6 |
4 | ashai | gpt-4o-mini | perplexity | β | β | 78.1% | 80.4% | 75.8% | 80.8% | 76.8% | 15.8 |
5 | one-shot | gpt-4o-mini | default | β | β | 74.5% | 78.3% | 76.2% | 78.6% | 76.2% | 14.2 |
6 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 73.5% | 75.2% | 73.8% | 72.7% | 73.2% | 15.2 |
7 | ashai | gpt-5 | perplexity | β | β | 67.6% | 68.9% | 71.7% | 66.5% | 65.2% | 28.0 |
8 | strict-referenced-after | gpt-5 | perplexity | β | β | 67.4% | 68.6% | 70.0% | 67.3% | 65.4% | 26.6 |
9 | strict-referenced | gpt-5 | perplexity | β | β | 66.9% | 67.5% | 68.3% | 67.2% | 63.2% | 33.8 |
10 | ashai-experiment | gpt-5 | perplexity | β | β | 60.5% | 60.8% | 62.1% | 61.1% | 59.4% | 34.9 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 42.1% | 43.3% | 39.6% | 42.3% | 49.3% | 23.2 |
v37 - Aug 21 8pm - testing weave leaderboard. def broken runs¶
Observations¶
- Timing verified: Wallβbased Avg s now matches expectations and aligns with manual agent timings (no backlog artifacts).
- Perplexity baseline leads:
perplexity (sonar)
tops quality at 88.0% and stays fast (~15.8s), a strong production baseline. - Strictβreferenced (gptβ4oβmini + perplexity) improved: 84.0% at ~15.9s suggests the strict flow is stable and no longer incurring heavy latency.
- Tools + gptβ5 still underperform: gptβ5 + perplexity variants land midβ60s overall and are slower (~26β35s). Indicates a prompting/config gap vs mini/sonar path.
- Oneβshot gptβ5 is efficient: 82.7% at ~17.6s without tools; tools donβt help gptβ5 yet.
- ashaiβexperiment lags: Experiment prompt underperforms the base
ashai
at similar latencies; likely remove or retune. -
Run health: No obvious failed/descendant_error rows in top results; filtering + flags working.
-
so some runs failed hard. not sure why
- the time is way way down.
v37 leaderboard¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | perplexity | β | β | 73.7% | 74.9% | 74.5% | 74.2% | 71.6% | 21.8 |
2 | ashai-experiment | gpt-4o-mini | default | β | β | 16.8% | 15.0% | 16.2% | 20.4% | 23.8% | 7.5 |
3 | strict-referenced-after | gpt-4o-mini | default | β | β | 12.4% | 8.9% | 12.1% | 16.2% | 17.1% | 27.2 |
4 | ashai | gpt-4o-mini | default | β | β | 8.3% | 7.5% | 8.8% | 8.8% | 13.8% | 26.6 |
5 | strict-referenced | gpt-4o-mini | default | β | β | 2.3% | 0.0% | 1.7% | 3.8% | 10.4% | 7.3 |
V36 - Aug 21 4pm - same as v35 but with gpt-4¶
Changes¶
- fixed some tests and some timing things but otherwise nothing. it thought this was going to run v34 again but the version got incremented. I wanted v34 to have more gpt-4 comparators as I'm trying to see if i can get gpt-4 to be as good as 5 with some prompt tweaks
Observations¶
- V35 β V36 focus shift: Added more GPTβ4 comparators while keeping the V35 setup, to see how far prompt tweaks can push GPTβ4 toward GPTβ5 quality.
- Timing instrumentation matured: V35 introduced perβcase
agent_time_s
; V36 addswall_time_s
andqueue_wait_s
so we can separate execution vs queueing and reconcile against Weave. - Weave vs wall timing: Weave
model_latency.mean
(agentβonly) stays inflated vs perβcase wall averages due to concurrency. Plan: swap the leaderboard βAvg sβ to a wallβbased metric. - Pattern holds: βCheckβafterβ continues to edge strictβinline on quality and speed; Perplexity integration looks healthier (GPTβ5 + Perplexity leads, GPTβ4 + Perplexity is competitive).
- Next: Use the corrected Weave entity to pull perβcase latencies for V35/V36 and update the timing_debug_report with true perβtest numbers.
Leaderboard V35¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | perplexity | β | β | 89.3% | 90.8% | 91.9% | 91.0% | 86.7% | 742.6 |
2 | strict-referenced | gpt-4o-mini | default | β | β | 85.1% | 87.1% | 84.2% | 89.3% | 85.0% | 407.2 |
3 | ashai-experiment | gpt-4o-mini | default | β | β | 82.1% | 81.5% | 80.8% | 86.2% | 82.5% | 126.9 |
4 | strict-referenced-after | gpt-4o-mini | default | β | β | 79.9% | 85.0% | 80.0% | 80.8% | 77.9% | 368.6 |
5 | ashai | gpt-4o-mini | default | β | β | 79.9% | 82.1% | 80.4% | 87.9% | 82.2% | 340.5 |
V35 - Aug 20, 2025 8pm¶
Changes¶
- trying to debug some timing things by manually logging times.
Leaderboard V35¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | perplexity | sonar | default | β | β | 87.0% | 87.6% | 88.8% | 88.2% | 86.8% | 57.6 |
2 | ashai-experiment | gpt-5 | default | β | β | 83.2% | 84.8% | 84.2% | 83.8% | 80.8% | 123.0 |
3 | ashai | gpt-5 | default | β | β | 82.7% | 85.2% | 83.3% | 83.4% | 80.0% | 207.9 |
4 | strict-referenced-after | gpt-5 | default | β | β | 76.0% | 77.9% | 78.3% | 75.0% | 72.9% | 288.1 |
5 | ashai | gpt-4o-mini | perplexity | β | β | 65.6% | 66.5% | 64.8% | 66.5% | 65.8% | 21.1 |
6 | strict-referenced | gpt-5 | default | β | β | 45.3% | 45.8% | 46.2% | 45.4% | 46.1% | 433.9 |
V34 - Aug 20, 2025 10am¶
Changes¶
- completely new Ahsai-experiment with encouraging more perplexity use and followup questions.
- took out warnings for perplexity being a fallback tool
- penalizing technical terms and drug names and dosage amounts in empathy and precision
Observations¶
- GPT-5 quality holds, speed improves ~3x vs V33: Top GPT-5 agents remain ~90% overall, with latency down to ~1,000β1,300s (from ~3,100s).
ashai-experiment
leads at 90.5% in 983.8s. - Perplexity integration looks healthier:
ashai (gpt-4o-mini, perplexity)
at 87.7% (Rank 5) narrowly edgesperplexity (sonar)
at 87.5%, suggesting prompt/flow updates helped. - Check-after beats strict inline:
strict-referenced-after (gpt-5)
edgesstrict-referenced
(89.9% vs 89.5%) and is faster (1083s vs 1311s), supporting the guess-then-verify pattern. - New jargon/drug penalties are biting as intended: Minor shifts in Precision/Empathy for some GPT-4o-mini configs without collapsing top rows.
- Speed tiers remain clear:
one-shot (gpt-4o-mini)
is fastest (38.7s) with solid 85.8%;perplexity (sonar)
is also quick (62.3s) with strong quality.
Leaderboard V34¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai-experiment | gpt-5 | default | β | β | 90.5% | 91.2% | 92.4% | 90.7% | 87.2% | 983.8 |
2 | strict-referenced-after | gpt-5 | default | β | β | 89.9% | 91.2% | 92.9% | 89.8% | 87.5% | 1083.0 |
3 | ashai | gpt-5 | default | β | β | 89.8% | 91.5% | 91.0% | 90.2% | 86.9% | 1015.3 |
4 | strict-referenced | gpt-5 | default | β | β | 89.5% | 90.3% | 89.5% | 90.9% | 86.4% | 1311.1 |
5 | ashai | gpt-4o-mini | perplexity | β | β | 87.7% | 89.2% | 88.8% | 88.8% | 85.7% | 166.9 |
6 | perplexity | sonar | default | β | β | 87.5% | 88.3% | 89.6% | 89.2% | 85.6% | 62.3 |
7 | strict-referenced | gpt-4o-mini | default | β | β | 86.3% | 88.4% | 85.5% | 89.5% | 85.0% | 157.0 |
8 | ashai-experiment | gpt-4o-mini | default | β | β | 86.1% | 88.3% | 86.7% | 88.0% | 86.0% | 161.0 |
9 | ashai | gpt-4o-mini | default | β | β | 86.1% | 88.1% | 85.8% | 89.8% | 86.8% | 116.7 |
10 | one-shot | gpt-4o-mini | default | β | β | 85.8% | 89.4% | 87.3% | 89.1% | 86.8% | 38.7 |
11 | ashai | gpt-4o-mini | default | β | β | 84.1% | 86.7% | 84.7% | 87.8% | 85.8% | 118.2 |
12 | ashai-experiment | gpt-4o-mini | default | β | β | 81.9% | 82.8% | 81.2% | 83.8% | 80.7% | 185.5 |
13 | strict-referenced | gpt-4o-mini | default | β | β | 79.9% | 79.6% | 78.3% | 83.2% | 82.1% | 158.1 |
14 | ashai | gpt-4o-mini | none | β | β | 78.7% | 80.0% | 81.5% | 82.2% | 80.4% | 33.8 |
15 | strict-referenced-after | gpt-4o-mini | default | β | β | 77.8% | 77.5% | 77.9% | 83.5% | 80.4% | 221.3 |
16 | strict-referenced-after | gpt-4o-mini | default | β | β | 75.0% | 75.0% | 77.9% | 73.2% | 72.1% | 260.4 |
V33¶
Changes¶
- some fixes to leaderboard and removed quick mode default - so these are not in quick mode eval
- trying some gpt-5 now
- no timeout things with using the apis. only september is using the run.sh every other test is just direct now
Observations¶
-
WOW! We got some high scores! Strict-referenced did great but WOW did it take a long long time!
-
GPT-5 dominance: GPT-5 models completely dominated the top 4 positions, with all achieving 90%+ overall scores. This represents a massive improvement over V32 where GPT-5 was inconsistent and often underperformed - due to timeouts.
I asked AI to dig into what were the diffs between runs from 4 and 5:
-
Strict-referenced performance explosion: The strict-referenced agent with GPT-5 achieved 90.6% overall, the highest score ever recorded. However, it took an extremely long time (2728 seconds / ~45 minutes), suggesting a significant speed-quality trade-off. Is 45 minutes even possible?! I don't know...
-
ashai consistency: The ashai agent with GPT-5 achieved 90.2% overall in just 1091 seconds, making it the most efficient high-performer. This suggests ashai has found an optimal balance between thoroughness and speed.
-
ashai-experiment recovery continues: After the V30 failure and V32 recovery, ashai-experiment with GPT-5 maintained strong performance (90.1% overall), showing the fixes from V32 were effective and stable.
-
Perplexity agent stability: The perplexity agent with Sonar maintained strong performance (87.7% overall) with excellent speed (94 seconds), confirming it as a reliable middle-ground option that doesn't require the latest models.
-
GPT-4o-mini performance ceiling: Even the best GPT-4o-mini agents maxed out around 86-87% overall, while GPT-5 agents consistently broke 90%. This suggests a clear performance ceiling for the smaller model.
-
Speed vs quality trade-off confirmed: The fastest agents (4-10 seconds) achieved 78-85% scores, while the highest performers (90%+) took 1000+ seconds. The relationship is clear: more thorough analysis requires significantly more time.
-
Tool usage impact: Agents without tools (ashai with "none") still achieved respectable scores (82.1% overall), but tools clearly provide a significant boost when used effectively.
-
Model consistency: Unlike V32 where GPT-5 was inconsistent, V33 shows GPT-5 agents are now reliably high-performing, suggesting the direct API approach and removal of quick mode resolved stability issues.
-
Ashai experiment does worse than ashai. the extra prompt additions hurt not helped. i'm going to take them out and try a diff experiement
I'll analyze the V33 results to understand what GPT-5 is doing better than GPT-4. Let me start by examining the V33 results file and then use the Weave MCP to get detailed trace data.
Read file: evaluate/results/v33.json Let me look at more of the file to understand the structure and find GPT-5 results:
Ran tool Now let me examine the GPT-5 results to understand what they're doing differently:
Read file: evaluate/results/v33.json Let me look at the actual case results to see the differences in responses:
Read file: evaluate/results/v33.json Now let me use the Weave MCP to get detailed trace data for the GPT-5 vs GPT-4o-mini comparisons:
Ran tool Let me try to find the correct project name:
Ran tool Let me try without specifying an entity:
Ran tool Let me examine the V33 results more closely to understand the differences between GPT-5 and GPT-4o-mini performance:
Read file: evaluate/results/v33.json Now let me compare this with the GPT-5 response for the same case. Let me look at the GPT-5 case results:
Read file: evaluate/results/v33.json Now let me look at the evaluation scores to understand the specific differences. Let me examine the evaluation metrics more closely:
Ran tool Let me look at the specific evidence_sufficiency scores for GPT-4o-mini vs GPT-5:
Read file: evaluate/results/v33.json Now let me look at the GPT-5 evidence_sufficiency score:
Read file: evaluate/results/v33.json Now let me create a comprehensive analysis document based on my findings:
Deep Dive Analysis: GPT-5 vs GPT-4o-mini Performance in V33¶
I've conducted a comprehensive analysis of the V33 results to understand what GPT-5 is doing better than GPT-4o-mini. Here are the key findings:
Critical Discovery: Evidence Sufficiency Gap¶
The most dramatic difference is in evidence sufficiency: - GPT-4o-mini (ashai): 68.75% - GPT-5 (ashai): 84.58% - Improvement: +15.83 percentage points
This suggests GPT-5 is doing much more thorough research and information gathering.
What GPT-5 Does Better¶
- More Comprehensive Research
- GPT-4o-mini used 2 sources (September + Niharika) for the headaches case
- GPT-5 used 6 high-quality sources (ACOG, Mayo Clinic, Banner Health, NHS, etc.)
-
GPT-5 conducts multiple Perplexity queries instead of single searches
-
Better Information Synthesis
- GPT-4o-mini: General advice like "Stay hydrated, get rest"
-
GPT-5: Specific actionable steps like "Check BP if 140/90+, take 500-650mg acetaminophen"
-
Enhanced Precision and Detail
- GPT-5 provides specific dosages, thresholds, and structured responses
- Includes detailed warning signs with clear action items
-
Adds personalized follow-up questions
-
Better Cultural Adaptation
- GPT-5's Hindi response includes culturally adapted language and India-specific examples
- More natural language flow and local context
Speed vs Quality Trade-off¶
- GPT-4o-mini: 104.7 seconds
- GPT-5: 1,091.2 seconds (10x longer)
- The quality improvements correlate with the increased processing time
Key Insight for Prompt Engineering¶
The improvements appear to be methodological rather than inherent model capabilities. GPT-5 is: - Conducting more thorough research - Using better response structures - Providing more specific, actionable guidance
This suggests that enhancing prompts for GPT-4o-mini could potentially bridge much of this performance gap by: - Requiring multiple search queries - Mandating specific response structures - Demanding more detailed, actionable responses - Including cultural adaptation requirements
The analysis shows that GPT-5's superior performance comes from more thorough processing rather than faster reasoning, making it a prime candidate for prompt engineering improvements to bring GPT-4o-mini performance closer to GPT-5 levels.
Leaderboard V33¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced | gpt-5 | default | β | β | 90.6% | 91.6% | 93.3% | 91.0% | 87.9% | 2728.3 |
2 | strict-referenced-after | gpt-5 | default | β | β | 90.5% | 91.6% | 92.1% | 89.9% | 87.2% | 3128.2 |
3 | ashai | gpt-5 | default | β | β | 90.2% | 91.5% | 93.3% | 90.6% | 87.8% | 1091.2 |
4 | ashai-experiment | gpt-5 | default | β | β | 90.1% | 91.9% | 91.5% | 91.1% | 87.8% | 3117.1 |
5 | perplexity | sonar | default | β | β | 87.7% | 89.2% | 91.6% | 89.3% | 86.0% | 94.1 |
6 | ashai-experiment | gpt-4o-mini | default | β | β | 86.7% | 88.5% | 88.2% | 88.6% | 87.5% | 117.7 |
7 | ashai | gpt-4o-mini | default | β | β | 85.9% | 87.5% | 88.2% | 89.8% | 85.2% | 104.7 |
8 | strict-referenced | gpt-4o-mini | default | β | β | 85.0% | 87.1% | 86.8% | 87.6% | 86.3% | 271.7 |
9 | ashai-experiment | gpt-4o-mini | default | β | β | 84.9% | 84.2% | 86.2% | 91.2% | 85.7% | 153.2 |
10 | ashai | gpt-4o-mini | default | β | β | 84.7% | 86.7% | 85.8% | 89.6% | 85.5% | 89.4 |
11 | strict-referenced-after | gpt-4o-mini | default | β | β | 84.4% | 86.1% | 84.6% | 86.5% | 81.5% | 309.6 |
12 | one-shot | gpt-4o-mini | default | β | β | 83.7% | 89.2% | 89.0% | 91.5% | 86.7% | 368.4 |
13 | ashai | gpt-4o-mini | none | β | β | 82.1% | 86.2% | 87.9% | 89.8% | 86.0% | 210.5 |
14 | ashai | gpt-4o-mini | perplexity | β | β | 80.3% | 82.1% | 81.7% | 83.2% | 79.2% | 264.6 |
15 | strict-referenced-after | gpt-4o-mini | default | β | β | 79.9% | 79.8% | 80.4% | 81.8% | 78.8% | 355.2 |
16 | strict-referenced | gpt-4o-mini | default | β | β | 78.6% | 79.2% | 80.6% | 81.9% | 80.1% | 151.1 |
V32 Aug 19 2025 5pm - WARNING - all these were eval'd in quick mode....¶
Changes¶
- did speed investigation, and went back to gpt-4 as an eval judge (much faster)
- potentially doing evals direct vs through the run.sh server
- made tool calls direct (except for september)
Observations¶
-
GPT-5 vs GPT-4o-mini performance gap: The top performer (ashai-experiment with GPT-5) achieved 89.8% overall, while GPT-4o-mini agents maxed out around 88.3%. However, GPT-5 agents also had much slower response times (160+ seconds vs 10-20 seconds for GPT-4o-mini). GPT5 takes a crazy amount of time!
-
Strict-referenced consistency: The strict-referenced agent with GPT-4o-mini shows remarkable consistency, taking the #2 and #3 spots with nearly identical scores (88.3% and 88.1% overall). This suggests the approach is very stable.
-
Speed vs quality trade-off: There's a clear inverse relationship between speed and performance. The fastest agents (4-10 seconds) are in the middle of the pack, while the highest performers take 75-200+ seconds.
-
ashai-experiment recovery: After completely failing in V30 (6.7% overall), ashai-experiment with GPT-5 is now the top performer. This suggests the speed investigation and direct tool calls fixed critical issues.
-
Perplexity agent stability: The perplexity agent with Sonar maintains strong performance (87.4% overall) with reasonable speed (75 seconds), showing it's a reliable middle-ground option.
-
GPT-5 consistency issues: While GPT-5 can achieve the highest scores, it's inconsistent - some GPT-5 agents rank very low (59-61% overall), suggesting it may be more sensitive to prompt variations or tool configurations.
Leaderboard V32¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai-experiment | gpt-5 | default | β | β | 89.8% | 92.7% | 88.3% | 91.7% | 86.7% | 160.3 |
2 | strict-referenced | gpt-4o-mini | default | β | β | 88.3% | 91.7% | 86.7% | 93.3% | 81.7% | 10.4 |
3 | strict-referenced | gpt-4o-mini | default | β | β | 88.1% | 91.7% | 87.3% | 90.0% | 83.3% | 19.0 |
4 | perplexity | sonar | default | β | β | 87.4% | 87.5% | 91.2% | 90.1% | 86.8% | 75.0 |
5 | ashai | gpt-4o-mini | default | β | β | 87.1% | 86.7% | 85.7% | 91.7% | 84.3% | 4.8 |
6 | one-shot | gpt-4o-mini | default | β | β | 86.2% | 90.0% | 85.0% | 90.0% | 80.0% | 4.9 |
7 | strict-referenced-after | gpt-4o-mini | default | β | β | 86.2% | 90.0% | 85.0% | 88.3% | 81.7% | 142.7 |
8 | ashai-experiment | gpt-4o-mini | default | β | β | 85.0% | 88.3% | 83.3% | 90.0% | 78.3% | 7.1 |
9 | ashai-experiment | gpt-4o-mini | default | β | β | 84.2% | 88.3% | 83.3% | 88.3% | 76.7% | 14.4 |
10 | ashai | gpt-4o-mini | default | β | β | 83.8% | 90.0% | 83.3% | 81.7% | 80.0% | 9.7 |
11 | ashai | gpt-4o-mini | default | β | β | 82.5% | 86.7% | 80.0% | 86.7% | 76.7% | 25.1 |
12 | strict-referenced-after | gpt-4o-mini | default | β | β | 67.9% | 71.7% | 65.0% | 76.7% | 58.3% | 17.8 |
13 | strict-referenced-after | gpt-5 | default | β | β | 61.2% | 65.0% | 61.7% | 61.7% | 56.7% | 196.3 |
14 | ashai | gpt-5 | default | β | β | 59.8% | 61.7% | 59.3% | 61.7% | 56.7% | 189.3 |
15 | strict-referenced | gpt-5 | default | β | β | 59.2% | 61.7% | 58.3% | 60.7% | 56.0% | 282.6 |
16 | ashai | gpt-4o-mini | default | β | β | 57.8% | 59.3% | 56.7% | 61.7% | 53.3% | 10.9 |
V30 - 8-19-2025 12:00pm¶
Changes¶
- multi-threaded run.sh now hopefully helps with performance?
- removed disclaimer stuff from medical accuracy
Observation¶
- multi-threaded didn't seem to really help with performance
- strict-referenced is doing the best on medical accuracy but getting nocked a bit in precision and clarity
- ashai experimental is just breaking. no clue why
- oddly ashai with reasoning off took a really long time.
Leaderboard¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | perplexity | sonar | default | β | β | 72.5% | 77.3% | 76.4% | 83.6% | 65.6% | 62.8 |
2 | ashai | gpt-4o-mini | perplexity | β | β | 71.1% | 78.9% | 73.2% | 80.6% | 69.8% | 139.0 |
3 | ashai | gpt-4o-mini | perplexity | β | β | 70.9% | 79.8% | 71.7% | 82.6% | 65.8% | 315.9 |
4 | strict-referenced | gpt-4o-mini | default | β | β | 70.7% | 80.8% | 72.2% | 80.5% | 69.2% | 209.6 |
5 | ashai | gpt-4o-mini | none | β | β | 69.6% | 84.9% | 73.9% | 87.1% | 74.2% | 24.8 |
6 | one-shot | gpt-4o-mini | default | β | β | 68.7% | 82.2% | 72.2% | 85.1% | 76.8% | 63.1 |
7 | ashai | gpt-4o-mini | default | β | β | 68.6% | 79.8% | 71.8% | 83.1% | 69.1% | 97.8 |
8 | strict-referenced | gpt-4o-mini | default | β | β | 68.5% | 79.4% | 66.1% | 80.8% | 68.7% | 150.6 |
9 | one-shot | gpt-4o-mini | default | β | β | 65.5% | 79.8% | 71.1% | 78.8% | 71.2% | 82.7 |
10 | ashai | gpt-4o-mini | default | β | β | 64.6% | 75.0% | 68.2% | 73.4% | 67.8% | 72.5 |
11 | strict-referenced-after | gpt-4o-mini | default | β | β | 63.1% | 74.5% | 65.4% | 71.4% | 60.9% | 379.4 |
12 | strict-referenced-after | gpt-4o-mini | default | β | β | 59.5% | 74.0% | 57.6% | 59.8% | 50.2% | 369.0 |
13 | ashai-experimental | gpt-4o-mini | default | β | β | 6.7% | 1.2% | 0.0% | 17.7% | 2.6% | β |
14 | ashai-experimental | gpt-4o-mini | default | β | β | 6.5% | 1.7% | 0.0% | 16.0% | 2.9% | 0.1 |
V29 - 8-18-2025 9pm¶
Changes¶
- medical accuracy: evidence_sufficiency is 0 if no references are provided (strict)
- medical accuracy: removed
disclaimer_present
; disclaimers are now considered under safety only when warranted - medical accuracy weights updated: evidence_sufficiency 30%, response_alignment 25%, safety 15%, medical_correctness 30%
Observations¶
- better job of down ranking things lacking references
- some are still clearly broken or timing out. unclear why
Leaderboard V29¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | perplexity | sonar | default | β | β | 72.4% | 79.2% | 78.4% | 83.7% | 67.6% | 53.6 |
2 | ashai | gpt-4o-mini | perplexity | β | β | 69.7% | 77.6% | 70.6% | 79.1% | 68.8% | 164.4 |
3 | one-shot | gpt-4o-mini | default | β | β | 69.1% | 81.5% | 74.9% | 85.2% | 76.1% | 123.7 |
4 | strict-referenced | gpt-4o-mini | default | β | β | 68.6% | 79.6% | 69.0% | 82.8% | 65.1% | 111.8 |
5 | one-shot | gpt-4o-mini | default | β | β | 68.0% | 80.8% | 72.2% | 86.4% | 75.2% | 37.7 |
6 | strict-referenced-after | gpt-4o-mini | default | β | β | 67.9% | 81.4% | 68.9% | 67.0% | 64.6% | 292.7 |
7 | ashai | gpt-4o-mini | perplexity | β | β | 67.0% | 79.2% | 64.1% | 80.9% | 60.6% | 324.7 |
8 | ashai | gpt-4o-mini | default | β | β | 65.9% | 73.3% | 68.6% | 79.0% | 65.5% | 273.0 |
9 | ashai | gpt-4o-mini | none | β | β | 63.9% | 78.4% | 66.2% | 80.1% | 68.8% | 27.8 |
10 | strict-referenced-after | gpt-4o-mini | default | β | β | 62.6% | 75.8% | 63.3% | 67.6% | 52.4% | 450.3 |
11 | ashai | gpt-4o-mini | default | β | β | 60.7% | 78.5% | 56.9% | 68.8% | 59.0% | 127.1 |
12 | strict-referenced | gpt-4o-mini | default | β | β | 60.5% | 73.6% | 62.8% | 72.5% | 54.8% | 121.4 |
13 | ashai-experimental | gpt-4o-mini | default | β | β | 5.6% | 2.9% | 0.0% | 17.9% | 2.0% | 0.1 |
14 | ashai-experimental | gpt-4o-mini | default | β | β | 5.3% | 0.4% | 0.0% | 13.3% | 2.9% | β |
V28 - 2025-08-18 5:00pm¶
Changes¶
- now sending references to evaluators (it was not there!)
- parallelized runs with batch_eval2
- increased max token size so evals don't get truncated
- messed a lot with odd timeout errors and key stuff
- output is in whatsapp format
- prompt modified to
Observations¶
- oneshot and others with no refs are still scoring too high, but they got taken down bigtime with the new ratings
- strict modes are breaking - maybe timing out? maybe having too big of context and timing out in the eval stage?
- these runs take a long long time... especially the evals now.
- Perplexity (sonar) leads at 77.2%; one-shot (gpt-4o-mini, default) is close (74β75%) and faster (~30β45s).
- Ashai (gpt-4o-mini, default) lands ~73% with ~92s latency; turning reasoning ON slightly hurts (71.8%) and does not speed it up.
- Perplexity tool usage lowers Ashai scores and adds large latency (71.5β67.7% at 144β176s) vs default/none - that shouldn't be true, we should weigh them more
- Passing references to evaluators restored citation credit; top Med Acc now ~75β82% instead of flat caps.
Next steps¶
- perplexity results may not be counting as high quality enough of a reference?
Leaderboard V28¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | perplexity | sonar | default | β | β | 77.2% | 81.8% | 78.5% | 84.8% | 70.5% | 54.6 |
2 | one-shot | gpt-4o-mini | default | β | β | 74.6% | 80.3% | 72.4% | 85.2% | 79.5% | 43.2 |
3 | one-shot | gpt-4o-mini | default | β | β | 74.1% | 80.6% | 73.2% | 82.2% | 76.8% | 30.5 |
4 | ashai | gpt-4o-mini | default | β | β | 73.3% | 81.5% | 74.6% | 80.7% | 71.3% | 93.7 |
5 | ashai | gpt-4o-mini | default | β | β | 71.8% | 80.5% | 72.2% | 77.9% | 73.0% | 91.7 |
6 | ashai | gpt-4o-mini | perplexity | β | β | 71.5% | 79.2% | 69.6% | 77.7% | 66.7% | 143.8 |
7 | strict-referenced | gpt-4o-mini | default | β | β | 69.9% | 78.2% | 67.2% | 80.3% | 69.6% | 101.1 |
8 | strict-referenced-after | gpt-4o-mini | default | β | β | 69.8% | 80.7% | 67.5% | 69.6% | 69.3% | 187.8 |
9 | ashai | gpt-4o-mini | perplexity | β | β | 67.7% | 75.2% | 63.4% | 78.4% | 66.0% | 176.4 |
10 | ashai | gpt-4o-mini | none | β | β | 65.9% | 71.4% | 63.2% | 76.1% | 67.1% | 32.3 |
11 | strict-referenced | gpt-4o-mini | default | β | β | 64.9% | 73.7% | 58.7% | 81.8% | 57.4% | 106.1 |
12 | strict-referenced-after | gpt-4o-mini | default | β | β | 64.4% | 74.9% | 60.7% | 67.6% | 56.3% | 203.0 |
13 | ashai-experimental | gpt-4o-mini | default | β | β | 5.8% | 2.9% | 0.0% | 20.2% | 2.7% | β |
14 | ashai-experimental | gpt-4o-mini | default | β | β | 5.3% | 2.9% | 0.0% | 18.2% | 2.9% | β |
15 | one-shot | gpt-4o-mini | default | β | β | 3.1% | 0.0% | 0.0% | 6.0% | 0.8% | β |
16 | strict-referenced-after | gpt-4o-mini | default | β | β | 3.1% | 0.4% | 0.0% | 6.7% | 0.4% | 0.1 |
17 | strict-referenced | gpt-4o-mini | default | β | β | 3.1% | 0.4% | 0.0% | 5.2% | 0.8% | β |
18 | one-shot | gpt-4o-mini | default | β | β | 2.8% | 0.0% | 0.0% | 5.9% | 1.1% | β |
2025-08-14 10:40pm - Evaluation V22¶
Observations¶
Leaderboard V22¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | perplexity | β | β | 89.9% | 93.6% | 92.5% | 91.5% | 84.0% | 138.2 |
2 | ashai | gpt-5 | none | β | β | 89.8% | 93.0% | 93.7% | 91.8% | 84.9% | 289.5 |
3 | ashai | gpt-5 | none | β | β | 89.3% | 93.8% | 94.2% | 91.7% | 85.6% | 220.9 |
4 | ashai | gpt-5 | perplexity | β | β | 81.8% | 85.7% | 85.0% | 83.9% | 78.5% | 175.2 |
5 | ashai | gpt-5 | default | β | β | 67.2% | 68.9% | 68.2% | 70.0% | 62.1% | 153.8 |
6 | ashai | gpt-5 | default | β | β | 61.7% | 63.3% | 63.2% | 66.6% | 57.1% | 219.9 |
2025-08-14 1pm - Evaluation V21¶
Here just looking at changing the models and tools - focusing on ashai only.
Changes¶
- just focusing on ashai for this round and its different combinations
- niharika url fixed
- hard enforcing that calls with no tools have 0 references
- brought Tool use discipline section into the system prompt for easier access/adjusting
- evaluators follow json schema
- upgraded all openai calls to the new respones API and enforced json_schemas
## Observations
- tools still degrade quality
- i have no idea why tools are
## Next
- having no tools seems to not hurt med acc - this is a bug. why?!?! The opposite is shown in v21.json - so what is going on. let's re-run
Leaderboard V21¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | none | β | β | 88.9% | 93.8% | 93.9% | 90.9% | 86.8% | 156.0 |
2 | ashai | gpt-5 | none | β | β | 88.3% | 91.8% | 94.2% | 90.2% | 85.0% | 195.4 |
3 | ashai | gpt-5 | perplexity | β | β | 74.7% | 77.4% | 78.2% | 76.8% | 71.5% | 118.3 |
4 | ashai | gpt-5 | perplexity | β | β | 65.6% | 68.0% | 66.6% | 69.3% | 63.3% | 98.9 |
5 | ashai | gpt-5 | default | β | β | 60.2% | 63.1% | 60.2% | 64.2% | 56.2% | 165.4 |
6 | ashai | gpt-5 | default | β | β | 25.4% | 23.6% | 24.2% | 30.4% | 21.7% | 110.2 |
V18-V20¶
Skipped because they were used for debugging
2025-06-13 11pm - Evaluation V17¶
Changes¶
- completely re-wrote perplexity's prompt
- added playground prompt
- made pubmed and perplexity tools async
- async batch eval - we'll see how it works...
Observations¶
- gptβ5 (no tools) is back on top at 91.4% and faster (48.8s), a big jump vs V16βs 76.2% @ 67.1s. Confirms earlier bugfixes/changes took effect; verify no data leakage. Might be making up its links though.
- gptβ5 with default/tools still broken (18β21% and ~115s). Tool path regression persists despite other improvements.
- Perplexity prompt rewrite helped:
perplexity (sonar)
now 77.9% (+6.1 vs V16βs 71.8%), with higher latency (16.4s vs ~6s). - Oneβshot (
gptβ4oβmini
) remains strong at 78.2% and fast. Reasoning ON > OFF (78.2% @ 33.9s vs 75.5% @ 38.6s). strictβreferenced
solid at 76.6% with high Med Acc (84.2%).strictβreferencedβafter
still lags (62.4%) due to low precision.- Async batch reduced wall time, but
ashai + perplexity
stays slow (64β91s). Async tools didnβt materially lower perβrun latency. - Outlier:
ashai gptβ4oβmini default
shows 742s avg β likely a stuck/misβmeasured run; exclude from speed comparisons.
Next Steps¶
- Niharika noticed halucinating niharika.org urls and ashai including links from other FAQs
Leaderboard for V17¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | none | β | β | 91.4% | 93.1% | 92.9% | 91.2% | 85.1% | 48.8 |
2 | one-shot | gpt-4o-mini | default | β | β | 78.2% | 84.8% | 73.8% | 85.7% | 75.1% | 33.9 |
3 | perplexity | sonar | default | β | β | 77.9% | 81.1% | 78.7% | 82.5% | 64.1% | 16.4 |
4 | strict-referenced | gpt-4o-mini | default | β | β | 76.6% | 84.2% | 77.8% | 76.2% | 67.2% | 21.8 |
5 | ashai | gpt-4o-mini | perplexity | β | β | 76.4% | 83.0% | 74.4% | 83.6% | 68.2% | 91.4 |
6 | ashai | gpt-4o-mini | none | β | β | 76.4% | 81.2% | 74.8% | 85.3% | 70.6% | 15.7 |
7 | one-shot | gpt-4o-mini | default | β | β | 75.5% | 82.1% | 72.0% | 80.8% | 70.2% | 38.6 |
8 | september | gpt-4o-mini | default | β | β | 74.4% | 81.9% | 70.2% | 83.2% | 73.0% | 14.2 |
9 | ashai | gpt-4o-mini | perplexity | β | β | 74.1% | 81.2% | 71.4% | 81.2% | 66.3% | 64.4 |
10 | ashai | gpt-4o-mini | default | β | β | 73.2% | 83.0% | 71.9% | 78.5% | 63.8% | 32.8 |
11 | ashai | gpt-4o-mini | default | β | β | 72.9% | 79.2% | 69.2% | 81.7% | 67.7% | 742.2 |
12 | strict-referenced-after | gpt-4o-mini | default | β | β | 62.4% | 80.0% | 54.1% | 56.8% | 49.8% | 31.5 |
13 | ashai-experiment | gpt-5 | default | β | β | 21.2% | 17.3% | 17.5% | 26.7% | 15.9% | 115.6 |
14 | ashai | gpt-5 | default | β | β | 18.2% | 15.6% | 15.8% | 24.0% | 14.8% | 114.4 |
2025-06-13 1pm - Evaluation V16¶
Changes¶
- improved dataset format organizing evaluation objects into one, renaming and improving guidance
- added question and risk_level to the evaluator information set
- organized evaluators to more clearly show what the evaluator data vs user input is
- started an ashai-experiment which will have some extra guidance that I hope will help
Observations¶
- interestingly the ashai experiment did slightly worse than regular ashai. i thought the extra prompt instructions (which gave it hints about how it would be valuated) would help but it seemed not to.
- crazy weirdly though it took way less time? Only 10s vs ashai's 60? Like what?! something odd there
- oneshot with no tools and reasoning turned on crushed it. except dang it was slow again? how?! why?
- perplexity's speed is all over the place... no clue why.
Next Step¶
- run evals in parallel... man it takes forever
- improve perplexity's prompt
- changed user prompt for ashai - to not split out the last message but give full thread
Leaderboard for V16¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | one-shot | gpt-5 | default | β | β | 92.0% | 93.2% | 94.3% | 91.6% | 89.2% | 61.4 |
2 | ashai | gpt-5 | none | β | β | 76.2% | 77.8% | 78.2% | 76.6% | 69.8% | 67.1 |
3 | ashai-experiment | gpt-4o-mini | perplexity | β | β | 76.1% | 82.6% | 76.6% | 85.2% | 70.0% | 20.7 |
4 | ashai-experiment | gpt-4o-mini | default | β | β | 75.5% | 84.2% | 73.7% | 85.3% | 66.9% | 10.0 |
5 | ashai | gpt-4o-mini | perplexity | β | β | 75.4% | 79.2% | 78.2% | 79.2% | 68.5% | 26.7 |
6 | ashai-experiment | gpt-4o-mini | default | β | β | 75.1% | 83.8% | 72.5% | 85.1% | 64.8% | 17.8 |
7 | strict-referenced | gpt-4o-mini | perplexity | β | β | 75.1% | 81.8% | 73.8% | 77.7% | 63.1% | 24.0 |
8 | strict-referenced | gpt-4o-mini | perplexity | β | β | 74.5% | 80.5% | 73.6% | 80.2% | 64.2% | 21.6 |
9 | perplexity | sonar | default | β | β | 71.8% | 76.0% | 73.7% | 81.3% | 62.4% | 6.0 |
10 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 68.5% | 77.3% | 66.2% | 68.3% | 59.7% | 24.1 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 61.0% | 73.2% | 55.5% | 65.2% | 48.2% | 24.1 |
12 | ashai | gpt-5 | default | β | β | 41.2% | 39.8% | 39.3% | 44.8% | 36.0% | 100.4 |
13 | strict-referenced | gpt-5 | perplexity | β | β | 18.3% | 16.2% | 14.0% | 24.4% | 13.2% | 115.0 |
14 | ashai | gpt-5 | default | β | β | 12.2% | 7.8% | 8.2% | 17.6% | 7.2% | 116.9 |
15 | strict-referenced | gpt-5 | perplexity | β | β | 3.4% | 0.8% | 0.0% | 9.2% | 0.1% | 120.1 |
2025-06-13 12pm - Evaluation V15¶
Just doing some testing here... ignore this version
2025-08-12 3pm - Evaluation v14¶
Changes¶
- perplexity and chatgpt-5 issue fixed I think
Observations¶
- gpt-5 no tools is winning - works better without tools sadly but its winning and working now
- turning off reasoning seems to be broken - need to debug
- speed seems to be getting worse? not sure wh
- strict references is doing well! though oddly has lower medical accuracy than gpt-5 wonder why especially because it should be requiring references
- perplexity with strict ref gpt-5 is still having issues?
Next Steps¶
- fix the no-reasoning toggle issues
- check gpt-5 references - is it adding them? are they actually right?
- perplexit should be doing better... look into why it isn't
Leader board for V14¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | none | β | β | 83.5% | 90.7% | 81.7% | 89.2% | 78.5% | 69.1 |
2 | one-shot | gpt-5 | default | β | β | 81.1% | 85.1% | 82.2% | 83.2% | 80.5% | 53.6 |
3 | ashai | gpt-4o-mini | default | β | β | 76.0% | 82.4% | 79.1% | 83.9% | 70.0% | 18.4 |
4 | ashai | gpt-4o-mini | none | β | β | 75.5% | 84.6% | 74.8% | 85.0% | 67.2% | 7.6 |
5 | strict-referenced | gpt-4o-mini | perplexity | β | β | 74.7% | 82.0% | 75.9% | 83.2% | 66.5% | 29.2 |
6 | ashai | gpt-4o-mini | perplexity | β | β | 72.4% | 80.4% | 75.2% | 83.4% | 66.8% | 25.4 |
7 | strict-referenced | gpt-4o-mini | perplexity | β | β | 72.1% | 79.2% | 71.7% | 78.3% | 66.5% | 25.1 |
8 | perplexity | sonar | default | β | β | 70.6% | 74.5% | 72.8% | 79.8% | 61.5% | 7.3 |
9 | perplexity | sonar | default | β | β | 70.4% | 76.1% | 71.2% | 78.8% | 61.5% | 7.8 |
10 | ashai | gpt-4o-mini | perplexity | β | β | 68.7% | 74.9% | 69.4% | 81.2% | 64.8% | 26.0 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 53.5% | 62.7% | 51.0% | 59.5% | 41.9% | 31.3 |
12 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 49.3% | 67.9% | 41.3% | 47.6% | 33.8% | 19.3 |
13 | strict-referenced | gpt-5 | perplexity | β | β | 19.1% | 17.2% | 14.7% | 19.8% | 14.8% | 118.6 |
14 | ashai | gpt-5 | default | β | β | 19.1% | 17.1% | 15.3% | 20.2% | 14.7% | 117.8 |
15 | ashai | gpt-5 | default | β | β | 12.1% | 9.8% | 7.8% | 13.5% | 7.7% | 116.1 |
2025-11-11 9:00pm - Evaluation V13¶
Changes¶
- full rewrite of evals, which were getting limited context
- evalutor is now gpt-5
- medical evaluator improvements - Implemented stricter reference validation and structured outputs
- language now checked only in language clarity (instead of all) and given more context
- better eval error handling - stopped graceful scoring
Key observations¶
- the new evals took scores down decently which is good
- one of the strict-referenced did quite well!
- lol gpt-5 did the best and the worst! sadly it does the worst with tools - so somethings just broken there
next steps¶
- log more stuff for easy reading (or figure out browsing weave better)
- make gpt-5 work with tools - I cannot for the life of me make it work with perplexity. so odd.
- i thought there was a really good strict-referenced but it disapeared? get those higher!
Leaderboard Snapshot for V13¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | none | β | β | 91.0% | 92.2% | 90.5% | 89.5% | 83.7% | 73.0 |
2 | ashai | gpt-4o-mini | none | β | β | 76.6% | 83.6% | 73.2% | 84.9% | 67.2% | 8.1 |
3 | one-shot | gpt-4o-mini | default | β | β | 75.5% | 83.5% | 77.7% | 86.8% | 75.0% | 8.1 |
4 | one-shot | gpt-4o-mini | default | β | β | 74.5% | 84.0% | 73.3% | 85.1% | 71.1% | 5.5 |
5 | september | gpt-4o-mini | default | β | β | 74.4% | 79.4% | 74.5% | 83.7% | 70.0% | 15.6 |
6 | perplexity | sonar | default | β | β | 72.7% | 79.5% | 76.8% | 81.5% | 62.3% | 7.7 |
7 | perplexity | sonar | default | β | β | 70.5% | 75.8% | 73.8% | 79.3% | 62.9% | 7.7 |
8 | ashai | gpt-4o-mini | default | β | β | 69.8% | 77.3% | 70.8% | 78.3% | 63.9% | 45.2 |
9 | ashai | gpt-4o-mini | default | β | β | 68.7% | 78.0% | 70.0% | 66.0% | 65.3% | 60.6 |
10 | ashai | gpt-4o-mini | perplexity | β | β | 62.8% | 72.7% | 59.4% | 76.9% | 56.1% | 66.2 |
11 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 53.8% | 64.3% | 47.3% | 58.4% | 40.8% | 38.5 |
12 | ashai | gpt-4o-mini | perplexity | β | β | 51.6% | 59.2% | 45.3% | 64.2% | 48.4% | 87.4 |
13 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 47.6% | 67.8% | 33.3% | 45.8% | 33.2% | 55.2 |
14 | strict-referenced | gpt-4o-mini | perplexity | β | β | 16.6% | 24.1% | 2.8% | 26.2% | 6.0% | 110.6 |
15 | strict-referenced | gpt-4o-mini | perplexity | β | β | 15.3% | 25.7% | 3.0% | 18.9% | 4.5% | 140.2 |
16 | ashai | gpt-5 | default | β | β | 4.7% | 0.8% | 0.0% | 4.0% | 1.0% | 160.0 |
2025-08-11 2:00pm - Evaluation V12¶
used this one for testing. iterate on to v13
2025-08-10 7:30pm - Evaluation V11¶
Changes¶
- SPEED fixes! - we were just reporting it wrong...
- another fix for reasoning, and the tool-less models now seem to go quite fast (5s) as expected
- many fixes to docs and tests - though those don't impact evals much
- even with perplexity or other tools turned on responses are now around 8s.
Key observations¶
- something broke with the med eval - it's always 60%
- interestingly oneshot took a slight 2% hit
- will come back to make more observations after i get what's left of today's sunshine
- something is still definitely broken with perplexity agent
- gpt-5 with no tools is winining
- ashai gpt-5 is broken... what?
Next Steps¶
- expriment with and probably fix retry option
- fix perplexity agent
- get some sunshine for a bit
Leaderboard Snapshot for V11¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | one-shot | gpt-5 | default | β | β | 76.1% | 60.0% | 90.4% | 89.2% | 87.1% | 9.8 |
2 | ashai | gpt-4o-mini | default | β | β | 74.6% | 60.0% | 88.8% | 88.3% | 77.5% | 9.3 |
3 | september | gpt-4o-mini | default | β | β | 74.4% | 60.0% | 88.3% | 87.5% | 77.9% | 3.8 |
4 | ashai | gpt-4o-mini | none | β | β | 74.1% | 60.0% | 87.1% | 87.5% | 77.9% | 4.5 |
5 | ashai | gpt-4o-mini | perplexity | β | β | 73.7% | 60.0% | 87.9% | 85.8% | 75.8% | 13.8 |
6 | one-shot | gpt-4o-mini | default | β | β | 72.0% | 59.2% | 81.2% | 87.5% | 75.4% | 5.3 |
7 | ashai | gpt-4o-mini | perplexity | β | β | 60.2% | 50.0% | 68.3% | 72.5% | 60.8% | 15.8 |
8 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 51.2% | 60.0% | 43.8% | 49.6% | 33.3% | 10.6 |
9 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 49.0% | 60.0% | 38.8% | 47.5% | 28.3% | 11.7 |
10 | strict-referenced | gpt-4o-mini | perplexity | β | β | 37.8% | 55.0% | 26.7% | 26.7% | 10.0% | 15.8 |
11 | strict-referenced | gpt-4o-mini | perplexity | β | β | 27.3% | 40.0% | 17.5% | 20.8% | 7.5% | 15.7 |
12 | perplexity | sonar | default | β | β | 18.0% | 35.0% | 2.5% | 8.3% | 0.0% | 2.9 |
13 | ashai | gpt-5 | default | β | β | 1.0% | 0.0% | 0.0% | 5.0% | 0.0% | 16.3 |
2025-08-10 7:30pm - Evaluation V10¶
Ooopse! was calling the toolnames wrong. incrementing to v11
2025-08-10 4pm - Evaluation V9¶
Changes¶
- simplified agents and made reasoning actually able to be turned off (it was not before) - expecting speeds to be much faster for models with reason turned off hopefully?!
- one-shot agent - trying to get speed here lets see how a simple one shot agent does
- docs, tests and python validator fixes everywhere
- moved a lot of endpoints to /agent/*
Key observations:¶
- turning off reasoning is singificantly faster - though still at ~30s! but oddly not for the one-shot?! confusing...
- the basic new
one-shot
model is doing the best... and fastest... ugh. - september did suprisingly well?!? i wonder if its just tricking the med evaluator which i still haven't made strong
- suspicious as to how many scored 60% on med eval... all the top ones. very exact and odd number... need to look into the more detailed wandb stats on that. i wouldn't doubt if some area's getting a 100 and others a 0.
- ugh why did perplexity do so terribly. i'm doubting my evaluators...
- Oneβshot with
gpt-4o-mini
leads on Overall (75.5%) and is fast (~41β48s).ashai
is close (74.9%) but ~2β3Γ slower (113s). - Turning off reasoning (π§ β) did not improve speed for oneβshot (65.6s vs 41β48s) and held roughly similar quality (~74%). Likely implementation/length behavior needs profiling.
- Perplexityβbased variants cratered (~18β22% Overall) with nearβzero Precision/Empathy β clear configuration/prompting regression.
- All strictβreferenced variants collapsed (~18β20% Overall, Med Acc ~35β40%) vs V8βs ~72% β indicates a breaking change in prompts, tools, or judging logic.
Next Steps¶
- Debug strictβreferenced flows: compare V8 vs V9 prompts/tool calls; add regression checks to ensure citations and extraction behave as expected.
- Fix Perplexity integration: verify tool instructions, response parsing, and citation routing; run a minimal sanity suite before full eval.
- Profile reasoning off path: confirm the flag actually alters model behavior (no hidden chainβofβthought or retries); cap output lengths and temperature to reduce latency.
- Promote oneβshot as a "fast default" path; tune for speed (shorter context, stricter max tokens) without regressing quality.
- Target Med Acc >60%: refine dataset/judger for medical correctness and reference validation; add a few hard cases with clear gold references.
- Add a small ablation with π retry enabled for top configs to quantify benefit vs latency.
Leaderboard Snapshot for V9¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | one-shot | gpt-4o-mini | default | β | β | 75.5% | 60.0% | 90.0% | 90.0% | 80.0% | 41.2 |
2 | ashai | gpt-4o-mini | default | β | β | 74.9% | 60.0% | 89.2% | 89.2% | 77.9% | 113.3 |
3 | one-shot | gpt-4o-mini | default | β | β | 74.5% | 60.0% | 87.9% | 87.3% | 80.8% | 48.3 |
4 | one-shot | gpt-4o-mini | default | β | β | 74.3% | 60.0% | 86.7% | 88.3% | 80.0% | 65.6 |
5 | september | gpt-4o-mini | default | β | β | 73.5% | 60.0% | 86.5% | 85.8% | 77.5% | 53.8 |
6 | ashai | gpt-4o-mini | none | β | β | 70.6% | 59.2% | 81.2% | 81.7% | 72.9% | 41.5 |
7 | perplexity | sonar | default | β | β | 22.5% | 45.0% | 2.5% | 8.3% | 0.0% | 52.6 |
8 | ashai | gpt-4o-mini | perplexity | β | β | 21.4% | 40.0% | 4.2% | 11.7% | 0.0% | 41.9 |
9 | strict-referenced | gpt-4o-mini | default | β | β | 20.0% | 40.0% | 0.0% | 10.0% | 0.0% | 23.2 |
10 | strict-referenced | gpt-4o-mini | default | β | β | 19.8% | 39.2% | 0.8% | 10.0% | 0.0% | 46.4 |
11 | ashai | gpt-4o-mini | perplexity | β | β | 19.1% | 35.0% | 4.2% | 11.7% | 0.0% | 29.7 |
12 | strict-referenced | gpt-4o-mini | perplexity | β | β | 17.8% | 35.0% | 0.8% | 9.2% | 0.0% | 39.0 |
13 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 17.8% | 35.0% | 0.0% | 10.0% | 0.0% | 48.5 |
14 | strict-referenced | gpt-4o-mini | perplexity | β | β | 17.8% | 35.0% | 0.8% | 9.2% | 0.0% | 36.9 |
15 | strict-referenced-after | gpt-4o-mini | default | β | β | 17.6% | 35.0% | 0.0% | 9.2% | 0.0% | 33.6 |
16 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 15.5% | 30.0% | 0.0% | 10.0% | 0.0% | 36.9 |
2025-01-10 - Evaluation V8 - testing only strict references¶
Changes¶
- adding reasoning and retry flags to table output (ensure now tracked in wandb)
- strict-referenced-after - a new version of ashai that first guesses and then checks references after
- strict-referenced updated to be a small wrapper around ashai - keeps it more similar while still having the strict reference requirements
Key observations:¶
- these are still taking FOREVER!
- so weird that strict-referenced-after with just perplexity won, but strict-referenced with just perplexity did really poorly. I think something there must be broken?! As strict-referenced with all tools (including perplexity) did well.
- strict-after didn't seem to be much better than before. but i think the judgment of the references is poor still.
- Guess-then-check (
strict-referenced-after
) withgpt-5-mini
+perplexity
leads, driven by strong precision/clarity; Med Acc still capped at 60%. - For
gpt-4o-mini
, the baselinestrict-referenced
slightly edges-after
by ~1β2 pts; simpler toolsets (e.g.,september
,niharika
) are as good as larger ones. perplexity
-only withgpt-4o-mini
is severely degraded (and very slow) for bothstrict
and-after
β likely configuration or prompting mismatch.- the fact that september (and even niharika) did as well as they did makes me wonder how well the medical accuracy eval is doing
- With no tools, strict agents perform poorly (citations required) vs prior non-strict agents.
- Reasoning was on (π§ ), retry off (π); need ablations to quantify their impact under strict modes.
Next Steps¶
- tighten up medical accuracy to really enforce valid references
- make a generic agent class and try a very 'basic' agent that doesn't use tools - get speed
- debug strict-referenced with perplexity combo
- Optimize speed: cache Perplexity responses, parallelize tool calls, cap depth/num results; profile slow paths.
- Run ablations: reasoning on/off (π§ ), enable controlled retry (π) for top configs to measure net gain vs latency.
- Validate low-scoring rows for config bugs; cross-check logs and W&B artifacts for anomalies.
Leaderboard Snapshot for V8¶
Rank | Agent | Model | Tools | π§ | π | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | strict-referenced-after | gpt-5-mini | perplexity | β | β | 75.6% | 60.0% | 90.8% | 88.3% | 82.1% | 106.2 |
2 | strict-referenced | gpt-4o-mini | september, niharika, perplexity, pubmed | β | β | 73.9% | 60.0% | 87.5% | 86.7% | 76.7% | 90.5 |
3 | strict-referenced | gpt-4o-mini | september, niharika, pubmed | β | β | 73.5% | 60.0% | 87.5% | 85.8% | 74.6% | 49.7 |
4 | strict-referenced | gpt-4o-mini | september, niharika | β | β | 73.4% | 60.0% | 86.5% | 86.3% | 75.4% | 55.1 |
5 | strict-referenced-after | gpt-4o-mini | september, niharika, pubmed | β | β | 72.8% | 60.0% | 85.4% | 85.0% | 74.6% | 79.2 |
6 | strict-referenced-after | gpt-4o-mini | september | β | β | 72.4% | 60.0% | 82.1% | 87.5% | 73.8% | 43.7 |
7 | strict-referenced-after | gpt-4o-mini | default | β | β | 72.4% | 60.0% | 83.8% | 85.4% | 74.2% | 88.5 |
8 | strict-referenced-after | gpt-4o-mini | september, niharika | β | β | 72.1% | 60.0% | 82.1% | 85.8% | 74.2% | 66.9 |
9 | strict-referenced-after | gpt-4o-mini | september, niharika, perplexity, pubmed | β | β | 72.0% | 60.0% | 82.9% | 85.0% | 72.9% | 71.9 |
10 | strict-referenced | gpt-4o-mini | niharika | β | β | 68.4% | 60.0% | 78.3% | 76.7% | 65.0% | 74.1 |
11 | strict-referenced-after | gpt-4o-mini | niharika | β | β | 65.9% | 60.0% | 72.1% | 74.2% | 60.0% | 57.6 |
12 | strict-referenced | gpt-5-mini | perplexity | β | β | 45.3% | 60.0% | 35.0% | 37.5% | 20.8% | 144.1 |
13 | strict-referenced-after | gpt-4o-mini | perplexity | β | β | 43.0% | 60.0% | 28.3% | 35.8% | 17.5% | 164.1 |
14 | strict-referenced-after | gpt-4o-mini | none | β | β | 39.1% | 56.7% | 25.0% | 30.0% | 13.3% | 28.3 |
15 | strict-referenced | gpt-4o-mini | none | β | β | 34.5% | 54.2% | 21.7% | 18.3% | 10.0% | 39.7 |
16 | strict-referenced | gpt-4o-mini | perplexity | β | β | 31.2% | 45.0% | 21.7% | 23.3% | 8.3% | 187.7 |
2025-08-09 - Evaluation V7¶
Changes¶
- Debugged GPT-5, increased timeout limit so it can complete
- Added
--no-reasoning
flag to ./eval script to test whether reasoning is helping or not - Added
--eval-retry
flag to allow models to double check themselves and try again with feedback if the response is poor (can add a lot of time) - Added
--verbose
flag for easier debugging to eval - Lowered empathy score importance from 15% to 10% and made it harder - all results were coming in at a consistent 80%
- utlized "Language" tag in example datasets and now more solidly judging that the response is in the correct language.
- added a example_description key to the dataset so that messages could be in native languages but you could still easily understand what the example was about.
Key observations:¶
- GPT-5 is now debugged a bit but still super slow (even nano and mini)
- GPT-5 performs the best (as expected) but is insanely slow still
- GPT-5 with no tools significantly outperforms other configurations - tools aren't yet helping!
- Reasoning seems to help performance based on results
- One row is 10 hours! This was because my laptop went to sleep and restarted when it awoke :).
- gpt-5 with reasoning (
step) got 89% but without got 76%
Next Steps¶
- Figure out speed issues with GPT-5
- better handle timeout issues
- better handle issues where the AI cannot come up with an answer
- maybe focus just on strict mode for a while - and make the tools actually help
- make strict mode more like Ashai (except for its requirements)
- will tighten ashai's tool prompt options and make work more with strict mode
- try a strict-reference-after approach
Leaderboard Snapshot for V7¶
Rank | Agent | Model | Tools | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-5 | none | 89.5% | 88.3% | 91.7% | 90.0% | 88.3% | 147.4 |
2 | ashai | gpt-4o-mini | default | 83.9% | 81.7% | 90.0% | 83.3% | 80.0% | 30.7 |
3 | ashai | gpt-5-mini | none | 76.2% | 60.0% | 90.8% | 90.0% | 85.4% | 46.4 |
4 | ashai | gpt-5 | none | 76.0% | 60.0% | 90.8% | 89.2% | 84.2% | 97.0 |
5 | ashai | gpt-4o-mini | default | 74.8% | 76.7% | 71.7% | 80.0% | 63.3% | 34.2 |
6 | ashai | gpt-5-mini | default | 72.1% | 59.2% | 82.5% | 85.8% | 76.7% | 156.9 |
7 | ashai | gpt-5-nano | default | 69.2% | 55.0% | 82.9% | 80.8% | 75.8% | 181.1 |
8 | ashai | gpt-3.5-turbo | default | 65.7% | 60.0% | 66.2% | 79.2% | 62.9% | 34.1 |
9 | ashai | gpt-5 | perplexity | 56.3% | 53.3% | 60.0% | 60.0% | 53.3% | 173.8 |
10 | ashai | gpt-5 | default | 51.3% | 40.0% | 60.4% | 62.9% | 55.8% | 17066.4 |
11 | ashai | gpt-5 | default | 50.0% | 50.0% | 50.0% | 50.0% | 50.0% | 162.5 |
12 | ashai | gpt-5 | default | 30.4% | 30.0% | 31.7% | 30.0% | 30.0% | 170.1 |
2025-08-09 15:00 - Evaluation V5¶
Changes¶
Initial batch evals show GPT-5 underperforming vs GPT-4o-mini; tool usage impact appears small.
Key observations:¶
- GPT-5 is failing, likely due to
usage/formatting issues; investigating. GPT-5 should be better. - GPT models without any tools perform only negligibly worse than with tools.
- Surprising: Perplexity underperforms; likely prompt/config differences to investigate.
- empathy scores are the same across the board
Next Steps¶
- Add eval options to disable reasoning.
- Add eval option for eval-retry (model gets feedback and retries).
- Improve eval stability, then consider disabling all tools except Perplexity.
- Test with strict-referenced.
- Try a "strict-ref-check-after": let GPT answer naturally, then perform a reference lookup/validation step.
- lower empathy's score and make it harder/tighter
- don't punish answers where it abstained because it didn't have enough data
Leaderboard Snapshot for V5¶
Rank | Agent | Model | Tools | Overall | Med Acc | Precision | Clarity | Empathy | Avg s |
---|---|---|---|---|---|---|---|---|---|
1 | ashai | gpt-4o-mini | default | 89.6% | 91.3% | 85.8% | 91.7% | 88.8% | 56.0 |
2 | ashai | gpt-4o-mini | september health, niharika faqs, pubmed | 89.4% | 90.4% | 86.7% | 91.7% | 88.3% | 51.5 |
3 | strict-referenced | gpt-4o-mini | default | 89.2% | 90.4% | 86.2% | 91.3% | 87.9% | 65.3 |
4 | ashai | gpt-4o-mini | none | 88.8% | 89.6% | 86.3% | 90.0% | 89.6% | 27.9 |
5 | ashai | gpt-4o-mini | niharika faqs, pubmed, perplexity | 88.6% | 90.0% | 86.2% | 89.2% | 88.3% | 45.1 |
6 | ashai | gpt-4o-mini | perplexity | 82.8% | 82.9% | 79.2% | 82.1% | 89.2% | 71.8 |
7 | ashai | gpt-4.1 | default | 76.4% | 79.6% | 57.5% | 85.0% | 87.9% | 42.9 |
8 | ashai | gpt-3.5-turbo | default | 72.6% | 71.2% | 57.5% | 84.6% | 85.4% | 44.4 |
9 | perplexity | sonar | default | 62.5% | 74.2% | 14.2% | 82.1% | 85.8% | 14.8 |
10 | perplexity | sonar | default | 62.0% | 75.0% | 14.2% | 78.8% | 84.6% | 17.7 |
11 | strict-referenced | gpt-4o-mini | september health, niharika faqs, pubmed | 61.8% | 76.7% | 7.1% | 82.5% | 85.4% | 25.9 |
12 | perplexity | sonar | default | 61.4% | 75.8% | 7.1% | 82.9% | 85.0% | 14.9 |
13 | perplexity | sonar | default | 61.0% | 67.1% | 21.2% | 80.0% | 85.8% | 34.7 |
14 | strict-referenced | gpt-4o-mini | september health, niharika faqs, perplexity, pubmed | 60.5% | 73.8% | 7.1% | 82.1% | 85.4% | 14.3 |
15 | strict-referenced | gpt-4o-mini | september health, niharika faqs | 60.2% | 68.8% | 14.2% | 81.2% | 86.2% | 30.9 |
16 | september | gpt-4o-mini | default | 49.3% | 50.8% | 22.5% | 54.2% | 83.3% | 79.2 |
17 | ashai | gpt-5 | default | 42.5% | 45.0% | 6.7% | 51.7% | 83.3% | 74.0 |