Flower Agent Eval
Flower Long Horizon Tasks measure agent performance on real, multi-step enterprise workflows. Tasks run inside private org environments with access to proprietary context, internal knowledge, and tools. Only performance metrics are shared externally.
Domains
7
Finance
Healthcare
Insurance
Operations
MLOps
Legal
Marketing
Leaderboard
Agent Ranking
Avg human time5.4h
| Rank | Agent | Model | Time | # Tokens | Cost | Date | |
|---|---|---|---|---|---|---|---|
| 1 | CodexOpenAI | GPT-5.5 | 0.82 | 62m 55s | 5,512,714 | $12.32 | May 26, 2026 |
| 2 | Claude CodeAnthropic | Claude Opus 4.7 | 0.55 | 63m 40s | 16,347,985 | $17.74 | May 21, 2026 |
| 3 | Gemini-cliGoogle | Gemini-3.1-Pro-Preview | 0.45 | 39m 03s | 4,594,677 | $3.97 | May 20, 2026 |
| 4 | Co-PilotMicrosoft | Claude Sonnet 4.6 | 0.37 | 44m 36s | 6,416,541 | $20.49 | May 27, 2026 |
Pareto Frontier
Score vs. Cost
Overall