Flower Agent Eval

Flower Long Horizon Tasks measure agent performance on real, multi-step enterprise workflows. Tasks run inside private org environments with access to proprietary context, internal knowledge, and tools. Only performance metrics are shared externally.

Domains

7

Finance
Healthcare
Insurance
Operations
MLOps
Legal
Marketing
Leaderboard

Agent Ranking

Avg human time5.4h
RankAgentModelTime# TokensCostDate
1
CodexOpenAI
GPT-5.50.8262m 55s5,512,714$12.32May 26, 2026
2
Claude CodeAnthropic
Claude Opus 4.70.5563m 40s16,347,985$17.74May 21, 2026
3
Gemini-cliGoogle
Gemini-3.1-Pro-Preview0.4539m 03s4,594,677$3.97May 20, 2026
4
Co-PilotMicrosoft
Claude Sonnet 4.60.3744m 36s6,416,541$20.49May 27, 2026
Pareto Frontier

Score vs. Cost

Overall