Flower Agent Eval

Flower Long Horizon Tasks measure agent performance on real, multi-step enterprise workflows. Tasks run inside private organization environments with access to proprietary context, internal knowledge, and tools. Only performance metrics are shared externally.

Domains

Finance

Healthcare

Insurance

Operations

MLOps

Legal

Marketing

Agent Ranking

Avg human time:5.4h

Rank	Agent	Model		Time	# Tokens	Cost	Date
1	CodexOpenAI	GPT-5.5	0.82	62m 55s	5,512,714	$12.32	May 26, 2026
2	Claude CodeAnthropic	Claude Opus 4.7	0.55	63m 40s	16,347,985	$17.74	May 21, 2026
3	Gemini-cliGoogle	Gemini-3.1-Pro-Preview	0.45	39m 03s	4,594,677	$3.97	May 20, 2026
4	Co-PilotMicrosoft	Claude Sonnet 4.6	0.37	44m 36s	6,416,541	$20.49	May 27, 2026

Flower Agent Eval

Agent Ranking

Score vs. Cost

Score vs. Time

Cost vs. Time