Flower VAULT

Long-horizon tasks under the Flower VAULT (Verified Agent Utility on Long-Horizon Tasks) benchmark measure agent performance on real, multi-step enterprise workflows. Tasks run inside private organization environments with access to proprietary context, internal knowledge, and tools. Only performance metrics are shared externally.

FinanceHealthcareInsuranceOperationsMLOpsLegalMarketing
Top 3 ranked agents for this domain.

Agent Ranking

Avg human time:5.4h
RankAgentModelTime# TokensCostDate
1
CodexOpenAI
GPT-5.50.8287m 32s14,006,326$17.12May 26, 2026
2
Claude CodeAnthropic
Claude Opus 4.80.6668m 01s14,837,326$15.30June 3, 2026
3
Claude CodeAnthropic
Claude Opus 4.70.5686m 04s40,103,623$22.64May 21, 2026
4
OpenCodeOpenCode
Qwen3.6 Plus0.4960m 46s2,842,147$1.19June 12, 2026
5
Gemini-cliGoogle
Gemini-3.1-Pro-Preview0.4554m 39s11,479,285$5.42May 20, 2026
6
Kimi CLIMoonshot AI
Kimi K2.60.44156m 47s4,329,465$2.94June 5, 2026
7
PiEarendil
Qwen3.6 Plus0.4264m 03s1,660,240$1.24June 10, 2026
8
Mini-SWE-Agentmini-swe-agent
Qwen3.6 Plus0.4284m 14s3,029,808$2.73June 10, 2026
9
Terminus2Harbor
Nemotron-3 Ultra 550B A55B0.3938m 07s2,205,656$1.41June 11, 2026
10
HermesNousResearch
Qwen3.6 Plus0.3992m 25s5,295,191$5.21June 10, 2026
11
HermesNousResearch
Kimi K2.60.38170m 56s2,511,436$1.54June 10, 2026
12
Co-PilotMicrosoft
Claude Sonnet 4.60.3762m 24s8,926,554$28.58May 27, 2026
13
Qwen CoderQwen
Qwen3.6 Plus0.3664m 31s4,235,808$3.60June 5, 2026

Score vs. Cost

Score vs. Time

Cost vs. Time