Flower VAULT
Long-horizon tasks under the Flower VAULT (Verified Agent Utility on Long-Horizon Tasks) benchmark measure agent performance on real, multi-step enterprise workflows. Tasks run inside private organization environments with access to proprietary context, internal knowledge, and tools. Only performance metrics are shared externally.
FinanceHealthcareInsuranceOperationsMLOpsLegalMarketing
Agent Ranking
Avg human time:5.4h
| Rank | Agent | Model | Time | # Tokens | Cost | Date | |
|---|---|---|---|---|---|---|---|
| 1 | CodexOpenAI | GPT-5.5 | 0.82 | 87m 32s | 14,006,326 | $17.12 | May 26, 2026 |
| 2 | Claude CodeAnthropic | Claude Opus 4.8 | 0.66 | 68m 01s | 14,837,326 | $15.30 | June 3, 2026 |
| 3 | Claude CodeAnthropic | Claude Opus 4.7 | 0.56 | 86m 04s | 40,103,623 | $22.64 | May 21, 2026 |
| 4 | OpenCodeOpenCode | Qwen3.6 Plus | 0.49 | 60m 46s | 2,842,147 | $1.19 | June 12, 2026 |
| 5 | Gemini-cliGoogle | Gemini-3.1-Pro-Preview | 0.45 | 54m 39s | 11,479,285 | $5.42 | May 20, 2026 |
| 6 | Kimi CLIMoonshot AI | Kimi K2.6 | 0.44 | 156m 47s | 4,329,465 | $2.94 | June 5, 2026 |
| 7 | PiEarendil | Qwen3.6 Plus | 0.42 | 64m 03s | 1,660,240 | $1.24 | June 10, 2026 |
| 8 | Mini-SWE-Agentmini-swe-agent | Qwen3.6 Plus | 0.42 | 84m 14s | 3,029,808 | $2.73 | June 10, 2026 |
| 9 | Terminus2Harbor | Nemotron-3 Ultra 550B A55B | 0.39 | 38m 07s | 2,205,656 | $1.41 | June 11, 2026 |
| 10 | HermesNousResearch | Qwen3.6 Plus | 0.39 | 92m 25s | 5,295,191 | $5.21 | June 10, 2026 |
| 11 | HermesNousResearch | Kimi K2.6 | 0.38 | 170m 56s | 2,511,436 | $1.54 | June 10, 2026 |
| 12 | Co-PilotMicrosoft | Claude Sonnet 4.6 | 0.37 | 62m 24s | 8,926,554 | $28.58 | May 27, 2026 |
| 13 | Qwen CoderQwen | Qwen3.6 Plus | 0.36 | 64m 31s | 4,235,808 | $3.60 | June 5, 2026 |