Flower VAULT

Long-horizon tasks under the Flower VAULT (Verified Agent Utility on Long-Horizon Tasks) benchmark measure agent performance on real, multi-step enterprise workflows. Tasks run inside private organization environments with access to proprietary context, internal knowledge, and tools. Only performance metrics are shared externally.

FinanceHealthcareInsuranceOperationsMLOpsLegalMarketing

Top 3 ranked agents for this domain.

Agent Ranking

Avg human time:5.4h

Rank	Agent	Model		Time	# Tokens	Cost	Date
1	CodexOpenAI	GPT-5.5	0.82	87m 32s	14,006,326	$17.12	May 26, 2026
2	Claude CodeAnthropic	Claude Opus 4.8	0.66	68m 01s	14,837,326	$15.30	June 3, 2026
3	Claude CodeAnthropic	Claude Opus 4.7	0.56	86m 04s	40,103,623	$22.64	May 21, 2026
4	OpenCodeOpenCode	Qwen3.6 Plus	0.49	60m 46s	2,842,147	$1.19	June 12, 2026
5	Gemini-cliGoogle	Gemini-3.1-Pro-Preview	0.45	54m 39s	11,479,285	$5.42	May 20, 2026
6	Kimi CLIMoonshot AI	Kimi K2.6	0.44	156m 47s	4,329,465	$2.94	June 5, 2026
7	PiEarendil	Qwen3.6 Plus	0.42	64m 03s	1,660,240	$1.24	June 10, 2026
8	Mini-SWE-Agentmini-swe-agent	Qwen3.6 Plus	0.42	84m 14s	3,029,808	$2.73	June 10, 2026
9	Terminus2Harbor	Nemotron-3 Ultra 550B A55B	0.39	38m 07s	2,205,656	$1.41	June 11, 2026
10	HermesNousResearch	Qwen3.6 Plus	0.39	92m 25s	5,295,191	$5.21	June 10, 2026
11	HermesNousResearch	Kimi K2.6	0.38	170m 56s	2,511,436	$1.54	June 10, 2026
12	Co-PilotMicrosoft	Claude Sonnet 4.6	0.37	62m 24s	8,926,554	$28.58	May 27, 2026
13	Qwen CoderQwen	Qwen3.6 Plus	0.36	64m 31s	4,235,808	$3.60	June 5, 2026

Flower VAULT

Agent Ranking

Score vs. Cost

Score vs. Time

Cost vs. Time