Affine-I vs Qwen3-32B headline results first, followed by the archived four-model EDA (Qwen3-32B-TEE vs Axon1 M19, Leary CX, Leary CS) on BrowseComp-ZH, MCP-Bench, MemoryAgentBench, and snapshots.
These benchmarks are useful for hypothesis generation but should not be used for strong leaderboard claims.
Ahead on six of eight charts: HumanEval, Tau2 Bench, SWE-Bench Multilingual, Terminal Bench V2, SWE Rebench, and BrowseComp+. Biggest lift over the base is on multilingual SWE, terminal work, and SWE Rebench. Trails only on BBH and MCP Agent Bench in this suite.
Leads BBH and MCP Agent Bench here (BBH is effectively tied). Narrow second on HumanEval. Used as the reference Qwen3-32B base for the headline Affine-I comparison above.
Original EDA run: Qwen3-32B-TEE baseline vs three Affine fine-tunes. Key findings for this suite are below the charts.
These benchmarks are useful for hypothesis generation but should not be used for strong leaderboard claims.
Takeaways from the four-model evaluation (Qwen3-32B-TEE base vs Axon1 M19, Leary CX, Leary CS) on BrowseComp-ZH, MCP-Bench, MemoryAgentBench, ToolSandbox, and Tau2 — corresponding to the charts above.
Excels on complex long-context and agentic workloads. Best accuracy on BrowseComp-ZH and strongest F1 on MemoryAgentBench.
Very strong on tool-use tasks with top task completion and tool selection scores. Materially weaker on Chinese retrieval QA.
Competitive on long-context memory and agent-style tasks. Highest ROUGE-L recall. Less stable than Leary models on tool-use.
Respectable baseline especially in Chinese retrieval QA and MCP tool-call correctness. Trails on complex agentic workloads.
DATA CUTOFF: 2026-04-17 UTC · STABLE BENCHMARKS: BROWSECOMP-ZH, MCP-BENCH, MEMORYAGENTBENCH · SNAPSHOT: TOOLSANDBOX, TAU2