Checking…

Research

Model Evaluation

Affine-I vs Qwen3-32B headline results first, followed by the archived four-model EDA (Qwen3-32B-TEE vs Axon1 M19, Leary CX, Leary CS) on BrowseComp-ZH, MCP-Bench, MemoryAgentBench, and snapshots.

Models Evaluated

Affine-I + Qwen + 3 legacy Affine

Benchmarks

8 primary + 12 EDA

Strongest Model

Affine-I

Headline vs Qwen 32B

Data Cutoff

04/17

2026 UTC

Qwen 32B

Affine-I

Stable Benchmarks

Headline Comparison

SWE-Bench Multilingual

Multilingual Software Engineering

BEST

Terminal Bench V2

Terminal Task Success

BEST

HumanEval

Python Coding Pass Rate

BEST

Snapshot Benchmarks

DIRECTIONAL

In-Progress Evaluations

These benchmarks are useful for hypothesis generation but should not be used for strong leaderboard claims.

BBH

Big-Bench Hard

BEST

MCP Agent Bench

Agent Task Score

BEST

BrowseComp+

Browse & Retrieval

BEST

SWE Rebench

Software Engineering Tasks

BEST

Tau2 Bench

Agent Interaction

BEST

Executive Summary

Key Findings

Affine-ICHALLENGER

Ahead on six of eight charts: HumanEval, Tau2 Bench, SWE-Bench Multilingual, Terminal Bench V2, SWE Rebench, and BrowseComp+. Biggest lift over the base is on multilingual SWE, terminal work, and SWE Rebench. Trails only on BBH and MCP Agent Bench in this suite.

Qwen3-32BBASELINE

Leads BBH and MCP Agent Bench here (BBH is effectively tied). Narrow second on HumanEval. Used as the reference Qwen3-32B base for the headline Affine-I comparison above.

Legacy EDA — 2026-04-07

Historical four-model comparison

Original EDA run: Qwen3-32B-TEE baseline vs three Affine fine-tunes. Key findings for this suite are below the charts.

Qwen TEE

Axon1 M19

Leary CX

Leary CS

Stable Benchmarks

Headline Comparison (EDA)

BrowseComp-ZH

Chinese Retrieval QA Accuracy

BEST

MCP-Bench

Tool Call Success Rate

BEST

MemoryAgentBench

Retrieval F1 Score

BEST

MCP-Bench Breakdown

Tool-Use Performance

Task Completion

MCP-Bench Score

BEST

Tool Selection

MCP-Bench Score

BEST

Planning

MCP-Bench Score

BEST

Agent Time

MCP-Bench Avg Runtime

FASTEST

MemoryAgentBench Breakdown

Long-Context Retrieval

Substring EM

MemoryAgentBench

BEST

ROUGE-L F1

MemoryAgentBench

BEST

Query Time

MemoryAgentBench Avg Latency

FASTEST

Snapshot Benchmarks

DIRECTIONAL

In-Progress Evaluations

These benchmarks are useful for hypothesis generation but should not be used for strong leaderboard claims.

ToolSandbox

ALL_CATEGORIES Similarity

BEST

Tau2

Snapshot pass_hat_1

BEST

Legacy EDA — 2026-04-07

Key Findings

Takeaways from the four-model evaluation (Qwen3-32B-TEE base vs Axon1 M19, Leary CX, Leary CS) on BrowseComp-ZH, MCP-Bench, MemoryAgentBench, ToolSandbox, and Tau2 — corresponding to the charts above.

Leary CXSTRONGEST OVERALL

Excels on complex long-context and agentic workloads. Best accuracy on BrowseComp-ZH and strongest F1 on MemoryAgentBench.

Leary CSTOOL-USE LEADER

Very strong on tool-use tasks with top task completion and tool selection scores. Materially weaker on Chinese retrieval QA.

Axon1 M19COMPETITIVE

Competitive on long-context memory and agent-style tasks. Highest ROUGE-L recall. Less stable than Leary models on tool-use.

Qwen3-32B-TEEBASELINE

Respectable baseline especially in Chinese retrieval QA and MCP tool-call correctness. Trails on complex agentic workloads.

DATA CUTOFF: 2026-04-17 UTC · STABLE BENCHMARKS: BROWSECOMP-ZH, MCP-BENCH, MEMORYAGENTBENCH · SNAPSHOT: TOOLSANDBOX, TAU2