Harness Arena

Methodology

Harness Arena aggregates metadata exported from local AI coding harness histories. Before anything reaches this site, the uploader normalizes each harness's native history format into a shared session metric schema. No prompt or response content is displayed. This page defines the metrics shown in the UI and explains what each data-completeness badge means.

Normalization Model

Each supported harness stores history differently. Claude, Codex, Gemini, Cursor Agent, and OpenCode all have their own log layouts, retention behavior, and native field names.

The uploader parses those harness-specific sources and maps them into a common session schema before upload. The site then aggregates only that normalized schema, rather than reinterpreting each harness independently.

That means cross-harness totals such as tokens, prompts, sessions, tool calls, subagents, and MCP calls are all computed from the same normalized fields after ingestion.

Metric Definitions

MetricDefinition
TokensSum of `sessions.total_tokens` across the selected scope.
PromptsSum of `sessions.message_count_user` across the selected scope.
SessionsCount of session records in the selected scope.
Tool CallsSum of `sessions.total_tool_calls` across the selected scope.
SubagentsSum of `sessions.subagent_calls` across the selected scope.
MCP CallsSum of `sessions.mcp_calls` across the selected scope.
Days ActiveCount of distinct daily activity rows after daily data is merged by date.
Tokens / PromptProject detail view: `round(total_tokens / total_prompts)` when prompts > 0, else `0`.
Tools / PromptProject detail view: `round((total_tool_calls / total_prompts) * 10) / 10` when prompts > 0, else `0`.
Intervention RateCurrent site formula: `round((total_prompts / (total_tokens / 1000)) * 100) / 100` when tokens > 0, else `0`.
Leaderboard RankUsers are ranked by total tokens descending.

Data Completeness

Completeness describes how much of the original harness history still survives locally for the sessions contributing to a view. It affects how confidently metrics such as tokens, tool calls, and daily detail can be interpreted.

Full data

Every counted session for the current scope came from the harness's primary history source, so token, prompt, model, timing, and tool details are available at normal fidelity.

Incomplete data

Some or all sessions for the current scope are missing detailed metrics. Session existence, prompt counts, and dates may be available, but token counts, tool calls, and richer per-session metadata may be absent due to harness garbage collection or partial sync.

How Completeness Affects Metrics

`Tokens`, `Tool Calls`, `Subagents`, `MCP Calls`, and most derived metrics are most trustworthy when completeness is `Full data`. These metrics may be understated when a project shows `Incomplete data`.

`Prompts`, `Sessions`, and coarse activity timing can often survive longer because some harnesses keep lightweight indexes or prompt-history files after richer session logs are pruned.