What We Track
We monitor agent releases, version notes, and source links, then store each release as a version record with capability and impact metadata.
- Coding
- Reasoning
- Tool use
- Memory
- Multimodal
- Speed
How Scoring Works
1. Parse release evidence into structured fields.
2. Score capability dimensions on a normalized 1-10 scale.
3. Validate and sanitize invalid values before publishing.
4. Rank release movement and velocity using version timelines.
How Comparison Works
Compare uses the latest published version per agent and applies workflow presets with weighted capabilities. This means every shortlist reflects both capability fit and current release state.
Frequently Asked Questions
What mechanism are you using to rate the models?
AgentCodex scores each release version across standardized capability dimensions on a 1 to 10 scale. Scores are derived from release evidence and normalized through validation and calibration rules so missing or invalid values do not distort results.
What are you using under the hood during comparison?
Comparison is version-aware and capability-weighted. AgentCodex aligns each selected agent's latest published profile, applies preset workflow weights, and shows side-by-side capability fit with direct links back to source-aware version context.
Is this benchmark data or live product intelligence?
It is release intelligence, not a synthetic benchmark leaderboard. AgentCodex tracks shipped updates from public sources and highlights what changed, when it changed, and where the source signal came from.
How do you reduce bias or inconsistency?
We use fixed capability dimensions, deterministic validation rules, source-quality signals, and human review workflows for drafts. This keeps scoring more consistent and easier to audit over time.