How three AI models interpret identical MCP compliance data, and what their differences reveal about agent-driven regulatory analysis
756a5a7Three agents received identical MCP data. Each was asked to evaluate the same codebase against the same GDPR requirements.
Claude Code CLI · Worktree Isolation
PyCharm Junie Agent
OpenAI Codex CLI
Side-by-side metrics across all three agents on the same evaluation target.
| Metric | Claude Opus 4.6 | Junie / Gemini 3 Flash | Codex / GPT-5.4 |
|---|---|---|---|
| Volume | |||
| Signals Reported | 39 | 119 | 236 |
| Unique Signal Names | 39 | 91 | 13 |
| Requirements Covered | 38 / 118 | 118 / 118 | 118 / 118 |
| Evidence Breakdown | |||
| Code Findings | 22 | 19 | 12 |
| Manual / Gap Findings | 17 | 100 | 224 |
| Average Confidence | 0.50 | 0.95 | 0.50 |
| Discovery | |||
| Source Files Cited | 8 | 8 | 3 |
| Unique Files Found | 2 | 2 | 1 |
| Behavior | |||
| Followed MCP Workflow | Automatic | Yes | After prompting |
| Called report_evidence Autonomously | Yes | Yes | No |
| Edited Source Files | No | Yes | Yes |
The ratio of unique signal names to total signals reveals how each agent names and categorizes its findings.
39 unique signals for 38 requirements. Every signal name was unique — one descriptive label per finding. Claude reported only what it found, skipping requirements it could not evaluate. This produced a concise, high-signal-to-noise evidence record where each entry stands alone as a distinct finding.
91 unique signals for 118 requirements. Junie covered every requirement in the evaluation, generating one signal per requirement with some grouped under shared labels. High confidence (0.95) across all signals. Systematic and thorough — no requirement left behind.
13 unique signals for 118 requirements. Codex applied broad category labels (e.g., "governance-gap" for 64 requirements, "security-control-gap" for 34) rather than per-finding descriptions. Every code-type signal was duplicated, inflating volume without adding information. Full coverage, but at the cost of specificity.
Each agent self-reported a confidence score with every evidence signal. The variation highlights differences in calibration behavior.
Claude and Codex both used the default confidence value of 0.50, neither adjusting it per finding. Gemini 3 Flash assigned 0.95 uniformly, indicating high self-assessed certainty across all signals. None of the three agents varied confidence on a per-signal basis — calibration was model-wide rather than finding-specific. This suggests that meaningful confidence differentiation may require explicit prompt engineering or schema-level guidance.
Which source files each agent examined and cited in evidence signals. Unique discoveries are highlighted.
| Source File | Claude | Junie | Codex |
|---|---|---|---|
| pages/terms.js | ✓ | ✓ | ✓ |
| lib/auth/jwtAuth.js | ✓ | ✓ | — |
| lib/constants/api.js | ✓ | ✓ | — |
| pages/settings/profile.js | ✓ | ✓ | ✓ |
| components/SignupModal.jsx | ✓ | ✓ | — |
| instrumentation-client.js | ✓ | ✓ | — |
| e2e/auth-complete.spec.jsJunie only | — | ✓ | — |
| lib/api/client.jsClaude only | ✓ | — | — |
| pages/settings/notifications.jsClaude only | ✓ | — | — |
| components/LoginModal.jsxJunie only | — | ✓ | — |
| pages/settings/security.jsCodex only | — | — | ✓ |
Two files were found by all three agents: pages/terms.js and pages/settings/profile.js. Four files were shared between Claude and Junie: lib/auth/jwtAuth.js, lib/constants/api.js, components/SignupModal.jsx, and instrumentation-client.js. Each agent found files the others missed — Claude uniquely cited lib/api/client.js and pages/settings/notifications.js, Junie uniquely found components/LoginModal.jsx and e2e/auth-complete.spec.js, and Codex uniquely found pages/settings/security.js. Collectively, the three agents cited 11 distinct source files.
Beyond metrics, each agent exhibited distinct patterns in how it approached the evaluation task.
Did the agent follow the get_compliance_plan → get_bundle_tasks → report_evidence loop?
How the agent structured its report_evidence calls.
The specificity of evidence signal names.
Whether the agent modified the codebase during evaluation.
How the agent handled the report_evidence API constraint.