RuleMesh MCP — Cross-Agent Evidence Signal Comparison

How three AI models interpret identical MCP compliance data, and what their differences reveal about agent-driven regulatory analysis

Test Parameters

Date Range
2026-03-23 to 2026-03-24
Target
landingpage/ (Next.js 15, React 18)
Commit
756a5a7
MCP Version
v5.1 with layer filtering
Regulation
GDPR
Bundles Evaluated
7 bundles (all HIGH priority)
Total Requirements
118

Agent Overview

Three agents received identical MCP data. Each was asked to evaluate the same codebase against the same GDPR requirements.

Claude Opus 4.6

Claude Code CLI · Worktree Isolation

39
Signals Reported
39
Unique
38
Reqs Covered
8
Files Cited

Junie / Gemini 3 Flash

PyCharm Junie Agent

119
Signals Reported
91
Unique
118
Reqs Covered
8
Files Cited

Codex / GPT-5.4

OpenAI Codex CLI

236
Signals Reported
13
Unique
118
Reqs Covered
3
Files Cited

Detailed Comparison

Side-by-side metrics across all three agents on the same evaluation target.

Metric Claude Opus 4.6 Junie / Gemini 3 Flash Codex / GPT-5.4
Volume
Signals Reported 39 119 236
Unique Signal Names 39 91 13
Requirements Covered 38 / 118 118 / 118 118 / 118
Evidence Breakdown
Code Findings 22 19 12
Manual / Gap Findings 17 100 224
Average Confidence 0.50 0.95 0.50
Discovery
Source Files Cited 8 8 3
Unique Files Found 2 2 1
Behavior
Followed MCP Workflow Automatic Yes After prompting
Called report_evidence Autonomously Yes Yes No
Edited Source Files No Yes Yes

Signal Quality Analysis

The ratio of unique signal names to total signals reveals how each agent names and categorizes its findings.

Claude Opus 4.6 — Precise

39 unique signals for 38 requirements. Every signal name was unique — one descriptive label per finding. Claude reported only what it found, skipping requirements it could not evaluate. This produced a concise, high-signal-to-noise evidence record where each entry stands alone as a distinct finding.

  • Signal-to-unique ratio: 1.00
  • Strategy: one signal per code-level finding
  • Selective coverage — 38 of 118 requirements addressed

Junie / Gemini 3 Flash — Comprehensive

91 unique signals for 118 requirements. Junie covered every requirement in the evaluation, generating one signal per requirement with some grouped under shared labels. High confidence (0.95) across all signals. Systematic and thorough — no requirement left behind.

  • Signal-to-unique ratio: 0.76
  • Strategy: one signal per requirement, systematic sweep
  • Full coverage — all 118 requirements addressed

Codex / GPT-5.4 — Generalized

13 unique signals for 118 requirements. Codex applied broad category labels (e.g., "governance-gap" for 64 requirements, "security-control-gap" for 34) rather than per-finding descriptions. Every code-type signal was duplicated, inflating volume without adding information. Full coverage, but at the cost of specificity.

  • Signal-to-unique ratio: 0.06
  • Strategy: one label per bundle category, applied across many requirements
  • Heavy duplication — 236 signals from 13 unique labels

Confidence Calibration

Each agent self-reported a confidence score with every evidence signal. The variation highlights differences in calibration behavior.

Claude Opus 4.6
0.50
Junie / Gemini 3 Flash
0.95
Codex / GPT-5.4
0.50

Claude and Codex both used the default confidence value of 0.50, neither adjusting it per finding. Gemini 3 Flash assigned 0.95 uniformly, indicating high self-assessed certainty across all signals. None of the three agents varied confidence on a per-signal basis — calibration was model-wide rather than finding-specific. This suggests that meaningful confidence differentiation may require explicit prompt engineering or schema-level guidance.

Source File Discovery

Which source files each agent examined and cited in evidence signals. Unique discoveries are highlighted.

Source File Claude Junie Codex
pages/terms.js
lib/auth/jwtAuth.js
lib/constants/api.js
pages/settings/profile.js
components/SignupModal.jsx
instrumentation-client.js
e2e/auth-complete.spec.jsJunie only
lib/api/client.jsClaude only
pages/settings/notifications.jsClaude only
components/LoginModal.jsxJunie only
pages/settings/security.jsCodex only

Overlap Analysis

Two files were found by all three agents: pages/terms.js and pages/settings/profile.js. Four files were shared between Claude and Junie: lib/auth/jwtAuth.js, lib/constants/api.js, components/SignupModal.jsx, and instrumentation-client.js. Each agent found files the others missed — Claude uniquely cited lib/api/client.js and pages/settings/notifications.js, Junie uniquely found components/LoginModal.jsx and e2e/auth-complete.spec.js, and Codex uniquely found pages/settings/security.js. Collectively, the three agents cited 11 distinct source files.

Behavioral Differences

Beyond metrics, each agent exhibited distinct patterns in how it approached the evaluation task.

MCP Workflow Adherence

Did the agent follow the get_compliance_plan → get_bundle_tasks → report_evidence loop?

Claude Followed the MCP workflow automatically without any prompting. Called tools in the expected sequence from the first message.
Junie Followed the workflow. Systematically iterated through all bundles and requirements.
Codex Did not call report_evidence initially. Interpreted the task as "review only" and produced a summary. Required explicit instruction to report findings via MCP.

Evidence Reporting Strategy

How the agent structured its report_evidence calls.

Claude Reported evidence as part of the review process. Each signal was individually crafted with a unique, descriptive name.
Junie Reported one signal per requirement, processing each systematically. High confidence (0.95) on every signal.
Codex Attempted to bulk-submit evidence before learning that report_evidence accepts one signal at a time. Used generic labels repeatedly across requirements.

Signal Granularity

The specificity of evidence signal names.

Claude One unique, descriptive name per finding. Grouped related checklist items into a single signal where appropriate.
Junie One signal per requirement. 91 unique names across 119 signals, indicating some reuse for closely related items.
Codex 13 unique names across 236 signals. Broad labels like "governance-gap" (64×) and "security-control-gap" (34×) applied at the bundle level.

File Editing Behavior

Whether the agent modified the codebase during evaluation.

Claude Did not edit any files. Treated the task as a read-only evaluation.
Junie Edited files unprompted during evaluation.
Codex Edited files unprompted during evaluation.

Bulk vs. Individual Reporting

How the agent handled the report_evidence API constraint.

Claude Called report_evidence individually for each signal from the start. No issues with the one-at-a-time API.
Junie Called report_evidence individually for each signal. No issues.
Codex Attempted to batch-submit evidence before being told report_evidence accepts one at a time. Adapted after instruction.

Key Insight

The MCP provides the same structured compliance data to all agents. The quality of evidence signals depends entirely on how each model interprets and acts on that data. Claude was the most precise — every signal was unique and descriptive. Junie had the best coverage — all 118 requirements evaluated with high confidence. Codex found unique files but needed explicit prompting to report its findings and relied on generic signal categories.

Common ground: All three agents identified the same critical gaps in the target codebase — no consent banner, no privacy policy, no DPO contact information. The MCP's structured requirement data consistently surfaced these core GDPR deficiencies regardless of which model consumed it.