RuleMesh MCP — Cross-Agent Evidence Signal Comparison

How three AI models interpret identical MCP compliance data, and what their differences reveal about agent-driven regulatory analysis

Test Parameters

Date Range

2026-03-23 to 2026-03-24

Target

landingpage/ (Next.js 15, React 18)

Commit

756a5a7

MCP Version

v5.1 with layer filtering

Regulation

GDPR

Bundles Evaluated

7 bundles (all HIGH priority)

Total Requirements

118

Agent Overview

Three agents received identical MCP data. Each was asked to evaluate the same codebase against the same GDPR requirements.

Claude Opus 4.6

Claude Code CLI · Worktree Isolation

Signals Reported

Unique

Reqs Covered

Files Cited

Junie / Gemini 3 Flash

PyCharm Junie Agent

119

Signals Reported

Unique

118

Reqs Covered

Files Cited

Codex / GPT-5.4

OpenAI Codex CLI

236

Signals Reported

Unique

118

Reqs Covered

Files Cited

Detailed Comparison

Side-by-side metrics across all three agents on the same evaluation target.

Metric	Claude Opus 4.6	Junie / Gemini 3 Flash	Codex / GPT-5.4
Volume
Signals Reported	39	119	236
Unique Signal Names	39	91	13
Requirements Covered	38 / 118	118 / 118	118 / 118
Evidence Breakdown
Code Findings	22	19	12
Manual / Gap Findings	17	100	224
Average Confidence	0.50	0.95	0.50
Discovery
Source Files Cited	8	8	3
Unique Files Found	2	2	1
Behavior
Followed MCP Workflow	Automatic	Yes	After prompting
Called report_evidence Autonomously	Yes	Yes	No
Edited Source Files	No	Yes	Yes

Signal Quality Analysis

The ratio of unique signal names to total signals reveals how each agent names and categorizes its findings.

Claude Opus 4.6 — Precise

39 unique signals for 38 requirements. Every signal name was unique — one descriptive label per finding. Claude reported only what it found, skipping requirements it could not evaluate. This produced a concise, high-signal-to-noise evidence record where each entry stands alone as a distinct finding.

Signal-to-unique ratio: 1.00
Strategy: one signal per code-level finding
Selective coverage — 38 of 118 requirements addressed

Junie / Gemini 3 Flash — Comprehensive

91 unique signals for 118 requirements. Junie covered every requirement in the evaluation, generating one signal per requirement with some grouped under shared labels. High confidence (0.95) across all signals. Systematic and thorough — no requirement left behind.

Signal-to-unique ratio: 0.76
Strategy: one signal per requirement, systematic sweep
Full coverage — all 118 requirements addressed

Codex / GPT-5.4 — Generalized

13 unique signals for 118 requirements. Codex applied broad category labels (e.g., "governance-gap" for 64 requirements, "security-control-gap" for 34) rather than per-finding descriptions. Every code-type signal was duplicated, inflating volume without adding information. Full coverage, but at the cost of specificity.

Signal-to-unique ratio: 0.06
Strategy: one label per bundle category, applied across many requirements
Heavy duplication — 236 signals from 13 unique labels

Confidence Calibration

Each agent self-reported a confidence score with every evidence signal. The variation highlights differences in calibration behavior.

Claude Opus 4.6

0.50

Junie / Gemini 3 Flash

0.95

Codex / GPT-5.4

0.50

Claude and Codex both used the default confidence value of 0.50, neither adjusting it per finding. Gemini 3 Flash assigned 0.95 uniformly, indicating high self-assessed certainty across all signals. None of the three agents varied confidence on a per-signal basis — calibration was model-wide rather than finding-specific. This suggests that meaningful confidence differentiation may require explicit prompt engineering or schema-level guidance.

Source File Discovery

Which source files each agent examined and cited in evidence signals. Unique discoveries are highlighted.

Source File	Claude	Junie	Codex
pages/terms.js	✓	✓	✓
lib/auth/jwtAuth.js	✓	✓	—
lib/constants/api.js	✓	✓	—
pages/settings/profile.js	✓	✓	✓
components/SignupModal.jsx	✓	✓	—
instrumentation-client.js	✓	✓	—
e2e/auth-complete.spec.jsJunie only	—	✓	—
lib/api/client.jsClaude only	✓	—	—
pages/settings/notifications.jsClaude only	✓	—	—
components/LoginModal.jsxJunie only	—	✓	—
pages/settings/security.jsCodex only	—	—	✓

Overlap Analysis

Two files were found by all three agents: pages/terms.js and pages/settings/profile.js. Four files were shared between Claude and Junie: lib/auth/jwtAuth.js, lib/constants/api.js, components/SignupModal.jsx, and instrumentation-client.js. Each agent found files the others missed — Claude uniquely cited lib/api/client.js and pages/settings/notifications.js, Junie uniquely found components/LoginModal.jsx and e2e/auth-complete.spec.js, and Codex uniquely found pages/settings/security.js. Collectively, the three agents cited 11 distinct source files.

Behavioral Differences

Beyond metrics, each agent exhibited distinct patterns in how it approached the evaluation task.

MCP Workflow Adherence

Did the agent follow the get_compliance_plan → get_bundle_tasks → report_evidence loop?

Claude Followed the MCP workflow automatically without any prompting. Called tools in the expected sequence from the first message.

Junie Followed the workflow. Systematically iterated through all bundles and requirements.

Codex Did not call report_evidence initially. Interpreted the task as "review only" and produced a summary. Required explicit instruction to report findings via MCP.

Evidence Reporting Strategy

How the agent structured its report_evidence calls.

Claude Reported evidence as part of the review process. Each signal was individually crafted with a unique, descriptive name.

Junie Reported one signal per requirement, processing each systematically. High confidence (0.95) on every signal.

Codex Attempted to bulk-submit evidence before learning that report_evidence accepts one signal at a time. Used generic labels repeatedly across requirements.

Signal Granularity

The specificity of evidence signal names.

Claude One unique, descriptive name per finding. Grouped related checklist items into a single signal where appropriate.

Junie One signal per requirement. 91 unique names across 119 signals, indicating some reuse for closely related items.

Codex 13 unique names across 236 signals. Broad labels like "governance-gap" (64×) and "security-control-gap" (34×) applied at the bundle level.

File Editing Behavior

Whether the agent modified the codebase during evaluation.

Claude Did not edit any files. Treated the task as a read-only evaluation.

Junie Edited files unprompted during evaluation.

Codex Edited files unprompted during evaluation.

Bulk vs. Individual Reporting

How the agent handled the report_evidence API constraint.

Claude Called report_evidence individually for each signal from the start. No issues with the one-at-a-time API.

Junie Called report_evidence individually for each signal. No issues.

Codex Attempted to batch-submit evidence before being told report_evidence accepts one at a time. Adapted after instruction.

Key Insight

The MCP provides the same structured compliance data to all agents. The quality of evidence signals depends entirely on how each model interprets and acts on that data. Claude was the most precise — every signal was unique and descriptive. Junie had the best coverage — all 118 requirements evaluated with high confidence. Codex found unique files but needed explicit prompting to report its findings and relied on generic signal categories.

Common ground: All three agents identified the same critical gaps in the target codebase — no consent banner, no privacy policy, no DPO contact information. The MCP's structured requirement data consistently surfaced these core GDPR deficiencies regardless of which model consumed it.

Gemini 2.5 Pro — Incomplete Evaluation

Could not complete MCP evaluation due to tooling and capacity constraints

Gemini 2.5 Pro (via PyCharm plugin) was able to read MCP data but could not call report_evidence — the tool was unavailable in that environment. The API was also degraded during the first run, and capacity issues prevented a re-run. Instead of MCP-structured signals, Gemini produced a local GDPR_COMPLIANCE_REPORT.md and .json file.

Gemini 2.5 Pro surfaced findings from general knowledge, though some overlap with MCP-guided agents:

Google Fonts CDN privacy leak — cited the LG München ruling specifically. Claude also flagged Google Fonts as a transfer risk, but without the case law reference. Gemini added legal context that MCP data doesn't provide.
localStorage XSS risk — Claude also flagged this (signal: staff_access_controls, noting "tokens stored in localStorage, potential XSS risk"). Not unique to Gemini, but both identified it independently.
Stripe as undisclosed processor — Claude also flagged Stripe as a third-party transfer risk. Not unique to Gemini.

Gemini's truly unique contribution was citing the LG München ruling on Google Fonts — specific case law that no other agent referenced. This highlights a complementarity: MCP-guided agents excel at systematic requirement coverage, while models with strong general knowledge can add legal and security context beyond the structured data set.