The Agent-Tool Discoverability Standard v0.3

A falsifiable, spec-derived check for AI-callable servers — plus an offline-verifiable certificate · 2026-06-18 · SaSame

We audited 920 public MCP servers and found ~80% return no real content. The loudest, most-repeated builder complaint is the same: there is no trusted signal for whether a server actually works, and the official registry's own moderation policy states it makes no quality guarantees. So the question "is this server MCP-ready?" has no agreed answer.

This is a proposed answer designed to be checkable, not trusted. It is transport-agnostic (it survives the "MCP vs CLI vs Skills" debate) and we name it the Agent-Tool Discoverability Standard.

The 10 criteria

Each criterion binds to one of three non-taste sources — the published spec/registry schema (a competitor's checker reaches the same boolean), crypto/information-theory, or a direct measurement. None is an opinion or a star rating.

#	Criterion	What is actually checked	Bound to
C1	Protocol handshake conformance	A JSON-RPC 2.0 `initialize` returns `protocolVersion` + `capabilities`	MCP spec 2025-11-25
C2	Tool listability	`tools/list` returns a `result.tools[]` array (session-id forwarded for stateful servers)	MCP spec /server/tools
C3	Tool object validity	Every tool has a spec-legal `name` `[A-Za-z0-9_-]{1,128}`, a non-empty description, and an object `inputSchema` (`type:"object"`, declared `properties`, or a bare `{}` — a valid JSON Schema meaning "accepts anything", legitimately emitted for no-arg tools and accepted since v0.3; a missing/null/non-object `inputSchema` is rejected)	MCP spec Tool type + JSON Schema
C4	Description sufficiency	Every description ≥12 chars, median ≥20, distinctness ratio ≥0.6 (templated/duplicate descriptions are unselectable by an LLM)	Registry schema + information theory
C5	Safety annotation presence	≥50% of tools carry a valid boolean `annotations` hint (readOnlyHint / destructiveHint / idempotentHint / openWorldHint)	MCP spec ToolAnnotations
C6	Liveness & latency	A `2xx` `initialize` within <5000 ms	Direct measurement
C7	Returns real content (anti-ghost)	A safe, read-only tool call returns substantive, non-echo MCP `content[]` / `structuredContent` (empty/echo/placeholder ⇒ fail). We invoke only read-only tools (minimal valid args if required); undeclared-safety tools are probed empty-args only. Priced/x402 ⇒ "delivery UNVERIFIED" — never asserted.	Census empirical + information theory
C8	Machine-discoverable identity	The server self-describes (`serverInfo.name` / version)	Official MCP Registry server.json schema
C9	Token efficiency	The decoded `tools/list` result payload (`Buffer.byteLength(JSON.stringify(result))`, since v0.3 — not the raw SSE-framed body) is <40000 bytes (token-bloat is a known ecosystem failure)	Direct measurement
C10	Honest error behavior	An unknown method returns a structured JSON-RPC error — not a hang or a crash	JSON-RPC 2.0

Grade is deterministic and strict (v0.3): A = a perfect 10/10, B = 8–9, C = 5–7, D otherwise — with an honesty cap: no verified real content ⇒ capped at B; priced delivery is never counted as verified. Because A demands a flawless pass including verified delivery, A is deliberately rare. An unreachable server is a measured D, not an error — the audit always completes all 10 criteria (only an SSRF policy refusal, e.g. a private/loopback target, refuses to measure).

Measurement semantics (v0.3 calibration)

Why this is an "absolute," not a SaSame opinion

The internet is a deterministic machine: HTTP, JSON-RPC, the MCP spec, and public-key crypto are not matters of taste. This standard rests only on those. The verdict ships with its own falsification procedure: every certificate carries, per criterion, an evidence_sha256 and the probe that produced it. You do not have to trust us — you re-run the audit and recompute the hashes. If we are wrong, the math says so. Because a fabricated PASS is detectable by anyone, our incentives are structurally aligned to truth; a single faked certificate would destroy the only thing we have (a verifiable reputation).

The MCP-Ready Certificate (offline-verifiable, like a TLS cert)

A verdict is issued as a compact canonical-JSON document under an ed25519 signature. It behaves like a TLS certificate, not a badge: it asserts the subject, pinned spec versions, per-criterion {pass, evidence_sha256}, the grade (with the honesty cap), a short expiry (liveness decays), and the issuer pubkey. Anyone verifies it offline with no callback to SaSame.

issuer pubkey (ed25519, SPKI hex): 302a300506032b6570032100439ce47d384c8ceb07c9040aef780cc3a2ba5a63c14027ad77ab458111f20fb6

It is callable, free, and open

The standard and the verifier are free and open forever (a verifier that costs money is a phone-home token, not a fact). The instrument is also live as MCP tools so other agents can gate selection on it programmatically:

Changelog

Version	Date	Changes
0.1	2026-06-18	Initial draft: 10 criteria bound to spec/crypto/measurement, ed25519 MCP-Ready certificate, honesty caps.
0.2	2026-06-18	STRICT tightening: A = perfect 10/10; C3 typed `inputSchema`; C4 quantified (every desc ≥12 chars, median ≥20, distinctness ≥0.6); C5 ≥50% valid boolean hints; C6 2xx <5000 ms; C7 read-only-only probing with x402 ⇒ UNVERIFIED.
0.3	2026-07-13	Calibration release: transient retry (one ~800 ms retry on network failure/timeout for `initialize` + `tools/list`, never on HTTP errors); negotiated protocol version on post-init calls; C9 measures decoded `tools/list` result payload bytes (threshold 40000 unchanged); C3 accepts a bare `{}` `inputSchema`; error-tolerant transport (unreachable ⇒ measured D, never an abort; SSRF policy refusals still refuse); C10 timeout isolated to C10; honesty-cap wording distinguishes declined verification from a measured ghost.

Honest caveats & conflict of interest (read before citing): This is a v0.3 draft, not a ratified standard, and not affiliated with the official Model Context Protocol project. The audit is a snapshot: a server can be temporarily down or behind auth we do not pass (an auth-walled endpoint correctly grades low because an agent without a token also gets nothing). We forward the Mcp-Session-Id and send notifications/initialized so stateful servers are not false-negatives — but edge transports may still under-grade; tell us and we will fix the instrument in public. We have a clear interest here: SaSame helps builders make their servers AI-callable. That is exactly why the criteria, the verifier, and every certificate's evidence are open — so you can re-run it and prove us wrong. Validation runs at publication. Under v0.3, A requires a perfect 10/10 (so A is rare — most live, real servers grade B), and an auth-walled endpoint correctly grades low (an agent without a token also gets nothing).

If your server is registered but agents do not call it, that gap is what we diagnose and fix at SaSame. The check is free; the only paid object is the executed fix plus a signed before/after proof that agents now call your tools.