Identifying agents

Sill identifies AI-agent traffic at the edge by matching each request’s identification signals — primarily the User-Agent string and an optional Sill-defined client hint — against a seeded identity registry of known agents (Anthropic, OpenAI, Google, and others). Every visiting client is then placed into one of three classes (matched_agent, unknown_agent, human_likely) and the classification is persisted onto the signed audit record, so the dashboard reads a stable value rather than re-classifying at query time.

Identification is informational, not authoritative. A malicious actor can claim any User-Agent. Discovery surfaces who the client says it is; cryptographic authorization for actions (ed25519-signed mandates) is the Transactional path.

The identity registry

The registry is a small set of AgentIdentitySeed records. Each record carries a stable agent_id, an organization, a user_agent_pattern (a regex), and — optionally — a sec_ch_ua_sill_agent exact-match string. In the framing of the A2A spec ecosystem and Google’s AP2 mandate work, this is the identity layer; the intent and proof layers live in the Transactional mode.

The registry is loaded from two sources, merged with a strict precedence rule:

Bundled registry (floor). A read-only set compiled into the edge build. It is always present.
Managed KV namespace (additive only). Sill’s origin may add new agents under unseen agent_ids by writing to a Workers KV namespace bound at the edge. KV can never override or remove a bundled entry. A collision is ignored and logged.

A boot-time canary asserts the named-organization anchors (agent_anthropic_claude, agent_openai_gpt, agent_google_gemini) are present on every isolate. A missing canary emits an error-level structured log; serving behavior is unchanged (the floor is the safety).

Seeded agents

The bundled registry covers the agent-mode traffic Sill expects to see on merchant sites today, plus the social link-preview crawlers that commonly appear in audit logs (so the merchant does not have to triage them as “unknown agent”):

Organization	Agent id	Matches
Anthropic	`agent_anthropic_claude`	`ClaudeBot` (training crawler)
Anthropic	`agent_anthropic_claude_user`	`Claude-User` (user-initiated fetch)
OpenAI	`agent_openai_gpt`	`GPTBot` (training crawler)
OpenAI	`agent_openai_chatgpt_user`	`ChatGPT-User` (browse tool)
OpenAI	`agent_openai_chatgpt`	`ChatGPT/...` (mobile in-app fetch)
OpenAI	`agent_openai_searchbot`	`OAI-SearchBot` (ChatGPT Search)
Google	`agent_google_gemini`	`Google-Extended`
Perplexity	`agent_perplexity`	`PerplexityBot`
Microsoft	`agent_microsoft_bingbot`	`bingbot` (also grounds Copilot)
Meta	`agent_meta_externalagent`	`meta-externalagent` (Llama training)
Meta	`agent_meta_externalfetcher`	`meta-externalfetcher` (Meta AI user fetch)
Apple	`agent_apple_extended`	`Applebot-Extended`
DuckDuckGo	`agent_duckduckgo_assistbot`	`DuckAssistBot`
Mistral	`agent_mistral_user`	`MistralAI-User`
X (Twitter)	`agent_x_twitterbot`	`Twitterbot` (link preview)
Meta	`agent_meta_facebook_external`	`facebookexternalhit` (link preview)
Meta	`agent_meta_facebot`	`Facebot` (link preview)
LinkedIn	`agent_linkedin_bot`	`LinkedInBot` (link preview)
Slack	`agent_slack_link_expander`	`Slackbot-LinkExpanding` (link unfurl)
Discord	`agent_discord_bot`	`Discordbot` (embed preview)

Note that ChatGPT Atlas (OpenAI’s agentic browser) sends a standard Chrome User-Agent with no public, cryptographically signed identity. OpenAI’s own documentation states Atlas “cannot be reliably detected with simple user agent filters.” Atlas visits will appear as human_likely and cannot be distinguished from a real Chrome user without a signed-mandate handshake (the Transactional path).

Match precedence

Given a visiting request, the matcher tries signals in this order:

Sec-CH-UA-Sill-Agent exact match. A Sill-defined client hint a registered agent (or the embed script’s own propagation) can send to assert identity. Strongest signal — exact-match against the registry wins immediately.
User-Agent regex match. Each registry entry’s user_agent_pattern is compiled to a RegExp per snapshot (cached for the lifetime of the loaded registry). The pattern uses a \b word-boundary anchor so Mozilla-wrapped UAs like Mozilla/5.0 ... (compatible; bingbot/2.0; ...) still match. If multiple patterns hit, the longest source pattern wins (most specific).
No match. The discovery record still ships with agent_id omitted and match_signal = 'none'.

flowchart TD
    A[Inbound request] --> B{sec_ch_ua_sill_agent<br/>exact match?}
    B -- yes --> M[matched agent]
    B -- no --> C{User-Agent regex<br/>match?}
    C -- one or more --> D[longest pattern wins] --> M
    C -- none --> N[no agent_id]
    M --> K[classify visitor and append to signed audit log]
    N --> K

Visitor classification

The MATCHED vs NOT-MATCHED binary is too coarse for triage — a human browsing the site and a never-seen-AI-agent both register as not-matched, but a merchant treats them differently. The edge runs a second pass, classifyVisitor, that assigns each record one of three visitor_class values:

matched_agent — the registry matched. The agent is identified; the merchant already knows who this is.
unknown_agent — bot-shaped but unrecognized. Worth investigating. This bucket also captures any client that sent a Sec-CH-UA-Sill-Agent claim Sill could not match (an attempted-but-unverified self-identification is surfaced, never laundered as human).
human_likely — Mozilla/-prefixed User-Agent carrying a known browser engine token (Chrome, Safari, Firefox, Edge, Opera, Version) and no bot keyword. Background noise; the merchant can ignore.

The heuristic is conservative: anything ambiguous stays unknown_agent. False positives toward “unknown agent” are recoverable; false negatives toward “human” would mask a real bot.

visitor_class is persisted onto the audit draft at ingest so dashboard queries read a stable value. If the heuristic changes later, only new records pick up the new classification — there is no retroactive reclassification of past records.

What gets recorded

For each Discovery beacon, the edge writes a DiscoveryDraftRecord to the replication queue. The identity-relevant fields on that record are:

{
  "schema_version": 1,
  "draft_id": "drf_01J9...",
  "site_id": "01J9...",
  "evaluated_at": "2026-06-22T18:04:11.219Z",
  "observed_at": "2026-06-22T18:04:11.190Z",
  "identification_match": {
    "matched": true,
    "agent_id": "agent_openai_chatgpt_user",
    "match_signal": "user_agent_pattern",
    "visitor_class": "matched_agent"
  },
  "identification_input": {
    "surface": "embed",
    "user_agent": "Mozilla/5.0 ... ChatGPT-User/webprod-20260601",
    "client_hints": { "sec_ch_ua_sill_agent": "\"ChatGPT-User\"" },
    "referrer_origin": "https://example-merchant.com"
  },
  "edge_meta": {
    "cf_ray": "8a1f9c2d4e0e6c3a-IAD",
    "cf_colo": "IAD",
    "worker_version": "b80915eb"
  }
}

The origin’s consumer drains the queue, validates the draft, and persists it into the signed, Merkle-chained audit envelope. The record is then visible in the dashboard’s audit log and exportable as part of the audit bundle. See Audit log and export.

What is not recorded

The Discovery beacon is deliberately narrow. Sill does not record raw IPs, full URLs, or query strings on the draft record:

The Cloudflare colo code (cf_colo, e.g. IAD) is the only location signal kept, sourced from request.cf.colo. The cf-ipcountry header is explicitly avoided.
page_context.page_path_hash is omitted in Phase 1; only the referrer_origin (scheme + host) is kept.
An agent_card_claim.jws_compact (if presented) is recorded verbatim into the draft but not verified at the edge in Discovery mode. Signature verification of signed mandates is the Transactional path.

What appears in the dashboard

Each matched_agent record renders with the agent’s organization label. unknown_agent renders as a generic bot glyph (a triage prompt). human_likely renders as a generic person glyph and is filtered out of agent-only views.

Audit-log rows, showing how the three visitor classes render side by side.

Reporting view: per-agent breakdown over the selected window, keyed on visitor_class.

Operational behavior

A few details that matter when reading the live system:

In-isolate cache. The merged registry is cached for 5 minutes per Workers isolate. A KV-managed addition is visible at the edge within 5 minutes (plus deploy and KV-propagation lag).
Safe degradation. If the KV binding is absent, or the KV list throws, the loader returns the bundled floor. Identification continues against the named-organization anchors.
Per-isolate canary. A missing canary id emits event: registry_canary_missing at error once per isolate, with missing_agent_ids, loaded_count, and kv_present — never seed bodies, patterns, or keys.
Validator drops. A KV record is dropped (and the floor absorbs the loss) if it is malformed, if its user_agent_pattern exceeds 256 chars, or if the pattern matches the ReDoS-prone nested-quantifier shape ((...+)+, (...*)*, etc.). Bundled patterns are not subject to this check — they are in-tree and reviewed.

Frequently asked

Can a hostile party impersonate a named agent by spoofing the User-Agent? Yes, for Discovery. Identification at the edge is informational — anyone can send any User-Agent. Sill never elevates an identification signal to authorization for Discovery. For actions that move money, the signed mandate path requires an ed25519 signature from a registered key, verified against the registry’s public_keys.

What if a new agent appears that is not in the registry? It will be classified as unknown_agent and recorded. Operators can add it to the KV-managed registry, after which the edge picks it up within the 5-minute cache window. Bundled entries are read-only; updates to a named-organization anchor ship via an edge redeploy.

Why does Bingbot appear under “AI agents”? Bingbot is the canonical Bing crawler and the retrieval surface that grounds Microsoft Copilot’s web answers. Treating it as an identifiable agent makes Copilot-driven traffic legible to merchants.

Why do link-preview crawlers (Twitterbot, Slack, Discord, Facebook) appear in the registry? They are not AI agents, but they commonly appear in audit logs when a user shares a merchant URL. Identifying them as link-preview crawlers means the merchant does not need to triage them as unknown agents.

Are the agent identifications signed? The identification of a visiting agent is not itself signed (the agent provides a User-Agent; Sill matches). The audit record that captures the identification is part of the append-only, ed25519-signed, Merkle-chained audit envelope. The per-site agent card Sill publishes on behalf of the merchant is also signed and independently verifiable.