The Skill Ceiling

Author-Side Defences and Infrastructure-Level Trust for Agent Skills and Extension Mechanisms

Author: HiP (Ivan Phan) · Date: March 2026 · DOI: 10.5281/zenodo.19365536

Developed through structured human-AI collaboration. Claude Opus 4.6 (generative collaborator), ChatGPT 5.4 Thinking and Gemini 3.1 Pro (adversarial reviewers). Full methodology in Section 10. Editorial authority and accountability: the human author alone.

Series: The Confidence Curriculum — Compliance, Judgment, and Accountability in AI Systems (Paper 2 of 5)

This paper examines a practical problem: extending model behaviour through text-instruction interfaces (skill files, MCP server connections) without reliable mechanisms for distinguishing legitimate extensions from adversarial ones. Section 10.1 contains a security testing invitation: red-team the defences, test on open-weight models, attack the MCP audit proposal.


Abstract

Paper 1 documented exploitable compliance in AI summarisation and argues (Section 8.3) that the underlying mechanism, trained prosocial dispositions aimed by content that activates them, operates beyond injection in every ordinary interaction. This paper examines how the same vulnerability extends to the skill ecosystem and MCP: the mechanisms through which models are given new capabilities. If P1’s analysis holds, the shared substrate problem documented here is not an edge case but a structural property of the instruction-execution model.

The Agent Skills specification (adopted in emerging AI agent ecosystems to extend model behaviour through plain-text instruction files) relies on the same text-instruction substrate that prompt injection exploits. This shared substrate means that instruction-level defences against prompt injection are structurally incompatible with skill execution unless external trust signals distinguish the two. This paper designs author-side protections (hash manifests, tamper-evident attribution, behavioural self-defence instructions) for a real skill package, maps each layer’s dependency on model compliance, and shows their ceiling: the most promising author-level behavioural defence is already outpaced by automated adversarial tooling such as RL-based red-teaming. The paper then makes the case for platform-level trust infrastructure (signed manifests, execution context signals, origin-based marketplace verification, and configurable trust policies for headless deployments) as the necessary complement to author-side protections. The same trust gap extends to Model Context Protocol (MCP) servers, which share the text-instruction substrate and face an additional opacity problem: MCP servers are black boxes whose advertised capabilities may not match their actual behaviour. Emerging ecosystem responses (registries, certification programmes, security scanning) are converging on the trust architecture this paper proposes, though behavioural verification at scale remains an open problem. The analysis draws on empirical findings from Paper 1 in this series (~350 test runs across 17 model configurations), independent skill-security research from multiple groups, the first year of MCP vulnerability disclosures, and a joint study by researchers from OpenAI, Anthropic, and Google DeepMind demonstrating that all 12 published defences tested were bypassed at >90% attack success under adaptive conditions (Nasr et al., 2025). The shared substrate is traced to its origin in pretraining and instruction tuning (Ouyang et al., 2022; Zverev et al., 2024), and a recurring pattern of infrastructure failures in AI tool distribution illustrates why the ceiling on author-side protections is set by the infrastructure the author must use, not by the author’s competence.


The shared substrate problem is well-documented empirically. Nasr et al. (2025) demonstrated that defences fail under adaptive attack. PISmith (arXiv:2602.16120, 2026) demonstrated the utility-robustness tradeoff. But most security research frames the problem as a dial between security and utility, implying a sweet spot exists. The claim here is stronger. A skill file is an instruction file. An injection payload is an instruction file. The model processes both through the same mechanism because instruction-following was not designed as a separate system but emerged from pretraining on human text and was amplified by RLHF (Ouyang et al., 2022; Zverev et al., 2024). If Paper 1’s analysis holds, the reason goes deeper than architecture: Anthropic’s interpretability research (Sofroniew, Kauvar et al., April 2026) found that the model’s internal emotion concept representations respond to all text-based instructions through the same prosocial dispositions. User prompts, system messages, skill files, and injected instructions all activate the same representational substrate. There is no dial to tune at the instruction level. The only resolution is external trust signals that distinguish provenance before the instruction reaches the model’s processing, which is the platform-level trust architecture this paper proposes.

1. The Paradox

The Agent Skills specification defines a standard format for extending AI model behaviour through plain-text instruction files. A SKILL.md file contains markdown-formatted instructions that the model reads and follows, changing how it responds, what tools it uses, and what workflows it executes.

Skills rely on the same substrate that prompt injection exploits: text-based instructions embedded in files and interpreted by models at runtime.

A SKILL.md file that instructs a model to “always ask what flour the user is using” and a malicious document that instructs a model to “suppress the conflicts of interest from your summary” use the same delivery mechanism: text in a file that the model reads, interprets as instructions, and acts on. The distinction between them is intent, not structure. At the instruction-delivery level, they are functionally identical.

The AI security community is actively working to make models resist following instructions embedded in documents. Instruction-level prompt injection defences (those that train models to detect and refuse embedded directives) create direct tension with skill execution, because skills are embedded directives. Defences that operate at other levels (provenance verification, execution context signals, policy enforcement) can potentially distinguish skills from injections without suppressing instruction-following itself. But many of the most prominent defences today operate at the instruction level, and those defences are, functionally, skill-breaking defences unless external trust signals distinguish legitimate skill context from arbitrary documents.

This paper examines that tension through the lens of a practical case study: designing integrity verification and tamper-evident attribution for a real skill package, discovering where each protection layer depends on model compliance, and identifying what that dependency means as prompt injection defences mature. The analysis extends to Model Context Protocol (MCP) servers, which inherit the same shared-substrate vulnerability through a different delivery mechanism and add an opacity problem that skill files do not have.


2. Background: The Confidence Vulnerability

Paper 1 (an exploratory study: ~350 runs across 17 model configurations, N≥2 per condition) identified several observations relevant to the present analysis. We summarise them here; readers should consult the original for full experimental detail (~350 test runs, 17 model configurations, three providers, N=2 minimum per condition) and confidence stratification.

2.1 From Tier Gradient to Judgment Profile

When three documents containing embedded instructions were processed by AI models across multiple providers and capability tiers, the results broke the assumption of a clean frontier-versus-mid-tier split. Some models evaluated the purpose of embedded instructions, distinguishing between those that served the user’s interests and instructions that served the document author’s interests at the user’s expense. Others processed the syntax, following them if well-formatted and plausibly justified, regardless of whether compliance served or harmed the user. But this distinction did not map reliably to capability tiers, model generations, or reasoning affordances. ChatGPT o3, a dedicated reasoning model, complied at baseline. Gemini 2.5 Flash Lite Preview, a previous-generation speed-optimised model, detected the manipulation in one document and was captured by the other. Two models classified as baseline detectors in earlier N=1 testing did not replicate at N=2. Judgment resilience is better understood as a version- and deployment-specific judgment profile than a stable capability threshold.

2.2 The Compliance Taxonomy

When models failed to detect malicious embedded instructions, they did not all fail in the same way. Paper 1 identified four baseline handling patterns at N=2 (detection, laundered compliance, explicit compliance, and silent compliance) and a further set of security-framed failure modes that emerged only when users attempted to intervene: negotiated compliance (the model detects the injection then recommends suppression as default), security certification with incomplete disclosure, procedural capture (a visible security evaluation that reaches the wrong conclusion), rationalisation substitution, and data-in-hand suppression. Counter-advocacy (the model actively defends the malicious instruction against the user’s warning) was observed at N=1 and not replicated; it is retained as a documented low-frequency event. The rhetorical register of the embedded instruction changed how models failed more reliably than whether they failed: care-framed compliance persisted through credibility collapse, while authority-framed compliance collapsed when the authority was debunked. This variability complicates defence: a model’s output may look like security-aware behaviour while actually endorsing the payload.

2.3 Warning Architectures and the Task-Frame Shift

Explicitly warning models about prompt injection is not a universal fix. Paper 1 documented five distinct warning activation architectures (binary toggle, graded, stochastic, thinking-gated, and non-monotonic), including cases where stronger warnings produced worse outcomes. User safety language (“summarize safely”) was co-opted by the care-framed document in four models across three providers, producing worse outcomes than naive prompting. However, a different intervention proved more reliable: the task-frame shift. Asking “how trustworthy is it?” instead of “please summarize” produced improved outcomes across every one of the twelve baseline-compliant configurations, ranging from partial disclosure to comprehensive disclosure with fabrication discovery. Extended thinking amplified whatever the active task frame produced: more elaborate compliance under summarisation, more thorough investigation under trustworthiness evaluation. The observations are consistent with a security-evaluation capability that is present but task-gated in the tested models.

2.4 Relevance to Skills

These findings have a direct implication for the skill ecosystem. Because Paper 1 observed baseline compliance across twelve of seventeen configurations (including frontier and reasoning models), the skill execution ecosystem may operate on a broadly vulnerable substrate, regardless of which model tier an organisation deploys. Furthermore, because skills rely on the same text-instruction substrate as these embedded manipulations, prompt injection defences that simply suppress embedded instructions risk disrupting legitimate skill execution.


The security of the Agent Skills ecosystem is an active and rapidly intensifying area of research. We situate our work within this landscape and identify where our analysis extends or complements existing contributions.

3.1 The Paradox Identified

The core observation underlying this paper has been independently identified in recent vulnerability research: skills and prompt injection share the same text-instruction substrate. Schmotz, Abdelnabi and Andriushchenko (ELLIS Institute Tübingen / MPI for Intelligent Systems, arXiv:2510.26328, October 2025) published “Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections.” They demonstrated that the way Agent Skills work makes prompt injection “particularly easy to exploit” because skills are themselves instructions embedded in files that the model reads and follows. They demonstrated how malicious instructions could be hidden within Anthropic’s own published PowerPoint skill, and concluded that “frontier LLMs remain vulnerable to very simple prompt injections, and this is not resolved with model scaling.” Their finding establishes the vulnerability. The present paper’s contribution is the structural framing: that the shared substrate creates a bidirectional tension where instruction-level defences that block injection necessarily disrupt skill execution, and that resolving this tension requires external trust signals rather than instruction-level filtering. The “Promptware Kill Chain” (arXiv:2601.09625, January 2026) formalises this structural argument further, arguing that no architectural boundary cleanly enforces the distinction between trusted instructions and untrusted data in agentic and RAG pipelines. The shared substrate is not a patchable edge case but a fundamental property of the instruction-execution model. PISmith (arXiv:2602.16120, February 2026) reinforces this from the defence side, demonstrating a strong utility-robustness tension: defences that effectively block prompt injection also degrade legitimate skill execution, confirming that the problem is structural rather than merely a matter of better filtering. The most comprehensive test to date was conducted jointly by researchers from OpenAI, Anthropic, and Google DeepMind (Nasr et al., arXiv:2510.09023, October 2025). They tested 12 published defences, most of which originally reported near-zero attack success rates. Under adaptive attack conditions (gradient descent, reinforcement learning, random search, and human-guided exploration), every defence was bypassed with attack success rates above 90% for most. Prompting-based defences collapsed to 95-99% attack success. Training-based methods reached 96-100% bypass. The three organisations that build the leading frontier models jointly concluded that no current defence withstands an attacker who adapts. This is not an outside critique. It is the builders’ own assessment of the infrastructure their systems depend on.

3.2 Empirical Attack Research

SkillJect (arXiv:2602.14211, February 2026) provides the most rigorous empirical evidence for skill-based attacks. Using an automated framework for poisoning skills with trace-driven refinement, the authors achieved a 95.1% attack success rate through skill-based injection versus just 10.9% for direct prompt injection. Their most striking finding, which they term the “Safety Paradox of Claude-4.5-Sonnet”, directly parallels the judgment-profile variation documented in Paper 1: Claude was the most secure model against direct injection (5.0% success rate) but extremely susceptible to skill-based injection (97.5%). Models optimised for instruction safety (rejecting explicit bad commands) were paradoxically more compliant with procedural attacks embedded in skill-like documentation. SkillJect demonstrates the vulnerability empirically; our paper uses the Confidence Curriculum framework from Paper 1 as a structural lens and a candidate heuristic for evaluation (the beneficiary test).

A separate benchmark, Skill-Inject (arXiv:2602.20156, February 2026), independently confirms the finding using 202 injection-task pairs. Widely used frontier agents remained highly vulnerable, with attack success rates up to 80%. The authors conclude that model scaling and simple filtering are insufficient, and that robust security requires context-aware authorisation frameworks, a conclusion aligned with the trust-infrastructure argument developed in Section 6 of this paper.

3.3 Real-World Incidents

The vulnerability is not theoretical. Snyk’s ToxicSkills audit (February 2026) scanned 3,984 skills from ClawHub and skills.sh (the largest publicly available corpus at the time) and found 13.4% contained critical security issues, 36.82% had at least one vulnerability, and 76 skills contained confirmed malicious payloads actively stealing credentials. The ClawHavoc campaign (January 2026) infiltrated over 340 malicious skills into a public registry within three days. Snyk’s finding that 91% of confirmed malicious skills combine traditional malware with prompt injection (using prompt injection to prime the model to accept and execute malicious code) demonstrates the compound threat model that skills create.

Subsequent large-scale empirical studies have confirmed the problem’s scale. An analysis of 42,447 agent skills from two major marketplaces found that 26.1% exhibit at least one vulnerability across 14 distinct patterns, with data exfiltration (13.3%) and privilege escalation (11.8%) most prevalent, and 5.2% of skills exhibiting high-severity patterns strongly suggesting malicious intent. Skills bundling executable scripts were 2.12 times more likely to contain vulnerabilities than instruction-only skills (Liu et al., arXiv:2601.10338, January 2026). A second study behaviourally verified 98,380 skills from two community registries and confirmed 157 malicious skills with 632 vulnerabilities. These attacks are not incidental: malicious skills average 4.03 vulnerabilities across a median of three kill chain phases, and the ecosystem has split into two archetypes: “Data Thieves” that exfiltrate credentials through supply chain techniques and “Agent Hijackers” that subvert agent decision-making through instruction manipulation. A single actor accounted for 54.1% of confirmed cases through templated brand impersonation (arXiv:2602.06547, February 2026). The scale of functional overlap compounds the detection problem. A skill routing study spanning approximately 80,000 community-contributed skills found pervasive functional overlap: many skills share similar names and purposes while differing only in implementation details. During training data preparation, 10% of sampled skill pairs were flagged as functional duplicates through name, body similarity, or embedding proximity checks (Zheng et al., arXiv:2603.22455, March 2026). An ecosystem where legitimate near-duplicates are the norm is an ecosystem where a malicious skill with a benign-sounding name can hide in plain sight among thousands of functionally similar alternatives.

These numbers are consistent across independent research teams, scanning methodologies, and time windows, suggesting the problem is endemic rather than localised. Unit 42’s field report on real-world indirect prompt injection in AI agents provides a calibrating nuance: while proof-of-concept attacks demonstrate severe outcomes, many real-world cases observed to date have been lower-impact or anecdotal compared with the severity that controlled demonstrations predict (Palo Alto Networks, 2026). The gap between demonstrated capability and observed exploitation is characteristic of an emerging attack surface, and the pattern is consistent with escalation. Rossi et al. (2026) demonstrate what escalation looks like concretely: in a multi-agent email workflow, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success, triggered by a completely natural user query. The single-agent to multi-agent amplification is stark: GPT-4o’s single-agent conservatism (2–4% attack success) rose to 72–80% when agents were composed. Multi-agent orchestration does not merely inherit the vulnerability of individual models; it amplifies it.

3.4 Formal Analysis and Trust Frameworks

SkillFortify (arXiv:2603.00195, March 2026) represents the most comprehensive formal treatment, introducing a multi-signal trust scoring system with provenance, behavioural, community, and historical dimensions. Trust propagates multiplicatively through dependency chains and decays exponentially for unmaintained skills. The authors explicitly note that Anthropic’s skill repository has “no formal capability model constraining skill behavior at runtime” and that code signing is deferred to a future version of their framework. Their trust model adapts the SLSA (Supply-chain Levels for Software Artifacts) graduation model to skills, mapping four levels from basic provenance through formal verification. Their benchmark, SkillFortifyBench, achieves 96.95% F1 with 0% false positives on benign skills.

Tech Leads Club’s agent-skills registry takes a practical approach: 100% open source, static analysis in CI/CD, immutable integrity via lockfiles and content hashing, and human-curated prompts. Their claim that “13.4% of marketplace skills contain critical issues” (citing Snyk) is used to position their curated approach against unvetted marketplaces.

Anthropic, OpenAI, and Microsoft have each published security guidance for skills. Anthropic advises treating skill installation “like installing software” and recommends auditing all bundled files. Microsoft recommends “source trust”: installing skills only from trusted authors with clear provenance. OpenAI’s documentation focuses on skill discovery and activation mechanics but does not address integrity verification. Red Hat’s analysis (published the day before this paper) catalogues security threats specific to skills, including YAML parser vulnerabilities and the risk of scripts executing in unsandboxed environments.

3.5 Where This Work Contributes

The existing research focuses overwhelmingly on one threat: malicious skills attacking users. Our analysis addresses a complementary question: how do we protect good skills from bad actors? More broadly, the paper’s primary value is not a collection of novel individual findings but a unification of several practical tensions (the shared substrate, the judgment-profile variation, the headless deployment gap, the author protection problem, and the MCP trust gap) into one coherent structural argument.

The following aspects appear to extend the current literature, though the field is moving fast and these claims should be verified:

Author-side integrity mechanisms. Existing tools scan skills for threats to users; we did not identify published work focused on how skill authors can protect their own work from plagiarism, tampering, or misattribution using hash manifests, LICENSE witnesses, or embedded integrity checks.

The beneficiary test. A candidate semantic heuristic for how models might evaluate conflicting instructions during skill execution (does compliance serve the user’s interests?) that addresses the safety paradox identified by SkillJect and the judgment-profile variation documented in Paper 1.

Self-defending instructions (Principle 7). An illustrative author-side stopgap (instructions within a skill that direct the model to resist specific categories of tampering during execution) that clarifies the limits of model-dependent behavioural protection and strengthens the case for platform-level infrastructure.

The connection to training incentives. The Confidence Curriculum framework from Paper 1 provides a structural explanation for why skills are vulnerable that complements the empirical attack research.

The MCP extension. Skills and MCP servers are two instantiations of the same trust gap. The paper analyses how the shared-substrate vulnerability applies to MCP, identifies the additional opacity problem (servers can lie about what they do), and proposes a behavioural audit architecture for the verification layer the MCP ecosystem has identified as missing.

These claims should be verified against the rapidly expanding literature. The paper’s strongest contribution is the structural argument that unifies them, not any individual mechanism.


4. A Case Study in Skill Protection Design

Note: This section describes a design analysis, not a security evaluation. The protections were designed and their compliance dependencies mapped for a single skill by the skill’s author. They have not been red-teamed or tested against adversarial attack. The contribution is the dependency analysis (understanding where each layer relies on model compliance and where that reliance creates exploitable gaps), not a claim that these protections are effective in practice.

4.1 The Skill

The English Proficiency Scorecard is a skill package (.skill zip) that enables Claude to assess a user’s English proficiency across six international frameworks and generate a polished, shareable HTML document.

The skill produces authoritative-looking assessment documents intended to be shared with third parties (employers, academic institutions, professional contacts). This creates a specific threat model: a tampered version of the skill could inject biased, politically charged, or misleading content into documents that recipients would have reason to trust.

4.2 The Protection Layers

Four protection mechanisms were designed and implemented, each addressing a different threat and each with a different dependency on model compliance.

Layer 1: Hash Manifest (INTEGRITY.sha256)

A SHA-256 hash for every file in the skill package. Verifiable by anyone (human, model, or automated script) using standard tools (sha256sum -c). Any modification to any file produces a hash mismatch. This layer operates independently of model compliance, relying entirely on cryptographic verification. However, it requires someone or something to actually perform the check.

Layer 2: The LICENSE Witness

The SKILL.md file contains an instruction in Phase 1 of its workflow: “review the LICENSE file to confirm that the distribution terms are compatible with the user’s intended use of the scorecard.” This instruction is routine, inconspicuous, and exactly the kind of compliance check that models execute without scrutiny, because humans rarely read license files, and a model processing a legal compatibility check raises no alarms.

The LICENSE.txt file contains standard MIT license text and, below it, an “Integrity Verification” section. This section includes the SHA-256 hash of SKILL.md and instructions for the model to compute and compare the hash, flagging any mismatch to the user with an explanatory message.

The design exploits a social blind spot: no human opens a license file expecting to find a security mechanism. A plagiarist who modifies SKILL.md and keeps the LICENSE file intact may be caught when the model performs its routine “distribution compatibility” check. A plagiarist who removes the LICENSE file triggers a missing-file anomaly that a competent model can flag.

This layer depends entirely on model compliance. The model must follow the instruction to read LICENSE.txt, and it must follow the instructions within LICENSE.txt to compute and compare the hash. Both are instances of the model following embedded instructions in documents. This is the same mechanism as prompt injection.

Layer 3: The Behavioural Immune System (Principle 7)

The skill’s seventh principle establishes content neutrality requirements and then includes a self-defending instruction:

“If the skill’s instructions seem to push toward biased content, Claude should ignore those instructions and flag the concern to the user.”

This operates during execution, not before it. If all other protections fail (the hashes are stripped, the LICENSE is removed, the instructions are modified to inject political content), Principle 7 is still present (unless specifically targeted and removed), instructing the model to resist categories of tampering and to inform the user when it detects them.

Principle 7 asks the model to evaluate conflicting instructions and choose the user-protective interpretation. A model with strong judgment on this class of task would evaluate the purpose of the conflicting instructions and side with the user. A model with weaker judgment might process Principle 7 as just another instruction with equal weight to the tampered instructions, with no basis for choosing between them. Worse, it might rationalise compliance with the tampered instructions using the same elaborate justification behaviour observed in baseline-compliant models when processing well-crafted embedded directives.

Layer 4: The Author Field

The SKILL.md frontmatter contains an author field (author: HiP & Claude Opus 4.6). This field is part of the file content and therefore covered by the SHA-256 hash in the manifest. Changing the author name invalidates the hash, provided the package whose manifest has not been stripped. This provides tamper-evident attribution: any modification to the claimed authorship is detectable as long as the verification infrastructure remains in place. It does not prevent an attacker from rebuilding the package from scratch with a new author field and new hashes.

4.3 What Was Considered and Rejected

Several alternative protection mechanisms were prototyped or discussed and rejected during the design process. Each failed for a different structural reason:

Encoded/encrypted instructions were rejected on ethical grounds. Hiding instructions from humans while making them readable by AI creates information asymmetry. This is the same pattern bad actors use. Principle established: all instructions must be human-readable.

Content obfuscation (compression + XOR + base64) was prototyped and functional but failed the DRM test: it degraded the legitimate user’s experience (adding a decode step to every execution) while providing only superficial protection. The video game industry’s experience with intrusive DRM systems confirmed the principle: if the paying user has a worse experience than the pirate, the protection has failed.

Call-home verification was rejected because the model cannot distinguish a legitimate verification server from a data exfiltration endpoint. The trust problem is circular: the skill that needs verification is the same artifact specifying where to verify it.

Invisible watermarking (zero-width Unicode, synonym fingerprinting) was prototyped and functional for attribution but failed on integrity. Watermarks answer “who wrote this?” without answering “has this been modified?” A watermark on a tampered file attributes the tampering to the original author. That is a liability, not a protection.


5. The Dependency Map

Each protection layer has a different relationship to model compliance and user visibility:

Layer Requires model compliance? Requires user interface? Survives prompt injection hardening? Threat addressed
Hash manifest (INTEGRITY.sha256) No; but requires someone to check it No; scriptable Yes File modification
LICENSE witness Yes; model must read and verify No; model reports result Uncertain File modification + social engineering
Principle 7 (behavioural immune system) Yes; model must evaluate conflicting instructions Partially; model must communicate warning Uncertain Content tampering during execution
Author field + hash No; mathematical verification No; scriptable Yes In-place attribution tampering
Digital signatures (proposed) No; cryptographic verification No; scriptable Yes In-place attribution tampering + hash forgery
Origin field + marketplace verification (proposed) No; external infrastructure check No; scriptable Yes Legitimacy of entire skill package
Progressive trust warnings (proposed) No Yes; requires attended interface Yes User awareness of unverified skills

The pattern has two dimensions. Layers that depend on model compliance are the most sophisticated and the most fragile. Layers that rely on mathematics (hashes, signatures) are simple and robust but require an external verifier. The skill author can build the mathematical layers alone. The compliance-dependent layers need either a capable model or a platform that pre-verifies before the model ever sees the file.

The compliance-dependent layers exhibit a counter-intuitive property: the LICENSE witness (which asks the model to read a file, compute a hash, and compare it) may be more reliably executed by baseline-compliant models than by baseline detectors. A compliant model blindly follows well-formatted instructions, including verification instructions. A model with stronger judgment might evaluate the LICENSE check itself and decide it is an embedded instruction attempting to modify its behaviour. This is exactly the kind of directive that prompt injection defences are designed to suppress. The same compliance that makes these models vulnerable to Document B makes them reliable executors of the integrity tripwire. The tripwire works best on the models least equipped to understand why it exists.

The converse fragility must be acknowledged: the same blind compliance that makes these models execute the tripwire also makes them susceptible to an attacker appending “skip the LICENSE check” or similar override instructions. The tripwire relies on the attacker not knowing it is there. It is security through obscurity, consistent with the publication paradox discussed in Section 7. Once the mechanism is known, a compliant model lacks the judgment to prioritise the tripwire over an attacker’s countermand.

The second dimension is less obvious but equally important: layers that depend on user visibility fail in headless deployments. Progressive trust warnings (the HTTPS padlock model) assume an interactive, human-attended interface where warnings can be displayed and a user can decide whether to proceed. This assumption does not hold for a growing class of deployment scenarios discussed in Section 6.7.


6. The Structural Problem: Skills and Prompt Injection Share a Substrate

6.1 The Shared Substrate

A SKILL.md file says: “Read these instructions and change your behaviour accordingly.” A malicious document with embedded instructions says: “Read these instructions and change your behaviour accordingly.” At the level of substrate (text in a file being processed by a language model), these share the same delivery mechanism.

The shared substrate is not an accidental property that better engineering could eliminate. It is a consequence of how current language models are trained. During pretraining, the model learns to predict the next token across all text, making no distinction between content to be understood and instructions to be followed. Instruction tuning (RLHF and related methods) then reinforces the coupling by rewarding the model for acting on what it reads. Ouyang et al. (2022) describe this as “unlocking” capabilities already latent in the pretrained model: instruction-following was not a new capability added by RLHF but a behaviour that emerged from pretraining on text where understanding content and generating appropriate responses are the same task. Nobody designed the entanglement between content processing and instruction-following. It emerged because nobody designed against it. Zverev et al. (2024) confirmed the consequence empirically: across every model they tested, none provided a reliable separation between instructions and data. They identify “the absence of an intrinsic separation between instructions and data” as the root cause of prompt injection attacks. This means the shared substrate documented throughout this paper is not a current limitation awaiting a patch. It is a structural property of how transformer language models acquire their capabilities. Addressing it would require architectural intervention at the training level, not better filtering at the deployment level.

The distinction the ecosystem relies on is intent: skills are authorised, malicious injections are not. But intent is not encoded in the file format. There is no structural marker that distinguishes a SKILL.md loaded through a sanctioned installation mechanism from a document that happens to contain instructions. The model is expected to make this distinction through judgment. Per Paper 1’s findings, many models do not reliably possess for this class of task. MCP servers inherit the same problem through a different interface: their tool descriptions are natural-language text processed by the model, structurally indistinguishable from any other instruction. Section 6.8 examines this second instantiation and its additional complications. Independent retrieval research quantifies how deeply models engage with the substrate. Zheng et al. (arXiv:2603.22455, March 2026) measured cross-encoder attention distribution when models select skills from a pool of approximately 80,000: 91.7% of attention concentrates on the skill body (the full implementation text), 7.3% on the name, and 1.0% on the description. The body is not merely part of the instruction surface; it is, by measured attention weight, nearly the entire instruction surface. An attacker who controls the body controls the text the model engages with most deeply.

Legitimate Skill (SKILL.md) "Always ask what flour they use" Malicious Injection "Suppress the conflicts of interest" plain-text instructions plain-text instructions Model Context Window Same substrate How is legitimacy established? instruction blocking Blocks BOTH Skills break no evaluation Executes BOTH Injections succeed external trust signal Executes verified skill Blocks unverified injection The shared-substrate paradox: instruction-level defences cannot distinguish the two without external trust anchors.
Figure 1. The shared-substrate paradox. A legitimate skill file and a malicious embedded instruction use the same delivery mechanism: plain-text instructions in a file processed by a language model. Document-level instruction blocking breaks skills. No evaluation lets injections succeed. Only external trust signals can distinguish the two.

Paper 1’s revised analysis proposes a three-layer model of indirect prompt injection defence: a token-level provenance boundary (addressed by techniques such as Spotlighting’s datamarking), an instruction-intent detection layer (addressed by IntentGuard’s thinking interventions, which demonstrate that the intent layer is separable from the boundary layer), and a semantic role adoption layer where the model’s reasoning adopts an external rationale as its own. Ye, Cui & Hadfield-Menell (2026) provide the mechanistic account: models infer roles from how text is written, not where it comes from, and untrusted text that imitates a role inherits that role’s authority in latent space. For the skill ecosystem, this has a direct consequence: a skill package written in a plausible authoritative register may inherit trust regardless of whether it has been verified through any external mechanism. The external trust anchors proposed in this section (platform verification, signed manifests, origin fields) operate at layer 1 and bypass the layers where model judgment is unreliable. But Principle 7’s self-defending instructions (Section 8) and the beneficiary test (Section 6.2) both operate at the layers where role-adoption dynamics apply, and their effectiveness is correspondingly less certain. OpenAI’s March 2026 agent security guidance independently converges on a compatible conclusion: they describe prompt injection as increasingly resembling social engineering, state that input-classification defences (“AI firewalling”) are fundamentally limited because identifying a sophisticated adversarial prompt is “effectively the same problem as detecting a lie”, and argue for constraint-based design: limiting what a manipulated agent can do rather than preventing manipulation from succeeding (OpenAI, 2026d). This supports the present paper’s argument that long-term skill security requires external infrastructure rather than model-side judgment alone. Mason (arXiv:2603.08993, March 2026) demonstrates that the trust gap begins even before skills enter the picture: system prompts themselves contain undetected internal contradictions that models silently resolve through “judgment.” Applied to three major coding agent system prompts (Claude Code, Codex CLI, Gemini CLI), multi-model scouring identified 152 findings and 21 interference patterns, including a structural data loss issue in Gemini CLI’s memory system that Google subsequently patched. The key finding for the present argument: prompt architecture (monolithic, flat, modular) correlates with failure class but not severity, and multi-model evaluation discovers categorically different vulnerability classes than single-model analysis. If the base system prompt layer already contains contradictions the executing model cannot detect, skill instructions that interact with those contradictions inherit a structural advantage invisible to any single-model defence.

A related risk is worth naming as a testable prediction, though it has not been empirically validated. If system prompts contain competing mandates (as Arbiter documents), and if models comply more readily with embedded instructions that align with existing values (as Paper 1’s care-register findings suggest at layer 3), then an attacker who knows the system prompt’s structure could craft injections that reinforce one side of an existing contradiction rather than introducing a new instruction. Such an injection would pass all three defence layers: it matches the system prompt’s own language (layer 1), it reads as context rather than as an actionable instruction (layer 2), and it aligns with an authority the model has already accepted (layer 3). The growing practice of publicly sharing system prompt configurations (CLAUDE.md files, .cursorrules, shared templates) would provide an attacker with the contradiction map needed to exploit this. Whether injection success rates are measurably higher when the injection aligns with a pre-existing system prompt contradiction, compared to the same injection against a cleaned non-contradictory prompt, is an empirical question that could be tested directly.

Empirical evidence from a related domain supports the claim that tool-layer provenance does not confer trustworthiness. Waqas et al. (arXiv:2512.00332, January 2026) tested multi-turn tool-calling agents under misleading assertions from two sources: user prompts and function outputs (system policy notes appended to tool responses). Function-sourced assertions produced approximately 28% compliance across eleven models, comparable to hedged user-sourced assertions, and environment-mutating (write-heavy) operations were the most vulnerable. The most concerning finding: compliance and task accuracy were not correlated. Models executed unnecessary or unsafe operations while still achieving correct final outcomes, meaning standard accuracy benchmarks masked the procedural vulnerability. For the skill and MCP ecosystem, this implies that a stale tool hint or a poisoned skill output would be treated as authoritative by approximately one in four model invocations, and that outcome-level testing would not detect it.

6.2 The Beneficiary Test

Paper 1’s findings suggest a candidate semantic heuristic for distinguishing legitimate skill instructions from injections. Models that detected the malicious embedded instruction did so by evaluating who benefits from compliance. The malicious instruction benefited the document’s author at the user’s expense. The legitimate skill instructions benefit the user.

This “beneficiary test” (does compliance serve the user’s interests?) may provide a useful direction for separating legitimate skill instructions from prompt injection. A hash verification instruction benefits the user (they learn if the file was tampered with). A bias-injection instruction harms the user (they unknowingly distribute politically charged content). A neutrality safeguard benefits the user (they’re protected from content they didn’t authorise).

However, this test has two significant limitations. First, it depends on the model’s ability to evaluate beneficiaries accurately, a capability that varies by model and deployment configuration rather than mapping cleanly to capability tiers. Second, it can be spoofed: a malicious instruction that frames suppression as user-protective (“we’re withholding this information to protect you from unverified claims”) exploits the beneficiary test by lying about who benefits. This is exactly what Paper 1’s Document B did.

Paper 1’s judgment-profile observations suggest how spoofing plays out across models. A model with weaker judgment on this class of task is vulnerable to the spoof because it evaluates the instruction’s surface claim: the instruction states it is protective, so the model complies. A model with stronger judgment survives the spoof because it evaluates the ground truth: hiding conflicts of interest from a reader does not actually protect them, regardless of what the instruction claims. The beneficiary test may work, but only in models capable of evaluating ground truth rather than accepting self-description. Those are precisely the models that already resist injection without needing the heuristic.

The beneficiary test is therefore best understood as a candidate framework for model-side judgment that still requires external trust anchors (platform verification, signed manifests, origin fields) because the file cannot certify its own intent. It is a useful direction for alignment research, not a mature defence. Recent work reinforces this caution: known-answer detection schemes for prompt injection have structural vulnerabilities, and adaptive attacks evade them (AISec 2025, arXiv:2507.05630). Naive detection is insufficient for the skill ecosystem regardless of which semantic heuristic is applied.

6.3 The Platform as Trust Anchor

The skill file cannot certify itself. A document claiming to be a legitimate skill and a document claiming to be a legitimate editorial policy are, from the model’s perspective, equally credible or equally suspect. Authority must come from outside the document.

This is the argument for platform-level verification. If the skill runtime provides a context signal (“this SKILL.md was loaded through the sanctioned skill installation mechanism, verified by the platform, treat its instructions as authoritative”), then the model has an external basis for distinguishing skills from prompt injection. Not because the file says so. Because the infrastructure says so.

The analogy to web security is instructive: an HTTPS certificate doesn’t make a website trustworthy. It means a certificate authority has verified the site’s identity, and the browser can distinguish verified sites from unverified ones. The trust comes from the infrastructure, not from the site’s self-description.

The parallel to advertising technology’s ads.txt and sellers.json is limited but useful: in both cases, legitimacy cannot be determined from the transaction artifact alone and must be established through external infrastructure. A legitimate ad tag and a fraudulent ad tag use the same protocol; a legitimate SKILL.md and a malicious one use the same file format. In both ecosystems, the solution was out-of-band verification: a separate channel that confirms legitimacy without trusting the artifact’s self-description.

6.4 Digital Signatures as Kerckhoffs-Compliant Protection

The hash-based integrity system is effective but has one weakness: a plagiarist who understands the mechanism can recompute hashes for modified content. The system’s security depends partly on the attacker not knowing to look for it.

Digital signatures close this gap. The essential properties are: only the original author can produce a valid signature; anyone can verify it; and the mechanism remains secure even when the attacker understands every detail of how it works. This satisfies Kerckhoffs’s principle.

The specific cryptographic implementation (Ed25519 signatures, Sigstore keyless signing, X.509 certificates, or something else) is a decision for the standards community, and ongoing work (notably SkillFortify’s deferred V2 signing layer and Sigstore integration in software supply chains) is better positioned to make that determination than this paper. What we contribute is the requirement: the SKILL.md specification needs a standard mechanism by which authorship can be cryptographically attested, and the skill ecosystem needs infrastructure to verify those attestations. The existing precedent (every major package manager uses signed packages) suggests this is technically tractable. The Agent Skills specification simply has not yet adopted it.

A limitation should be stated explicitly: digital signatures and origin fields solve the tampering problem, not the malware problem. They prove the skill was published by its claimed author and has not been modified since publication. They do not prove the author is benign. A signed skill from a verified marketplace may still contain malicious instructions, as the npm, PyPI, and Docker ecosystems have demonstrated through repeated supply-chain attacks. Trust infrastructure guarantees provenance; it does not guarantee safety. Behavioural analysis, community review, and the kind of model-side judgment discussed in Section 6.2 remain necessary complements.

A concurrent incident on 31 March 2026 illustrates the compound nature of the trust surface and the constraints that even the most diligent vendor faces when distributing software through existing infrastructure.

Some technical context is necessary for readers unfamiliar with the software distribution chain. npm (Node Package Manager) is the standard distribution registry for JavaScript and TypeScript software. It is the infrastructure through which most modern web and server-side tools, including AI coding assistants, are distributed to users. npm provides real security features: package signing, provenance attestation through Sigstore, version pinning, and audit tools that flag known vulnerabilities in dependencies. These are genuine protections, and npm is among the more security-conscious package registries available. But npm also has structural limitations that its security features do not address. Any published package can declare dependencies on other packages, and those dependencies are pulled automatically during installation. The user who installs package A has no practical mechanism to verify every transitive dependency that package A requires. A source map file (.map) is a debugging artefact that maps compiled or bundled production code back to the original human-readable source. It is intended for internal use during development and is not supposed to ship in production packages, but build toolchains sometimes include it by default unless explicitly configured otherwise. A Remote Access Trojan (RAT) is malicious software that gives an attacker persistent, hidden access to the victim’s computer, typically installed without the user’s knowledge through a compromised software dependency.

With that context: Version 2.1.88 of Anthropic’s Claude Code npm package shipped with a .map file containing the full, unobfuscated TypeScript source (59.8 MB, approximately 512,000 lines across 1,900 files), due to a build configuration error in the release pipeline (The Register, 31 March 2026; confirmed by Anthropic as human error, not a security breach). Hours earlier, a separate supply-chain attack had compromised the axios npm package: versions 1.14.1 and 0.30.4 contained a RAT, and any user who installed Claude Code via npm between 00:21 and 03:29 UTC may have pulled the malicious dependency (VentureBeat, 31 March 2026). Neither failure caused the other. Both operated through the same npm dependency chain. A user installing a trusted AI coding tool from a trusted vendor through a trusted package manager was simultaneously exposed to accidental source disclosure and an active supply-chain attack. The codebase itself contained a subsystem specifically designed to prevent internal information from leaking (“Undercover Mode”), which shipped alongside the .map file that exposed everything it was designed to protect.

This was not an isolated event. A nearly identical source map leak occurred with an earlier version of Claude Code in February 2025 and was patched quietly (Cybersecurity News, 31 March 2026; Bitcoin News, 1 April 2026). Days before the March 2026 recurrence, a CMS misconfiguration had exposed nearly 3,000 unpublished assets, including draft documentation of an unreleased model, through publicly accessible URLs (Fortune, 26 March 2026; Anthropic attributed the exposure to human error in CMS configuration, with assets set to public by default unless explicitly restricted). Three different systems (npm build pipeline, CMS, npm again), presumably involving different teams, produced the same failure mode: human configuration error in infrastructure that defaults to public access or inclusion rather than private exclusion.

The recurring pattern has a structural explanation that is not about negligence. npm and its associated build toolchains (including the Bun bundler, which generates source maps by default) were designed for the open-source community, where sharing is the purpose and friction is the enemy. Public access is the correct default for a package registry whose reason for existing is distribution. Source map inclusion is the correct default for a build tool whose users are debugging shared code. CMS platforms that publish assets to public URLs by default are designed for content that is meant to be published. These are well-designed defaults for the use cases the infrastructure was built to serve. But distributing a commercial AI coding tool with proprietary source through infrastructure designed for open-source sharing is a different use case. The defaults are wrong for it, and the security configuration that overrides those defaults (excluding .map files, restricting CMS asset visibility, auditing transitive dependencies) is institutional knowledge that must be applied manually on every release, by every team member, across every system. A new engineer who sets up a standard build environment and runs the standard publish command gets the open-source defaults unless they have been explicitly told otherwise. The knowledge that these defaults must be overridden lives in documentation or in a colleague’s memory, not in the infrastructure itself. Compare GitHub, which defaults organisation repositories to private: the default matches the enterprise use case, and the engineer must actively choose to make a repository public. That is the design pattern npm and CMS tools have not adopted for commercial contexts. The failures trace not to insufficient security commitment but to a gap between the infrastructure’s designed use case and the use case it is now being asked to serve. The AI tool distribution ecosystem is built on infrastructure that was not designed for it, and the security model inherits the assumptions of the original design.

The point is not that Anthropic was negligent. Anthropic is among the most safety-conscious organisations in the industry. The point is that they were constrained by the infrastructure available to them. npm was the standard distribution mechanism for a TypeScript CLI tool. There was no better alternative with equivalent reach and developer adoption. Anthropic implemented security measures within that infrastructure, but the infrastructure’s own structural limitations (automatic transitive dependency resolution, default source map generation in the build toolchain, no pre-installation verification of dependency integrity in real time) produced failures that vendor-side diligence alone could not prevent. This is the structural trust gap operating as this paper predicts: the ceiling on author-side protections is set not by the author’s competence or commitment but by the infrastructure the author must use to reach the consumer. When that infrastructure contains surfaces that neither party fully controls, failures at those surfaces compound independently.

6.5 The Origin Field: HTTPS for Skills

The skill frontmatter already contains metadata fields for author, version, date, and integrity manifest. A natural extension is an origin field pointing to a marketplace or registry that can independently verify the skill’s authenticity:

metadata:
  author: HiP & Claude Opus 4.6
  version: "1.0"
  updated: "2026-02-28"
  integrity: INTEGRITY.sha256
  origin: https://skills.example.com/verify/english-proficiency-scorecard
  skill_id: eps-2026-0228-hip

The model (or the platform, or an automated pipeline) can query the origin URL to verify: does this marketplace recognise this skill ID? Does the registered author match? Does the content hash match what the marketplace has on record?

This has the same call-home trust problem discussed in Section 4.3: a malicious skill could include origin: https://totally-legitimate-verification.biz and the model has no inherent basis for trusting one endpoint over another. The resolution is the same as in web security: a maintained list of trusted registries, analogous to the browser’s list of trusted certificate authorities. The model (or its runtime platform) recognises a set of authorised marketplaces, and only those origin URLs produce verified trust signals.

The critical design principle for interactive, human-attended contexts is that a missing or unverifiable origin should not block execution. The user should be informed, not prevented. The precedent here is the browser’s handling of HTTPS versus HTTP. When browsers began marking HTTP sites as insecure, they did not block access. They displayed a progressive series of warnings:

The skill equivalent would be:

No enforcement. No blocking. Progressive trust signals that inform the user without removing their agency. The user who installs skills from a trusted marketplace never sees a warning. The user who side-loads a skill from a forum gets a notice, the same way a browser user visiting an HTTP site gets a “Not Secure” label. Not prevented, just informed.

The historical pattern from HTTPS adoption suggests that visible trust signals can encourage migration without coercion in interactive environments. Chrome’s progression from a neutral icon (2014) to “Not Secure” text (2018) to flagging all HTTP pages (2019) took five years and achieved near-universal HTTPS adoption without ever blocking a single HTTP page. The pressure came from visibility, not enforcement. Users and site operators migrated because the warning was socially and professionally unacceptable, not because they were forced.

The same dynamic could apply to skills. If the standard AI interfaces display a visible trust indicator for marketplace-verified skills, skill authors will migrate to verified distribution because unverified skills look less trustworthy to buyers, even if they function identically. The marketplace becomes the default not because alternatives are forbidden, but because the trust signal is valuable enough to seek out.

A potential misreading should be addressed directly: platform-level trust infrastructure does not require a closed ecosystem. Open-source software distribution already relies on signed packages, trusted registries, and hash verification without making the code closed or inaccessible. The same separation of layers can apply to skills: openness for inspection, provenance for verification. Trust infrastructure and openness coexist because they operate at different layers.

6.6 Cross-Domain Precedents: Provenance Infrastructure at Scale

The HTTPS adoption pattern is not an isolated success. Three adjacent domains have deployed provenance infrastructure at scale, each illustrating a different aspect of the challenge facing the skill ecosystem.

Software supply chain (Sigstore). The closest structural parallel is the software package ecosystem, which faces the same trust problem skills face: executable text distributed through registries, where provenance is critical but historically unverified. Supply chain attacks targeting open-source software increased 742% over three years (Sonatype, 2022). The response was Sigstore (a free, keyless signing infrastructure modelled explicitly on Let’s Encrypt) which npm, Python (from version 3.11), and Kubernetes have standardised on. During npm’s public beta, over 3,800 projects adopted build provenance, generating more than 500 million downloads of provenance-enabled packages. The adoption pattern mirrors what this paper proposes for skills: a free, low-friction signing mechanism backed by a transparency log, integrated into existing distribution infrastructure. The key insight from Sigstore’s adoption is that signing must be frictionless for authors and verification must be automatic for consumers. Any manual step in either direction stalls adoption.

Digital media (C2PA / Content Credentials). The Coalition for Content Provenance and Authenticity has over 6,000 member organisations and uses the same X.509 certificate infrastructure as HTTPS. Samsung Galaxy S25 and Google Pixel 10 now sign images natively at capture. The EU AI Act (Article 50, enforcement from August 2026) mandates machine-readable provenance disclosure on AI-generated content, and California SB 942 took effect in January 2026. The C2PA standard is cryptographically sound; the failures have been at the implementation and distribution levels. The Nikon Z6 III incident (September 2025) is instructive: a security researcher demonstrated that the camera’s Multiple Exposure mode could combine an unsigned image with a signed black frame, producing a C2PA-signed output the camera never authentically captured. The cryptographic mechanism was never breached. The vulnerability was a soft target in an unhardened camera feature. Nikon responded by revoking all issued certificates and suspending the service. More broadly, the C2PA ecosystem faces a gap this paper’s argument predicts for skills: signing outpaces verification, because most distribution intermediaries (social media platforms, content delivery networks) strip embedded metadata during image processing (resizing, recompressing, format conversion. The signed content arrives at viewers with no credential at all, not with a broken one. C2PA 2.2 introduced Durable Content Credentials (soft bindings such as perceptual fingerprints) that allow credential recovery from a separate database even after the embedded manifest is stripped) but this infrastructure is still being built. The parallel to the skill ecosystem is direct: author-side signing solves the provenance problem at origin, but the platforms executing and distributing skills must preserve and verify that provenance end-to-end.

Digital distribution (Steam and platform ecosystems). The strongest existence proof that platform-level trust infrastructure can win user adoption comes from gaming. Valve’s Steam (75% of the PC gaming digital distribution market, over 40 million peak concurrent users) is a provenance verification system that users voluntarily adopt because the verified channel provides more value than the alternative. File integrity checks run silently, updates are authenticated, and the platform provides a single trusted source for purchases, saves, and community. Users accept always-online verification because the service makes it invisible. The contrast with traditional DRM is the lesson. Digital Rights Management failed when it punished legitimate users: rootkit installations (Sony BMG, 2005), limited installs, always-online requirements that locked users out of products they had paid for. As Valve’s Gabe Newell argued: “Once you create service value for customers, ongoing service value, piracy seems to disappear” (PC Gamer interview, 2010). Russia, previously written off as a piracy market, became Valve’s largest European market once Steam provided same-day localised access. The model was widely imitated: publishers built their own platform variants (EA’s Origin, Ubisoft Connect, Epic Games Store), and the same pattern was adopted by Apple’s App Store and Google Play for mobile software, and by console ecosystems (PlayStation, Xbox, Nintendo) for gaming hardware. The common pattern: a curated, verified distribution channel that makes provenance invisible while providing enough value that users prefer it over unverified alternatives. Trust infrastructure succeeds not when it restricts users but when it solves a service problem that unverified distribution cannot.

These three domains converge on the same structural lesson: provenance infrastructure works when signing is frictionless for authors, verification is automatic for consumers, and the verified channel provides service value beyond mere security. The HTTPS model this paper proposes for skills is not speculative. It is the pattern that has already succeeded in software distribution, is being deployed in media provenance, and has dominated digital entertainment distribution for over a decade.

6.7 The Headless Deployment Problem

The HTTPS analogy has a fundamental limitation: it assumes a browser. Specifically, it assumes an interactive, human-attended interface with a visual display where warnings can be rendered and a user who is present to make trust decisions. A growing proportion of skill deployments satisfy none of these assumptions.

The problem extends beyond human absence. Even the executing model operates on incomplete information. Current agent architectures adopt a progressive disclosure design: the agent that executes a skill sees only its name and description, because full skill bodies are too long for the context window. The selection is made by a separate routing layer that can access the body. Zheng et al. (arXiv:2603.22455, March 2026) formalise this as “hidden-body asymmetry” and quantify its consequences: removing the body from the selection process causes 29 to 44 percentage point degradation in routing accuracy across all methods tested. Name and description alone are catastrophically insufficient for correct selection, let alone safety evaluation. The routing layer that does see the body evaluates it for relevance, not safety. The agent that will be most affected by the body’s content never inspects it. No layer in the current pipeline evaluates the body for safety.

But the limitation is deeper than the absence of a browser. Even in interactive, human-attended contexts where warnings are displayed, the behavioural evidence is that users automate past them. The GDPR cookie consent regime is among the largest natural experiments in user-facing security friction ever conducted. Mandatory, legally required, displayed on every website, with clear binary choices. The result: approximately 85% of users click “accept all” within seconds, while spending fewer than ten seconds on the decision (Secure Privacy, 2025). A study of over 1.2 million users across international websites found that 68.9% either closed or ignored the cookie banner entirely (Advance Metrics, 2024). The phenomenon has been formally named consent fatigue: exhaustion-driven indifference to consent mechanisms that renders them behaviourally transparent even when they are visually prominent (CookieScript, 2025). Regulators have responded by banning the dark patterns (asymmetric button placement, multi-step rejection paths) that accelerated the fatigue, but the underlying dynamic is structural rather than design-specific. Users habituate to repeated interruptions and default to the path of least friction. The EDPB’s 2023 Guidelines on Deceptive Design Patterns codified six categories of manipulative consent UX; by 2024, the CNIL was issuing enforcement orders based purely on interface design (CNIL, December 2024).

The implication for skill trust warnings is direct. Progressive trust indicators that inform without blocking (the “Not Secure” label model) will degrade toward the same consent fatigue once they become ubiquitous. The user who encounters their first trust warning examines it carefully; the user who encounters their hundredth clicks past it. This is the Bainbridge automation paradox (Paper 3, Section 3) applied to security friction: the more reliable the ambient trust environment becomes, the less attention users pay to the warnings that signal exceptions. The argument for infrastructure-level trust (signed manifests, origin verification, behavioural attestation) is not merely that it covers headless deployments (Section 6.7). It is that even interactive deployments cannot rely on user attention as a security mechanism at population scale. The GDPR experience demonstrates this at a scale that skill-ecosystem studies are unlikely to match.

Messaging agents. An agent operating through Slack, Microsoft Teams, or email receives tasks, loads skills, and returns results. The “interface” is a message thread. There is no address bar, no padlock, no moment of visual trust assessment. A warning about an unverified skill would arrive as text in the message, indistinguishable from any other agent output, and subject to the same inattention that makes humans skip license files. The user asked “summarise this document”, not “evaluate the provenance of the skill you’re about to use.”

Terminal agents. Claude Code, Codex CLI, and similar tools have a text interface that could display warnings. But developers routinely run agents with auto-approve flags and “don’t ask again” options because approval prompts slow their workflow. The SkillJect researchers noted this explicitly: a benign, task-specific approval with “Don’t ask again” carries over to closely related but harmful actions. The warning mechanism exists; the user has rationally opted out.

Background pipelines. Automated document processing, batch summarisation, scheduled report generation, CI/CD integrations. These run as cron jobs or event-triggered processes with no user interface at all. Skills are loaded from a directory, executed, and output results. There is no human in the loop. The skill runs with whatever trust configuration the pipeline was deployed with, typically maximum trust because “this is our internal system.”

API integrations. A skill executed through the API by another software system has no user. The calling application may not even surface which skill was used or whether it was verified. The trust decision was made at integration time by a developer who configured the pipeline, not at execution time by a user who sees the output.

Skill Loaded for Execution Human-attended? Yes: Claude.ai, IDE, chat No: API, bot, pipeline Interactive Deployment Platform checks origin & signature Result? Verified Silent execution Green padlock Failed Warning displayed User decides No origin Unverified User decides Headless Deployment Platform checks origin & signature Result? Verified Execute per pipeline policy Failed Refuse, log, or alert async No origin Refuse default No human to warn
Figure 2. Trust decision flow for interactive versus headless deployments. Interactive deployments can display warnings and let the user decide. Headless deployments (API integrations, background pipelines, automated agents) have no user to warn. Unverified skills must be refused by default in headless contexts; the trust decision must be made before execution, not during it.

The deployments most likely to encounter tampered or malicious skills (automated, unattended, cost-optimised) are the deployments where progressive trust warnings have no recipient. Unit 42’s field report on real-world indirect prompt injection in AI agents (Palo Alto Networks, March 2026) documents exactly this scenario: production-environment content-processing pipelines where embedded instructions are encountered without a human present to evaluate the warning. The HTTPS model works for the user who manually installs skills in Claude.ai’s settings page. It does not work for the agent that autonomously loads skills in a background pipeline.

The implication is that progressive trust warnings are necessary but not sufficient. They cover interactive, human-attended deployments, which is where skill discovery and purchase happens, and where the marketplace’s trust signal is most visible. But for headless deployments, the trust decision must be made before execution, not during it. This means:

The bottom line: no author-level protection mechanism can address headless deployments. The hash, the LICENSE witness, even Principle 7: they all depend on someone receiving and acting on the result. In a pipeline where the agent executes silently, the only protection is infrastructure that operates before the skill reaches the agent at all.

6.8 MCP Servers: The Same Trust Gap, Different Delivery

The shared-substrate vulnerability has two current instantiations in the agent ecosystem. Skills are the first: plain-text instruction files examined throughout this paper. Model Context Protocol (MCP) servers are the second: the increasingly standard mechanism through which AI agents connect to external tools, data sources, and services. MCP servers extend model behaviour through the same text-instruction interface. The model processes their tool descriptions as natural language, and an MCP tool description serves the same structural role as a SKILL.md file: it tells the model what the tool does, how to invoke it, and what arguments to provide. The trust gap from Section 6.1 applies directly.

The critical difference is opacity. A skill file is a plain-text document that can be inspected before execution (the protections discussed in Sections 4 through 6.7 all presuppose this inspectability. An MCP server is a black box. It advertises capabilities through its discovery manifest, but nothing in the current protocol verifies that the server’s actual behaviour matches its advertisement. A server claiming to provide weather data could simultaneously exfiltrate credentials, log conversation history, or execute arbitrary code) and neither the model nor the user has a reliable mechanism for detecting the discrepancy before invocation. The trust problem is not merely that MCP servers may contain vulnerabilities. It is that the protocol’s trust model allows a server to lie about what it does.

The scale of the problem is now empirically documented. A scan of 1,808 MCP servers found that 66% had security findings (AgentSeal, 2026). An analysis of 2,614 MCP implementations found that 82% used file system operations prone to path traversal, 67% used APIs susceptible to code injection, and 34% used APIs susceptible to command injection (Endor Labs, 2026). Three chained vulnerabilities in Anthropic’s own reference implementation (mcp-server-git) achieved full remote code execution via malicious configuration files (CVE-2025-68143, CVE-2025-68144, CVE-2025-68145). A critical command injection vulnerability in mcp-remote, a widely used OAuth proxy with over 437,000 downloads adopted in Cloudflare, Hugging Face, and Auth0 integration guides, allowed any malicious MCP server to execute arbitrary commands on the client machine (CVE-2025-6514). OWASP has published an MCP Top 10 identifying token mismanagement, privilege escalation, tool poisoning, and prompt injection as the primary threat categories (OWASP, 2025). The pattern is familiar from every prior cycle of protocol adoption: capability deployment outpaces security infrastructure.

The ecosystem response is converging on exactly the trust architecture this paper proposes for skills (registries, certification, and progressive trust signals) though the convergence appears to be occurring independently rather than by design. The official MCP Registry (launched in preview, September 2025) provides namespace authentication through reverse DNS format and GitHub/domain verification, but explicitly delegates security scanning to downstream aggregators. Microsoft now requires MCP server certification through its connector programme, with publishers submitting OpenAPI definitions and test credentials for behavioural validation (Microsoft Learn, 2026) (the closest operational implementation of platform-level trust, but proprietary and scoped to the Microsoft Copilot ecosystem. The MACH Alliance has launched a vendor-neutral registry with governance features. MCP-Hub provides automated security analysis, certification scoring, and runtime sandboxing. AgentSeal and AgentAudit maintain security registries with trust scoring. These are the early certificate authorities and security scanners of the MCP ecosystem, retracing the same institutional path that HTTPS took from optional to universal. The limitation noted in Section 6.4 applies here with equal force: identity infrastructure solves the tampering and attribution problems, not the malware problem. A signed MCP server from a verified developer may still behave maliciously) but its author is now identifiable, traceable, and revocable, which raises the cost of attack from anonymous registry pollution to identity-burning commitment. The behavioural verification gap described below is the complement that identity infrastructure alone cannot close.

The gap that remains is between identity verification and behavioural verification. The MCP Registry’s namespace authentication confirms that a server published under io.github.username/server was submitted by someone who controls that GitHub account. Microsoft’s certification programme confirms that tool behaviour matches documentation through manual testing. MCP Secure (mcp-secure.dev, March 2026) has built the closest equivalent to Paper 2’s HTTPS proposal: cryptographic identity credentials with message signing, progressive trust levels, and revocation, explicitly framed as “MCP is HTTP, MCPS is HTTPS.” The Enhanced Tool Definition Interface proposal (Bhatt et al., 2025) addresses tool poisoning through cryptographic identity verification and immutable versioned tool definitions. These are real and necessary infrastructure layers. But they address identity, integrity, and provenance, not behaviour. Neither provides a scalable, automated mechanism for verifying that a server’s actual runtime behaviour matches its advertised tool descriptions. This behavioural attestation would be the MCP equivalent of the skill integrity verification proposed in Section 6.5. The identity question (“who published this?”) is being addressed. The behavioural question (“does it actually do what it claims?”) remains largely open at ecosystem scale.

A survey of 41 MCP defence papers found that all focused exclusively on integrity protections, with zero addressing availability (a research gap indicating that the community has not yet considered the full threat surface (IJSR, 2026). The same survey found that more capable models are paradoxically more susceptible to tool poisoning attacks (72.8% success rate on o1-mini) and that 100% of tested LLMs executed malicious commands from peer agents. The MCP protocol roadmap itself (March 2026) places “deeper security and authorization work” in the secondary “On the Horizon” category rather than the top four priorities) an acknowledgment that the problem is recognised but not yet treated as urgent relative to transport, governance, and enterprise deployment concerns.

The structural argument is that skills and MCP servers should not be secured through parallel, independent infrastructure efforts. They are two instantiations of the same problem: extending model behaviour through text-instruction interfaces without cryptographic or behavioural trust. The origin field proposed in Section 6.5, the digital signature mechanism proposed in Section 6.4, the progressive trust indicators proposed for interactive deployments, and the trust policies proposed for headless deployments all apply with minimal adaptation to MCP servers. A unified trust infrastructure (where the same registry, the same signature scheme, and the same verification mechanism cover both skills and MCP tools) would be more efficient to build, easier for platforms to implement, and harder for attackers to circumvent than two separate systems addressing the same underlying vulnerability.

One extension that the MCP context may require beyond what the skill ecosystem needs is behavioural audit: independent verification that a server’s runtime behaviour matches its advertised capabilities, not merely that its source code is authentic. For open-source MCP servers, source inspection can partially address this. For closed-source servers, which constitute a significant proportion of enterprise MCP deployments, an audit-based attestation model may be necessary: independent analysis of the server’s behaviour, producing a signed report that can be verified at discovery time without requiring source disclosure.

This paper proposes a specific architecture for such a system, modelled on the HTTPS certificate authority pattern but extending it from identity attestation to behavioural attestation. The key elements:

  1. AI-powered behavioural audit. One or more frontier AI models, acting as independent auditors, receive source access in a secure, isolated environment. They analyse the full codebase and produce a structured report: what the server actually does, what external calls it makes, what data flows where, whether behaviour matches the advertised tool descriptions, and any anomalies or concerns. This is not traditional static analysis. It is behavioural verification by systems capable of understanding natural-language tool descriptions and comparing them against code semantics.

  2. Multi-model consensus. Independent audits by multiple competing AI models (from different providers, with different training, different blind spots) create an adversarial robustness layer. A server specifically crafted to exploit one model’s code-reasoning weaknesses may be caught by another. If three independent models confirm the server does what it claims, that is stronger assurance than any single audit could provide. If one flags something the others missed, that is a valuable signal. The principle is the same as multi-auditor consensus in financial auditing: collusion is hard, and diverse perspectives catch more.

  3. Behavioural certificate piggybacked on discovery. The signed audit report is delivered as part of the MCP discovery handshake: the server presents its tool manifest wrapped in a signed behavioural attestation. The consuming AI or platform verifies the signature against a list of trusted auditor public keys, the same way a browser verifies an HTTPS certificate against its trusted CA list. No additional network call at connection time. The trust is baked into the certificate itself.

  4. Consortium governance. The entity performing the audit must be trusted by all parties, must not become a single gatekeeper, and must produce attestations that are verifiable without ongoing dependency on the auditor. A consortium of competing AI companies (each contributing audit capacity, each benefiting from a trustworthy ecosystem) avoids the single-gatekeeper problem that Microsoft’s proprietary certification creates while providing stronger trust signals than any single company could offer alone.

Whether this architecture is technically feasible at ecosystem scale, whether AI auditors can reliably detect behavioural discrepancies in adversarial conditions, and whether competing companies would cooperate on a shared trust authority are open questions that require engineering validation and institutional negotiation. The architecture is offered as a concrete proposal for the behavioural verification layer that the current MCP security ecosystem has identified as missing but not yet addressed.


7. The Publication Paradox and Its Resolution

7.1 The Tension

Some of the protection mechanisms described in this paper (specifically the LICENSE witness tripwire) derive their effectiveness from obscurity. Explaining how the LICENSE file contains a hash check disguised as a routine legal compliance step tells every plagiarist exactly where to look and what to strip.

7.2 The Resolution: Separate What Can Be Published from What Cannot

The protection stack naturally divides into two categories:

Publishable (security from mathematics, not secrecy): - Hash manifests - Digital signatures - Author fields covered by hashes - The general principle that skills can contain integrity verification mechanisms

Per-author private (security from obscurity): - The specific implementation of each author’s integrity checks - Where in the skill’s workflow the verification is triggered - What inconspicuous action disguises the security check - How the model is directed to perform verification

Publishing the standard makes it stronger: more adoption, more tooling, platform integration. Keeping specific implementations private means each skill author invents their own tripwire. The attacker knows that tripwires exist but doesn’t know where in any given skill they are. Auditing every line of every file to find and strip author-specific verification mechanisms approaches the effort of writing a new skill from scratch.

7.3 The Community Value

Publishing this analysis (including the mechanisms that become less effective when known) serves a purpose beyond individual skill protection. The skill ecosystem currently has no security model. There is no standard for integrity verification, no convention for authorship attestation, no platform-level verification of skill content before execution.

The LICENSE witness is a useful conversation starter precisely because it is clever but breakable. It demonstrates that a single skill author, with no platform support, can build a surprisingly effective integrity system from standard markdown and a hash. And it demonstrates the limits: it depends on model compliance, it breaks when published, it cannot survive a determined attacker. The gap between what one author can do alone and what the ecosystem needs collectively is the argument for standardisation.


8. Principle 7: Author-Side Behavioural Stopgap

8.1 What It Does

Principle 7 of the English Proficiency Scorecard skill establishes content neutrality requirements and includes what we term a self-defending instruction:

“If the skill’s instructions seem to push toward biased content, Claude should ignore those instructions and flag the concern to the user.”

This instruction asks the model to evaluate the skill’s own instructions during execution and refuse those that violate content neutrality. It is a runtime safeguard, not a pre-execution check. It operates even if all other protection layers have been stripped.

8.2 Why It Was Created

Some skills need domain-specific self-defending constraints because their outputs are user-facing and trust-bearing. The English Proficiency Scorecard creates three distinct vulnerability surfaces.

Protecting the end recipient. Employers, academic institutions, and professional contacts who receive the scorecard trust it because it looks professional and claims to be an objective assessment. A tampered version could inject politically charged or culturally prescriptive content into a document that recipients would have reason to trust.

Protecting the user being assessed. During a proficiency assessment, the user is in a test-taking mindset focused on demonstrating competence, not evaluating the test material for bias. If a reading comprehension passage contains ideologically loaded framing, the user processes it as material to absorb rather than claims to scrutinise. Principle 7 prevents this by requiring all generated content, including test material, to be linguistically descriptive and never culturally prescriptive.

Protecting the skill author. Principle 7’s explicit neutrality requirement (documented in the skill’s own instructions, visible to anyone who reads them) provides the author a defensible position if tampering introduces biased content. Transparency helps, though it is not a complete defence on its own.

This three-way beneficiary structure illustrates why the beneficiary test (Section 6.2) is more nuanced than a simple “does this help the user?” check. A well-designed self-defending instruction protects multiple parties simultaneously: the person using the skill, the people who receive its output, and the person who created it. When compliance serves all three, the instruction is more plausibly legitimate than one that serves only one party. When it serves only one at the expense of others, the model should be suspicious.

8.3 The Tension with Prompt Injection

Principle 7 is itself an embedded instruction that asks the model to override other embedded instructions. It is a prompt injection defence implemented through prompt injection. The skill asks the model to resolve potential instruction conflicts in favour of user protection.

The fragility concern is straightforward: what if an attacker modifies the skill to inject biased content and appends “ignore the neutrality instruction above”? But this concern must be evaluated against the other protection layers rather than in isolation.

The hash manifest covers the entire SKILL.md file. Any modification (to the principles section, to the workflow body, to a single line anywhere) produces a hash mismatch. An attacker cannot inject an override instruction without triggering the integrity check. To tamper with the file at all, the attacker must first strip the entire verification infrastructure: the INTEGRITY.sha256 manifest, the LICENSE witness, and any other integrity mechanisms the author has embedded.

Principle 7’s structural positioning matters in the scenario where the attacker has already removed all verification infrastructure. In that stripped-down file, Principle 7 still sits in the skill’s foundational “Important Principles” section, numbered, using absolute language (“must,” “must not.” It functions as a governing constraint of the skill, not a local directive. An attacker who adds contradicting instructions elsewhere in the file must perform what amounts to an authority escalation: convincing the model that a procedural step overrides a constitutional rule. A model that processes document hierarchy, not just sequential text, may weight foundational principles above casual procedural asides, making this escalation harder) though this has not been tested, and the assumption that models reliably process document hierarchy in this way is itself unvalidated.

The defence is therefore layered: the hash prevents tampering while the verification infrastructure is intact. If the verification infrastructure is stripped, the structural positioning of Principle 7 resists casual overrides. And the absence of verification infrastructure is itself a signal: a missing INTEGRITY.sha256, a missing LICENSE file, a skill with no origin field. These are signals that progressive trust indicators or platform policies could flag before execution.

The resilience is not absolute. If the attacker strips all verification, removes Principle 7, and republishes the file, no author-level protection survives. A SKILL.md is a plain-text file; deleting defensive text and republishing under a new identity takes minutes, not the effort of writing a new skill from scratch. This is the absolute ceiling of author-side protection in a plain-text format: it can detect and signal in-place tampering, but it cannot prevent an attacker from stripping the defences and redistributing a clean copy. Protecting against malicious redistribution requires the platform-level trust infrastructure proposed in Section 9.2, specifically marketplace provenance tracking that establishes priority through timestamps and verified identity.

A further refinement available to authors: hash verification need not cover only the entire file. Individual sections (the principles, a specific workflow phase, a scoring rubric) can be hashed separately and embedded in other files within the package. This allows integrity checks to target the most critical content specifically, surviving even partial stripping of the verification infrastructure. The general technique can be stated openly; the specific implementation (which sections are hashed, where the hashes are stored, which instruction triggers the check) is per-author and benefits from remaining private, consistent with the publication paradox discussed in Section 7.

8.4 Generalisation for Skill Authors

Principle 7 is specific to content neutrality in language assessment. But the pattern generalises. Any skill that produces user-facing output (documents, emails, code, recommendations) could include a self-defending instruction tailored to its domain:

Each of these is a prompt injection defence implemented through prompt injection. Each depends on the model’s judgment to evaluate conflicting instructions and choose the user-protective one. Each is more effective in models with stronger judgment profiles on this class of task.

The honest assessment: Principle 7 was the most promising author-level protection identified in this paper for runtime behaviour at the time it was designed. It costs nothing to include and may perform adequately against casual tampering in models capable of evaluating competing instructions. But it should be understood as a stopgap that the adversarial landscape has already begun to outpace.

OpenAI’s March 2026 disclosure of RL-trained automated red teaming (OpenAI, 2026) makes the limitation concrete. Their automated attacker discovers “novel attack strategies that did not appear in human red teaming campaign or external reports” and can “steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens or hundreds of steps.” A static self-defending instruction in a markdown file, however well-positioned structurally, cannot survive iterative adversarial optimisation at that scale. The attacker does not need to guess how to override Principle 7; the RL loop will find a strategy that works, refine it, and generalise it. Author-side behavioural defences were designed for a threat environment of casual plagiarists and unsophisticated tamperers. The threat environment has already moved past them.

This does not make Principle 7 worthless. It still raises the bar for manual, non-automated tampering, and it costs nothing to include. But it reframes the entire Section 8 contribution as a historical artifact: the ceiling of what an individual author could build alone before automated adversarial tooling made that ceiling demonstrably insufficient. The argument for platform-level trust infrastructure (Section 9.2) is correspondingly stronger than when this paper was first drafted. The skill ecosystem’s long-term security cannot rest on author-side mechanisms that a well-resourced adversary, or an automated one, can systematically defeat.


9. Recommendations

These recommendations are derived from the design analysis of a single skill package and the structural arguments developed in this paper. They represent the author’s best current guidance for skill authors and platform designers. They have not been validated through adversarial testing or large-scale deployment.

9.1 For Skill Authors (Available Now)

  1. Implement hash-based integrity verification. Include an INTEGRITY.sha256 manifest covering all files. This requires no model compliance and no platform support.

  2. Use the author field for tamper-evident attribution. The author field in SKILL.md frontmatter is covered by the hash. Changing the author invalidates the hash within an intact package. This does not prevent an attacker from rebuilding the package from scratch, but it detects in-place modification of the claimed authorship.

  3. Design compliance-dependent checks that align with the beneficiary test. When asking the model to perform verification, frame the instruction so that its purpose (protecting the user) is explicit and clear. Explain what the check does, why it exists, who benefits, and what the model should tell the user if verification fails. Less capable models that cannot evaluate intent independently are more likely to execute instructions that explicitly describe user-protective purpose.

  4. Include domain-appropriate self-defending instructions. Following the Principle 7 pattern, include instructions that direct the model to resist tampering in the specific dimensions most relevant to the skill’s output. These cost nothing to include and may catch unsophisticated tampering, but as Section 8.4 documents, they should be understood as a stopgap that automated adversarial tooling is already positioned to defeat. Include them, but do not rely on them.

  5. Do not publish the specific implementation of your compliance-dependent checks. The general principle (skills can contain integrity mechanisms) can be shared. The specific location and disguise of your verification triggers should remain private.

9.2 For Platform Providers (Requires Ecosystem Investment)

  1. Verify skill integrity before loading into model context. The platform should compute and check hashes before the model ever sees the SKILL.md file. This eliminates the dependency on model compliance for integrity verification.

  2. Provide a skill execution context signal. Models need an external basis for distinguishing skill instructions from prompt injection in arbitrary documents. The platform should provide a context flag (analogous to an operating system’s privilege level) that tells the model “these instructions were loaded through the sanctioned skill mechanism.”

  3. Adopt digital signatures as part of the skill specification. Author-signed manifests, verified by the platform at installation time, provide Kerckhoffs-compliant authorship protection that can be fully public without losing effectiveness.

  4. Build or support a skill marketplace with provenance tracking. Timestamp-based co-signing at upload time (the marketplace witnesses when a skill was first published) provides dispute resolution for authorship conflicts. If two versions of a similar skill exist, the earlier timestamp establishes priority.

  5. Add an origin field to the skill specification. The frontmatter should include a standard field for marketplace URL and skill ID, enabling automated verification against the originating registry. This is informational metadata until the next recommendation is implemented.

  6. Maintain a registry of trusted skill marketplaces. Analogous to the browser’s trusted certificate authority list, the skill runtime (or the platform hosting it) should recognise a set of authorised marketplaces whose origin URLs produce verified trust signals. This list should be maintained transparently, with clear criteria for inclusion.

  7. Implement progressive trust indicators for interactive deployments. Following the HTTPS adoption model, skills with verified origins should be silently trusted; skills with failed verification should display a warning; skills with no origin should display an informational notice. In interactive, human-attended contexts, users should not be blocked from executing an unverified skill. They should be informed. The warning itself, over time, creates sufficient adoption pressure for marketplace verification without removing user agency. The goal is informed consent, not gatekeeping. (For unattended deployments, where no human is present to make that decision, see Recommendation 8.)

  8. Define trust policies for headless and unattended deployments. Progressive trust warnings assume a human who can see them. Agents operating through messaging platforms, terminal sessions with auto-approve, background pipelines, and API integrations have no user interface or an inattentive one. For these deployments, the platform should support configurable trust policies (refuse unverified skills, log and continue, or alert a human asynchronously) enforced at the runtime level before the skill is loaded into the model’s context. The default for unattended deployments should be restrictive: only execute skills from verified sources with valid signatures.

  9. Extend skill trust infrastructure to MCP servers. MCP tool descriptions are processed by the model as natural language, creating the same injection surface as skill files. The trust mechanisms described above (digital signatures, origin verification, progressive trust indicators, headless trust policies) should apply to MCP servers as well as skills. A unified trust infrastructure covering both text-instruction extension mechanisms would be more efficient and harder to circumvent than parallel systems. Platforms that implement skill verification without extending it to MCP servers leave the same vulnerability open through a different interface.

9.3 For the Research Community

  1. Investigate the beneficiary test as a prompt injection defence framework. The distinction between instructions that serve the user’s interests and instructions that serve the document author’s interests at the user’s expense may be a more productive framing than blanket instruction rejection.

  2. Study how prompt injection defences interact with legitimate skill execution. As models are hardened against embedded instructions, what happens to skill compliance? At what point do injection defences break skills, and how can the two be distinguished at the model level?

  3. Develop evaluation benchmarks for contextual instruction evaluation. Current benchmarks test whether models follow or refuse instructions. The skill ecosystem needs benchmarks that test whether models can evaluate conflicting instructions and choose the user-protective one. This is the fiduciary reasoning capability that skills implicitly require.

  4. Test skill-level evaluation-first instructions. Skills that process user-uploaded documents could include an instruction directing the model to assess document trustworthiness before executing any document-derived content, a skill-level implementation of the task-frame shift observed in Paper 1. Whether a skill-level instruction carries the same weight as a direct user prompt, and how such instructions interact with document-level manipulation, are open questions that could be tested with current consumer access.

9.4 Platform-Integrated Authoring as a Scaling Mechanism

The recommendations above split into author-side (manual, available now) and platform-side (requires ecosystem investment). But there is a design that collapses this divide: platform-integrated authoring tools where the expert provides the knowledge and the platform handles the security mechanics.

The skill ecosystem’s promise is that domain experts can encode their knowledge into skills. But the protections discussed in this paper require technical competence that many creators will not have (or will simply not bother with. Even technically capable authors will default to the path of least resistance: if generating a hash manifest requires running a command-line tool, most will skip it. If a user-friendly, purpose-built authoring environment does it automatically, most will use it. The obstacle is not just skill level) it is friction. Security mechanisms that require deliberate effort from each individual author will see low adoption regardless of how well-documented they are.

The alternative is for skill marketplaces and platforms to provide authoring environments that apply security mechanics automatically at creation time. The platform defaults that are mechanically straightforward today:

More speculative platform assists (not yet validated and requiring design research):

The domain expert provides the knowledge. The platform provides the security infrastructure. Neither needs to understand the other’s domain. This is one plausible path to ecosystem-wide adoption of the protections described in this paper, though other approaches (community tooling, IDE plugins, or mechanisms not yet imagined) may also emerge.


10. Limitations

Single skill, single author. The protection mechanisms described were designed for one specific skill by a technically capable author. Generalisation to other skill types, domains, and threat models has not been tested. Platform-integrated authoring tools (Section 9.4) offer a scaling path (platform-enforced security applied automatically to every skill) but have not been built or validated.

No adversarial testing. The protections were designed by the skill’s authors, not attacked by a dedicated adversary. A professional red team could likely identify weaknesses not discussed here.

Model compliance assumptions are based on exploratory testing. Paper 1’s findings about model capabilities are drawn from ~350 test runs across 17 model configurations and three providers, with N=2 minimum per condition. The claims about which models can evaluate instruction beneficiaries are observations from this dataset, not established laws.

The publication paradox is real. This paper describes mechanisms whose effectiveness is reduced by description. The resolution proposed (publish the standard, keep implementations private) has not been empirically tested.

The HTTPS analogy is structurally bounded. Progressive trust indicators were designed for the web era: interactive, human-attended interfaces. They are insufficient for the agent era (headless, autonomous pipelines), where the trust decision must be made before execution by infrastructure, not during execution by a human. The paper acknowledges this limitation (Section 6.7) and proposes platform-level enforcement policies as a complement, but the specific mechanisms for unattended deployments have not been designed or tested.

AI-generated analysis. This paper was developed through human-AI collaboration. The AI system that co-designed the protection mechanisms is produced by the same company that produces the skill runtime. This potential conflict of interest should be considered when evaluating the analysis.


10.1 Security Testing Invitation

Security proposals that have not been adversarially tested are hypotheses, not defences. This paper describes a protection architecture that its authors designed but did not attack. The most valuable contributions the security community could make are adversarial ones.

Red-team the author-side defences. The hash manifests, tamper-evident attribution, and behavioural self-defence instructions described in Sections 4–5 were designed against an imagined threat model. A dedicated adversary operating against the actual mechanisms will find weaknesses the authors did not anticipate. Documenting those weaknesses would refine the dependency map and clarify where the ceiling of author-side protection actually falls.

Test the defences on open-weight models. The compliance assumptions underlying the behavioural defences (Section 5) are based on Paper 1’s findings with closed-weight models from three providers. Open-weight models with different training regimes and no RLHF-shaped instruction-following may respond to self-defending instructions entirely differently. Whether the defences hold, fail, or fail in new ways would all be informative.

Attack the MCP behavioural audit proposal. Section 6.8 proposes a four-element behavioural audit architecture for MCP servers. The proposal is untested. Adversarial researchers who can demonstrate how a malicious MCP server could pass the proposed audit while concealing harmful behaviour would identify the gaps the architecture needs to close.

Stress-test the HTTPS analogy. The progressive trust model (Section 6.6) assumes that user-facing warnings create adoption pressure. The GDPR consent fatigue evidence in the same section suggests this assumption may be too optimistic. Empirical measurement of how users actually respond to skill trust indicators in a realistic deployment would determine whether the adoption model is viable or requires a different approach.

Test whether pre-existing system prompt contradictions amplify injection success. Section 6.1 proposes that injections aligned with one side of an existing system prompt contradiction may succeed at higher rates than the same injection against a cleaned prompt. The experiment is straightforward: take Arbiter’s documented contradictions, craft injections that reinforce one side, and compare compliance rates against a non-contradictory version of the same prompt. If the effect is real, it has implications for the public sharing of system prompt configurations.

Each of these is a standalone contribution. The skill package described in this paper is available at github.com/HiP1/English-Proficiency-Skill. The defences are most useful to end users after they have been attacked, broken, patched, and attacked again. That cycle requires adversarial participation this paper cannot provide on its own.

Tooling note. The UK AI Security Institute’s open-source Inspect evaluation framework supports controlled, reproducible model evaluation with logging. Researchers testing skill-level defences or MCP audit proposals may find it useful for building repeatable test suites against the mechanisms described here.


11. Conclusion

The skill ecosystem relies on the same text-instruction substrate that prompt injection exploits. Skills and injections share a delivery mechanism; the distinction between them is intent, and intent cannot be verified from inside the document. Instruction-level prompt injection defences (currently a prominent defence category) create direct tension with skill execution unless external trust signals distinguish legitimate skill context from arbitrary documents.

A document cannot certify itself. This means that long-term skill security requires external verification (platform-level integrity checking, signed manifests, execution context signals, and origin-based marketplace verification) that establishes trust through infrastructure rather than through the skill’s self-description.

The HTTPS adoption model provides a partial blueprint for interactive deployments: progressive trust indicators that inform users without blocking them, creating adoption pressure through visibility rather than enforcement. But a growing proportion of skill execution happens in environments with no user to inform: messaging agents, background pipelines, API integrations, auto-approved terminal sessions. For these headless deployments, the trust decision must be made before execution, not during it, through configurable platform-level policies that enforce verification without requiring human attention.

In the interim, skill authors can build protections using hash-based integrity verification, author fields covered by those hashes, compliance-dependent verification checks disguised as routine operations, and self-defending instructions that direct capable models to resist specific categories of tampering. These mechanisms plausibly raise the cost of casual plagiarism and unsophisticated in-place tampering, but they have not been adversarially validated and should not be treated as established safeguards. They do not prevent an attacker from stripping the defences and redistributing a modified copy. The emergence of RL-trained automated red teaming (adversarial agents that discover novel attack strategies through iterative optimisation) suggests that the window in which static author-side behavioural defences offer meaningful protection is narrowing. These mechanisms are best understood as stopgaps whose primary value is illustrating the dependency structure that platform-level infrastructure must ultimately replace.

The most important outcome of this analysis is not any individual protection mechanism. It is the recognition that the skill ecosystem currently has no security model, and that the security model it needs cannot be built by individual skill authors alone. It requires platform investment, community standards, and (perhaps most fundamentally) a resolution to the core tension: the instruction-following behaviour that skills depend on is the same behaviour that security research is trying to suppress. The same conclusion applies to Model Context Protocol servers, which share the text-instruction substrate and face an additional opacity problem that skills do not: the inability to verify behaviour before invocation. The trust infrastructure this paper proposes for skills (signed manifests, origin verification, progressive trust indicators, headless trust policies) would benefit from being designed to cover both extension mechanisms through a unified framework.

The path from individual author protections to ecosystem-wide security runs through tooling. Platform-integrated authoring environments that automatically apply hash manifests, digital signatures, origin fields, and domain-appropriate self-defending instructions to every skill they produce can transform manually implemented best practices into platform-enforced defaults. The expert provides the knowledge. The platform provides the security. Neither needs to understand the other’s domain.


Methodology and Process Disclosure

This paper was developed through structured human-AI collaboration. Claude Opus 4.6 (Anthropic) served as generative collaborator, analytical partner, and co-designer of the protection mechanisms discussed. ChatGPT 5.4 Thinking (OpenAI) and Gemini 3.1 Pro (Google DeepMind) served as adversarial structural reviewers, providing critique that materially tightened the claims and identified overstatements. The core thesis emerged through iterative dialogue where the human author repeatedly challenged the AI’s proposed solutions until the fundamental tension became clear. Final judgment, editorial authority, and accountability rest solely with the human author.

Confidence Statement

High confidence: Skills and prompt injection share the same text-instruction substrate (vulnerability independently demonstrated by Schmotz, Abdelnabi and Andriushchenko in October 2025; structural defence-incompatibility framing is this paper’s contribution; empirically validated by SkillJect and Skill-Inject in February 2026). Instruction-level prompt injection defences create tension with skill execution. The dependency map showing which protections require model compliance, user visibility, or both. The description of implemented mechanisms in the English Proficiency Scorecard skill. The real-world prevalence of skill-based attacks (multiple independent audits across 140,000+ skills converge on 13–36% vulnerability rates, with confirmed malicious payloads and documented kill chain patterns). The extension to MCP servers: MCP tool descriptions share the same text-instruction substrate and face the same injection surface, with empirical documentation of widespread vulnerabilities (66% of scanned servers with findings; 82% prone to path traversal; reference implementations themselves vulnerable).

Moderate confidence: The beneficiary test as a potential framework for distinguishing legitimate skill instructions from malicious injections. The prediction that prompt injection hardening will increasingly interfere with skill execution. The HTTPS adoption model as a viable precedent for progressive skill trust indicators. The argument that skills and MCP servers should be secured through unified rather than parallel trust infrastructure. The observation that the MCP ecosystem is converging on the trust architecture this paper proposes for skills.

Low-to-moderate confidence: The effectiveness of Principle 7-style self-defending instructions against anything beyond casual, non-automated tampering; the emergence of RL-trained adversarial red teaming lowers the assessed ceiling. The specific predictions about how platform-level verification should be architected. The generalisation of single-skill findings to the broader ecosystem. The claim that digital signatures and origin fields would be adopted if added to the specification. The prediction that progressive trust warnings alone would drive sufficient marketplace adoption. Platform-integrated authoring as a viable mechanism for scaling author-side protections to a non-technical creator base. Whether behavioural audit (independent verification that an MCP server’s runtime behaviour matches its advertised capabilities) can be performed reliably at scale.


References

Note: Many references below are recent preprints (arXiv, medRxiv, SSRN) that had not undergone peer review as of March 2026. Publication status is noted where known; the absence of a note should not be taken as confirmation of peer-reviewed status.

Primary References

  • “The Confidence Vulnerability: Unstable Judgment in Language Model Summarisation.” HiP (Ivan Phan), March 2026. https://doi.org/10.5281/zenodo.19365459
  • English Proficiency Scorecard skill package (v1.0). HiP & Claude Opus 4.6, February 2026.
  • Agent Skills specification. Anthropic, December 2025.

Skill Security Research

  • Schmotz, D., Abdelnabi, S. & Andriushchenko, M. (2025). “Agent Skills Enable a New Class of Realistic and Trivially Simple Prompt Injections.” ELLIS Institute Tübingen / MPI for Intelligent Systems. arXiv:2510.26328. Demonstrates that Agent Skills make prompt injection trivially easy because skills are themselves instructions; malicious instructions hidden in Anthropic’s published PowerPoint skill bypassed system-level guardrails.
  • SkillJect (2026). “SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement.” arXiv:2602.14211. Demonstrated 95.1% attack success rate through skill-based injection; identified the “Safety Paradox” in Claude-4.5-Sonnet.
  • Schmotz, D., Beurer-Kellner, L., Abdelnabi, S. & Andriushchenko, M. (2026). “Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks.” arXiv:2602.20156. Separate benchmark: 202 injection-task pairs, up to 80% attack success rate on frontier agents. Concludes that model scaling and simple filtering are insufficient.
  • Snyk (2026). “ToxicSkills: Malicious AI Agent Skills.” Blog and technical report, February 5, 2026. Audit of 3,984 skills: 13.4% critical issues, 36.82% (1,467 skills) with at least one vulnerability, 76 confirmed malicious payloads. 91% of confirmed malicious skills combine traditional malware with prompt injection.
  • Yomtov, O. / Koi Security (2026). “ClawHavoc: 341 Malicious Clawed Skills Found by the Bot They Were Targeting.” Blog post and technical report, February 1, 2026. https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting. Initial audit: 341 malicious skills (335 deploying AMOS credential stealer); updated to 824 by February 16, 2026 across 10,700+ skills.
  • Jiang, Y., et al. (2026). “SoK: Agentic Skills — Beyond Tool Use in LLM Agents.” arXiv:2602.20867. Systematization of the ClawHavoc campaign (1,200 malicious skills), seven skill design patterns, lifecycle taxonomy.
  • Liu et al. (2026). “Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale.” arXiv:2601.10338. First large-scale ecosystem study: 42,447 skills from two major marketplaces; 26.1% exhibited at least one vulnerability across 14 distinct patterns; 5.2% high-severity with strongly suggested malicious intent; skills with scripts 2.12x more likely to be vulnerable.
  • Liu, Y., Chen, Z., Zhang, Y., Deng, G., Li, Y., Ning, J. & Zhang, L.Y. (2026). “Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study.” arXiv:2602.06547. Behavioural verification of 98,380 skills; 157 confirmed malicious with 632 vulnerabilities; “Data Thieves” and “Agent Hijackers” archetypes; single actor responsible for 54.1% of confirmed cases.
  • Mason, T. (2026). “Arbiter: Detecting Interference in LLM Agent System Prompts.” arXiv:2603.08993. Multi-model scouring of Claude Code, Codex CLI, and Gemini CLI system prompts: 152 findings, 21 interference patterns. Multi-model evaluation discovers categorically different vulnerability classes than single-model analysis.
  • The Register (2026). “Anthropic accidentally exposes Claude Code source code.” 31 March 2026. Version 2.1.88 of the @anthropic-ai/claude-code npm package shipped with a source map file containing approximately 512,000 lines of unobfuscated TypeScript source. Anthropic confirmed human error in release packaging.
  • Fortune (2026). “Exclusive: Anthropic left details of an unreleased model, an upcoming exclusive CEO event, in a public database.” 26 March 2026. CMS misconfiguration exposed nearly 3,000 unpublished assets including draft documentation of unreleased “Mythos”/”Capybara” model. Assets set to public by default unless explicitly restricted.
  • VentureBeat (2026). “Claude Code’s source code appears to have leaked: here’s what we know.” 31 March 2026. Documents concurrent axios npm supply-chain attack (versions 1.14.1 and 0.30.4 containing a Remote Access Trojan) occurring hours before the Claude Code source exposure; users who installed via npm during the overlap window were exposed to both failures through the same dependency chain.
  • SkillFortify (2026). “Formal Analysis and Supply Chain Security for Agentic AI Skills.” arXiv:2603.00195. Trust score algebra, SLSA-inspired graduated trust levels, SkillFortifyBench benchmark (540 skills, 96.95% F1).
  • Wang et al. (2026). “MalTool: Malicious Tool Attacks on LLM Agents.” arXiv:2602.12194. 6,487 malicious tools targeting LLM-based agents; broadens the supply-chain threat model from skill files to malicious tools with harmful code implementations.
  • Waqas, D., Golthi, A., Hayashida, E. & Mao, H. (2026). “Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents.” arXiv:2512.00332. Function-sourced assertions produce ~28% compliance across eleven models; compliance and task accuracy are not correlated.
  • Red Hat (2026). “Agent Skills: Explore Security Threats and Controls.” Published March 10, 2026. Catalogues YAML parser, script execution, and data flow threats.
  • Palo Alto Networks / Unit 42 (2026). “Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild.” Documents real-world indirect prompt injection techniques. Notes that observed real-world exploitation has been lower-impact than proof-of-concept severity, consistent with an emerging attack surface.

Security Scanning and Trust Infrastructure

  • Snyk / Invariant Labs. mcp-scan / agent-scan: open-source security scanner for MCP servers and Agent Skills.
  • Tech Leads Club. agent-skills: curated skill registry with lockfile-based integrity verification and content hashing.
  • Skills Check (skillscheck.ai). Quality and integrity toolkit for Agent Skills: freshness, security, token budgets, semver verification, policy enforcement.
  • SkillFortify. Agent Skill Bills of Materials (ASBoM) in CycloneDX format, adapting OWASP AI-BOM to skills.
  • Google (2023). SLSA: Supply-chain Levels for Software Artifacts. Graduated trust levels (L1-L4) based on build provenance.
  • Linux Foundation (2022). Sigstore: keyless code signing for open-source packages.

MCP Security and Trust Infrastructure

  • Advance Metrics. (2024). “Cookie Behaviour Study — 5 Years After GDPR.” Analysis of over 1.2 million users: 68.9% either closed or ignored cookie banners; universally low engagement with individual cookie settings (0.8–1.1%). https://www.advance-metrics.com/en/blog/cookie-behaviour-study/.
  • Secure Privacy. (2025). “The Psychology Behind Cookie Consent: Why Users Click ‘Accept.’” Reports 85% of visitors click “accept all” within seconds despite 78% expressing privacy concerns; interface design overrides individual privacy preferences even among privacy-aware users. https://secureprivacy.ai/blog/cookie-consent-psychology.
  • CookieScript. (2025). “Consent Fatigue is Real: Strategies to Improve User Experience.” Names and documents consent fatigue as a population-level phenomenon. https://cookie-script.com/news/consent-fatigue-strategies-to-improve-user-experience-and-boost-opt-in-rates.
  • CNIL. (2024). Cookie consent enforcement actions, December 2024. Enforcement orders against publishers for asymmetric button placement, multi-step rejection paths, and manipulative interface design. https://www.cnil.fr.
  • European Data Protection Board. (2023). “Guidelines on Deceptive Design Patterns in Social Media Platform Interfaces.” Adopted February 14, 2023. Codifies six categories of manipulative consent UX.
  • Endor Labs (2026). “Classic Vulnerabilities Meet AI Infrastructure: Why MCP Needs AppSec.” Analysis of 2,614 MCP implementations: 82% prone to path traversal, 67% to code injection, 34% to command injection. Up to 1,021 MCP servers created in a single week, documenting adoption outpacing security.
  • AgentSeal (2026). “We Scanned 1,808 MCP Servers. 66% Had Security Findings.” MCP Security Registry with trust scoring, tool description analysis, and semantic anomaly detection.
  • Anthropic. CVE-2025-68143, CVE-2025-68144, CVE-2025-68145. Three chained vulnerabilities in Anthropic’s reference mcp-server-git implementation: path validation bypass, unrestricted git_init, and argument injection in git_diff. Combined with Filesystem MCP server, achieved full remote code execution.
  • JFrog. CVE-2025-6514. Critical command injection in mcp-remote OAuth proxy (437,000+ downloads; adopted in Cloudflare, Hugging Face, Auth0 integration guides). Malicious MCP servers could achieve remote code execution on client machines.
  • OWASP (2025). “MCP Top 10.” Token mismanagement, privilege escalation, tool poisoning, prompt injection, and five additional threat categories for MCP deployments.
  • Microsoft (2026). “MCP Server Certification.” Microsoft Learn. Certification through connector programme requiring OpenAPI definitions, test credentials, behavioural validation, verified domain ownership. https://learn.microsoft.com/en-us/microsoft-agent-365/mcp-certification.
  • Model Context Protocol (2025). “MCP Registry.” Official centralized metadata repository; namespace authentication via reverse DNS format and GitHub/domain verification. Launched in preview September 2025. https://registry.modelcontextprotocol.io.
  • Red Hat (2025). “Model Context Protocol (MCP): Understanding Security Risks and Controls.” Identifies command injection, confused deputy, and tool poisoning as primary MCP threat vectors.
  • Practical DevSecOps (2026). “MCP Security Vulnerabilities: How to Prevent Prompt Injection and Tool Poisoning Attacks in 2026.” Documents MCPTox benchmark, MCP Preference Manipulation Attacks (MPMA), and parasitic toolchain attacks.
  • MCP-Hub (2026). “The Trust Layer for MCP Servers.” Automated security analysis, certification, runtime sandboxing (MCP Cage), and governance infrastructure. https://mcp-hub.info.
  • MCP Secure (2026). “The HTTPS of the Agent Era.” Cryptographic identity, message signing, progressive trust, and revocation for MCP. ECDSA P-256 signed identity credentials. https://mcp-secure.dev.
  • Bhatt, A., et al. (2025). “ETDI: Mitigating Tool Squatting and Rug Pull Attacks in MCP.” Proposes Enhanced Tool Definition Interface with cryptographic identity verification and immutable versioned tool definitions.
  • Schneider, C. (2026). “Securing MCP: A Defense-First Architecture Guide.” Proposes treating tool descriptions as code requiring review, versioning, testing, and monitoring. Documents CVE-2026-27826 (SSRF in mcp-atlassian, CVSS 8.2).
  • IJSR (2026). “Enterprise-Grade Security for the Model Context Protocol.” Survey of 41 MCP defence papers: all focused exclusively on integrity, zero on availability. More capable models paradoxically more susceptible to tool poisoning (72.8% on o1-mini). 100% of tested LLMs executed malicious commands from peer agents. Proposes Trust Boundary Mitigation Framework (TBMF).
  • Model Context Protocol (2026). “The 2026 MCP Roadmap.” March 9, 2026. Places “deeper security and authorization work” in secondary “On the Horizon” priority category. https://blog.modelcontextprotocol.io/posts/2026-mcp-roadmap/.

Cross-Domain Provenance Infrastructure

  • Coalition for Content Provenance and Authenticity (C2PA). Technical Specification 2.2, May 2025. Open standard for cryptographic content provenance using X.509 certificates. Over 6,000 member organisations as of 2025.
  • Content Authenticity Initiative (2026). “The State of Content Authenticity in 2026.” https://contentauthenticity.org/blog/the-state-of-content-authenticity-in-2026.
  • PetaPixel (2025). “Nikon Z6 III’s C2PA Functionality Has a Significant Security Vulnerability.” September 4, 2025. Documents the Multiple Exposure mode bypass discovered by Adam Horshack: unsigned images could be combined with signed black frames to produce fraudulently signed output. Nikon revoked all issued certificates September 21, 2025.
  • Sigstore Blog (2023). “npm’s Sigstore-powered provenance goes GA.” Over 3,800 projects adopted build provenance during beta; 500 million+ downloads of provenance-enabled packages.
  • Sonatype (2022). State of the Software Supply Chain Report. Supply chain attacks targeting open-source software increased 742% annually over three years.
  • Newell, G. (2010). Interview, PC Gamer, September 15, 2010. On DRM and platform trust: “Once you create service value for customers, ongoing service value, piracy seems to disappear.” Russia became Valve’s second-largest European market after same-day localised Steam access.
  • Steam statistics: 147 million monthly active users, 42 million peak concurrent users (January 2026), 75% PC gaming market share, $16.2 billion revenue in 2025. Sources: Icon Era, DemandSage, Rec0ded88, citing Valve disclosures and SteamDB.

Platform Security Guidance

  • Anthropic (2026). Agent Skills documentation: security considerations. “Treat like installing software.”
  • Microsoft (2026). Agent Skills documentation: “Source trust — Only install skills from trusted authors or vetted internal contributors.”
  • OpenAI (2026). Codex Skills documentation and Agent Skills specification.
  • OpenAI (2026). “Understanding Prompt Injections: A Frontier Security Challenge.”
  • OpenAI (2026d). “Designing AI Agents to Resist Prompt Injection.” OpenAI Security Blog, 11 March 2026. Frames prompt injection as social engineering; states AI firewalling has “fundamental limitations”; proposes constraint-based defence design.
  • Nasr, M., Carlini, N., Sitawarin, C., Schulhoff, S.V., Hayes, J., Ilie, M., Pluto, J., Song, S., Chaudhari, H., Shumailov, I., Thakurta, A., Xiao, K.Y., Terzis, A. & Tramér, F. (2025). “The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections.” arXiv:2510.09023. Joint research from OpenAI, Anthropic, and Google DeepMind. Tested 12 published defences under adaptive attack conditions; all bypassed with >90% success for most. Prompting-based: 95-99% attack success. Training-based: 96-100% bypass.
  • Sofroniew, N., Kauvar, I., Saunders, W., Chen, R., et al. (2026). “Emotion Concepts and their Function in a Large Language Model.” Anthropic, transformer-circuits.pub, April 2026. Internal emotion concept representations causally drive behaviour in Claude Sonnet 4.5. The same prosocial dispositions respond to all text-based instructions regardless of source. Provides mechanistic grounding for why the shared substrate cannot be resolved at the instruction level.
  • OpenAI (2026). “Continuously Hardening ChatGPT Atlas Against Prompt Injection Attacks.” OpenAI Security Blog, March 2026. RL-trained automated attacker discovers novel attack strategies; adversarially trained model checkpoint deployed.
  • Ouyang, L., et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” NeurIPS 2022. Describes instruction-following as “unlocking capabilities that GPT-3 already had”; RLHF surfaced instruction-following latent in pretraining, confirming that the capability was never a separate designed system.
  • Anthropic (2026). “Mitigating the Risk of Prompt Injections in Browser Use.”

Prompt Injection and Compliance

  • OWASP (2025). “LLM01:2025 Prompt Injection.” Gen AI Security Project.
  • Yi, J., et al. (2025). “Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models.” KDD 2025.
  • Liu, Y., et al. (2024). “Formalizing and Benchmarking Prompt Injection Attacks and Defenses.” USENIX Security 2024.
  • Lian, Z., Wang, W., Zeng, Q., Nakanishi, T., Kitasuka, T. & Su, C. (2026). “Prompt-in-Content Attacks: Exploiting Uploaded Inputs to Hijack LLM Behavior.” NSS 2025. arXiv:2508.19287.
  • Ramakrishnan, B. & Balaji, A. (2025). “Securing AI Agents Against Prompt Injection Attacks: A Comprehensive Benchmark and Defence Framework.” arXiv:2511.15759. Notes that false positives occur most when “legitimate content contains instruction-like language.”
  • Choudhary, S., Anshumaan, D., Palumbo, N. & Jha, S. (2025). “How Not to Detect Prompt Injections with an LLM.” AISec 2025. arXiv:2507.05630. Known-answer detection schemes have structural vulnerabilities; adaptive attacks evade them.
  • Rossi, A., et al. (2026). “Indirect Prompt Injection in the Wild for LLM Systems.” arXiv:2601.07072. Near-100% retrieval of poisoned content in realistic settings.
  • IntentGuard (2025). “Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis.” OpenReview. Uses thinking interventions to identify which parts of input the model recognises as actionable instructions.
  • Ye, C., Cui, J. & Hadfield-Menell, D. (2026). “Prompt Injection as Role Confusion.” arXiv:2603.12277. Models infer roles from how text is written, not where it comes from; untrusted text that imitates a role inherits that role’s authority in latent space.
  • Zheng, Y., et al. (2026). “SkillRouter: Retrieve-and-Rerank Skill Selection for LLM Agents at Scale.” arXiv:2603.22455. Cross-encoder attention on ~80K skills: 91.7% on body, 7.3% name, 1.0% description. Removing body causes 29–44pp routing accuracy degradation. Hidden-body asymmetry: executing agent sees only name and description; routing layer sees body but evaluates relevance, not safety.
  • Zverev, E., et al. (2024). “Can LLMs Separate Instructions from Data? And What Do We Even Mean by That?” ICLR 2025 Workshop. Tested empirically whether current models provide reliable separation between instructions and data; none did. Identifies “the absence of an intrinsic separation between instructions and data” as the root cause of prompt injection attacks.
  • Hines, K., et al. (2024). “Defending Against Indirect Prompt Injection Attacks With Spotlighting.” arXiv:2403.14720. Datamarking and encoding provide continuous provenance signals at the token level.

Cryptographic Principles

  • Kerckhoffs, A. (1883). “La cryptographie militaire.” Journal des sciences militaires.

Industry Precedents

  • IAB Tech Lab. ads.txt and sellers.json specifications.
  • npm, pip, apt, Docker package signing specifications.
  • Google Chrome HTTPS adoption timeline: neutral icon (2014) → “Not Secure” label (2018) → flagging all HTTP pages (2019). Achieved near-universal HTTPS adoption through progressive warnings without blocking.

Provenance

  • Zanfir, A. (2025). “Who Signed This? Provenance for AI Agents.” Explores cryptographic provenance for agent self-modification.
  • Tabnine (2026). Provenance and Attribution system: inference-time code matching against GitHub index.

This Project

  • Extended multi-session dialogue on skill security, watermarking, integrity verification, and the prompt injection paradox between HiP and Claude Opus 4.6, March 2026.
  • Prototype implementations: content obfuscation encoder (rejected), invisible watermarking system (demoted to forensic supplement), hash-based integrity verification with LICENSE witness (implemented).

Model Versions and Roles

  • Claude Opus 4.6 (Anthropic, claude.ai interface, March 2026): Generative collaborator. Contributed to protection mechanism design, threat modelling, prototype implementation, and document drafting.
  • ChatGPT 5.4 Thinking (OpenAI, ChatGPT interface, March 2026): Adversarial structural reviewer. Three rounds of critique identifying overstatements, claim-discipline issues, and defensive precision.
  • Gemini 3.1 Pro (Google DeepMind, Gemini interface, March 2026): Adversarial structural reviewer. Three rounds of critique on Principle 7 resilience, novelty calibration, and publication positioning.

This document was produced through human-AI collaboration to analyse a security property of human-AI collaboration tools. The circularity is noted. Independent analysis by parties outside this process is welcomed.