Multimodal Agent Skills Guide

Author: Jia Jingqiu - English web edition: 2026-07-01

Abstract

Multimodal agents are moving from "can call tools" toward "can execute complex workflows reliably." In production, the bottleneck is often not model capability but skill design: when a skill should trigger, how much context it should load, which steps it should follow, how it should choose branches, how it should avoid stale instructions, and how success should be evaluated. Inspired by Matt Pocock's framework for writing great agent skills, this paper proposes a multimodal multi-skill architecture for commerce workflows. Instead of a single large skill, a production system should compose small, inspectable skills for product truth, evidence gating, routing, keyframes, generation prompts, video QA, publication acceptance and failure memory. The goal is not more instructions. The goal is a more predictable process.

1. Thesis: skills are instruments for predictability

A skill is not valuable because it makes an agent sound smarter. A skill is valuable when it turns a repeated task into a more predictable process. A brainstorming skill may produce different ideas each time, but it should reliably understand the goal, generate candidates, filter them and suggest next actions. In the same way, a multimodal production skill should reliably protect product facts, inspect evidence, choose routes and reject weak outputs.

This distinction matters because multimodal output can be visually convincing while commercially wrong. A video can look polished while the package drifts. A product image can look premium while the logo, material or scale is wrong. A report can read fluently while it never checks the screenshots. Skill design is the layer that makes the agent slow down, inspect the right evidence and leave an auditable trail.

The first principle of multimodal skill systems is predictability: every skill should stabilize one process, not promise one magical output.

2. Why multimodal work needs many skills

A commerce video task is not one task. It is a chain of smaller decisions across text, image, video, audio, product pages, platform rules and business goals. If everything is placed inside one giant skill, the agent sees too many future goals at once and often rushes to the final artifact.

Stage	Input	Output	Main Risk
Product truth	Links, photos, titles, specs, reviews	Product Truth Card	Facts get polluted by style words
Evidence gate	Images, screenshots, listings, historic assets	Ready / needs evidence / blocked	The system generates before evidence is sufficient
Intent routing	Raw user text, target platform, available assets	Task type, platform, route	TikTok, Amazon and site-hero standards get mixed
Keyframes	Product Truth Card, platform goal	Storyboard and keyframe prompts	The scene looks good but proves nothing
Multimodal QA	Candidate image or video	Accept, repair or reject	The review rewards aesthetics instead of truth
Publication and memory	Final asset, channel page, QA result	Acceptance record and next-time rule	Successful and failed decisions are not reusable

3. A taxonomy of multimodal skills

Product Truth Skill

Extract product name, category, material, shape, logo, packaging, visual anchors, supported claims and unknowns. It should not write creative copy.

Evidence Gate Skill

Decide whether the task is ready, needs more evidence or must be blocked before generation. This is the safety valve before expensive output.

Router Skill

Convert raw user text into structured intent: modality, product category, platform, task stage and risk. Style requests belong in generation payload, not retrieval intent.

Storyboard and Keyframe Skill

Turn product truth into scenes, actions, product anchors, on-screen text, voiceover and QA risks. Video should often be keyframe-first.

Generation Prompt Skill

Use approved facts, references and scene plans to build model-specific prompt contracts. It should execute, not invent missing facts.

Multimodal QA Skill

Inspect product shape, material, hand-object contact, action logic, temporal fidelity, audio-visual alignment and commercial readability.

Publication Acceptance Skill

Verify file existence, format, visible page, platform fit, product truth and acceptance evidence. "Generated" is not the same as "usable."

Failure Memory Skill

Store failure labels, evidence, repair actions and next-time rules so future routing can avoid repeated mistakes.

4. Multi-skill architecture

A production-grade multimodal agent should not begin with a universal skill. It should begin with a small router and a set of specialized skills. The router preserves the user's original request, extracts structured intent and points to the next skill.

raw_user_text
  -> Router Skill
  -> Product Truth Skill
  -> Evidence Gate Skill
  -> Platform / Task Route
  -> Storyboard or Keyframe Skill
  -> Generation Prompt Skill
  -> Multimodal QA Skill
  -> Repair or Publish
  -> Failure Memory Skill

The boundary matters. The router should not write prompts. The product-truth skill should not do style expansion. The evidence gate can stop generation. The QA skill must label failures and propose repair actions. The memory skill must produce next-time rules, not vague retrospectives.

5. Invocation design: model invoked or user invoked

Model-invoked skills reduce user effort but increase context load because their descriptions sit in the agent's environment. User-invoked skills keep the agent leaner but increase cognitive load on the operator. A multimodal system should choose based on risk and frequency.

Skill Type	Recommended Trigger	Reason
Evidence gate	Model-invoked	High-risk pre-generation check.
Product truth	Model-invoked or router-called	Most commerce tasks need it.
Router	User-invoked	The user remembers one entry point.
Storyboard / keyframe	Router-called	Not every task needs video planning.
Generation prompt	User-invoked or explicit route	Avoid accidental generation before evidence is ready.
QA and acceptance	Model-invoked before publication	Risk control should be hard to skip.
Failure memory	Automatic at the end	Each completion or failure should produce learning data.

6. Product Truth Card and clean skill routing

The clean route is not a long prompt that mixes facts, style and task intent together. It is a data boundary: preserve the user's original request, extract a narrow retrieval intent, freeze product facts, pass style into generation only, then verify the candidate output with multimodal QA.

raw_user_text
  -> retrieval_intent
  -> product_truth_card
  -> generation_payload
  -> multimodal_verification

Layer	Used For	Must Not Do
`raw_user_text`	Evidence, replay and debugging. Keep the original request intact.	Do not feed it directly into skill retrieval.
`retrieval_intent`	Skill selection only: task type, platform, input asset, target action and product-truth status.	Do not include style words such as cinematic, premium, dark, streetwear or low-AI-feel.
`product_truth_card`	Visible facts, unknowns, claim boundaries, forbidden mutations and identity references.	Do not write marketing copy or invent missing selling points.
`generation_payload`	Scene, camera, lighting, subtitle, voiceover, CTA and style instructions.	Do not override the Product Truth Card.
`multimodal_verification`	Accept, repair or reject based on product identity, action logic, realism and commercial readability.	Do not score beauty alone.

The hard gate is simple: if a Product Truth Card does not exist, the system should not enter storyboard, script or video-prompt generation. "Do not change the product" is too weak as a prompt sentence; the product facts need to be a locked state that later skills cannot override.

Generation priorityProduct geometry first, then color, label, logo, packaging, visible accessories, known claims, platform format, scene logic and finally creative style.

Model routeUse image-to-video or reference-to-video when product appearance matters. Pure text-to-video is better for concept films than SKU-level product fidelity.

TikTok branchDiscovery commerce: hook attention first, then prove the product action clearly enough for a viewer who was not searching for it.

Amazon branchIntent commerce: the buyer is already comparing. Prioritize clarity, trust, feature proof, claim safety and doubt reduction.

7. Information design: keep SKILL.md small

The main skill file should be a control surface, not a knowledge dump. Put the trigger, core steps and completion criteria in SKILL.md. Move branch-specific material into references, templates and examples that the agent opens only when needed.

multimodal-commerce-skills/
  router/SKILL.md
  product-truth/SKILL.md
  product-truth/templates/product_truth_card.json
  evidence-gate/SKILL.md
  evidence-gate/references/evidence_requirements.md
  storyboard-keyframe/SKILL.md
  storyboard-keyframe/templates/storyboard_table.md
  generation-prompt/SKILL.md
  generation-prompt/templates/prompt_contract.json
  multimodal-qa/SKILL.md
  multimodal-qa/references/failure_labels.md
  publish-acceptance/SKILL.md
  failure-memory/SKILL.md

This is progressive disclosure for agents. Most tasks should load only the steps. Platform rules, failure taxonomies, templates and examples should live behind context pointers.

8. Leading words for multimodal agents

A leading word compresses a behavior into a short reusable phrase. The phrase should appear in the skill, the templates, the QA labels and the agent's own operational language. This makes the desired behavior easier to repeat.

Leading Word	Meaning	Use
`product truth`	The product facts cannot drift.	All commerce generation.
`evidence-first`	Inspect source evidence before generation.	Listings, reviews, RAG and QA.
`visual anchor`	Shape, logo, color, material and usage action must be preserved.	Image and video fidelity.
`keyframe-first`	Plan inspectable frames before video generation.	Complex actions and product demos.
`commercial readability`	The buyer can understand the value without internal context.	Landing pages and video QA.
`next-time rule`	A failure becomes a future routing rule.	Failure memory.

9. Pruning: remove sediment and no-ops

Multimodal skills often grow by sediment. Every failed generation adds another warning. Every platform adds another exception. Every bad output adds another style phrase. Eventually the skill becomes long, duplicated and stale.

Use deletion tests. If removing a paragraph does not change the agent's behavior, it is probably a no-op. If a rule applies only to one branch, move it to that branch. If a concept appears in multiple files, choose one source of truth. If a style request is repeated as prose, turn it into a structured field.

Each rule must change behavior.
Each concept should have one source of truth.
Branch-specific references should live outside the main skill file.
Style phrases should become parameters or leading words.
Failure notes should first go into failure memory, then be promoted only if they repeatedly matter.

10. Evaluation framework

A multi-skill system needs evaluation at four layers: invocation, process, output and memory.

InvocationDid the right skill fire at the right time? Did model-invoked skills increase context load too much? Can the user remember the entry points?

ProcessDid the agent complete every step, do enough inspection and choose the correct branch before moving forward?

OutputDid the asset preserve product truth, fit the platform, align with evidence and reduce repair time?

MemoryAre repeated failures decreasing? Are next-time rules retrieved and used in future routing decisions?

11. Minimal router template

The router should be small. It should preserve raw user text, extract retrieval intent and route the work without absorbing every downstream instruction.

# Multimodal Commerce Router

Use this when a user gives product links, product images, videos,
ecommerce pages, or asks for multimodal generation, QA, repair or publication.

1. Preserve raw_user_text.
2. Extract retrieval_intent:
   modality, product category, platform, task stage, risk.
3. Route:
   - facts missing -> product-truth
   - evidence unclear -> evidence-gate
   - video plan needed -> storyboard-keyframe
   - prompt needed after approval -> generation-prompt
   - asset exists -> multimodal-qa
   - final upload/check needed -> publish-acceptance
   - failure observed -> failure-memory
4. Keep style requests in generation_payload, not retrieval_intent.

12. Relationship to Product-Truth Feedback Training

Product-Truth Feedback Training (PTFT) defines what data should be captured around a generation: product truth, route choice, QA label, failure evidence, repair action and preference signal. The multi-skill system is the operating layer that produces that data reliably.

PTFT Module	Skill-Layer Implementation
Product Truth Card	Product Truth Skill
Failure Memory Retriever	Failure Memory Skill
Model Router	Router Skill
Video QA Scorer	Multimodal QA Skill
Preference Ranking	Candidate comparison and acceptance skill
Repair Policy	Repair action from QA and failure memory

In short, PTFT says what should be learned. The skill system says how an agent reliably produces the evidence needed to learn it. Product Truth Card routing adds the missing runtime boundary: facts are frozen before creative generation starts.

13. Conclusion

The next step for multimodal agents is not a longer prompt and not a single massive manual. It is a set of small, composable, auditable skills. Product truth, evidence gates, routing, keyframes, generation prompts, QA, acceptance and failure memory should each be explicit. A mature multimodal system does not simply generate polished assets. It leaves clearer judgments, fewer repeated errors and more reusable production intelligence after every run.

Multimodal Agent Skills: From Predictability to Production Workflows

Abstract

Abstract

1. Thesis: skills are instruments for predictability

2. Why multimodal work needs many skills

3. A taxonomy of multimodal skills

Product Truth Skill

Evidence Gate Skill

Router Skill

Storyboard and Keyframe Skill

Generation Prompt Skill

Multimodal QA Skill

Publication Acceptance Skill

Failure Memory Skill

4. Multi-skill architecture

5. Invocation design: model invoked or user invoked

6. Product Truth Card and clean skill routing

7. Information design: keep SKILL.md small

8. Leading words for multimodal agents

9. Pruning: remove sediment and no-ops

10. Evaluation framework

11. Minimal router template

12. Relationship to Product-Truth Feedback Training

13. Conclusion

Source Materials

Part of the product-truth research system.