Abstract
Abstract
Multimodal agents are moving from "can call tools" toward "can execute complex workflows reliably." In production, the bottleneck is often not model capability but skill design: when a skill should trigger, how much context it should load, which steps it should follow, how it should choose branches, how it should avoid stale instructions, and how success should be evaluated. Inspired by Matt Pocock's framework for writing great agent skills, this paper proposes a multimodal multi-skill architecture for commerce workflows. Instead of a single large skill, a production system should compose small, inspectable skills for product truth, evidence gating, routing, keyframes, generation prompts, video QA, publication acceptance and failure memory. The goal is not more instructions. The goal is a more predictable process.
1. Thesis: skills are instruments for predictability
A skill is not valuable because it makes an agent sound smarter. A skill is valuable when it turns a repeated task into a more predictable process. A brainstorming skill may produce different ideas each time, but it should reliably understand the goal, generate candidates, filter them and suggest next actions. In the same way, a multimodal production skill should reliably protect product facts, inspect evidence, choose routes and reject weak outputs.
This distinction matters because multimodal output can be visually convincing while commercially wrong. A video can look polished while the package drifts. A product image can look premium while the logo, material or scale is wrong. A report can read fluently while it never checks the screenshots. Skill design is the layer that makes the agent slow down, inspect the right evidence and leave an auditable trail.
The first principle of multimodal skill systems is predictability: every skill should stabilize one process, not promise one magical output.
2. Why multimodal work needs many skills
A commerce video task is not one task. It is a chain of smaller decisions across text, image, video, audio, product pages, platform rules and business goals. If everything is placed inside one giant skill, the agent sees too many future goals at once and often rushes to the final artifact.
| Stage | Input | Output | Main Risk |
|---|---|---|---|
| Product truth | Links, photos, titles, specs, reviews | Product Truth Card | Facts get polluted by style words |
| Evidence gate | Images, screenshots, listings, historic assets | Ready / needs evidence / blocked | The system generates before evidence is sufficient |
| Intent routing | Raw user text, target platform, available assets | Task type, platform, route | TikTok, Amazon and site-hero standards get mixed |
| Keyframes | Product Truth Card, platform goal | Storyboard and keyframe prompts | The scene looks good but proves nothing |
| Multimodal QA | Candidate image or video | Accept, repair or reject | The review rewards aesthetics instead of truth |
| Publication and memory | Final asset, channel page, QA result | Acceptance record and next-time rule | Successful and failed decisions are not reusable |
3. A taxonomy of multimodal skills
Product Truth Skill
Extract product name, category, material, shape, logo, packaging, visual anchors, supported claims and unknowns. It should not write creative copy.
Evidence Gate Skill
Decide whether the task is ready, needs more evidence or must be blocked before generation. This is the safety valve before expensive output.
Router Skill
Convert raw user text into structured intent: modality, product category, platform, task stage and risk. Style requests belong in generation payload, not retrieval intent.
Storyboard and Keyframe Skill
Turn product truth into scenes, actions, product anchors, on-screen text, voiceover and QA risks. Video should often be keyframe-first.
Generation Prompt Skill
Use approved facts, references and scene plans to build model-specific prompt contracts. It should execute, not invent missing facts.
Multimodal QA Skill
Inspect product shape, material, hand-object contact, action logic, temporal fidelity, audio-visual alignment and commercial readability.
Publication Acceptance Skill
Verify file existence, format, visible page, platform fit, product truth and acceptance evidence. "Generated" is not the same as "usable."
Failure Memory Skill
Store failure labels, evidence, repair actions and next-time rules so future routing can avoid repeated mistakes.
4. Multi-skill architecture
A production-grade multimodal agent should not begin with a universal skill. It should begin with a small router and a set of specialized skills. The router preserves the user's original request, extracts structured intent and points to the next skill.
raw_user_text
-> Router Skill
-> Product Truth Skill
-> Evidence Gate Skill
-> Platform / Task Route
-> Storyboard or Keyframe Skill
-> Generation Prompt Skill
-> Multimodal QA Skill
-> Repair or Publish
-> Failure Memory Skill
The boundary matters. The router should not write prompts. The product-truth skill should not do style expansion. The evidence gate can stop generation. The QA skill must label failures and propose repair actions. The memory skill must produce next-time rules, not vague retrospectives.
5. Invocation design: model invoked or user invoked
Model-invoked skills reduce user effort but increase context load because their descriptions sit in the agent's environment. User-invoked skills keep the agent leaner but increase cognitive load on the operator. A multimodal system should choose based on risk and frequency.
| Skill Type | Recommended Trigger | Reason |
|---|---|---|
| Evidence gate | Model-invoked | High-risk pre-generation check. |
| Product truth | Model-invoked or router-called | Most commerce tasks need it. |
| Router | User-invoked | The user remembers one entry point. |
| Storyboard / keyframe | Router-called | Not every task needs video planning. |
| Generation prompt | User-invoked or explicit route | Avoid accidental generation before evidence is ready. |
| QA and acceptance | Model-invoked before publication | Risk control should be hard to skip. |
| Failure memory | Automatic at the end | Each completion or failure should produce learning data. |
6. Product Truth Card and clean skill routing
The clean route is not a long prompt that mixes facts, style and task intent together. It is a data boundary: preserve the user's original request, extract a narrow retrieval intent, freeze product facts, pass style into generation only, then verify the candidate output with multimodal QA.
raw_user_text
-> retrieval_intent
-> product_truth_card
-> generation_payload
-> multimodal_verification
| Layer | Used For | Must Not Do |
|---|---|---|
raw_user_text |
Evidence, replay and debugging. Keep the original request intact. | Do not feed it directly into skill retrieval. |
retrieval_intent |
Skill selection only: task type, platform, input asset, target action and product-truth status. | Do not include style words such as cinematic, premium, dark, streetwear or low-AI-feel. |
product_truth_card |
Visible facts, unknowns, claim boundaries, forbidden mutations and identity references. | Do not write marketing copy or invent missing selling points. |
generation_payload |
Scene, camera, lighting, subtitle, voiceover, CTA and style instructions. | Do not override the Product Truth Card. |
multimodal_verification |
Accept, repair or reject based on product identity, action logic, realism and commercial readability. | Do not score beauty alone. |
The hard gate is simple: if a Product Truth Card does not exist, the system should not enter storyboard, script or video-prompt generation. "Do not change the product" is too weak as a prompt sentence; the product facts need to be a locked state that later skills cannot override.
7. Information design: keep SKILL.md small
The main skill file should be a control surface, not a knowledge dump. Put the trigger, core steps and completion criteria in SKILL.md. Move branch-specific material into references, templates and examples that the agent opens only when needed.
multimodal-commerce-skills/
router/SKILL.md
product-truth/SKILL.md
product-truth/templates/product_truth_card.json
evidence-gate/SKILL.md
evidence-gate/references/evidence_requirements.md
storyboard-keyframe/SKILL.md
storyboard-keyframe/templates/storyboard_table.md
generation-prompt/SKILL.md
generation-prompt/templates/prompt_contract.json
multimodal-qa/SKILL.md
multimodal-qa/references/failure_labels.md
publish-acceptance/SKILL.md
failure-memory/SKILL.md
This is progressive disclosure for agents. Most tasks should load only the steps. Platform rules, failure taxonomies, templates and examples should live behind context pointers.
8. Leading words for multimodal agents
A leading word compresses a behavior into a short reusable phrase. The phrase should appear in the skill, the templates, the QA labels and the agent's own operational language. This makes the desired behavior easier to repeat.
| Leading Word | Meaning | Use |
|---|---|---|
product truth | The product facts cannot drift. | All commerce generation. |
evidence-first | Inspect source evidence before generation. | Listings, reviews, RAG and QA. |
visual anchor | Shape, logo, color, material and usage action must be preserved. | Image and video fidelity. |
keyframe-first | Plan inspectable frames before video generation. | Complex actions and product demos. |
commercial readability | The buyer can understand the value without internal context. | Landing pages and video QA. |
next-time rule | A failure becomes a future routing rule. | Failure memory. |
9. Pruning: remove sediment and no-ops
Multimodal skills often grow by sediment. Every failed generation adds another warning. Every platform adds another exception. Every bad output adds another style phrase. Eventually the skill becomes long, duplicated and stale.
Use deletion tests. If removing a paragraph does not change the agent's behavior, it is probably a no-op. If a rule applies only to one branch, move it to that branch. If a concept appears in multiple files, choose one source of truth. If a style request is repeated as prose, turn it into a structured field.
- Each rule must change behavior.
- Each concept should have one source of truth.
- Branch-specific references should live outside the main skill file.
- Style phrases should become parameters or leading words.
- Failure notes should first go into failure memory, then be promoted only if they repeatedly matter.
10. Evaluation framework
A multi-skill system needs evaluation at four layers: invocation, process, output and memory.
11. Minimal router template
The router should be small. It should preserve raw user text, extract retrieval intent and route the work without absorbing every downstream instruction.
# Multimodal Commerce Router
Use this when a user gives product links, product images, videos,
ecommerce pages, or asks for multimodal generation, QA, repair or publication.
1. Preserve raw_user_text.
2. Extract retrieval_intent:
modality, product category, platform, task stage, risk.
3. Route:
- facts missing -> product-truth
- evidence unclear -> evidence-gate
- video plan needed -> storyboard-keyframe
- prompt needed after approval -> generation-prompt
- asset exists -> multimodal-qa
- final upload/check needed -> publish-acceptance
- failure observed -> failure-memory
4. Keep style requests in generation_payload, not retrieval_intent.
12. Relationship to Product-Truth Feedback Training
Product-Truth Feedback Training (PTFT) defines what data should be captured around a generation: product truth, route choice, QA label, failure evidence, repair action and preference signal. The multi-skill system is the operating layer that produces that data reliably.
| PTFT Module | Skill-Layer Implementation |
|---|---|
| Product Truth Card | Product Truth Skill |
| Failure Memory Retriever | Failure Memory Skill |
| Model Router | Router Skill |
| Video QA Scorer | Multimodal QA Skill |
| Preference Ranking | Candidate comparison and acceptance skill |
| Repair Policy | Repair action from QA and failure memory |
In short, PTFT says what should be learned. The skill system says how an agent reliably produces the evidence needed to learn it. Product Truth Card routing adds the missing runtime boundary: facts are frozen before creative generation starts.
13. Conclusion
The next step for multimodal agents is not a longer prompt and not a single massive manual. It is a set of small, composable, auditable skills. Product truth, evidence gates, routing, keyframes, generation prompts, QA, acceptance and failure memory should each be explicit. A mature multimodal system does not simply generate polished assets. It leaves clearer judgments, fewer repeated errors and more reusable production intelligence after every run.