Research guide - Agent skills - Multimodal production systems

Multimodal Agent Skills: From Predictability to Production Workflows

An English guide that adapts Matt Pocock's "Building Great Agent Skills: The Missing Manual" into a practical multi-skill architecture for product-truth multimodal commerce work.

Robot vacuum product-video QA example used in the product-truth workflow
Visual anchor example A real product-video QA case from the PTFT workflow: the skill system must preserve product shape, action logic and commercial readability before an asset becomes usable.

Author: Jia Jingqiu - English web edition: 2026-07-01

Abstract

Abstract

Multimodal agents are moving from "can call tools" toward "can execute complex workflows reliably." In production, the bottleneck is often not model capability but skill design: when a skill should trigger, how much context it should load, which steps it should follow, how it should choose branches, how it should avoid stale instructions, and how success should be evaluated. Inspired by Matt Pocock's framework for writing great agent skills, this paper proposes a multimodal multi-skill architecture for commerce workflows. Instead of a single large skill, a production system should compose small, inspectable skills for product truth, evidence gating, routing, keyframes, generation prompts, video QA, publication acceptance and failure memory. The goal is not more instructions. The goal is a more predictable process.

1. Thesis: skills are instruments for predictability

A skill is not valuable because it makes an agent sound smarter. A skill is valuable when it turns a repeated task into a more predictable process. A brainstorming skill may produce different ideas each time, but it should reliably understand the goal, generate candidates, filter them and suggest next actions. In the same way, a multimodal production skill should reliably protect product facts, inspect evidence, choose routes and reject weak outputs.

This distinction matters because multimodal output can be visually convincing while commercially wrong. A video can look polished while the package drifts. A product image can look premium while the logo, material or scale is wrong. A report can read fluently while it never checks the screenshots. Skill design is the layer that makes the agent slow down, inspect the right evidence and leave an auditable trail.

The first principle of multimodal skill systems is predictability: every skill should stabilize one process, not promise one magical output.

2. Why multimodal work needs many skills

A commerce video task is not one task. It is a chain of smaller decisions across text, image, video, audio, product pages, platform rules and business goals. If everything is placed inside one giant skill, the agent sees too many future goals at once and often rushes to the final artifact.

Stage Input Output Main Risk
Product truth Links, photos, titles, specs, reviews Product Truth Card Facts get polluted by style words
Evidence gate Images, screenshots, listings, historic assets Ready / needs evidence / blocked The system generates before evidence is sufficient
Intent routing Raw user text, target platform, available assets Task type, platform, route TikTok, Amazon and site-hero standards get mixed
Keyframes Product Truth Card, platform goal Storyboard and keyframe prompts The scene looks good but proves nothing
Multimodal QA Candidate image or video Accept, repair or reject The review rewards aesthetics instead of truth
Publication and memory Final asset, channel page, QA result Acceptance record and next-time rule Successful and failed decisions are not reusable

3. A taxonomy of multimodal skills

01

Product Truth Skill

Extract product name, category, material, shape, logo, packaging, visual anchors, supported claims and unknowns. It should not write creative copy.

02

Evidence Gate Skill

Decide whether the task is ready, needs more evidence or must be blocked before generation. This is the safety valve before expensive output.

03

Router Skill

Convert raw user text into structured intent: modality, product category, platform, task stage and risk. Style requests belong in generation payload, not retrieval intent.

04

Storyboard and Keyframe Skill

Turn product truth into scenes, actions, product anchors, on-screen text, voiceover and QA risks. Video should often be keyframe-first.

05

Generation Prompt Skill

Use approved facts, references and scene plans to build model-specific prompt contracts. It should execute, not invent missing facts.

06

Multimodal QA Skill

Inspect product shape, material, hand-object contact, action logic, temporal fidelity, audio-visual alignment and commercial readability.

07

Publication Acceptance Skill

Verify file existence, format, visible page, platform fit, product truth and acceptance evidence. "Generated" is not the same as "usable."

08

Failure Memory Skill

Store failure labels, evidence, repair actions and next-time rules so future routing can avoid repeated mistakes.

4. Multi-skill architecture

A production-grade multimodal agent should not begin with a universal skill. It should begin with a small router and a set of specialized skills. The router preserves the user's original request, extracts structured intent and points to the next skill.

raw_user_text
  -> Router Skill
  -> Product Truth Skill
  -> Evidence Gate Skill
  -> Platform / Task Route
  -> Storyboard or Keyframe Skill
  -> Generation Prompt Skill
  -> Multimodal QA Skill
  -> Repair or Publish
  -> Failure Memory Skill

The boundary matters. The router should not write prompts. The product-truth skill should not do style expansion. The evidence gate can stop generation. The QA skill must label failures and propose repair actions. The memory skill must produce next-time rules, not vague retrospectives.

5. Invocation design: model invoked or user invoked

Model-invoked skills reduce user effort but increase context load because their descriptions sit in the agent's environment. User-invoked skills keep the agent leaner but increase cognitive load on the operator. A multimodal system should choose based on risk and frequency.

Skill Type Recommended Trigger Reason
Evidence gateModel-invokedHigh-risk pre-generation check.
Product truthModel-invoked or router-calledMost commerce tasks need it.
RouterUser-invokedThe user remembers one entry point.
Storyboard / keyframeRouter-calledNot every task needs video planning.
Generation promptUser-invoked or explicit routeAvoid accidental generation before evidence is ready.
QA and acceptanceModel-invoked before publicationRisk control should be hard to skip.
Failure memoryAutomatic at the endEach completion or failure should produce learning data.

6. Product Truth Card and clean skill routing

The clean route is not a long prompt that mixes facts, style and task intent together. It is a data boundary: preserve the user's original request, extract a narrow retrieval intent, freeze product facts, pass style into generation only, then verify the candidate output with multimodal QA.

raw_user_text
  -> retrieval_intent
  -> product_truth_card
  -> generation_payload
  -> multimodal_verification
Layer Used For Must Not Do
raw_user_text Evidence, replay and debugging. Keep the original request intact. Do not feed it directly into skill retrieval.
retrieval_intent Skill selection only: task type, platform, input asset, target action and product-truth status. Do not include style words such as cinematic, premium, dark, streetwear or low-AI-feel.
product_truth_card Visible facts, unknowns, claim boundaries, forbidden mutations and identity references. Do not write marketing copy or invent missing selling points.
generation_payload Scene, camera, lighting, subtitle, voiceover, CTA and style instructions. Do not override the Product Truth Card.
multimodal_verification Accept, repair or reject based on product identity, action logic, realism and commercial readability. Do not score beauty alone.

The hard gate is simple: if a Product Truth Card does not exist, the system should not enter storyboard, script or video-prompt generation. "Do not change the product" is too weak as a prompt sentence; the product facts need to be a locked state that later skills cannot override.

Generation priorityProduct geometry first, then color, label, logo, packaging, visible accessories, known claims, platform format, scene logic and finally creative style.
Model routeUse image-to-video or reference-to-video when product appearance matters. Pure text-to-video is better for concept films than SKU-level product fidelity.
TikTok branchDiscovery commerce: hook attention first, then prove the product action clearly enough for a viewer who was not searching for it.
Amazon branchIntent commerce: the buyer is already comparing. Prioritize clarity, trust, feature proof, claim safety and doubt reduction.

7. Information design: keep SKILL.md small

The main skill file should be a control surface, not a knowledge dump. Put the trigger, core steps and completion criteria in SKILL.md. Move branch-specific material into references, templates and examples that the agent opens only when needed.

multimodal-commerce-skills/
  router/SKILL.md
  product-truth/SKILL.md
  product-truth/templates/product_truth_card.json
  evidence-gate/SKILL.md
  evidence-gate/references/evidence_requirements.md
  storyboard-keyframe/SKILL.md
  storyboard-keyframe/templates/storyboard_table.md
  generation-prompt/SKILL.md
  generation-prompt/templates/prompt_contract.json
  multimodal-qa/SKILL.md
  multimodal-qa/references/failure_labels.md
  publish-acceptance/SKILL.md
  failure-memory/SKILL.md

This is progressive disclosure for agents. Most tasks should load only the steps. Platform rules, failure taxonomies, templates and examples should live behind context pointers.

8. Leading words for multimodal agents

A leading word compresses a behavior into a short reusable phrase. The phrase should appear in the skill, the templates, the QA labels and the agent's own operational language. This makes the desired behavior easier to repeat.

Leading Word Meaning Use
product truthThe product facts cannot drift.All commerce generation.
evidence-firstInspect source evidence before generation.Listings, reviews, RAG and QA.
visual anchorShape, logo, color, material and usage action must be preserved.Image and video fidelity.
keyframe-firstPlan inspectable frames before video generation.Complex actions and product demos.
commercial readabilityThe buyer can understand the value without internal context.Landing pages and video QA.
next-time ruleA failure becomes a future routing rule.Failure memory.

9. Pruning: remove sediment and no-ops

Multimodal skills often grow by sediment. Every failed generation adds another warning. Every platform adds another exception. Every bad output adds another style phrase. Eventually the skill becomes long, duplicated and stale.

Use deletion tests. If removing a paragraph does not change the agent's behavior, it is probably a no-op. If a rule applies only to one branch, move it to that branch. If a concept appears in multiple files, choose one source of truth. If a style request is repeated as prose, turn it into a structured field.

  • Each rule must change behavior.
  • Each concept should have one source of truth.
  • Branch-specific references should live outside the main skill file.
  • Style phrases should become parameters or leading words.
  • Failure notes should first go into failure memory, then be promoted only if they repeatedly matter.

10. Evaluation framework

A multi-skill system needs evaluation at four layers: invocation, process, output and memory.

InvocationDid the right skill fire at the right time? Did model-invoked skills increase context load too much? Can the user remember the entry points?
ProcessDid the agent complete every step, do enough inspection and choose the correct branch before moving forward?
OutputDid the asset preserve product truth, fit the platform, align with evidence and reduce repair time?
MemoryAre repeated failures decreasing? Are next-time rules retrieved and used in future routing decisions?

11. Minimal router template

The router should be small. It should preserve raw user text, extract retrieval intent and route the work without absorbing every downstream instruction.

# Multimodal Commerce Router

Use this when a user gives product links, product images, videos,
ecommerce pages, or asks for multimodal generation, QA, repair or publication.

1. Preserve raw_user_text.
2. Extract retrieval_intent:
   modality, product category, platform, task stage, risk.
3. Route:
   - facts missing -> product-truth
   - evidence unclear -> evidence-gate
   - video plan needed -> storyboard-keyframe
   - prompt needed after approval -> generation-prompt
   - asset exists -> multimodal-qa
   - final upload/check needed -> publish-acceptance
   - failure observed -> failure-memory
4. Keep style requests in generation_payload, not retrieval_intent.

12. Relationship to Product-Truth Feedback Training

Product-Truth Feedback Training (PTFT) defines what data should be captured around a generation: product truth, route choice, QA label, failure evidence, repair action and preference signal. The multi-skill system is the operating layer that produces that data reliably.

PTFT Module Skill-Layer Implementation
Product Truth CardProduct Truth Skill
Failure Memory RetrieverFailure Memory Skill
Model RouterRouter Skill
Video QA ScorerMultimodal QA Skill
Preference RankingCandidate comparison and acceptance skill
Repair PolicyRepair action from QA and failure memory

In short, PTFT says what should be learned. The skill system says how an agent reliably produces the evidence needed to learn it. Product Truth Card routing adds the missing runtime boundary: facts are frozen before creative generation starts.

13. Conclusion

The next step for multimodal agents is not a longer prompt and not a single massive manual. It is a set of small, composable, auditable skills. Product truth, evidence gates, routing, keyframes, generation prompts, QA, acceptance and failure memory should each be explicit. A mature multimodal system does not simply generate polished assets. It leaves clearer judgments, fewer repeated errors and more reusable production intelligence after every run.

Source Materials

  • Building Great Agent Skills: The Missing Manual YouTube talk by Matt Pocock on the AI Engineer channel.
  • Matt Pocock / skills Open-source skill repository used as the primary source for the skill-writing framework.
  • writing-great-skills / SKILL.md Source skill for predictability, invocation, structure, steering and pruning.
  • writing-great-skills / GLOSSARY.md Definitions for model-invoked skills, user-invoked skills, context load, cognitive load, progressive disclosure and no-op.
  • Product Truth Card and Skill Routing field note Internal design note by Jia Jingqiu, 2026-07-02.
  • Part of the product-truth research system.

    This guide is the skill-design layer next to the CMV-24 / PTFT dataset and paper. Together, they turn multimodal commerce production into an inspectable feedback loop.