Product-Truth Feedback Training: A Commerce Multimodal Video Feedback Dataset and Training Protocol
Abstract
Recent multimodal models have rapidly advanced from image-text understanding toward native multimodal reasoning, long-video comprehension, audio-visual interaction, and any-to-any generation. However, commercial AI video production still lacks a practical training framework that captures product truth, platform constraints, generation failures, repair actions, and downstream business use. This paper proposes Product-Truth Feedback Training (PTFT), an industry-side multimodal training protocol for cross-border ecommerce video production. Instead of training a foundation video model from scratch, PTFT trains the surrounding production intelligence: a failure-memory retriever, a model-routing policy, a multimodal QA scorer, and a preference-ranking policy. We introduce CMV-24, a seed dataset of 24 AI-generated commerce video samples covering TikTok/UGC, independent-site hero assets, and Amazon product videos. Each sample is annotated with product category, target platform, generation route, risk/failure labels, repair action, final use, verified video path, duration, and resolution. The paper argues that commercial multimodal training should not begin with more prompts or more generated clips, but with structured feedback data that teaches a production system when to generate, when to use keyframes, when to route to a different model, and when to reject an output.
1. Introduction
Multimodal AI systems are moving quickly toward native multimodal modeling. Models and research systems such as GPT-4o, Gemini, Qwen2.5-VL, Qwen2.5-Omni, InternVL3, LLaVA-OneVision, Chameleon, Transfusion, Emu3, and 4M-21 show that image, video, audio, text, screen, and document inputs are becoming part of unified reasoning and generation pipelines.
Yet ecommerce AI video production exposes a gap that general benchmarks rarely capture. A generated video can look visually impressive while still being commercially unusable: the product package drifts, a hand-object interaction becomes impossible, a supplement claim becomes non-compliant, a pet motion looks fake, the Amazon listing proof is unclear, or the first three seconds fail to communicate the product problem.
How can ecommerce AI video production be turned into trainable multimodal feedback data?
2. Dataset: CMV-24
CMV-24 is a seed dataset of 24 commerce AI video samples curated from cross-border ecommerce video work. It covers:
| Group | Count | Primary QA Focus |
|---|---|---|
| TikTok Skill / UGC | 17 | Hook, low-AI feel, phone-native realism, action believability |
| Independent-site Hero / Product-page | 3 | Commercial readability, trust proof, landing-page fit |
| Amazon Product Video / Product Proof | 4 | Product clarity, feature proof, listing-claim accuracy |
Each sample includes product category, platform, route, QA risk labels, repair action, final use, public video URL, duration, resolution, and evidence status.
3. Method: Product-Truth Feedback Training
PTFT is a staged protocol for training the production system around generative models.
3.1 Product Truth Card
Each task starts with a Product Truth Card:
{
"product": "string",
"category": "string",
"market": "string",
"channels": ["TikTok", "Amazon", "Shopify"],
"buying_reasons": ["string"],
"visual_anchors": ["shape", "logo", "material", "usage action"],
"claims_to_avoid": ["medical cure", "guaranteed result"],
"commercial_goal": "hook / product proof / hero / listing"
}
3.2 Failure Memory Retriever
Each failure is stored as:
product category + platform + route + failed label + evidence + repair action
The next similar task retrieves these failures before prompt writing and model routing.
3.3 Model Router
The router predicts:
Product Truth Card + platform + risk flags -> route_type + keyframe_required + QA labels
Route types include:
text-to-videoimage-firstreference-imagekeyframe-firstmanual-edit-enhanced
3.4 Video QA Scorer
The QA scorer predicts labels and repair actions:
keyframes + video summary + Product Truth Card + platform -> QA labels + acceptability score + repair action
Key labels include product_shape_risk, hand_object_contact, action_logic, scene_consistency, temporal_fidelity, commercial_readability, listing_claim_accuracy, first_three_seconds_hook, and platform_compliance_risk.
3.5 Preference Ranking
When multiple candidate videos exist:
chosen_video > rejected_video
The chosen output should be more faithful to product truth, more platform-native, more commercially readable, and easier to use or repair.
4. Minimal Viable Experiment
CMV-24 supports a first experiment:
- Extract four keyframes per video: hook, product proof, usage/action, CTA/payoff.
- Label each keyframe and full video with QA labels.
- Generate two candidate variants per product, expanding the set to 72 videos.
- Create chosen/rejected preference pairs.
- Train a rule-based router baseline.
- Train a lightweight multi-label QA classifier.
- Measure retries, usable output rate, product drift, and human review time.
5. Limitations
CMV-24 is a seed dataset, not a benchmark. It lacks original model names, detailed timecode-level failure evidence, A/B preference pairs, and live campaign metrics. It should be treated as a reproducible structure for expanding a production dataset.
6. Ethics and Commercial Safety
Commerce multimodal training must avoid turning generated assets into false product evidence. PTFT tracks product truth and unsupported claims. Health, beauty, pet, supplement, and safety-related clips require conservative language and explicit compliance review.
7. Conclusion
Commercial multimodal training should begin with feedback structure. CMV-24 and PTFT show how product truth, platform fit, visual failure, compliance risk, and commercial usability can become trainable data for routing, QA, preference ranking, and rejection policies.
The most valuable training signal in commerce multimodal video is not the prompt itself, but the structured explanation of why a generated output is usable, risky, repairable, or rejected.