Name: CMV-24: Commerce Multimodal Video Feedback Dataset
Creator: Jia Jingqiu

Author: Jia Jingqiu · Release draft: 2026-06-29

Product-Truth Feedback Training: A Commerce Multimodal Video Feedback Dataset and Training Protocol

Abstract

Recent multimodal models have rapidly advanced from image-text understanding toward native multimodal reasoning, long-video comprehension, audio-visual interaction, and any-to-any generation. However, commercial AI video production still lacks a practical training framework that captures product truth, platform constraints, generation failures, repair actions, and downstream business use. This paper proposes Product-Truth Feedback Training (PTFT), an industry-side multimodal training protocol for cross-border ecommerce video production. Instead of training a foundation video model from scratch, PTFT trains the surrounding production intelligence: a failure-memory retriever, a model-routing policy, a multimodal QA scorer, and a preference-ranking policy. We introduce CMV-24, a seed dataset of 24 AI-generated commerce video samples covering TikTok/UGC, independent-site hero assets, and Amazon product videos. Each sample is annotated with product category, target platform, generation route, risk/failure labels, repair action, final use, verified video path, duration, and resolution. The paper argues that commercial multimodal training should not begin with more prompts or more generated clips, but with structured feedback data that teaches a production system when to generate, when to use keyframes, when to route to a different model, and when to reject an output.

1. Introduction

Multimodal AI systems are moving quickly toward native multimodal modeling. Models and research systems such as GPT-4o, Gemini, Qwen2.5-VL, Qwen2.5-Omni, InternVL3, LLaVA-OneVision, Chameleon, Transfusion, Emu3, and 4M-21 show that image, video, audio, text, screen, and document inputs are becoming part of unified reasoning and generation pipelines.

Yet ecommerce AI video production exposes a gap that general benchmarks rarely capture. A generated video can look visually impressive while still being commercially unusable: the product package drifts, a hand-object interaction becomes impossible, a supplement claim becomes non-compliant, a pet motion looks fake, the Amazon listing proof is unclear, or the first three seconds fail to communicate the product problem.

How can ecommerce AI video production be turned into trainable multimodal feedback data?

2. Dataset: CMV-24

CMV-24 is a seed dataset of 24 commerce AI video samples curated from cross-border ecommerce video work. It covers:

Group	Count	Primary QA Focus
TikTok Skill / UGC	17	Hook, low-AI feel, phone-native realism, action believability
Independent-site Hero / Product-page	3	Commercial readability, trust proof, landing-page fit
Amazon Product Video / Product Proof	4	Product clarity, feature proof, listing-claim accuracy

Each sample includes product category, platform, route, QA risk labels, repair action, final use, public video URL, duration, resolution, and evidence status.

3. Method: Product-Truth Feedback Training

PTFT is a staged protocol for training the production system around generative models.

3.1 Product Truth Card

Each task starts with a Product Truth Card:

{
  "product": "string",
  "category": "string",
  "market": "string",
  "channels": ["TikTok", "Amazon", "Shopify"],
  "buying_reasons": ["string"],
  "visual_anchors": ["shape", "logo", "material", "usage action"],
  "claims_to_avoid": ["medical cure", "guaranteed result"],
  "commercial_goal": "hook / product proof / hero / listing"
}

3.2 Failure Memory Retriever

Each failure is stored as:

product category + platform + route + failed label + evidence + repair action

The next similar task retrieves these failures before prompt writing and model routing.

3.3 Model Router

The router predicts:

Product Truth Card + platform + risk flags -> route_type + keyframe_required + QA labels

Route types include:

text-to-video
image-first
reference-image
keyframe-first
manual-edit-enhanced

3.4 Video QA Scorer

The QA scorer predicts labels and repair actions:

keyframes + video summary + Product Truth Card + platform -> QA labels + acceptability score + repair action

Key labels include product_shape_risk, hand_object_contact, action_logic, scene_consistency, temporal_fidelity, commercial_readability, listing_claim_accuracy, first_three_seconds_hook, and platform_compliance_risk.

3.5 Preference Ranking

When multiple candidate videos exist:

chosen_video > rejected_video

The chosen output should be more faithful to product truth, more platform-native, more commercially readable, and easier to use or repair.

4. Minimal Viable Experiment

CMV-24 supports a first experiment:

Extract four keyframes per video: hook, product proof, usage/action, CTA/payoff.
Label each keyframe and full video with QA labels.
Generate two candidate variants per product, expanding the set to 72 videos.
Create chosen/rejected preference pairs.
Train a rule-based router baseline.
Train a lightweight multi-label QA classifier.
Measure retries, usable output rate, product drift, and human review time.

5. Limitations

CMV-24 is a seed dataset, not a benchmark. It lacks original model names, detailed timecode-level failure evidence, A/B preference pairs, and live campaign metrics. It should be treated as a reproducible structure for expanding a production dataset.

6. Ethics and Commercial Safety

Commerce multimodal training must avoid turning generated assets into false product evidence. PTFT tracks product truth and unsupported claims. Health, beauty, pet, supplement, and safety-related clips require conservative language and explicit compliance review.

7. Conclusion

Commercial multimodal training should begin with feedback structure. CMV-24 and PTFT show how product truth, platform fit, visual failure, compliance risk, and commercial usability can become trainable data for routing, QA, preference ranking, and rejection policies.

The most valuable training signal in commerce multimodal video is not the prompt itself, but the structured explanation of why a generated output is usable, risky, repairable, or rejected.

Product-Truth Feedback Training