Full paper · Dataset · Training protocol · Ecommerce AI video QA

Product-Truth Feedback Training

A commerce multimodal video feedback dataset and training protocol built from CMV-24, a 24-sample AI commerce video metadata dataset.

Author: Jia Jingqiu · Release draft: 2026-06-29

Product-Truth Feedback Training: A Commerce Multimodal Video Feedback Dataset and Training Protocol

Abstract

Recent multimodal models have rapidly advanced from image-text understanding toward native multimodal reasoning, long-video comprehension, audio-visual interaction, and any-to-any generation. However, commercial AI video production still lacks a practical training framework that captures product truth, platform constraints, generation failures, repair actions, and downstream business use. This paper proposes Product-Truth Feedback Training (PTFT), an industry-side multimodal training protocol for cross-border ecommerce video production. Instead of training a foundation video model from scratch, PTFT trains the surrounding production intelligence: a failure-memory retriever, a model-routing policy, a multimodal QA scorer, and a preference-ranking policy. We introduce CMV-24, a seed dataset of 24 AI-generated commerce video samples covering TikTok/UGC, independent-site hero assets, and Amazon product videos. Each sample is annotated with product category, target platform, generation route, risk/failure labels, repair action, final use, verified video path, duration, and resolution. The paper argues that commercial multimodal training should not begin with more prompts or more generated clips, but with structured feedback data that teaches a production system when to generate, when to use keyframes, when to route to a different model, and when to reject an output.

1. Introduction

Multimodal AI systems are moving quickly toward native multimodal modeling. Models and research systems such as GPT-4o, Gemini, Qwen2.5-VL, Qwen2.5-Omni, InternVL3, LLaVA-OneVision, Chameleon, Transfusion, Emu3, and 4M-21 show that image, video, audio, text, screen, and document inputs are becoming part of unified reasoning and generation pipelines.

Yet ecommerce AI video production exposes a gap that general benchmarks rarely capture. A generated video can look visually impressive while still being commercially unusable: the product package drifts, a hand-object interaction becomes impossible, a supplement claim becomes non-compliant, a pet motion looks fake, the Amazon listing proof is unclear, or the first three seconds fail to communicate the product problem.

How can ecommerce AI video production be turned into trainable multimodal feedback data?

2. Dataset: CMV-24

CMV-24 is a seed dataset of 24 commerce AI video samples curated from cross-border ecommerce video work. It covers:

Group Count Primary QA Focus
TikTok Skill / UGC 17 Hook, low-AI feel, phone-native realism, action believability
Independent-site Hero / Product-page 3 Commercial readability, trust proof, landing-page fit
Amazon Product Video / Product Proof 4 Product clarity, feature proof, listing-claim accuracy

Each sample includes product category, platform, route, QA risk labels, repair action, final use, public video URL, duration, resolution, and evidence status.

3. Method: Product-Truth Feedback Training

PTFT is a staged protocol for training the production system around generative models.

3.1 Product Truth Card

Each task starts with a Product Truth Card:

{
  "product": "string",
  "category": "string",
  "market": "string",
  "channels": ["TikTok", "Amazon", "Shopify"],
  "buying_reasons": ["string"],
  "visual_anchors": ["shape", "logo", "material", "usage action"],
  "claims_to_avoid": ["medical cure", "guaranteed result"],
  "commercial_goal": "hook / product proof / hero / listing"
}

3.2 Failure Memory Retriever

Each failure is stored as:

product category + platform + route + failed label + evidence + repair action

The next similar task retrieves these failures before prompt writing and model routing.

3.3 Model Router

The router predicts:

Product Truth Card + platform + risk flags -> route_type + keyframe_required + QA labels

Route types include:

  • text-to-video
  • image-first
  • reference-image
  • keyframe-first
  • manual-edit-enhanced

3.4 Video QA Scorer

The QA scorer predicts labels and repair actions:

keyframes + video summary + Product Truth Card + platform -> QA labels + acceptability score + repair action

Key labels include product_shape_risk, hand_object_contact, action_logic, scene_consistency, temporal_fidelity, commercial_readability, listing_claim_accuracy, first_three_seconds_hook, and platform_compliance_risk.

3.5 Preference Ranking

When multiple candidate videos exist:

chosen_video > rejected_video

The chosen output should be more faithful to product truth, more platform-native, more commercially readable, and easier to use or repair.

4. Minimal Viable Experiment

CMV-24 supports a first experiment:

  1. Extract four keyframes per video: hook, product proof, usage/action, CTA/payoff.
  2. Label each keyframe and full video with QA labels.
  3. Generate two candidate variants per product, expanding the set to 72 videos.
  4. Create chosen/rejected preference pairs.
  5. Train a rule-based router baseline.
  6. Train a lightweight multi-label QA classifier.
  7. Measure retries, usable output rate, product drift, and human review time.

5. Limitations

CMV-24 is a seed dataset, not a benchmark. It lacks original model names, detailed timecode-level failure evidence, A/B preference pairs, and live campaign metrics. It should be treated as a reproducible structure for expanding a production dataset.

6. Ethics and Commercial Safety

Commerce multimodal training must avoid turning generated assets into false product evidence. PTFT tracks product truth and unsupported claims. Health, beauty, pet, supplement, and safety-related clips require conservative language and explicit compliance review.

7. Conclusion

Commercial multimodal training should begin with feedback structure. CMV-24 and PTFT show how product truth, platform fit, visual failure, compliance risk, and commercial usability can become trainable data for routing, QA, preference ranking, and rejection policies.

The most valuable training signal in commerce multimodal video is not the prompt itself, but the structured explanation of why a generated output is usable, risky, repairable, or rejected.

Dataset Appendix

24AI video samples
17TikTok / UGC assets
3site hero assets
4Amazon proof videos

Training Modules

01

Product Truth Card

Lock product category, visual anchors, channel goals, supported claims and claims to avoid.

02

Failure Memory

Store product category, platform, route, failed label, evidence and repair action for retrieval.

03

Model Router

Predict whether the task should use text-to-video, image-first, reference-image, keyframe-first or manual-edit-enhanced.

04

Video QA Scorer

Predict labels such as product shape risk, hand-object contact, action logic and compliance risk.

05

Preference Ranker

Rank candidate videos by product truth, platform fit, commercial readability and repairability.

QA Label Taxonomy

product_shape_riskProduct shape, logo, package, color, scale or material may drift.
hand_object_contactHuman hand and product contact may be unnatural.
action_logicProduct usage or physical action may be impossible or unclear.
scene_consistencyPerson, pet, room, product, lighting or outfit may drift.
commercial_readabilityBuyer may not understand product value without internal context.
platform_compliance_riskHealth, beauty, pet, supplement or safety claims may be risky.

Supporting Files

  • data/cmv24_metadata.csv 24 annotated commerce AI video metadata records.
  • data/label_taxonomy.md QA label definitions for commerce video review.
  • schema/cmv24.schema.json Structured dataset field schema.
  • docs/training_protocol.md Product-Truth Feedback Training protocol.
  • docs/paper.md Release paper draft.
  • Dataset repository and sample video.

    The paper is fully readable on this page. The repository is kept as the data package for metadata, taxonomy, schema and reproducible protocol files.