GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

Xiaoyi Bao^1,2,3*, Jindi Lv^1,4*, Xiaofeng Wang¹, Zheng Zhu^1†, Xinze Chen¹, YuKun Zhou¹, Jiancheng Lv⁴, Xingang Wang^2†, Guan Huang¹

¹GigaAI, ²Institute of Automation, Chinese Academy of Sciences, ³School of Artificial Intelligence, University of Chinese Academy of Sciences, ⁴School of Computer Science, Sichuan University

^*Equal Contribution ^†Corresponding Authors

arXiv Code

Visualization of GigaVideo-1 performance. The left figure compares videos generated by the baseline Wan2.1 and our GigaVideo-1 across two different dimensions. The right figure provides the performance of GigaVideo-1 and other state-of-the-art T2V models on VBench-2.0. With only ~4 GPU-hours of training, GigaVideo-1 achieves notable improvements over the baseline, demonstrating both effectiveness and efficiency.

Abstract

Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of ~4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.

Approach Overview

GigaVideo-1 Training Pipeline. Our pipeline consists of two components: prompt-driven data engine and reward-guided optimization. On the left, we generate synthetic prompts targeting weak dimensions using LLMs, and synthesize training videos via a pre-trained T2V model. These are combined with real-caption–based samples to balance diversity and realism. On the right, a frozen MLLM scores each video on dimension-specific criteria. These scores guide training via weighted denoising loss. For synthetic videos from real-world caption prompts, extra realism constraint is applied for distribution alignment. GigaVideo-1 enables efficient, automatic fine-tuning without manual labels or extra data collection.

Visual comparisons

GigaVideo-1 improves the performance of our baseline Wan2.1 across different real-world dimensions.

Camera_Motion

CASE1

"Alhambra, zoom out"

Wan2.1

GigaVideo-1

CASE2

"Garden, First-person perspective, oblique shot, airborne dolly movement"

Wan2.1

GigaVideo-1

CASE3

"Machu Picchu, zoom in"

Wan2.1

GigaVideo-1

CASE4

"Pyramid, pan right"

Wan2.1

GigaVideo-1