GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

Xiaoyi Bao1,2,3*, Jindi Lv1,4*, Xiaofeng Wang1, Zheng Zhu1†, Xinze Chen1, YuKun Zhou1, Jiancheng Lv4, Xingang Wang2†, Guan Huang1
1GigaAI,  2Institute of Automation, Chinese Academy of Sciences,  3School of Artificial Intelligence, University of Chinese Academy of Sciences,  4School of Computer Science, Sichuan University 
*Equal Contribution  Corresponding Authors 

Visualization of GigaVideo-1 performance. The left figure compares videos generated by the baseline Wan2.1 and our GigaVideo-1 across two different dimensions. The right figure provides the performance of GigaVideo-1 and other state-of-the-art T2V models on VBench-2.0. With only ~4 GPU-hours of training, GigaVideo-1 achieves notable improvements over the baseline, demonstrating both effectiveness and efficiency.

Abstract

Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of ~4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.

Approach Overview

GigaVideo-1 Training Pipeline. Our pipeline consists of two components: prompt-driven data engine and reward-guided optimization. On the left, we generate synthetic prompts targeting weak dimensions using LLMs, and synthesize training videos via a pre-trained T2V model. These are combined with real-caption–based samples to balance diversity and realism. On the right, a frozen MLLM scores each video on dimension-specific criteria. These scores guide training via weighted denoising loss. For synthetic videos from real-world caption prompts, extra realism constraint is applied for distribution alignment. GigaVideo-1 enables efficient, automatic fine-tuning without manual labels or extra data collection.

Visual comparisons


GigaVideo-1 improves the performance of our baseline Wan2.1 across different real-world dimensions.


Mechenics

"A metal coin is gently placed on the surface of a shallow pool of water."


Wan2.1


GigaVideo-1


"A plastic toy is placed on the surface of a pond filled with water."


Wan2.1


GigaVideo-1

Material

"A clear glass of baking soda is gently poured into a glass of vinegar."



Wan2.1


GigaVideo-1


Camera Motion

"Alhambra, zoom out."


Wan2.1


GigaVideo-1


"Pyramid, pan right"


Wan2.1


GigaVideo-1


Thermotics


"A timelapse captures the gradual transformation of a block of cheese as the temperature rises to 60°C."


Wan2.1


GigaVideo-1


"A timelapse captures the gradual transformation of a piece of ice as the temperature rises to 10°C"


Wan2.1


GigaVideo-1


Dynamic Attribute


"An ant gradually grow big."


Wan2.1


GigaVideo-1


"A snowman changes from large to small."


Wan2.1


GigaVideo-1

Quantitative Results

BibTeX

If you use our work in your research, please cite:

@article{gigavideo1,
  title={GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning},
  author={Bao, Xiaoyi and Lv, Jindi and Wang, Xiaofeng and Zhu, Zheng and Chen, Xinze and Zhou, Yukun and Lv, Jiancheng and Wang, Xingang and Huang Guan},
  journal={arXiv preprint arXiv:2506.10639},
  year={2026}
}