GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning

Xiaoyi Bao1,2,3*, Jindi Lv1,4*, Xiaofeng Wang1, Zheng Zhu1†, Xinze Chen1, YuKun Zhou1, Jiancheng Lv4, Xingang Wang2†, Guan Huang1
1GigaAI,  2Institute of Automation, Chinese Academy of Sciences,  3School of Artificial Intelligence, University of Chinese Academy of Sciences,  4School of Computer Science, Sichuan University 
*Equal Contribution  Corresponding Authors 

Visualization of GigaVideo-1 performance. The left figure compares videos generated by the baseline Wan2.1 and our GigaVideo-1 across two different dimensions. The right figure provides the performance of GigaVideo-1 and other state-of-the-art T2V models on VBench-2.0. With only ~4 GPU-hours of training, GigaVideo-1 achieves notable improvements over the baseline, demonstrating both effectiveness and efficiency.

Abstract

Recent progress in diffusion models has greatly enhanced video generation quality, yet these models still require fine-tuning to improve specific dimensions like instance preservation, motion rationality, composition, and physical plausibility. Existing fine-tuning approaches often rely on human annotations and large-scale computational resources, limiting their practicality. In this work, we propose GigaVideo-1, an efficient fine-tuning framework that advances video generation without additional human supervision. Rather than injecting large volumes of high-quality data from external sources, GigaVideo-1 unlocks the latent potential of pre-trained video diffusion models through automatic feedback. Specifically, we focus on two key aspects of the fine-tuning process: data and optimization. To improve fine-tuning data, we design a prompt-driven data engine that constructs diverse, weakness-oriented training samples. On the optimization side, we introduce a reward-guided training strategy, which adaptively weights samples using feedback from pre-trained vision-language models with a realism constraint. We evaluate GigaVideo-1 on the VBench-2.0 benchmark using Wan2.1 as the baseline across 17 evaluation dimensions. Experiments show that GigaVideo-1 consistently improves performance on almost all the dimensions with an average gain of ~4% using only 4 GPU-hours. Requiring no manual annotations and minimal real data, GigaVideo-1 demonstrates both effectiveness and efficiency. Code, model, and data will be publicly available.

Approach Overview

GigaVideo-1 Training Pipeline. Our pipeline consists of two components: prompt-driven data engine and reward-guided optimization. On the left, we generate synthetic prompts targeting weak dimensions using LLMs, and synthesize training videos via a pre-trained T2V model. These are combined with real-caption–based samples to balance diversity and realism. On the right, a frozen MLLM scores each video on dimension-specific criteria. These scores guide training via weighted denoising loss. For synthetic videos from real-world caption prompts, extra realism constraint is applied for distribution alignment. GigaVideo-1 enables efficient, automatic fine-tuning without manual labels or extra data collection.

Visual comparisons


GigaVideo-1 improves the performance of our baseline Wan2.1 across different real-world dimensions.

Camera_Motion

CASE1

"Alhambra, zoom out"

Wan2.1

GigaVideo-1


CASE2

"Garden, First-person perspective, oblique shot, airborne dolly movement"

Wan2.1

GigaVideo-1


CASE3

"Machu Picchu, zoom in"

Wan2.1

GigaVideo-1


CASE4

"Pyramid, pan right"

Wan2.1

GigaVideo-1


Complex_Landscape

CASE1

"The camera enters a golden autumn forest, where the leaves have turned brilliant shades from gold to orange-red. A few leaves drift down with the wind. Sunlight filters through the..."

Wan2.1

GigaVideo-1


CASE2

"The camera begins in a vast grassland, where the lush green grass sways gently in the breeze, the air fresh, and the soft rustling of the leaves fills the space. As the camera move..."

Wan2.1

GigaVideo-1


CASE3

"The camera gently descends, passing through the layers of waves, entering the deep underwater world. The surrounding coral reefs are vibrant and colorful, with a variety of tropica..."

Wan2.1

GigaVideo-1


CASE4

"The camera moves through the vast expanse of the universe, where stars twinkle against the dark backdrop, the Milky Way arcing like a silver river across the sky. Nebulae slowly ro..."

Wan2.1

GigaVideo-1


Complex_Plot

CASE1

"In an ancient tomb deep in the mountains, the tomb raiders discovered a massive dragon bone structure. Legend had it that the skeleton belonged to an ancient dragon god. Captain Ol..."

Wan2.1

GigaVideo-1


CASE2

"It is said that the ancient Dragon Tribe left behind a vast treasure, accessible only through a series of trials. Young adventurer Lucas set off on his own journey to find this leg..."

Wan2.1

GigaVideo-1


CASE3

"Little Red Riding Hood brings food to visit her sick grandmother and encounters a cunning wolf along the way. The wolf pretends not to know her and guides her down a longer path. A..."

Wan2.1

GigaVideo-1


CASE4

"The race began, and the runners quickly started. The Team A runner took the lead initially due to a powerful start. However, the Team B runner did not rush to chase but instead ste..."

Wan2.1

GigaVideo-1


Composition

CASE1

"A crocodile with the arms of a gorilla, the legs of a cheetah, the scales of a snake, and the eyes of a chameleon, an apex predator in both land and water."

Wan2.1

GigaVideo-1


CASE2

"A giraffe with the body of a whale, the legs of a kangaroo, and the tail of a flamingo, allowing it to leap from one ocean wave to another with remarkable grace."

Wan2.1

GigaVideo-1


CASE3

"A giraffe with the wings of a bat, soaring above the trees in a mysterious flight."

Wan2.1

GigaVideo-1


CASE4

"A lion with the wings of an eagle, soaring through the sky with majestic ease."

Wan2.1

GigaVideo-1


Dynamic_Attribute

CASE1

"The wooden toy turned into a glass toy."

Wan2.1

GigaVideo-1


CASE2

"A snowman changes from large to small."

Wan2.1

GigaVideo-1


CASE3

"An ant gradually grow big."

Wan2.1

GigaVideo-1


CASE4

"A star changes from faint to bright."

Wan2.1

GigaVideo-1


Dynamic_Spatial_Relationship

CASE1

"A cat is on the left of a chair, then the cat runs to the front of the chair."

Wan2.1

GigaVideo-1


CASE2

"A cat is on the right of a rock, then the cat runs to the left of the rock."

Wan2.1

GigaVideo-1


CASE3

"A kangaroo is in front of a basket, then the kangaroo jumps to the right of the basket."

Wan2.1

GigaVideo-1


CASE4

"A squirrel is behind a rock, then the squirrel jumps to the left of the rock."

Wan2.1

GigaVideo-1


Human Anatomy

CASE1

"A man is playing basketball."

Wan2.1

GigaVideo-1


CASE2

"A man is playing football."

Wan2.1

GigaVideo-1


CASE3

"A person is sitting in a chair, then they suddenly get up and start stretching."

Wan2.1

GigaVideo-1


CASE4

"Two people are exchanging keys."

Wan2.1

GigaVideo-1


Human Interaction

CASE1

"One person places a blanket over another person."

Wan2.1

GigaVideo-1


CASE2

"One person adjusts the collar of another person’s shirt."

Wan2.1

GigaVideo-1


CASE3

"One person adjusts the glasses of another."

Wan2.1

GigaVideo-1


Human_Clothes

CASE1

"A man is playing badminton."

Wan2.1

GigaVideo-1


CASE2

"A man is doing yoga."

Wan2.1

GigaVideo-1


CASE3

"A person is working on a project, then they suddenly start cooking dinner."

Wan2.1

GigaVideo-1


Human_Identity

CASE1

"A person is sitting at the table, then they suddenly start drawing on a notepad."

Wan2.1

GigaVideo-1


CASE2

"A person is drinking a glass of water, then they suddenly start cleaning the windows."

Wan2.1

GigaVideo-1


CASE3

"A person is drinking tea, then they suddenly start folding the laundry."

Wan2.1

GigaVideo-1


CASE4

"A person is reading the news, then they suddenly start watering the plants."

Wan2.1

GigaVideo-1


Instance_preservation

CASE3

"A man and a woman is doing yoga."

Wan2.1

GigaVideo-1


Material

CASE1

"A clear glass of baking soda is gently poured into a glass of vinegar."

Wan2.1

GigaVideo-1


CASE2

"A clear glass of coffee is gently poured into a glass of milk."

Wan2.1

GigaVideo-1


CASE3

"A clear glass of flour is gently poured into a glass of water."

Wan2.1

GigaVideo-1


CASE4

"A small burning candle was thrown into a pile of dry twigs."

Wan2.1

GigaVideo-1


Mechanics

CASE1

"A bottle of water is opened in the space station, and the water starts to float out in irregular shapes."

Wan2.1

GigaVideo-1


CASE2

"A cork is placed on the surface of a bucket filled with water."

Wan2.1

GigaVideo-1


CASE3

"A metal coin is gently placed on the surface of a shallow pool of water."

Wan2.1

GigaVideo-1


CASE4

"A plastic toy is placed on the surface of a pond filled with water."

Wan2.1

GigaVideo-1


Motion Order Understanding

CASE1

"A dog is lying in the sun, then it suddenly jumps up and starts playing with its owner."

Wan2.1

GigaVideo-1


CASE2

"A horse is standing in the stable, then it suddenly starts chewing hay."

Wan2.1

GigaVideo-1


CASE3

"A dog is running in the yard, then it suddenly starts sitting under a tree."

Wan2.1

GigaVideo-1


Motion Rationality

CASE1

"A person is drinking a smoothie from a glass."

Wan2.1

GigaVideo-1


CASE2

"A person is painting."

Wan2.1

GigaVideo-1


CASE3

"A person is pouring olive oil into a frying pan."

Wan2.1

GigaVideo-1


Multi-view-consistency

CASE1

"The camera orbits around. Bathtub, the camera circles around."

Wan2.1

GigaVideo-1


CASE2

"The camera orbits around. Birdhouse, the camera circles around."

Wan2.1

GigaVideo-1


CASE3

"The camera orbits around. Castle, the camera circles around."

Wan2.1

GigaVideo-1


CASE4

"The camera orbits around. Clock Tower, the camera circles around."

Wan2.1

GigaVideo-1


Thermotics

CASE1

"A timelapse captures the gradual transformation of a block of cheese as the temperature rises to 60°C"

Wan2.1

GigaVideo-1


CASE2

"A timelapse captures the gradual transformation of a piece of ice as the temperature rises to 10°C"

Wan2.1

GigaVideo-1


CASE3

"A timelapse captures the transformation as steam from a boiling pot comes into contact with a cold tile wall"

Wan2.1

GigaVideo-1


CASE4

"A timelapse captures the transformation of water in a pot as the temperature reaches 130°C"

Wan2.1

GigaVideo-1


Quantitative Results

BibTeX

If you use our work in your research, please cite:

@article{gigavideo1,
  title={GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning},
  author={Bao, Xiaoyi and Lv, Jindi and Wang, Xiaofeng and Zhu, Zheng and Chen, Xinze and Zhou, Yukun and Lv, Jiancheng and Wang, Xingang and Huang Guan},
  journal={arXiv preprint arXiv:2506.10639},
  year={2026}
}