📝 Publications
A full publication list is available on my google scholar page.
(*: equal contribution; †: corresponding authors.)
Visual Generation Alignment and its Application

[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou.
- EvolveDirector leverages large vision-language models (VLMs) to evaluate visual generation results, guiding the evolution of a T2I model by dynamically refining the training dataset through selection and mutation.
- The trained T2I model, Edgen, powered by EvolveDirector, achieves SOTA performance using only 1% of the data typically required by other models.

[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
[Project page]
- InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
- InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.
Visual Generation / Editing

[arXiv] FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu.
[Project page]
- FreeScale introduces a groundbreaking tuning-free inference paradigm that enables high-resolution visual generation by seamlessly fusing information from multiple receptive scales.
- FreeScale unlocks the potential for generating 8k-resolution images and videos, setting a new benchmark in the field.

[AAAI 2025] FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing
Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang.
[Project page] [code]
- FreeMask uncovers a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep.
- We quantify this variability and propose FreeMask to select optimal masks for various video editing tasks.

[arXiv] DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, et al.
[Project page] [code] (to be updated)
- DreamVideo-2 is the first zero-shot video customization framework capable of generating videos adhering to a specific subject and motion trajectory.

[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
[Project page]
- DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.

[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
[Project page]
[ModelScope]
- I2VGen-XL is the first publicly available foundation model for image-to-video generation.
- I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.

[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
[Project page]
- TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
- TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.
[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
[Project page]
[机器之心]
[CVer]
- VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
- VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.
[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers]
[ModelScope]
- ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
- ModelScopeT2V is selected for inclusion in Diffusers.
Visual Relation Detection (HOI Detection / Scene Graph Generation)

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.
- RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
- The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.
[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
[video talk]
- RLIP is the first work to use relation texts as a language-image pre-training signal.
- RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.
[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
[video talk]
- OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
- OCN incorporates language and statistical prior to facilitate verb inference.
Video Understanding
[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
[知乎]
[将门创投]
[video talk]
- DIN proposes to perform spatio-temporal dynamic inference.
- DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.
[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]
- This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
- This paper achieves SOTA results on Volleyball and CAD benchmarks.
- [arXiv] Few-shot Action Recognition with Captioning Foundation Models, Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang.
AI for Science and Engineering

[ICML 2024] PAPM: A Physics-aware Proxy Model for Process Systems
Pengwei Liu, Zhongkai Hao, Xingyu Ren, Hangjie Yuan, Jiayang Ren, Dong Ni.
[code]
- PAPM is a pioneering work that fully incorporates partial prior physics of process systems to enable better generalization capabilities.
- [ICLR 2024] LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition, Lingfeng Liu, Dong Ni, Hangjie Yuan. [code]
Incremental / Continual Learning
[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.
- This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
- This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.
-
[NeurIPS 2024] Make Continual Learning Stronger via C-Flat, Ang Bian, Wei Li, Hangjie Yuan, Chengrong Yu, Zixiang Zhao, Mang Wang, Aojun Lu, Tao Feng. [code] (to be updated)
-
[IJCAI 2024] Revisiting Neural Networks for Continual Learning: An Architectural Perspective, Aojun Lu, Tao Feng, Hangjie Yuan, Xiaotian Song, Yanan Sun.
-
[arXiv] Progressive Learning without Forgetting, Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, Jianzhou Zhang.
Other Interesting Topics
- [WACV 2024] From Denoising Training to Test-Time Adaptation: Enhancing Domain Generalization for Medical Image Segmentation, Ruxue Wen, Hangjie Yuan, Dong Ni, Wenbo Xiao, Yaoyao Wu. [code]