📝 Publications

A full publication list is available on my google scholar page.

(*: equal contribution; †: corresponding authors.)

Visual Generation Alignment and its Application

NeurIPS 2024
sym

[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou.
GitHub Stars GitHub Forks

  • EvolveDirector leverages large vision-language models (VLMs) to evaluate visual generation results, guiding the evolution of a T2I model by dynamically refining the training dataset through selection and mutation.
  • The trained T2I model, Edgen, powered by EvolveDirector, achieves SOTA performance using only 1% of the data typically required by other models.
CVPR 2024
sym

[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
GitHub Stars GitHub Forks [Project page]

  • InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
  • InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.

Visual Generation / Editing

arXiv
sym

[arXiv] FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu.
GitHub Stars GitHub Forks [Project page]

  • FreeScale introduces a groundbreaking tuning-free inference paradigm that enables high-resolution visual generation by seamlessly fusing information from multiple receptive scales.
  • FreeScale unlocks the potential for generating 8k-resolution images and videos, setting a new benchmark in the field.
AAAI 2025
sym

[AAAI 2025] FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing
Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang.
[Project page] [code]

  • FreeMask uncovers a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep.
  • We quantify this variability and propose FreeMask to select optimal masks for various video editing tasks.
arXiv
sym

[arXiv] DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, et al.
[Project page] [code] (to be updated)

  • DreamVideo-2 is the first zero-shot video customization framework capable of generating videos adhering to a specific subject and motion trajectory.
CVPR 2024
sym

[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
GitHub Stars GitHub Forks [Project page]

  • DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.
arXiv
sym

[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
GitHub Stars GitHub Forks [Project page] [ModelScope]

  • I2VGen-XL is the first publicly available foundation model for image-to-video generation.
  • I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.
CVPR 2024
sym

[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
GitHub Stars GitHub Forks [Project page]

  • TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
  • TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.
NeurIPS 2023
sym

[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
GitHub Stars GitHub Forks [Project page] [机器之心] [CVer]

  • VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
  • VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.
Technical report
sym

[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers] Star [ModelScope] Star

  • ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
  • ModelScopeT2V is selected for inclusion in Diffusers.

Visual Relation Detection (HOI Detection / Scene Graph Generation)

ICCV 2023
sym

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.
GitHub Stars GitHub Forks

  • RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
  • The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.
NeurIPS 2022 Spotlight
sym

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
GitHub Stars GitHub Forks [video talk]

  • RLIP is the first work to use relation texts as a language-image pre-training signal.
  • RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.
AAAI 2022
sym

[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
GitHub Stars GitHub Forks [video talk]

  • OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
  • OCN incorporates language and statistical prior to facilitate verb inference.

Video Understanding

ICCV 2021
sym

[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
GitHub Stars GitHub Forks [知乎] [将门创投] [video talk]

  • DIN proposes to perform spatio-temporal dynamic inference.
  • DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.
AAAI 2021
sym

[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]

  • This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
  • This paper achieves SOTA results on Volleyball and CAD benchmarks.

AI for Science and Engineering

ICML 2024
sym

[ICML 2024] PAPM: A Physics-aware Proxy Model for Process Systems
Pengwei Liu, Zhongkai Hao, Xingyu Ren, Hangjie Yuan, Jiayang Ren, Dong Ni.
[code]

  • PAPM is a pioneering work that fully incorporates partial prior physics of process systems to enable better generalization capabilities.

Incremental / Continual Learning

CVPR 2022
sym

[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.
GitHub Stars GitHub Forks

  • This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
  • This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.

Other Interesting Topics