📝 Publications

A full publication list is available on my google scholar page.

(*: equal contribution; †: corresponding authors.)

Visual Generation Alignment and its Application

NeurIPS 2024

[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou.

EvolveDirector leverages large vision-language models (VLMs) to evaluate visual generation results, guiding the evolution of a T2I model by dynamically refining the training dataset through selection and mutation.
The trained T2I model, Edgen, powered by EvolveDirector, achieves SOTA performance using only 1% of the data typically required by other models.

CVPR 2024

[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
[Project page]

InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.

Visual Generation / Editing

arXiv

[arXiv] FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu.
[Project page]

FreeScale introduces a groundbreaking tuning-free inference paradigm that enables high-resolution visual generation by seamlessly fusing information from multiple receptive scales.
FreeScale unlocks the potential for generating 8k-resolution images and videos, setting a new benchmark in the field.

AAAI 2025

[AAAI 2025] FreeMask: Rethinking the Importance of Attention Masks for Zero-shot Video Editing
Lingling Cai, Kang Zhao, Hangjie Yuan, Yingya Zhang, Shiwei Zhang, Kejie Huang.
[Project page] [code]

FreeMask uncovers a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep.
We quantify this variability and propose FreeMask to select optimal masks for various video editing tasks.

arXiv

[arXiv] DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, et al.
[Project page] [code] (to be updated)

DreamVideo-2 is the first zero-shot video customization framework capable of generating videos adhering to a specific subject and motion trajectory.

CVPR 2024

[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
[Project page]

DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.

arXiv

[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
[Project page] [ModelScope]

I2VGen-XL is the first publicly available foundation model for image-to-video generation.
I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.

CVPR 2024

[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
[Project page]

TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.

NeurIPS 2023

[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
[Project page] [机器之心] [CVer]

VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.

Technical report

[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers] [ModelScope]

ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
ModelScopeT2V is selected for inclusion in Diffusers.

Visual Relation Detection (HOI Detection / Scene Graph Generation)

ICCV 2023

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.

RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.

NeurIPS 2022 Spotlight

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
[video talk]

RLIP is the first work to use relation texts as a language-image pre-training signal.
RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.

AAAI 2022

[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
[video talk]

OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
OCN incorporates language and statistical prior to facilitate verb inference.

Video Understanding

ICCV 2021

[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
[知乎] [将门创投] [video talk]

DIN proposes to perform spatio-temporal dynamic inference.
DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.

AAAI 2021

[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]

This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
This paper achieves SOTA results on Volleyball and CAD benchmarks.

[arXiv] Few-shot Action Recognition with Captioning Foundation Models, Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang.

AI for Science and Engineering

ICML 2024

[ICML 2024] PAPM: A Physics-aware Proxy Model for Process Systems
Pengwei Liu, Zhongkai Hao, Xingyu Ren, Hangjie Yuan, Jiayang Ren, Dong Ni.
[code]

PAPM is a pioneering work that fully incorporates partial prior physics of process systems to enable better generalization capabilities.

[ICLR 2024] LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition, Lingfeng Liu, Dong Ni, Hangjie Yuan. [code]

Incremental / Continual Learning

CVPR 2022

[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.

This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.

[NeurIPS 2024] Make Continual Learning Stronger via C-Flat, Ang Bian, Wei Li, Hangjie Yuan, Chengrong Yu, Zixiang Zhao, Mang Wang, Aojun Lu, Tao Feng. [code] (to be updated)
[IJCAI 2024] Revisiting Neural Networks for Continual Learning: An Architectural Perspective, Aojun Lu, Tao Feng, Hangjie Yuan, Xiaotian Song, Yanan Sun.
[arXiv] Progressive Learning without Forgetting, Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, Jianzhou Zhang.

Hangjie Yuan

📝 Publications

Visual Generation Alignment and its Application

Visual Generation / Editing

Visual Relation Detection (HOI Detection / Scene Graph Generation)

Video Understanding

AI for Science and Engineering

Incremental / Continual Learning

Other Interesting Topics