😃 Welcome to my personal page!

I am Hangjie Yuan (袁杭杰 in Chinese), currently pursuing my Ph.D. at Zhejiang University and serving as a long-term research intern at Alibaba DAMO Academy. I am fortunate to be supervised by Prof. Dong Ni at ZJU, who is open-minded and wise. Additionally, I am undertaking a visiting Ph.D. program at MMLab@NTU, supervised by Prof. Ziwei Liu. Moreover, I am fortunate to be supervised by Prof. Samuel Albanie at the University of Cambridge. As part of Alibaba’s Research Intern Program, I am/was supervised by Deli Zhao, Shiwei Zhang, Jianwen Jiang and Mang Wang. They are brilliant!

My representative projects include InstructVideo, VideoComposer, and the RLIP series (RLIP & RLIPv2).

My current research interests include:

1️⃣ Video Synthesis/Understanding,
2️⃣ Human-Object Interaction Detection/Scene Graph Generation, and
3️⃣ AI for Science and Engineering.

Feel free to drop me an email📧 if you are interested in having a chat or collaborating with me.

I am currently seeking job opportunities and anticipate graduating in 2024. I am exploring options in both academia, such as postdoctoral positions, and industry, focusing on research-oriented roles. I would be delighted to discuss any job openings or research projects. Please don’t hesitate to contact me at hj.yuan@zju.edu.cn.

🔥 News

2024-02 : 📑 InstructVideo, DreamVideo and TF-T2V are accepted to CVPR 2024. Thrilled to collaborate with them on these promising projects.
2024-01 : 👑 I am honored to receive the Outstanding Research Intern Award (20 in 1000+ candidates) for my contribution in video generation to Alibaba.
2024-01 : 📑 LUM-ViT is accepted to ICLR 2024. Congrats to Lingfeng Liu!
2023-09 : 📑 VideoComposer is accepted to NeurIPS 2023. Thrilled to collaborate with them on this project.
2023-08 : 📑 RLIPv2 is accepted to ICCV 2023. Code and models are publicly available here!
2023-08 : 🏡 We release ModelscopeT2V (the default T2V in Diffusers) and VideoComposer, two foundations for video generation.
2022-09 : 📑 RLIP: Relational Language-Image Pre-training is accepted to NeurIPS 2022 as a Spotlight paper (Top 5%). It’s my honor to work with Samuel and Jianwen. Btw, the pronunciation of RLIP is /’ɑ:lɪp/.
2022-05 : 📑 Elastic Response Distillation is accepted to CVPR 2022. A great pleasure to work with Tao Feng and Mang Wang.
2022-02 : 👑 I am awarded AAAI-22 Scholarship. Acknowledgement to AAAI!
2021-12 : 📑 Object-guided Cross-modal Calibration Network is accepted to AAAI 2022. A great pleasure to work with Mang Wang.
2021-07 : 📑 Spatio-Temporal Dynamic Inference Network is accepted to ICCV 2021.
2021-03 : 👷 I start my internship at Alibaba DAMO Academy.
2020-12 : 📑 Learning Visual Context (for Group Activity recognition) is accepted to AAAI 2021.

📝 Publications

A full publication list is available on my google scholar page.

(* denotes equal contribution.)

Video Generation

CVPR 2024

[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
[Project page]

InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.

arXiv

[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
[Project page] [ModelScope]

I2VGen-XL is the first publicly available foundation model for image-to-video generation.
I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.

CVPR 2024

[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
[Project page]

TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.

CVPR 2024

[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
[Project page]

DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.

NeurIPS 2023

[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
[Project page] [机器之心] [CVer]

VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.

Technical report

[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers] [ModelScope]

ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
ModelScopeT2V is selected for inclusion in Diffusers.

HOI Detection / Scene Graph Generation

ICCV 2023

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.

RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.

NeurIPS 2022 Spotlight

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
[video talk]

RLIP is the first work to use relation texts as a language-image pre-training signal.
RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.

AAAI 2022

[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
[video talk]

OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
OCN incorporates language and statistical prior to facilitate verb inference.

Video Understanding

ICCV 2021

[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
[知乎] [将门创投] [video talk]

DIN proposes to perform spatio-temporal dynamic inference.
DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.

AAAI 2021

[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]

This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
This paper achieves SOTA results on Volleyball and CAD benchmarks.

[arXiv] Few-shot Action Recognition with Captioning Foundation Models, Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang.

Incremental / Continual Learning

CVPR 2022

[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.

This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.

[arXiv] Refined Response Distillation for Class-Incremental Player Detection, Liang Bai, Hangjie Yuan, Tao Feng, Hong Song, Jian Yang.
[arXiv] Progressive Learning without Forgetting, Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, Jianzhou Zhang.

AI for Science and Engineering

[ICLR 2024] LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition, Lingfeng Liu, Dong Ni, Hangjie Yuan. [code]

💻 Internships

2021.03 - Present, DAMO Academy, Alibaba Group, Hangzhou.
- Advisor: Deli Zhao, Shiwei Zhang, Jianwen Jiang and Mang Wang

🎓 Academic Service

Reviewing
- Conferences:
  - ICLR 2024
  - SIGGRAPH 2024
  - NeurIPS 2023
  - CVPR 2022-2024
  - ICCV 2023
  - ECCV 2024
  - AAAI 2023-2024
- Journals:
  - IEEE Transactions on Pattern Analysis and Machine Intelligence
  - IEEE Transactions on Multimedia
  - IEEE Transactions on Circuits and Systems for Video Technology
  - Knowledge-Based Systems
  - Pattern Recognition

💬 Miscellaneous

Goal of My Research: While conducting research, I prioritize humanity above all else. Therefore, the ultimate goal of my research is to prioritize human well-being.

Characteristics: I am friendly from a personal perspective (this is not biased!). Although I major in Computer Science Engineering, I am quite emotionally sensitive, often grasping other people’s sense of feelings faster than most people. (Some people may say this is a gift. Well, I’ll take it!) However, I am not emotionally vulnerable.

English Proficiency: I once dabbled with the TOEFL and snagged a score of 107. Not to brag, but I also clinched the Special Prize (top 0.1%) in the National English Competition for College Students. English reading and writing? Totally my jam, although there’s much room for improvement. Flashback to high school: I entertained the idea of moonlighting as a translator. Fast forward to now: English has morphed into more of a hobby as I dive deep into the world of AI research. While I’m cerntainly no linguistic prodigy, there’s a certain joy I find in crafting sentences in English, especially when penning down my research papers. But then GPT-4 came along and made my hobby feel, well, a tad redundant. 😅 Why? Because this section is also polished by GPT-4.

🎖 Honors and Awards

Below, I exhasutively list some of my Honors and Awards that inspire me a lot.

2024-01 Outstanding Research Intern in Alibaba Group (Top 20 in 1000+ candidates)
2024-01 Outstanding Graduates of Zhejiang University
2023-09 International Travel Grant for Graduate Students
2022-02 AAAI-22 Scholarship
2021-10 Runner-up in ICCV 2021 Masked Face Recognition Challenge (Webface260M track) [ICCV report]
2020-12 Social Practice Scholarship of Zhejiang University
2019-05 Outstanding Graduates of Zhejiang Province
2017~2018 Provincial Government Scholarship for two consecutive years (Top 5%)
2016~2018 The First Prize Scholarship for three consecutive years (Top 3%)
2018-05 Honorable Mention in Mathematical Contest in Modeling
2017-11 Second Prize in National Mathematical Modeling Contest (Top 5%)
2017-01 Frist Prize in Physics Competition for College Students in Zhejiang (Top 5%)
2016-11 National Scholarship (Top 1% among all undergraduates)
2016-05 Special Prize, National English Competition for College Students (Top 0.1%)

Hangjie Yuan

🔥 News

📝 Publications

Video Generation

HOI Detection / Scene Graph Generation

Video Understanding

Incremental / Continual Learning

AI for Science and Engineering

Other Interesting Topics

💻 Internships

🎓 Academic Service

💬 Miscellaneous

🎖 Honors and Awards