😃 Welcome to my personal page!
I am Hangjie Yuan (袁杭杰 in Chinese), currently pursuing my Ph.D. at Zhejiang University and serving as a long-term research intern at Alibaba DAMO Academy. I am fortunate to be supervised by Prof. Dong Ni at ZJU, who is open-minded and wise. Additionally, I am undertaking a visiting Ph.D. program at MMLab@NTU, supervised by Prof. Ziwei Liu. Moreover, I am fortunate to be supervised by Prof. Samuel Albanie at the University of Cambridge. As part of Alibaba’s Research Intern Program, I am/was supervised by Deli Zhao, Shiwei Zhang, Jianwen Jiang and Mang Wang. They are brilliant!
My representative projects include InstructVideo, VideoComposer, and the RLIP series (RLIP & RLIPv2).
My current research interests include:
- 1️⃣ Video Synthesis/Understanding,
- 2️⃣ Human-Object Interaction Detection/Scene Graph Generation, and
- 3️⃣ AI for Science and Engineering.
Feel free to drop me an email📧 if you are interested in having a chat or collaborating with me.
I am currently seeking job opportunities and anticipate graduating in 2024. I am exploring options in both academia, such as postdoctoral positions, and industry, focusing on research-oriented roles. I would be delighted to discuss any job openings or research projects. Please don’t hesitate to contact me at hj.yuan@zju.edu.cn.
🔥 News
-
2024-02 : 📑 InstructVideo, DreamVideo and TF-T2V are accepted to CVPR 2024. Thrilled to collaborate with them on these promising projects.
-
2024-01 : 👑 I am honored to receive the Outstanding Research Intern Award (20 in 1000+ candidates) for my contribution in video generation to Alibaba.
-
2024-01 : 📑 LUM-ViT is accepted to ICLR 2024. Congrats to Lingfeng Liu!
-
2023-09 : 📑 VideoComposer is accepted to NeurIPS 2023. Thrilled to collaborate with them on this project.
-
2023-08 : 📑 RLIPv2 is accepted to ICCV 2023. Code and models are publicly available here!
-
2023-08 : 🏡 We release ModelscopeT2V (the default T2V in Diffusers) and VideoComposer, two foundations for video generation.
-
2022-09 : 📑 RLIP: Relational Language-Image Pre-training is accepted to NeurIPS 2022 as a Spotlight paper (Top 5%). It’s my honor to work with Samuel and Jianwen. Btw, the pronunciation of RLIP is /’ɑ:lɪp/.
-
2022-05 : 📑 Elastic Response Distillation is accepted to CVPR 2022. A great pleasure to work with Tao Feng and Mang Wang.
-
2022-02 : 👑 I am awarded AAAI-22 Scholarship. Acknowledgement to AAAI!
-
2021-12 : 📑 Object-guided Cross-modal Calibration Network is accepted to AAAI 2022. A great pleasure to work with Mang Wang.
-
2021-07 : 📑 Spatio-Temporal Dynamic Inference Network is accepted to ICCV 2021.
-
2021-03 : 👷 I start my internship at Alibaba DAMO Academy.
-
2020-12 : 📑 Learning Visual Context (for Group Activity recognition) is accepted to AAAI 2021.
📝 Publications
A full publication list is available on my google scholar page.
(* denotes equal contribution.)
Video Generation
[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
[Project page]
- InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
- InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.
[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
[Project page]
[ModelScope]
- I2VGen-XL is the first publicly available foundation model for image-to-video generation.
- I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.
[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
[Project page]
- TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
- TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.
[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
[Project page]
- DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.
[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
[Project page]
[机器之心]
[CVer]
- VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
- VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.
[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers]
[ModelScope]
- ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
- ModelScopeT2V is selected for inclusion in Diffusers.
HOI Detection / Scene Graph Generation
[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.
- RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
- The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.
[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
[video talk]
- RLIP is the first work to use relation texts as a language-image pre-training signal.
- RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.
[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
[video talk]
- OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
- OCN incorporates language and statistical prior to facilitate verb inference.
Video Understanding
[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
[知乎]
[将门创投]
[video talk]
- DIN proposes to perform spatio-temporal dynamic inference.
- DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.
[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]
- This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
- This paper achieves SOTA results on Volleyball and CAD benchmarks.
- [arXiv] Few-shot Action Recognition with Captioning Foundation Models, Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang.
Incremental / Continual Learning
[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.
- This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
- This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.
- [arXiv] Refined Response Distillation for Class-Incremental Player Detection, Liang Bai, Hangjie Yuan, Tao Feng, Hong Song, Jian Yang.
- [arXiv] Progressive Learning without Forgetting, Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, Jianzhou Zhang.
AI for Science and Engineering
- [ICLR 2024] LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition, Lingfeng Liu, Dong Ni, Hangjie Yuan. [code]
Other Interesting Topics
- [WACV 2024] From Denoising Training to Test-Time Adaptation: Enhancing Domain Generalization for Medical Image Segmentation, Ruxue Wen, Hangjie Yuan, Dong Ni, Wenbo Xiao, Yaoyao Wu. [code]
💻 Internships
- 2021.03 - Present, DAMO Academy, Alibaba Group, Hangzhou.
- Advisor: Deli Zhao, Shiwei Zhang, Jianwen Jiang and Mang Wang
🎓 Academic Service
- Reviewing
- Conferences:
- ICLR 2024
- SIGGRAPH 2024
- NeurIPS 2023
- CVPR 2022-2024
- ICCV 2023
- ECCV 2024
- AAAI 2023-2024
- Journals:
- IEEE Transactions on Pattern Analysis and Machine Intelligence
- IEEE Transactions on Multimedia
- IEEE Transactions on Circuits and Systems for Video Technology
- Knowledge-Based Systems
- Pattern Recognition
- Conferences:
💬 Miscellaneous
Goal of My Research: While conducting research, I prioritize humanity above all else. Therefore, the ultimate goal of my research is to prioritize human well-being.
Characteristics: I am friendly from a personal perspective (this is not biased!). Although I major in Computer Science Engineering, I am quite emotionally sensitive, often grasping other people’s sense of feelings faster than most people. (Some people may say this is a gift. Well, I’ll take it!) However, I am not emotionally vulnerable.
English Proficiency: I once dabbled with the TOEFL and snagged a score of 107. Not to brag, but I also clinched the Special Prize (top 0.1%) in the National English Competition for College Students. English reading and writing? Totally my jam, although there’s much room for improvement. Flashback to high school: I entertained the idea of moonlighting as a translator. Fast forward to now: English has morphed into more of a hobby as I dive deep into the world of AI research. While I’m cerntainly no linguistic prodigy, there’s a certain joy I find in crafting sentences in English, especially when penning down my research papers. But then GPT-4 came along and made my hobby feel, well, a tad redundant. 😅 Why? Because this section is also polished by GPT-4.
🎖 Honors and Awards
Below, I exhasutively list some of my Honors and Awards that inspire me a lot.
- 2024-01 Outstanding Research Intern in Alibaba Group (Top 20 in 1000+ candidates)
- 2024-01 Outstanding Graduates of Zhejiang University
- 2023-09 International Travel Grant for Graduate Students
- 2022-02 AAAI-22 Scholarship
- 2021-10 Runner-up in ICCV 2021 Masked Face Recognition Challenge (Webface260M track) [ICCV report]
- 2020-12 Social Practice Scholarship of Zhejiang University
- 2019-05 Outstanding Graduates of Zhejiang Province
- 2017~2018 Provincial Government Scholarship for two consecutive years (Top 5%)
- 2016~2018 The First Prize Scholarship for three consecutive years (Top 3%)
- 2018-05 Honorable Mention in Mathematical Contest in Modeling
- 2017-11 Second Prize in National Mathematical Modeling Contest (Top 5%)
- 2017-01 Frist Prize in Physics Competition for College Students in Zhejiang (Top 5%)
- 2016-11 National Scholarship (Top 1% among all undergraduates)
- 2016-05 Special Prize, National English Competition for College Students (Top 0.1%)