😃 Welcome to my personal page!

I am Hangjie Yuan (袁杭杰 in Chinese), currently pursuing my Ph.D. at Zhejiang University and serving as a long-term research intern at Alibaba DAMO Academy. I am fortunate to be supervised by Prof. Dong Ni at ZJU, who is open-minded and wise. Additionally, I am undertaking a visiting Ph.D. program at MMLab@NTU, supervised by Prof. Ziwei Liu. Moreover, I am fortunate to be supervised by Prof. Samuel Albanie at the University of Cambridge. As part of Alibaba’s Research Intern Program, I am/was supervised by Deli Zhao, Shiwei Zhang, Jianwen Jiang and Mang Wang. They are brilliant!

My representative projects include InstructVideo, VideoComposer, and the RLIP series (RLIP & RLIPv2).

My current research interests include:

  • 1️⃣ Video Synthesis/Understanding,
  • 2️⃣ Human-Object Interaction Detection/Scene Graph Generation, and
  • 3️⃣ AI for Science and Engineering.

Feel free to drop me an email📧 if you are interested in having a chat or collaborating with me.

I am currently seeking job opportunities and anticipate graduating in 2024. I am exploring options in both academia, such as postdoctoral positions, and industry, focusing on research-oriented roles. I would be delighted to discuss any job openings or research projects. Please don’t hesitate to contact me at hj.yuan@zju.edu.cn.

🔥 News

  • 2024-02 : 📑 InstructVideo, DreamVideo and TF-T2V are accepted to CVPR 2024. Thrilled to collaborate with them on these promising projects.

  • 2024-01 : 👑 I am honored to receive the Outstanding Research Intern Award (20 in 1000+ candidates) for my contribution in video generation to Alibaba.

  • 2024-01 : 📑 LUM-ViT is accepted to ICLR 2024. Congrats to Lingfeng Liu!

  • 2023-09 : 📑 VideoComposer is accepted to NeurIPS 2023. Thrilled to collaborate with them on this project.

  • 2023-08 : 📑 RLIPv2 is accepted to ICCV 2023. Code and models are publicly available here!

  • 2023-08 : 🏡 We release ModelscopeT2V (the default T2V in Diffusers) and VideoComposer, two foundations for video generation.

  • 2022-09 : 📑 RLIP: Relational Language-Image Pre-training is accepted to NeurIPS 2022 as a Spotlight paper (Top 5%). It’s my honor to work with Samuel and Jianwen. Btw, the pronunciation of RLIP is /’ɑ:lɪp/.

  • 2022-05 : 📑 Elastic Response Distillation is accepted to CVPR 2022. A great pleasure to work with Tao Feng and Mang Wang.

  • 2022-02 : 👑 I am awarded AAAI-22 Scholarship. Acknowledgement to AAAI!

  • 2021-12 : 📑 Object-guided Cross-modal Calibration Network is accepted to AAAI 2022. A great pleasure to work with Mang Wang.

  • 2021-07 : 📑 Spatio-Temporal Dynamic Inference Network is accepted to ICCV 2021.

  • 2021-03 : 👷 I start my internship at Alibaba DAMO Academy.

  • 2020-12 : 📑 Learning Visual Context (for Group Activity recognition) is accepted to AAAI 2021.

📝 Publications

A full publication list is available on my google scholar page.

(* denotes equal contribution.)

Video Generation

CVPR 2024
sym

[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
GitHub Stars GitHub Forks [Project page]

  • InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
  • InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.
arXiv
sym

[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
GitHub Stars GitHub Forks [Project page] [ModelScope]

  • I2VGen-XL is the first publicly available foundation model for image-to-video generation.
  • I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.
CVPR 2024
sym

[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
GitHub Stars GitHub Forks [Project page]

  • TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
  • TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.
CVPR 2024
sym

[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
GitHub Stars GitHub Forks [Project page]

  • DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.
NeurIPS 2023
sym

[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
GitHub Stars GitHub Forks [Project page] [机器之心] [CVer]

  • VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
  • VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.
Technical report
sym

[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers] Star [ModelScope] Star

  • ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
  • ModelScopeT2V is selected for inclusion in Diffusers.

HOI Detection / Scene Graph Generation

ICCV 2023
sym

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.
GitHub Stars GitHub Forks

  • RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
  • The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.
NeurIPS 2022 Spotlight
sym

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
GitHub Stars GitHub Forks [video talk]

  • RLIP is the first work to use relation texts as a language-image pre-training signal.
  • RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.
AAAI 2022
sym

[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
GitHub Stars GitHub Forks [video talk]

  • OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
  • OCN incorporates language and statistical prior to facilitate verb inference.

Video Understanding

ICCV 2021
sym

[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
GitHub Stars GitHub Forks [知乎] [将门创投] [video talk]

  • DIN proposes to perform spatio-temporal dynamic inference.
  • DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.
AAAI 2021
sym

[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]

  • This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
  • This paper achieves SOTA results on Volleyball and CAD benchmarks.

Incremental / Continual Learning

CVPR 2022
sym

[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.
GitHub Stars GitHub Forks

  • This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
  • This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.

AI for Science and Engineering

Other Interesting Topics

💻 Internships

🎓 Academic Service

  • Reviewing
    • Conferences:
      • ICLR 2024
      • SIGGRAPH 2024
      • NeurIPS 2023
      • CVPR 2022-2024
      • ICCV 2023
      • ECCV 2024
      • AAAI 2023-2024
    • Journals:
      • IEEE Transactions on Pattern Analysis and Machine Intelligence
      • IEEE Transactions on Multimedia
      • IEEE Transactions on Circuits and Systems for Video Technology
      • Knowledge-Based Systems
      • Pattern Recognition

💬 Miscellaneous

Goal of My Research: While conducting research, I prioritize humanity above all else. Therefore, the ultimate goal of my research is to prioritize human well-being.

Characteristics: I am friendly from a personal perspective (this is not biased!). Although I major in Computer Science Engineering, I am quite emotionally sensitive, often grasping other people’s sense of feelings faster than most people. (Some people may say this is a gift. Well, I’ll take it!) However, I am not emotionally vulnerable.

English Proficiency: I once dabbled with the TOEFL and snagged a score of 107. Not to brag, but I also clinched the Special Prize (top 0.1%) in the National English Competition for College Students. English reading and writing? Totally my jam, although there’s much room for improvement. Flashback to high school: I entertained the idea of moonlighting as a translator. Fast forward to now: English has morphed into more of a hobby as I dive deep into the world of AI research. While I’m cerntainly no linguistic prodigy, there’s a certain joy I find in crafting sentences in English, especially when penning down my research papers. But then GPT-4 came along and made my hobby feel, well, a tad redundant. 😅 Why? Because this section is also polished by GPT-4.

🎖 Honors and Awards

Below, I exhasutively list some of my Honors and Awards that inspire me a lot.

  • 2024-01    Outstanding Research Intern in Alibaba Group (Top 20 in 1000+ candidates)
  • 2024-01    Outstanding Graduates of Zhejiang University
  • 2023-09    International Travel Grant for Graduate Students
  • 2022-02    AAAI-22 Scholarship
  • 2021-10    Runner-up in ICCV 2021 Masked Face Recognition Challenge (Webface260M track) [ICCV report]
  • 2020-12    Social Practice Scholarship of Zhejiang University
  • 2019-05    Outstanding Graduates of Zhejiang Province
  • 2017~2018    Provincial Government Scholarship for two consecutive years (Top 5%)
  • 2016~2018    The First Prize Scholarship for three consecutive years (Top 3%)
  • 2018-05    Honorable Mention in Mathematical Contest in Modeling
  • 2017-11    Second Prize in National Mathematical Modeling Contest (Top 5%)
  • 2017-01    Frist Prize in Physics Competition for College Students in Zhejiang (Top 5%)
  • 2016-11    National Scholarship (Top 1% among all undergraduates)
  • 2016-05    Special Prize, National English Competition for College Students (Top 0.1%)