😃 Greetings!

I have joined Alibaba DAMO Academy as a research scientist via the Alibaba Star (‘阿里星’) program, working on cutting-edge problems in foundation models. I will also remain a research position at Zhejiang University, working with Prof. Yi Yang. I obtained my PhD from Zhejiang University in the beautiful summer of 2024 (from Sept., 2019), under the supervision of Prof. Dong Ni, Prof. Samuel Albanie (University of Cambridge/DeepMind), Deli Zhao (Alibaba DAMO) and Shiwei Zhang (Alibaba Tongyi Lab/DAMO). I have undertaken a visiting Ph.D. program at MMLab@NTU, supervised by Prof. Ziwei Liu.

My representative projects include InstructVideo, VideoComposer, the RLIP series (v1 and v2), the DreamVideo series (v1 and v2) and ModelScopeT2V.

My current research interests include:

  • 1️⃣ Generative models: visual generation/editing, visual autoregressive models, and visual generation alignment;
  • 2️⃣ Representation learning: video understanding, visual relation detection (HOI detection/scene graph generation);
  • 3️⃣ AI for science and engineering.

📧 Feel free to drop me an email at hj.yuan@zju.edu.cn if you are interested in collaborating with me as a full-time researcher/intern or remotely.

🔥 News

  • 2024-09 : 📑 EvolveDirector and C-Flat are accepted to NeurIPS 2024. Congrats to Rui and Tao.

  • 2024-05 : 📑 PAPM is accepted to ICML 2024 and ArchCraft is accepted to IJCAI 2024. I am happy to see that they are able to publish their work at top-tier conferences.

  • 2024-02 : 📑 InstructVideo, DreamVideo and TF-T2V are accepted to CVPR 2024. Thrilled to collaborate with them on these promising projects.

  • 2024-01 : 👑 I am honored to receive the Outstanding Research Intern Award (20 in 1000+ candidates) for my contribution in video generation to Alibaba.

  • 2024-01 : 📑 LUM-ViT is accepted to ICLR 2024. Congrats to Lingfeng Liu!

  • 2023-09 : 📑 VideoComposer is accepted to NeurIPS 2023. Thrilled to collaborate with them on this project.

  • 2023-08 : 📑 RLIPv2 is accepted to ICCV 2023. Code and models are publicly available here!

  • 2023-08 : 🏡 We release ModelscopeT2V (the default T2V in Diffusers) and VideoComposer, two foundations for video generation.

  • 2022-09 : 📑 RLIP: Relational Language-Image Pre-training is accepted to NeurIPS 2022 as a Spotlight paper (Top 5%). It’s my honor to work with Samuel and Jianwen. Btw, the pronunciation of RLIP is /’ɑ:lɪp/.

  • 2022-05 : 📑 Elastic Response Distillation is accepted to CVPR 2022. A great pleasure to work with Tao Feng and Mang Wang.

  • 2022-02 : 👑 I am awarded AAAI-22 Scholarship. Acknowledgement to AAAI!

  • 2021-12 : 📑 Object-guided Cross-modal Calibration Network is accepted to AAAI 2022. A great pleasure to work with Mang Wang.

  • 2021-07 : 📑 Spatio-Temporal Dynamic Inference Network is accepted to ICCV 2021.

  • 2021-03 : 👷 I start my internship at Alibaba DAMO Academy.

  • 2020-12 : 📑 Learning Visual Context (for Group Activity recognition) is accepted to AAAI 2021.

📝 Publications

A full publication list is available on my google scholar page.

(* denotes equal contribution.)

Visual Generation Alignment and its Application

NeurIPS 2024
sym

[NeurIPS 2024] EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou.
GitHub Stars GitHub Forks

  • EvolveDirector leverages large vision-language models (VLMs) to evaluate visual generation results, guiding the evolution of a T2I model by dynamically refining the training dataset through selection and mutation.
  • The trained T2I model, Edgen, powered by EvolveDirector, achieves SOTA performance using only 1% of the data typically required by other models.
CVPR 2024
sym

[CVPR 2024] InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni.
GitHub Stars GitHub Forks [Project page]

  • InstructVideo is the first research attempt that instructs video diffusion models with human feedback.
  • InstructVideo significantly enhances the visual quality of generated videos without compromising generalization capabilities, with merely 0.1% of the parameters being fine-tuned.

Visual Generation / Editing

arXiv
sym

[arXiv] I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou.
GitHub Stars GitHub Forks [Project page] [ModelScope]

  • I2VGen-XL is the first publicly available foundation model for image-to-video generation.
  • I2VGen-XL decouples high-resolution image-to-video synthesis into two stages: 1) the base stage that generates low-resolution semantically coherent videos, and 2) the refinement stage that enhances the video’s details and improves the resolution to 1280×720.
CVPR 2024
sym

[CVPR 2024] A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang.
GitHub Stars GitHub Forks [Project page]

  • TF-T2V proposes to separate the process of text decoding from that of temporal modeling during pre-training.
  • TF-T2V is proven effective for both text-to-video generation and compositional video synthesis.
CVPR 2024
sym

[CVPR 2024] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan.
GitHub Stars GitHub Forks [Project page]

  • DreamVideo is the first method that generates personalized videos from a few static images of the desired subject and a few videos of target motion.
NeurIPS 2023
sym

[NeurIPS 2023] VideoComposer: Compositional Video Synthesis with Motion Controllability
Xiang Wang*, Hangjie Yuan*, Shiwei Zhang*, Dayou Chen*, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou.
GitHub Stars GitHub Forks [Project page] [机器之心] [CVer]

  • VideoComposer pioneers controllable video synthesis, seamlessly integrating textual, spatial, and crucially, temporal conditions, through a unified interface for information injection.
  • VideoComposer can craft videos using various input control signals, from intricate hand-drawn sketches to defined motions.
Technical report
sym

[Technical report] ModelScope Text-to-Video Technical Report
Jiuniu Wang*, Hangjie Yuan*, Dayou Chen*, Yingya Zhang*, Xiang Wang, Shiwei Zhang.
[Diffusers] Star [ModelScope] Star

  • ModelScopeT2V is the first publicly-available diffusion-based text-to-video model at scale, which has been used by millions of people.
  • ModelScopeT2V is selected for inclusion in Diffusers.

Visual Relation Detection (HOI Detection / Scene Graph Generation)

ICCV 2023
sym

[ICCV 2023] RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao.
GitHub Stars GitHub Forks

  • RLIPv2 elevates RLIP by leveraging a new language-image fusion mechanism, designed for expansive data scales.
  • The most advanced pre-trained RLIPv2 (Swin-L) matches the performance of RLIPv1 (R50) while utilizing a mere 1% of the data.
NeurIPS 2022 Spotlight
sym

[NeurIPS 2022 Spotlight] RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection
Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni and Mingqian Tang.
GitHub Stars GitHub Forks [video talk]

  • RLIP is the first work to use relation texts as a language-image pre-training signal.
  • RLIP-ParSe achieves SOTA results on fully-finetuned, few-shot, zero-shot HOI detetction benchmarks and learning from noisy labels.
AAAI 2022
sym

[AAAI 2022] Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu.
GitHub Stars GitHub Forks [video talk]

  • OCN proposes a two-stage HOI detection method by decoupling entity detection and relation inference.
  • OCN incorporates language and statistical prior to facilitate verb inference.

Video Understanding

ICCV 2021
sym

[ICCV 2021] Spatio-Temporal Dynamic Inference Network for Group Activity Recognition
Hangjie Yuan, Dong Ni and Mang Wang.
GitHub Stars GitHub Forks [知乎] [将门创投] [video talk]

  • DIN proposes to perform spatio-temporal dynamic inference.
  • DIN achieves SOTA results on Volleyball and CAD benchmarks while costing much less computational overhead of the reasoning module.
AAAI 2021
sym

[AAAI 2021] Learning Visual Context for Group Activity Recognition
Hangjie Yuan and Dong Ni.
[video talk]

  • This paper proposes to incorporate visual context when recognizing activities. This should work for other related problems as well.
  • This paper achieves SOTA results on Volleyball and CAD benchmarks.

AI for Science and Engineering

ICML 2024
sym

[ICML 2024] PAPM: A Physics-aware Proxy Model for Process Systems
Pengwei Liu, Zhongkai Hao, Xingyu Ren, Hangjie Yuan, Jiayang Ren, Dong Ni.
[code]

  • PAPM is a pioneering work that fully incorporates partial prior physics of process systems to enable better generalization capabilities.

Incremental / Continual Learning

CVPR 2022
sym

[CVPR 2022] Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation
Tao Feng, Mang Wang and Hangjie Yuan.
GitHub Stars GitHub Forks

  • This paper proposes a response-only distillation method for Incremental Object Detection, dubbed Elastic Response Distillation.
  • This paper achieves SOTA results on COCO benchmarks while utilizing much fewer responses for distillation.

Other Interesting Topics

💻 Internships

🎓 Academic Service

  • Reviewing
    • Conferences:
      • ICLR 2024-2025
      • SIGGRAPH 2024
      • NeurIPS 2023-2024
      • CVPR 2022-2025
      • ICCV 2023
      • ECCV 2024
      • AAAI 2023-2025
    • Journals:
      • IEEE Transactions on Pattern Analysis and Machine Intelligence
      • IEEE Transactions on Multimedia
      • IEEE Transactions on Circuits and Systems for Video Technology
      • Knowledge-Based Systems
      • Pattern Recognition

💬 Miscellaneous

  • Goal of my research: While conducting research, I prioritize humanity above all else. Therefore, the ultimate goal of my research is to prioritize human well-being.

  • Characteristics:
    • I am friendly from a personal perspective (this is not biased!). Although I major in Computer Science Engineering, I am quite emotionally sensitive, often grasping other people’s sense of feelings faster than most people. (Some people may say this is a gift. Well, I’ll take it!) However, I am not emotionally vulnerable.

    • I describe myself as a realistic idealist. For years, I have been searching for the purpose of life and found my Goal of Research (see above). I resonate with Steve Jobs’ philosophy, saying that “The people who are crazy enough to think they can change the world are the ones who do.”.

  • The following list might be dynamic 😆.

    • Favoriate singer: Taylor Swift. I officially became a Swiftie after the release of “Safe & Sound”.

    • Concerts that I have been to: Taylor Swift’s Eras Tour in London, 2024 (where I saw Ed Sheeran as well!), Coldplay’s Music of the Spheres World Tour in Singapore, 2024 and One Love Asia Festival in Singapore, 2023.

    • Favoriate athlete / football club: Kylian Mbappé and Tottenham Hotspur F.C. We’re on a mission to win a trophy! And yeah, I respect the big guys like Man City, Man United, Chelsea, Arsenal… even if they are the competition.

    • Favoriate movie / TV Series: Iron Man I, While You Were Sleeping (i.e., 당신이 잠든 사이에, starring Bae Suzy and Lee Jong-suk), the Harry Potter series (I am a fan of Hermione Granger), Batman (starring Christian Bale, including Batman Begins, The Dark Knight and The Dark Knight Rises), The Amazing Spider-Man I&II (starring Andrew Garfield and Emma Stone).

  • English proficiency: I once dabbled with the TOEFL and snagged a score of 107. Not to brag, but I also clinched the Special Prize (top 0.1%) in the National English Competition for College Students. English reading and writing? Totally my jam, although there’s much room for improvement. Flashback to high school: I entertained the idea of moonlighting as a translator. Fast forward to now: English has morphed into more of a hobby as I dive deep into the world of AI research. While I’m certainly no linguistic prodigy, there’s a certain joy I find in crafting sentences in English, especially when penning down my research papers. But then GPT-4 came along and made my hobby feel, well, a tad redundant. 😅 Why? Because this section is also polished by GPT-4.

🎖 Honors and Awards

Below, I exhasutively list some of my Honors and Awards that inspire me a lot.

  • 2024-01    Outstanding Research Intern in Alibaba Group (Top 20 in 1000+ candidates)
  • 2024-01    Outstanding Graduates of Zhejiang University
  • 2023-09    International Travel Grant for Graduate Students
  • 2022-02    AAAI-22 Scholarship
  • 2021-10    Runner-up in ICCV 2021 Masked Face Recognition Challenge (Webface260M track) [ICCV report]
  • 2020-12    Social Practice Scholarship of Zhejiang University
  • 2019-05    Outstanding Graduates of Zhejiang Province
  • 2017~2018    Provincial Government Scholarship for two consecutive years (Top 5%)
  • 2016~2018    The First Prize Scholarship for three consecutive years (Top 3%)
  • 2018-05    Honorable Mention in Mathematical Contest in Modeling
  • 2017-11    Second Prize in National Mathematical Modeling Contest (Top 5%)
  • 2017-01    Frist Prize in Physics Competition for College Students in Zhejiang (Top 5%)
  • 2016-11    National Scholarship (Top 1% among all undergraduates)
  • 2016-05    Special Prize, National English Competition for College Students (Top 0.1%)