I am a fourth-year direct PhD student in the Department of Computer Science and Technology at Peking University (expected graduation in 2026). Before that, I obtained my undergraduate degree from the School of Electronics and Information Engineering at South China University of Technology in 2021.

📌 Research Interests

My research primarily focuses on the field of “Multimodal Large Language Models and Image/Video Understanding”, specifically including:

  • Multimodal Large Language Model (video understanding), including:
    • General video understanding: Qwen2.5-VL core contributor
    • Audio-visual understanding: VideoLLaMA2; CMM
    • Streaming video understanding: VideoLLaMA3
    • Long video understanding: Inf-CL (CVPR 2025 Highlight)
    • Fine-grained video understanding: VideoRefer (CVPR 2025)
  • Image/video segmentation, including:
    • Weakly supervised segmentation: OCR (CVPR 2023)
    • Video instance segmentation: TAR (ICCV 2025)
    • Multimodal segmentation: WiCo (IJCAI 2023, Neurocomputing 2024); PVD (AAAI 2024); BriVIS (AAAI 2025)
    • Medical image segmentation: Fused U-Net (Medical Physics 2021)

📈 Academic Achievements

I have published over 20 papers, with a total of Citations citations on Google Scholar.

The open-source projects I have participated in have received widespread attention, with the number of GitHub Stars for representative projects as follows:

VideoLLaMA2 Stars VideoLLaMA3 Stars Inf-CL Stars CMM Stars VideoRefer Stars

💬 Contact Information

If you are interested in my research, please feel free to contact me for collaboration or to discuss internship/full-time opportunities 🙏🙏. My email address is: cyanlaser@stu.pku.edu.cn

🔥 News

  • 2021.03: I join Sensetime as a research intern in shenzhen for developing MMSegmentation toolkit.

📝 Publications

🎞️ Multi-modal LLM (Video Understanding)

Qwen2.5-VL
sym

Qwen2.5-VL Technical Report
Core Contributors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, …, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin

Project | Code |

VideoLLaMA3
sym

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang* Kehan Li*, Zesen Cheng*, Zhiqiang Hu*, Yuqian Yuan*, Guanzheng Chen*, Sicong Leng*, Yuming Jiang*, Hang Zhang*, Xin Li*, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

Code | hf_space | hf_space | hf_paper

VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing

Code | hf_space | hf_space | hf_paper

CMM
sym

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing

Project | Code | hf_data

CVPR 2025 Highlight
sym

Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy (Hightlight)
Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing

Code | hf_paper | PyPI

CVPR 2025
sym

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, et al.

Project | Code | | |

🧩 Image/Video Segmentation

ICCV 2025
tar
AAAI 2024
tar

Parallel Vertex Diffusion for Unified Visual Grounding
Zesen Cheng, Kehan Li, Peng Jin, et al.

IJCAI 2023
tar
CVPR 2023
tar

Out-of-Candidate Rectification for Weakly-supervised Semantic Segmentation
Zesen Cheng, Pengchong Qiao, Kehan Li, Siheng Li, et al.

Others

🥇 Honors and Awards

  • 2023.10 Pingan Scholarship
  • 2020.10 National Scholarship (Undergraduate) (Top 1%)
  • 2019.10 The Second Prize Scholarship
  • 2018.10 National Scholarship (Undergraduate) (Top 1%)

📖 Educations

💻 Internships

Flag Counter