我是北京大学计算机科学与技术专业直博四年级学生 (预计 2026 年毕业),本科毕业于华南理工大学电子与信息学院(2021 届)。
人生格言: 知行合一,格物致知;志存高远,脚踏实地。
📌 主要研究方向
我的研究方向主要集中在 “多模态大模型与图像/视频理解” 领域,具体包括:
- 多模态大模型 (视频理解), 包括:
- 泛视频理解: Qwen2.5-VL core contributor
- 音视频理解: VideoLLaMA2; CMM
- 流视频理解: VideoLLaMA3
- 长视频理解: Inf-CL (CVPR 2025 Highlight)
- 细粒度视频理解: VideoRefer (CVPR 2025)
- 图像/视频分割,包括:
- 弱监督分割: OCR (CVPR 2023)
- 视频实例分割: TAR (ICCV 2025)
- 多模态分割: WiCo (IJCAI 2023, Neurocomputing 2024); PVD (AAAI 2024); BriVIS (AAAI 2025)
- 医学图像分割: Fused U-Net (Medical Physics 2021)
📈 学术成果
目前已发表论文 20+ 篇,总 Google Scholar 引用量为
。
所参与开源项目获得广泛关注,代表性项目的 GitHub Star 数如下:
💬 联系方式
如果您对我的研究感兴趣,欢迎联系交流合作或提供实习 / 全职机会 🙏🙏。这是我的联系邮箱: cyanlaser@stu.pku.edu.cn
🔥 News
- 2021.03: I join Sensetime
as a research intern in shenzhen for developing MMSegmentation
toolkit.
📝 Publications
🎞️ Multi-modal LLM (Video Understanding)

Qwen2.5-VL Technical Report
Core Contributors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, …, Zesen Cheng,
Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang* Kehan Li*, Zesen Cheng*, Zhiqiang Hu*, Yuqian Yuan*, Guanzheng Chen*, Sicong Leng*, Yuming Jiang*, Hang Zhang*, Xin Li*, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
Code |
|
|
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing
Code |
|
|

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy (Hightlight)
Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing
Code |
|

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, et al.
🧩 Image/Video Segmentation

Temporal-aware Query Routing for Real-time Video Instance Segmentation
Zesen Cheng, Kehan Li, Yian Zhao, et al.

Aligning Instance Brownian Bridge with Texts for Open-vocabulary Video Instance Segmentation
Zesen Cheng, Kehan Li, Li Hao, Peng Jin, et al.

Parallel Vertex Diffusion for Unified Visual Grounding
Zesen Cheng, Kehan Li, Peng Jin, et al.
-
CVPR 2024
(Hightlight) GraCo: Granularity-Controllable Interactive Segmentation
Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, et al. | -
Neurocomputing 2024
Hierarchical collaboration for referring image segmentation
Wei Zhang, Zesen Cheng, et al. -
ICCV 2023
Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation
Kehan Li, Yian Zhao, Zhennan Wang, Zesen Cheng, Peng Jin, et al.

WiCo: Win-win Cooperation of Bottom-up and Top-down Referring Image Segmentation
Zesen Cheng, Peng Jin, Hao Li, Kehan Li, et al.

Out-of-Candidate Rectification for Weakly-supervised Semantic Segmentation
Zesen Cheng, Pengchong Qiao, Kehan Li, Siheng Li, et al.
-
CVPR 2023
(Hightlight) ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation
Kehan Li, Zhennan Wang, Zesen Cheng, Runyi Yu, et al. -
CVPR 2023
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang. | -
Medical Physics 2021
Integrating multiple MRI sequences for pelvic organs segmentation via the attention mechanism
Sijuan Huang*, Zesen Cheng*, et al.
Others
ECCV 2024
Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, et al.ECCV 2024
FreestyleRet: Retrieving Images from Style-Diversified Queries
Hao Li, Curise Jia, Peng Jin, Zesen Cheng, Kehan Li, et al. |PRCV 2023
(Oral) Object-Aware Transfer-Based Black-Box Adversarial Attack on Object Detector
Zhuo Leng, Zesen Cheng, et al.ICCV 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, et al. |IJCAI 2023
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, et al.IJCAI 2023
TG-VQA: Ternary Game of Video Question Answering
Hao Li, Peng Jin, Zesen Cheng, et al.
🥇 Honors and Awards
- 2023.10 Pingan Scholarship
- 2020.10 National Scholarship (Undergraduate) (Top 1%)
- 2019.10 The Second Prize Scholarship
- 2018.10 National Scholarship (Undergraduate) (Top 1%)
📖 Educations
- 2021.09 - present, Ph.D. Candidate, School of Electronic and Computer Science, Peking University.
- 2017.09 - 2021.06, Undergraduate, School of Electronic and Information Engineering, South China University of Technology.
💻 Internships
- 2025.01 - Present, Alibaba, Qwen Team, Hangzhou.
- 2024.01 - 2024.12, Alibaba, DAMO Academy, Hangzhou.
- 2021.03 - 2021.10, SenseTime, OpenMMLab, Shenzhen.