Zesen Cheng (成泽森)

I am a fourth-year direct PhD student in the Department of Computer Science and Technology at Peking University (expected graduation in 2026). Before that, I obtained my undergraduate degree from the School of Electronics and Information Engineering at South China University of Technology in 2021.

📌 Research Interests

My research primarily focuses on the field of “Multimodal Large Language Models and Image/Video Understanding”, specifically including:

Multimodal Large Language Model (video understanding), including:
- General video understanding: Qwen2.5-VL core contributor
- Audio-visual understanding: VideoLLaMA2; CMM
- Streaming video understanding: VideoLLaMA3
- Long video understanding: Inf-CL (CVPR 2025 Highlight)
- Fine-grained video understanding: VideoRefer (CVPR 2025)
Image/video segmentation, including:
- Weakly supervised segmentation: OCR (CVPR 2023)
- Video instance segmentation: TAR (ICCV 2025)
- Multimodal segmentation: WiCo (IJCAI 2023, Neurocomputing 2024); PVD (AAAI 2024); BriVIS (AAAI 2025)
- Medical image segmentation: Fused U-Net (Medical Physics 2021)

📈 Academic Achievements

I have published over 20 papers, with a total of citations on Google Scholar.

The open-source projects I have participated in have received widespread attention, with the number of GitHub Stars for representative projects as follows:

💬 Contact Information

If you are interested in my research, please feel free to contact me for collaboration or to discuss internship/full-time opportunities 🙏🙏. My email address is: cyanlaser@stu.pku.edu.cn