深入了解视觉语言模型_huggingface

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

MM-LLMs: Recent Advances in MultiModal Large Language Models

Vision Transformers Need Registers

概念

vit

SSL Self-supervised Learning 自监督学习

Contrastive Learning 对比学习 CLIP (Contrastive Language-Image Pre-Training) https://github.com/OpenAI/CLIP https://github.com/mlfoundations/open_clip

COCO(Common Objects in Context)

PEFT (Parameter-Efficient Fine-Tuning )