Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
MM-LLMs: Recent Advances in MultiModal Large Language Models
Vision Transformers Need Registers
概念
vit
SSL Self-supervised Learning 自监督学习
Contrastive Learning 对比学习 CLIP (Contrastive Language-Image Pre-Training) https://github.com/OpenAI/CLIP https://github.com/mlfoundations/open_clip
COCO(Common Objects in Context)
PEFT (Parameter-Efficient Fine-Tuning )