LLM https://github.com/openai/CLIP https://openai.com/index/clip/

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

CLIP(对比语言 - 图像预训练)是一种在多种(图像,文本)对上训练的神经网络。它可以通过自然语言指令,在不直接针对该任务进行优化的情况下,根据图像预测最相关的文本片段,这与 GPT-2 和 GPT-3 的零样本能力类似。我们发现,CLIP 在未使用 ImageNet 原始 128 万标注样本中的任何一个的情况下,其“零样本”性能即可媲美原始的 ResNet50,从而克服了计算机视觉领域的若干重大挑战。