Multimodal Few-Shot Learning with Frozen Language Models: A Review
Source: Multimodal Few-Shot Learning with Frozen Language Models Introduction Recent advances in natural language processing have led to large transformer-based language models that exhibit impressive few-shot learning abilities. Specifically, when trained at sufficient scale, these models can learn new language tasks after being prompted with just a few examples, without any further gradient updates. However, despite these impressive capabilities, large language models are limited to only understanding text. These models cannot process inputs in other modalities like vision, which prevents us from directly communicating visual information, tasks, or concepts to the models. We cannot simply show the model an image