- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
- Visual-RFT: Visual Reinforcement Fine-Tuning
- Rethinking Overlooked Aspects in Vision-Language Models
- CogVLM2: Visual Language Models for Image and Video Understanding
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
- Unifying Visual-Semantic Embeddings
- CLIP: Learning Transferable Visual Models From Natural Language Supervision
- ImageBind: One Embedding Space To Bind Them All
- BLIP: Bootstrapping Language-Image Pre-training
- HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
- VL-T5: Unifying Vision-and-Language Tasks via Text Generation
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- Frozen: Scaling Vision-Language Pre-training with Noisy Text
- Flamingo: a Visual Language Model for Few-Shot Learning
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
- Kosmos-1: Language Is Not All You Need: Aligning Perception with Language Models
- PaLM-E: An Embodied Multimodal Language Model
- Qwen2.5-VL: A Large-Scale Vision-Language Model
- vision-language models-CS 685, Spring 2024
- https://www.cs.cmu.edu/~mgormley/courses/10423//slides/lecture14-vlm-ink.pdf
- Traditional vision models need large task-specific labeled datasets
- NLP saw breakthroughs with pretraining + prompts (e.g., GPT-3)
- Goal: Train a single model to understand images and text jointly using web-scale supervision
- Leverage natural language supervision instead of manual labels
- CLIP = Contrastive Language-Image Pretraining
- Trained to match (image, text) pairs: learn alignment without labels
- Dataset: 400M imageโtext pairs collected from the internet
- Diverse and noisy, but rich in semantics
- No manual annotations; relies on web alt-text and captions
- Image Encoder: ResNet-50 / Vision Transformer (ViT)
- Text Encoder: Transformer (similar to GPT-style)
- Both encode to 512-D embedding space
- Loss Function: Contrastive loss (InfoNCE)
- Pull matching (image, text) pairs close
- Push non-matching pairs apart
- Softmax over similarity matrix across batch
- Advantages:
- Strong zero-shot performance: no fine-tuning required
- Generalizes to tasks it wasn't trained on
- Works well on rare classes & non-standard categories
- Limitations:
- Still struggles on fine-grained tasks
- Requires large compute and dataset for training
- Sensitive to prompt phrasing (zero-shot brittleness)
- Zero-shot classification with prompts like "a photo of a cat"
- Image retrieval from text (and vice versa)
- Basis for downstream models (e.g., DALLยทE, Flamingo, BLIP)
- Used in multimodal agents, content filtering, and visual Q&A
- GPT-4 shows remarkable multimodal capabilities but is closed-source and heavy
- There's a need for a lightweight, open alternative that mimics GPT-4-like image-text reasoning
- Goal: Enable vision-to-language understanding and generation using smaller models and open tools
- Leverage frozen LLMs (like Vicuna) + vision encoders with minimal training
- Image-text pairs with instruction-following format
- Trained in two stages:
- Pre-alignment: align vision features with language embeddings using 5M image-text pairs
- Instruction tuning: fine-tune on ~3K high-quality vision-language instruction-following examples
- Datasets include LAION, Conceptual Captions, and curated instruct data
- Image Encoder: ViT with pretrained Q-Former (from BLIP-2)
- Language Model: Vicuna-7B (frozen)
- A linear projection layer maps vision features into Vicuna's embedding space
- Uses cross-entropy loss during instruction fine-tuning
- Advantages:
- Efficient: 7B LLM + lightweight vision alignment
- Open-source alternative to GPT-4's multimodal reasoning
- Good at describing, reasoning, and interacting over images
- Limitations:
- Still lags behind GPT-4 in fine-grained reasoning
- Heavily dependent on instruction tuning quality
- Inference may be slower without quantization or optimization
- Image captioning (dense, factual, creative)
- Visual question answering
- Dialogue with images
- Foundation for multimodal agents (e.g., using LangChain, Gradio)
- Existing vision-language models perform well with full fine-tuning
- Few-shot generalization (like GPT-3) is missing in multimodal models
- Goal: Build a model that can understand and respond to image/text sequences in few-shot settings
- Pretrained vision encoder (from Perceiver Resampler) + frozen text decoder (Chinchilla)
- Large-scale pretraining on interleaved image-text corpora (public + proprietary)
- Uses captioning, QA, and multimodal dialogue data
- Combines:
- Vision encoder: EfficientNet or similar backbone
- Perceiver Resampler: compresses vision features
- Gated cross-attention layers (xAttn): insert into frozen language model
- Loss: Language modeling loss (causal LM) over text tokens conditioned on images
- Advantages:
- Supports few-shot VQA, captioning, reasoning with no fine-tuning
- Highly modular: frozen backbone + light insertions (xAttn)
- Sets SOTA on several multimodal benchmarks
- Limitations:
- Proprietary data makes full reproducibility hard
- Model size and training cost are high
- Text generation is sensitive to formatting of few-shot prompts
- Image-based QA (e.g., ScienceQA, OKVQA)
- Multimodal dialogue and image-grounded reasoning
- Visual understanding with few-shot prompts (like in NLP)
- Inspires models like OpenFlamingo, MiniGPT-4, IDEFICS