Skip to content

Instantly share code, notes, and snippets.

@maulikmadhavi
Last active May 25, 2025 17:08
Show Gist options
  • Select an option

  • Save maulikmadhavi/f77304d03ea20748f8e7cd148274613b to your computer and use it in GitHub Desktop.

Select an option

Save maulikmadhavi/f77304d03ea20748f8e7cd148274613b to your computer and use it in GitHub Desktop.
Vision-Language Model

Related papers

Recent Vision-Language Papers

Key Models in Detail

CLIP Overview

๐Ÿ“˜ Motivation Behind CLIP

  • Traditional vision models need large task-specific labeled datasets
  • NLP saw breakthroughs with pretraining + prompts (e.g., GPT-3)
  • Goal: Train a single model to understand images and text jointly using web-scale supervision
  • Leverage natural language supervision instead of manual labels

๐Ÿง  Core Idea & Training Data

  • CLIP = Contrastive Language-Image Pretraining
  • Trained to match (image, text) pairs: learn alignment without labels
  • Dataset: 400M imageโ€“text pairs collected from the internet
  • Diverse and noisy, but rich in semantics
  • No manual annotations; relies on web alt-text and captions

๐Ÿ—๏ธ Model Architecture & Loss

  • Image Encoder: ResNet-50 / Vision Transformer (ViT)
  • Text Encoder: Transformer (similar to GPT-style)
  • Both encode to 512-D embedding space
  • Loss Function: Contrastive loss (InfoNCE)
    • Pull matching (image, text) pairs close
    • Push non-matching pairs apart
    • Softmax over similarity matrix across batch

โœ… Advantages & Limitations

  • Advantages:
    • Strong zero-shot performance: no fine-tuning required
    • Generalizes to tasks it wasn't trained on
    • Works well on rare classes & non-standard categories
  • Limitations:
    • Still struggles on fine-grained tasks
    • Requires large compute and dataset for training
    • Sensitive to prompt phrasing (zero-shot brittleness)

๐Ÿ› ๏ธ Supported Tasks & Use Cases

  • Zero-shot classification with prompts like "a photo of a cat"
  • Image retrieval from text (and vice versa)
  • Basis for downstream models (e.g., DALLยทE, Flamingo, BLIP)
  • Used in multimodal agents, content filtering, and visual Q&A

MiniGPT-4 Overview

๐Ÿง  Motivation Behind MiniGPT-4

  • GPT-4 shows remarkable multimodal capabilities but is closed-source and heavy
  • There's a need for a lightweight, open alternative that mimics GPT-4-like image-text reasoning
  • Goal: Enable vision-to-language understanding and generation using smaller models and open tools
  • Leverage frozen LLMs (like Vicuna) + vision encoders with minimal training

๐Ÿ“ฆ Training Data & Setup

  • Image-text pairs with instruction-following format
  • Trained in two stages:
    • Pre-alignment: align vision features with language embeddings using 5M image-text pairs
    • Instruction tuning: fine-tune on ~3K high-quality vision-language instruction-following examples
  • Datasets include LAION, Conceptual Captions, and curated instruct data

๐Ÿ—๏ธ Architecture & Loss Function

  • Image Encoder: ViT with pretrained Q-Former (from BLIP-2)
  • Language Model: Vicuna-7B (frozen)
  • A linear projection layer maps vision features into Vicuna's embedding space
  • Uses cross-entropy loss during instruction fine-tuning

โœ… Advantages & Limitations

  • Advantages:
    • Efficient: 7B LLM + lightweight vision alignment
    • Open-source alternative to GPT-4's multimodal reasoning
    • Good at describing, reasoning, and interacting over images
  • Limitations:
    • Still lags behind GPT-4 in fine-grained reasoning
    • Heavily dependent on instruction tuning quality
    • Inference may be slower without quantization or optimization

๐Ÿ› ๏ธ Supported Tasks & Use Cases

  • Image captioning (dense, factual, creative)
  • Visual question answering
  • Dialogue with images
  • Foundation for multimodal agents (e.g., using LangChain, Gradio)

Flamingo Overview

๐Ÿง  Motivation Behind Flamingo

  • Existing vision-language models perform well with full fine-tuning
  • Few-shot generalization (like GPT-3) is missing in multimodal models
  • Goal: Build a model that can understand and respond to image/text sequences in few-shot settings

๐Ÿ“ฆ Training Data & Setup

  • Pretrained vision encoder (from Perceiver Resampler) + frozen text decoder (Chinchilla)
  • Large-scale pretraining on interleaved image-text corpora (public + proprietary)
  • Uses captioning, QA, and multimodal dialogue data

๐Ÿ—๏ธ Architecture & Loss Function

  • Combines:
    • Vision encoder: EfficientNet or similar backbone
    • Perceiver Resampler: compresses vision features
    • Gated cross-attention layers (xAttn): insert into frozen language model
  • Loss: Language modeling loss (causal LM) over text tokens conditioned on images

โœ… Advantages & Limitations

  • Advantages:
    • Supports few-shot VQA, captioning, reasoning with no fine-tuning
    • Highly modular: frozen backbone + light insertions (xAttn)
    • Sets SOTA on several multimodal benchmarks
  • Limitations:
    • Proprietary data makes full reproducibility hard
    • Model size and training cost are high
    • Text generation is sensitive to formatting of few-shot prompts

๐Ÿ› ๏ธ Supported Tasks & Use Cases

  • Image-based QA (e.g., ScienceQA, OKVQA)
  • Multimodal dialogue and image-grounded reasoning
  • Visual understanding with few-shot prompts (like in NLP)
  • Inspires models like OpenFlamingo, MiniGPT-4, IDEFICS

PaliGemma

Qwen2.5-VL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment