maulikmadhavi/Recent.md

Related papers

Recent Vision-Language Papers

Key Models in Detail

CLIP Overview

📘 Motivation Behind CLIP

Traditional vision models need large task-specific labeled datasets
NLP saw breakthroughs with pretraining + prompts (e.g., GPT-3)
Goal: Train a single model to understand images and text jointly using web-scale supervision
Leverage natural language supervision instead of manual labels

🧠 Core Idea & Training Data

CLIP = Contrastive Language-Image Pretraining
Trained to match (image, text) pairs: learn alignment without labels
Dataset: 400M image–text pairs collected from the internet
Diverse and noisy, but rich in semantics
No manual annotations; relies on web alt-text and captions

🏗️ Model Architecture & Loss

Image Encoder: ResNet-50 / Vision Transformer (ViT)
Text Encoder: Transformer (similar to GPT-style)
Both encode to 512-D embedding space
Loss Function: Contrastive loss (InfoNCE)
- Pull matching (image, text) pairs close
- Push non-matching pairs apart
- Softmax over similarity matrix across batch

✅ Advantages & Limitations

Advantages:
- Strong zero-shot performance: no fine-tuning required
- Generalizes to tasks it wasn't trained on
- Works well on rare classes & non-standard categories
Limitations:
- Still struggles on fine-grained tasks
- Requires large compute and dataset for training
- Sensitive to prompt phrasing (zero-shot brittleness)

🛠️ Supported Tasks & Use Cases

Zero-shot classification with prompts like "a photo of a cat"
Image retrieval from text (and vice versa)
Basis for downstream models (e.g., DALL·E, Flamingo, BLIP)
Used in multimodal agents, content filtering, and visual Q&A

MiniGPT-4 Overview

🧠 Motivation Behind MiniGPT-4

GPT-4 shows remarkable multimodal capabilities but is closed-source and heavy
There's a need for a lightweight, open alternative that mimics GPT-4-like image-text reasoning
Goal: Enable vision-to-language understanding and generation using smaller models and open tools
Leverage frozen LLMs (like Vicuna) + vision encoders with minimal training

📦 Training Data & Setup

Image-text pairs with instruction-following format
Trained in two stages:
- Pre-alignment: align vision features with language embeddings using 5M image-text pairs
- Instruction tuning: fine-tune on ~3K high-quality vision-language instruction-following examples
Datasets include LAION, Conceptual Captions, and curated instruct data

🏗️ Architecture & Loss Function

Image Encoder: ViT with pretrained Q-Former (from BLIP-2)
Language Model: Vicuna-7B (frozen)
A linear projection layer maps vision features into Vicuna's embedding space
Uses cross-entropy loss during instruction fine-tuning

✅ Advantages & Limitations

Advantages:
- Efficient: 7B LLM + lightweight vision alignment
- Open-source alternative to GPT-4's multimodal reasoning
- Good at describing, reasoning, and interacting over images
Limitations:
- Still lags behind GPT-4 in fine-grained reasoning
- Heavily dependent on instruction tuning quality
- Inference may be slower without quantization or optimization

🛠️ Supported Tasks & Use Cases

Image captioning (dense, factual, creative)
Visual question answering
Dialogue with images
Foundation for multimodal agents (e.g., using LangChain, Gradio)

Flamingo Overview

🧠 Motivation Behind Flamingo

Existing vision-language models perform well with full fine-tuning
Few-shot generalization (like GPT-3) is missing in multimodal models
Goal: Build a model that can understand and respond to image/text sequences in few-shot settings

📦 Training Data & Setup

Pretrained vision encoder (from Perceiver Resampler) + frozen text decoder (Chinchilla)
Large-scale pretraining on interleaved image-text corpora (public + proprietary)
Uses captioning, QA, and multimodal dialogue data

🏗️ Architecture & Loss Function

Combines:
- Vision encoder: EfficientNet or similar backbone
- Perceiver Resampler: compresses vision features
- Gated cross-attention layers (xAttn): insert into frozen language model
Loss: Language modeling loss (causal LM) over text tokens conditioned on images

✅ Advantages & Limitations

Advantages:
- Supports few-shot VQA, captioning, reasoning with no fine-tuning
- Highly modular: frozen backbone + light insertions (xAttn)
- Sets SOTA on several multimodal benchmarks
Limitations:
- Proprietary data makes full reproducibility hard
- Model size and training cost are high
- Text generation is sensitive to formatting of few-shot prompts

🛠️ Supported Tasks & Use Cases

Image-based QA (e.g., ScienceQA, OKVQA)
Multimodal dialogue and image-grounded reasoning
Visual understanding with few-shot prompts (like in NLP)
Inspires models like OpenFlamingo, MiniGPT-4, IDEFICS

maulikmadhavi/Recent.md

Select an option

No results found

Select an option

No results found

Related papers

Recent Vision-Language Papers

Key Models in Detail

CLIP Overview

📘 Motivation Behind CLIP

🧠 Core Idea & Training Data

🏗️ Model Architecture & Loss

✅ Advantages & Limitations

🛠️ Supported Tasks & Use Cases

MiniGPT-4 Overview

🧠 Motivation Behind MiniGPT-4

📦 Training Data & Setup

🏗️ Architecture & Loss Function

✅ Advantages & Limitations

🛠️ Supported Tasks & Use Cases

Flamingo Overview

🧠 Motivation Behind Flamingo

📦 Training Data & Setup

🏗️ Architecture & Loss Function

✅ Advantages & Limitations

🛠️ Supported Tasks & Use Cases

PaliGemma

Qwen2.5-VL