ppries/2026-03-23-ipt-pipeline-improvement-design.md

IPT Next-Generation Architecture

Problem

IPT is a 7-model sequential pipeline (soon 8 with Tracklet2ID) where changing any upstream model forces retraining multiple downstream models. The coupling is real: 64/71 SUSHI node features are raw ReID embeddings, track classification parses JND output positionally at [:3], [3:33], [33:].

The root cause is structural: models were added incrementally as independent units rather than designed as a coherent system.

Current Architecture (7 models)

graph TD
    V[Video<br/>L+R cameras] --> P[Radial Rainbowmelt<br/>Panorama 512x1024]
    P --> D["<b>MODEL 1: Heatmap Detector</b><br/>&#8594; (x,z) foot-points"]
    D -->|"project to perspective<br/>cameras, extract crops"| C[Perspective Crops<br/>128x256]
    C --> R["<b>MODEL 2: ReID</b><br/>64-dim embedding"]
    C --> J["<b>MODEL 3: JND</b><br/>133-dim JN logits"]
    R --> S["<b>MODEL 4: SUSHI Tracker</b><br/>GNN on detection graphs<br/>&#8594; track IDs"]
    J --> S
    S --> CC["<b>MODEL 5: Category Classifier</b><br/>&#8594; team category"]
    S --> JC["<b>MODEL 6: JN Classifier</b><br/>&#8594; jersey number"]
    R --> CC
    J --> CC
    R --> JC
    J --> JC
    S -->|"track structure"| T["<b>MODEL 7: Tracklet2ID</b><br/>&#8594; global player IDs"]
    R -->|"ReID embeddings"| T
    J -.->|"JND features (optional)"| T
    D -->|"spatial features"| T

    style D fill:#f9d,stroke:#333
    style R fill:#f9d,stroke:#333
    style J fill:#f9d,stroke:#333
    style S fill:#f9d,stroke:#333
    style CC fill:#f9d,stroke:#333
    style JC fill:#f9d,stroke:#333
    style T fill:#f9d,stroke:#333

7 models, many dependency edges. Changing ReID retrains SUSHI, Cat Cls, JN Cls, Tracklet2ID (4 direct consumers).

Proposed Architecture

Two paths, both reducing to 5 models. Path A is the pragmatic consolidation. Path B layers on MOTIP and temporal features as an independent upgrade.

Path A -- Core Consolidation (no MOTIP, no temporal)

graph TD
    V[Video<br/>L+R cameras] --> P[Radial Rainbowmelt<br/>Panorama 512x1024]
    P --> D["<b>MODEL 1: Detector</b><br/>existing heatmap or DETR<br/>&#8594; (x,z) foot-points"]
    D -->|"project to perspective<br/>cameras, extract crops"| CE["<b>MODEL 2: Crop Encoder</b><br/>multi-task, per-frame<br/>&#8226; appearance embedding<br/>&#8226; jersey number features<br/>&#8226; preliminary team"]
    CE --> MT["<b>MODEL 3: Tracker + Category Head</b><br/>SUSHI GNN with category classification<br/>&#8594; track IDs + team category"]
    CE -->|"JND features<br/>+ track assignments"| JN["<b>MODEL 4: JN Classifier</b><br/>temporal transformer<br/>&#8594; jersey number"]
    MT --> JN
    MT -->|"track structure"| IR["<b>MODEL 5: Identity Resolver</b><br/>&#8594; global player IDs + refined labels"]
    CE -.->|"embeddings +<br/>spatial features"| IR

    style D fill:#bfb,stroke:#333
    style CE fill:#bfb,stroke:#333
    style MT fill:#bfb,stroke:#333
    style JN fill:#bfb,stroke:#333
    style IR fill:#bfb,stroke:#333

5 models. Merges ReID+JND into Crop Encoder, adds category head to SUSHI. JN classification stays separate (architectural mismatch with GNN). Changing Crop Encoder retrains Tracker+Cat and JN Classifier (2 models instead of 4).

Path B -- Temporal Enhancement (adds MOTIP + temporal attention)

graph TD
    V[Video<br/>L+R cameras] --> P[Radial Rainbowmelt<br/>Panorama 512x1024]
    P --> M["<b>MODEL 1: MOTIP</b><br/>Keypoint DETR + ID Decoder<br/>&#8594; (x,z) foot-points<br/>&#8594; micro-track IDs (40-frame, 50 slots)<br/>&#8594; 256-dim DETR embeddings"]
    M -->|"project to perspective cameras<br/>extract crops grouped by<br/>MOTIP micro-track IDs"| TCE["<b>MODEL 2: Temporal Crop Encoder</b><br/>backbone + temporal attention<br/>&#8226; stabilized appearance embedding<br/>&#8226; stabilized JN features<br/>&#8226; preliminary team"]
    TCE --> MT["<b>MODEL 3: Tracker + Category Head</b><br/>SUSHI GNN with category classification<br/>&#8594; track IDs + team category<br/>(graph ~10x smaller vs Path A)"]
    M -.->|"optional DETR<br/>embeddings"| MT
    TCE -->|"JND features<br/>+ track assignments"| JN["<b>MODEL 4: JN Classifier</b><br/>temporal transformer<br/>&#8594; jersey number"]
    MT --> JN
    MT -->|"track structure"| IR["<b>MODEL 5: Identity Resolver</b><br/>&#8594; global player IDs + refined labels"]
    TCE -.->|"embeddings +<br/>spatial features"| IR

    style M fill:#bbf,stroke:#333
    style TCE fill:#bbf,stroke:#333
    style MT fill:#bbf,stroke:#333
    style JN fill:#bbf,stroke:#333
    style IR fill:#bbf,stroke:#333

Same 5 models, but with MOTIP providing micro-track IDs for temporal crop grouping, and an optional DETR embedding shortcut to the Tracker.

Dependency Comparison

graph LR
    subgraph "Current -- change ReID: retrain 4 models"
        det1[Detection] --> sushi1[SUSHI]
        reid1[ReID] --> sushi1
        jnd1[JND] --> sushi1
        sushi1 --> cat1[Cat Cls]
        sushi1 --> jn1[JN Cls]
        reid1 --> cat1
        jnd1 --> cat1
        reid1 --> jn1
        jnd1 --> jn1
        sushi1 --> t2id1[Tracklet2ID]
        reid1 --> t2id1
        jnd1 -.-> t2id1
        det1 --> t2id1
    end

    subgraph "Proposed -- change Crop Encoder: retrain 2 models"
        det2[Detector/MOTIP] --> crop2[Crop Encoder]
        crop2 --> macro2[Tracker<br/>+ Cat Head]
        crop2 --> jn2[JN Classifier]
        macro2 --> jn2
        macro2 --> t2id2[Identity Resolver]
        crop2 -.-> t2id2
    end

    style reid1 fill:#f99,stroke:#333
    style sushi1 fill:#fcc,stroke:#333
    style cat1 fill:#fcc,stroke:#333
    style jn1 fill:#fcc,stroke:#333
    style t2id1 fill:#fcc,stroke:#333
    style crop2 fill:#9f9,stroke:#333
    style macro2 fill:#cfc,stroke:#333
    style jn2 fill:#cfc,stroke:#333

Model Details

Model 1: Detector / MOTIP

	Path A	Path B
Architecture	Existing heatmap or keypoint-mode DETR	MOTIP: Keypoint DETR (`output_dim=2`) + ID Decoder
Output	(x,z) foot-points	(x,z) foot-points + micro-track IDs (40-frame, 50 slots) + 256-dim DETR embeddings
Status	Already in production	Already adapted for keypoint annotations and working in codebase

Annotations are foot-points only (no bounding boxes). The existing MOTIP project already solved keypoint adaptation: output_dim=2 propagates through DETR head, matcher, loss, and evaluation. The ID decoder operates on 256-dim DETR embeddings and required zero changes from the original paper. Two TRT engines (DETR + ID decoder). Crops are extracted by projecting (x,z) into perspective camera space using camera geometry.

Model 2: Crop Encoder

	Path A	Path B
Temporal scale	Single frame	Micro-tracklet (5-15 frames, grouped by MOTIP track IDs)
Input	Single perspective crop (128x256) per detection	Perspective crops grouped by MOTIP micro-track IDs
Architecture	Shared crop backbone + three task heads	Same backbone + temporal attention layer + three task heads
Replaces	ReID + JND (2 models into 1)	ReID + JND + implicit temporal aggregation

Both paths use the same backbone and three task heads (appearance embedding, jersey number features, preliminary team category). Path B adds temporal attention that pools evidence across frames. MOTIP's micro-track IDs solve the chicken-and-egg problem: crops are grouped by MOTIP's 40-frame ID persistence without needing separate association logic.

Path A is the migration entry point. Path B is the upgrade.

Model 3: Tracker + Category Head


Temporal scale	~128 frames (local temporal)
Input	Embeddings from Model 2 (per-frame or per-micro-tracklet), (x,z) positions, optionally DETR embeddings from Model 1
Output	Track IDs + team category
Replaces	SUSHI + Category Classification (2 models into 1)

Why category fits in the tracker but jersey number doesn't:

Category classification already uses SUSHI's GNN building blocks (FeatureEncoder, MPNTrackConv). SUSHI produces per-node embeddings at every hierarchy level that it currently discards -- only edge logits are returned. Adding a category MLP head to the final depth level's node embeddings is a small change. The precomputed parquets already contain category columns that SUSHI currently ignores.

Jersey number classification uses a TrackTransformer (temporal aggregation of per-detection JND features via mean pooling + attention), not a GNN. JND features (133-dim: objectness + digit logits + whole-number logits) need per-detection temporal aggregation. Forcing this into a GNN framework would lose the inductive bias that makes it work.

Risks: Multi-task training (edge association + category classification) has no precedent in this codebase. Category loss gradients could degrade tracking HOTA. Mitigation: start with a stop-gradient option on the classification head's contribution to the GNN backbone.

Tracker architecture candidates:

Candidate	Basis	Notes
SUSHI + category head	Current codebase	Proven GNN, easiest migration, category classifier already uses SUSHI's blocks
MOTIP + category head	Current codebase	End-to-end, no heuristic matching
OVTR-style	Literature (ICLR 2025)	Joint tracking+classification with category propagation
PuTR-style	Literature (2024)	Pure Transformer, strong on SportsMOT

Model 4: JN Classifier


Temporal scale	Per-track (aggregates across all detections in a track)
Input	JND features from Crop Encoder + track assignments from Tracker
Output	Jersey number per track (101 classes: 0-99 + unknown)
Architecture	Temporal transformer (largely as-is from current `TrackTransformer`)

Lightweight model. Consumes track assignments (which detections belong to which track) and per-detection JND features, then aggregates via mean pooling + attention. Architecturally identical to the current JN classifier -- the only change is that JND features come from the Crop Encoder instead of a separate JND model.

Model 5: Identity Resolver


Temporal scale	Full match
Input	Track structure from Model 3, embeddings + spatial features from Model 2, player roster. (ReID + spatial proven in current Tracklet2ID; all inputs optional.)
Output	Global player IDs, refined team + jersey labels
Replaces	Tracklet2ID (largely as-is)

Coupling Reduction

What changed	Retrain (current 7)	Retrain (proposed 5)
Detection/MOTIP	5 (ReID, JND, SUSHI, Cat Cls, JN Cls) + Tracklet2ID	2 (Crop Encoder, Tracker+Cat) + JN Cls + Identity Resolver
Crop Encoder	4 (SUSHI, Cat Cls, JN Cls, Tracklet2ID)	2 (Tracker+Cat, JN Cls) + Identity Resolver
Tracker approach	2 (Cat Cls, JN Cls)	1 (JN Cls)
JN Classification	0	0

Why Not Fewer Models?

Model 1 vs 2: Different image spaces. Detector: radial_rainbowmelt panorama. Crop encoder: raw perspective crops. Cannot share a backbone.
Model 2 vs 3: Different modalities. CNN on pixels vs GNN/Transformer on embedding sequences.
Model 3 vs 4 (why not merge JN into tracker): Architectural mismatch. JN classification uses a temporal transformer aggregating per-detection JND features. The GNN-based tracker doesn't provide the right inductive bias. The current TrackTransformer outperforms GNN-based JN classification.
Models 3-4 vs 5: Different temporal scales. Models 3-4 are streaming (~128 frame chunks). Model 5 is offline (full match).

Migration Path

Each step is independently validatable against the current pipeline:

Multi-task crop model (Path A): Merge ReID + JND into one model with shared backbone and three heads. Drop into the current pipeline replacing ReID + JND. Validate via PTFE.
Category head on tracker (Model 3): Add category classification MLP to SUSHI's final-depth node embeddings. Validate tracking HOTA is preserved + category accuracy matches standalone classifier.
MOTIP as detector (Model 1, Path B): Already in codebase. Evaluate against heatmap detector. Replace when it matches or exceeds.
Temporal crop encoding (Path B, optional): Add temporal attention to Model 2. Measure improvement over Path A.
Identity Resolver (Model 5): Tracklet2ID continues as-is, consuming richer features from Models 2-3.

Steps 1-2 deliver the core consolidation. Steps 3-4 are independent enhancements.

Infrastructure Simplification

The current infrastructure pain points -- monolithic precompute hash, implicit model compatibility, manual cascade -- become simpler with fewer models and fewer dependency edges.

Provenance tracking

Currently, the PlayerTrackingDataGenerator hashes all model IDs together, and nothing prevents deploying a SUSHI trained with PE-262 embeddings alongside a PE-300 ReID model. With 7 models and ~10 dependency edges, tracking compatibility is error-prone.

With 5 models and 4 dependency edges, provenance becomes tractable. Each model records which upstream models it was trained with. The ModelCollection in models.yaml (from the active AI-302/AI-303 migration) validates compatibility at promotion time. Fewer models means fewer compatibility relationships and fewer ways to get it wrong.

Layered artifact caching

The monolithic PlayerTrackingDataGenerator invalidates all artifacts when any model changes. With the proposed architecture, precompute splits naturally into two layers:

Layer	Produces	Hash depends on	Invalidated by
Frame-level	Detections + crop encoder embeddings per frame	Detector ref + Crop Encoder ref	Detector or Crop Encoder change
Track-level	Track assignments + category labels	Frame-level hash + Tracker ref	Tracker change (or upstream)

Changing the JN classifier or Identity Resolver requires zero precompute regeneration -- they consume existing embeddings and track assignments. Changing the tracker regenerates only track-level artifacts. This aligns with the existing lab.datasets BaseArtifactGenerator framework (two generators with content-addressable hashing). The existing AI-244 issue covers the implementation path.

Cascade automation

Current cascade for a ReID change: precompute all, retrain SUSHI, retrain Cat Cls, retrain JN Cls, run PTFE. Four sequential manual steps.

Proposed cascade for a Crop Encoder change: precompute frame-level, retrain Tracker+Cat, retrain JN Cls, run PTFE. Three steps, and JN Cls can train in parallel with PTFE (no dependency). The DAG is small enough that a shell script suffices -- a full Metaflow orchestration flow is likely YAGNI until model upgrades become more frequent.

Open Experiments

Tracker architecture for Model 3: SUSHI vs MOTIP vs OVTR vs PuTR
Multi-task training: does adding a category head to SUSHI degrade tracking HOTA?
Stop-gradient vs end-to-end: should classification gradients flow through the GNN backbone?
Path A vs Path B: does temporal aggregation improve over frame-level embeddings?
Micro-tracklet window size (5 vs 10 vs 15 frames)
Whether DETR coarse embeddings help the tracker alongside crop embeddings

References

MOTIP (Gao et al., CVPR 2025): Tracking as ID prediction. In codebase, adapted for keypoints. 82 citations.
OVTR (Li et al., ICLR 2025): End-to-end open-vocabulary tracker modeling motion, appearance, and category simultaneously.
PuTR (Liu et al., 2024): Pure Transformer for separated online MOT. Strong on SportsMOT.
Koshkina et al. (WACV 2025): Jersey number + team ID features improve long-term sports player tracking.
Hybrid-SORT (Yang et al., AAAI 2024): Training-free weak-cue enhancement, plug-and-play on any detector.