IPT is a 7-model sequential pipeline (soon 8 with Tracklet2ID) where changing any upstream model forces retraining multiple downstream models. The coupling is real: 64/71 SUSHI node features are raw ReID embeddings, track classification parses JND output positionally at [:3], [3:33], [33:].
The root cause is structural: models were added incrementally as independent units rather than designed as a coherent system.
graph TD
V[Video<br/>L+R cameras] --> P[Radial Rainbowmelt<br/>Panorama 512x1024]
P --> D["<b>MODEL 1: Heatmap Detector</b><br/>→ (x,z) foot-points"]
D -->|"project to perspective<br/>cameras, extract crops"| C[Perspective Crops<br/>128x256]
C --> R["<b>MODEL 2: ReID</b><br/>64-dim embedding"]
C --> J["<b>MODEL 3: JND</b><br/>133-dim JN logits"]
R --> S["<b>MODEL 4: SUSHI Tracker</b><br/>GNN on detection graphs<br/>→ track IDs"]
J --> S
S --> CC["<b>MODEL 5: Category Classifier</b><br/>→ team category"]
S --> JC["<b>MODEL 6: JN Classifier</b><br/>→ jersey number"]
R --> CC
J --> CC
R --> JC
J --> JC
S -->|"track structure"| T["<b>MODEL 7: Tracklet2ID</b><br/>→ global player IDs"]
R -->|"ReID embeddings"| T
J -.->|"JND features (optional)"| T
D -->|"spatial features"| T
style D fill:#f9d,stroke:#333
style R fill:#f9d,stroke:#333
style J fill:#f9d,stroke:#333
style S fill:#f9d,stroke:#333
style CC fill:#f9d,stroke:#333
style JC fill:#f9d,stroke:#333
style T fill:#f9d,stroke:#333
7 models, many dependency edges. Changing ReID retrains SUSHI, Cat Cls, JN Cls, Tracklet2ID (4 direct consumers).
Two paths, both reducing to 5 models. Path A is the pragmatic consolidation. Path B layers on MOTIP and temporal features as an independent upgrade.
graph TD
V[Video<br/>L+R cameras] --> P[Radial Rainbowmelt<br/>Panorama 512x1024]
P --> D["<b>MODEL 1: Detector</b><br/>existing heatmap or DETR<br/>→ (x,z) foot-points"]
D -->|"project to perspective<br/>cameras, extract crops"| CE["<b>MODEL 2: Crop Encoder</b><br/>multi-task, per-frame<br/>• appearance embedding<br/>• jersey number features<br/>• preliminary team"]
CE --> MT["<b>MODEL 3: Tracker + Category Head</b><br/>SUSHI GNN with category classification<br/>→ track IDs + team category"]
CE -->|"JND features<br/>+ track assignments"| JN["<b>MODEL 4: JN Classifier</b><br/>temporal transformer<br/>→ jersey number"]
MT --> JN
MT -->|"track structure"| IR["<b>MODEL 5: Identity Resolver</b><br/>→ global player IDs + refined labels"]
CE -.->|"embeddings +<br/>spatial features"| IR
style D fill:#bfb,stroke:#333
style CE fill:#bfb,stroke:#333
style MT fill:#bfb,stroke:#333
style JN fill:#bfb,stroke:#333
style IR fill:#bfb,stroke:#333
5 models. Merges ReID+JND into Crop Encoder, adds category head to SUSHI. JN classification stays separate (architectural mismatch with GNN). Changing Crop Encoder retrains Tracker+Cat and JN Classifier (2 models instead of 4).
graph TD
V[Video<br/>L+R cameras] --> P[Radial Rainbowmelt<br/>Panorama 512x1024]
P --> M["<b>MODEL 1: MOTIP</b><br/>Keypoint DETR + ID Decoder<br/>→ (x,z) foot-points<br/>→ micro-track IDs (40-frame, 50 slots)<br/>→ 256-dim DETR embeddings"]
M -->|"project to perspective cameras<br/>extract crops grouped by<br/>MOTIP micro-track IDs"| TCE["<b>MODEL 2: Temporal Crop Encoder</b><br/>backbone + temporal attention<br/>• stabilized appearance embedding<br/>• stabilized JN features<br/>• preliminary team"]
TCE --> MT["<b>MODEL 3: Tracker + Category Head</b><br/>SUSHI GNN with category classification<br/>→ track IDs + team category<br/>(graph ~10x smaller vs Path A)"]
M -.->|"optional DETR<br/>embeddings"| MT
TCE -->|"JND features<br/>+ track assignments"| JN["<b>MODEL 4: JN Classifier</b><br/>temporal transformer<br/>→ jersey number"]
MT --> JN
MT -->|"track structure"| IR["<b>MODEL 5: Identity Resolver</b><br/>→ global player IDs + refined labels"]
TCE -.->|"embeddings +<br/>spatial features"| IR
style M fill:#bbf,stroke:#333
style TCE fill:#bbf,stroke:#333
style MT fill:#bbf,stroke:#333
style JN fill:#bbf,stroke:#333
style IR fill:#bbf,stroke:#333
Same 5 models, but with MOTIP providing micro-track IDs for temporal crop grouping, and an optional DETR embedding shortcut to the Tracker.
graph LR
subgraph "Current -- change ReID: retrain 4 models"
det1[Detection] --> sushi1[SUSHI]
reid1[ReID] --> sushi1
jnd1[JND] --> sushi1
sushi1 --> cat1[Cat Cls]
sushi1 --> jn1[JN Cls]
reid1 --> cat1
jnd1 --> cat1
reid1 --> jn1
jnd1 --> jn1
sushi1 --> t2id1[Tracklet2ID]
reid1 --> t2id1
jnd1 -.-> t2id1
det1 --> t2id1
end
subgraph "Proposed -- change Crop Encoder: retrain 2 models"
det2[Detector/MOTIP] --> crop2[Crop Encoder]
crop2 --> macro2[Tracker<br/>+ Cat Head]
crop2 --> jn2[JN Classifier]
macro2 --> jn2
macro2 --> t2id2[Identity Resolver]
crop2 -.-> t2id2
end
style reid1 fill:#f99,stroke:#333
style sushi1 fill:#fcc,stroke:#333
style cat1 fill:#fcc,stroke:#333
style jn1 fill:#fcc,stroke:#333
style t2id1 fill:#fcc,stroke:#333
style crop2 fill:#9f9,stroke:#333
style macro2 fill:#cfc,stroke:#333
style jn2 fill:#cfc,stroke:#333
| Path A | Path B | |
|---|---|---|
| Architecture | Existing heatmap or keypoint-mode DETR | MOTIP: Keypoint DETR (output_dim=2) + ID Decoder |
| Output | (x,z) foot-points | (x,z) foot-points + micro-track IDs (40-frame, 50 slots) + 256-dim DETR embeddings |
| Status | Already in production | Already adapted for keypoint annotations and working in codebase |
Annotations are foot-points only (no bounding boxes). The existing MOTIP project already solved keypoint adaptation: output_dim=2 propagates through DETR head, matcher, loss, and evaluation. The ID decoder operates on 256-dim DETR embeddings and required zero changes from the original paper. Two TRT engines (DETR + ID decoder). Crops are extracted by projecting (x,z) into perspective camera space using camera geometry.
| Path A | Path B | |
|---|---|---|
| Temporal scale | Single frame | Micro-tracklet (5-15 frames, grouped by MOTIP track IDs) |
| Input | Single perspective crop (128x256) per detection | Perspective crops grouped by MOTIP micro-track IDs |
| Architecture | Shared crop backbone + three task heads | Same backbone + temporal attention layer + three task heads |
| Replaces | ReID + JND (2 models into 1) | ReID + JND + implicit temporal aggregation |
Both paths use the same backbone and three task heads (appearance embedding, jersey number features, preliminary team category). Path B adds temporal attention that pools evidence across frames. MOTIP's micro-track IDs solve the chicken-and-egg problem: crops are grouped by MOTIP's 40-frame ID persistence without needing separate association logic.
Path A is the migration entry point. Path B is the upgrade.
| Temporal scale | ~128 frames (local temporal) |
| Input | Embeddings from Model 2 (per-frame or per-micro-tracklet), (x,z) positions, optionally DETR embeddings from Model 1 |
| Output | Track IDs + team category |
| Replaces | SUSHI + Category Classification (2 models into 1) |
Why category fits in the tracker but jersey number doesn't:
Category classification already uses SUSHI's GNN building blocks (FeatureEncoder, MPNTrackConv). SUSHI produces per-node embeddings at every hierarchy level that it currently discards -- only edge logits are returned. Adding a category MLP head to the final depth level's node embeddings is a small change. The precomputed parquets already contain category columns that SUSHI currently ignores.
Jersey number classification uses a TrackTransformer (temporal aggregation of per-detection JND features via mean pooling + attention), not a GNN. JND features (133-dim: objectness + digit logits + whole-number logits) need per-detection temporal aggregation. Forcing this into a GNN framework would lose the inductive bias that makes it work.
Risks: Multi-task training (edge association + category classification) has no precedent in this codebase. Category loss gradients could degrade tracking HOTA. Mitigation: start with a stop-gradient option on the classification head's contribution to the GNN backbone.
Tracker architecture candidates:
| Candidate | Basis | Notes |
|---|---|---|
| SUSHI + category head | Current codebase | Proven GNN, easiest migration, category classifier already uses SUSHI's blocks |
| MOTIP + category head | Current codebase | End-to-end, no heuristic matching |
| OVTR-style | Literature (ICLR 2025) | Joint tracking+classification with category propagation |
| PuTR-style | Literature (2024) | Pure Transformer, strong on SportsMOT |
| Temporal scale | Per-track (aggregates across all detections in a track) |
| Input | JND features from Crop Encoder + track assignments from Tracker |
| Output | Jersey number per track (101 classes: 0-99 + unknown) |
| Architecture | Temporal transformer (largely as-is from current TrackTransformer) |
Lightweight model. Consumes track assignments (which detections belong to which track) and per-detection JND features, then aggregates via mean pooling + attention. Architecturally identical to the current JN classifier -- the only change is that JND features come from the Crop Encoder instead of a separate JND model.
| Temporal scale | Full match |
| Input | Track structure from Model 3, embeddings + spatial features from Model 2, player roster. (ReID + spatial proven in current Tracklet2ID; all inputs optional.) |
| Output | Global player IDs, refined team + jersey labels |
| Replaces | Tracklet2ID (largely as-is) |
| What changed | Retrain (current 7) | Retrain (proposed 5) |
|---|---|---|
| Detection/MOTIP | 5 (ReID, JND, SUSHI, Cat Cls, JN Cls) + Tracklet2ID | 2 (Crop Encoder, Tracker+Cat) + JN Cls + Identity Resolver |
| Crop Encoder | 4 (SUSHI, Cat Cls, JN Cls, Tracklet2ID) | 2 (Tracker+Cat, JN Cls) + Identity Resolver |
| Tracker approach | 2 (Cat Cls, JN Cls) | 1 (JN Cls) |
| JN Classification | 0 | 0 |
- Model 1 vs 2: Different image spaces. Detector: radial_rainbowmelt panorama. Crop encoder: raw perspective crops. Cannot share a backbone.
- Model 2 vs 3: Different modalities. CNN on pixels vs GNN/Transformer on embedding sequences.
- Model 3 vs 4 (why not merge JN into tracker): Architectural mismatch. JN classification uses a temporal transformer aggregating per-detection JND features. The GNN-based tracker doesn't provide the right inductive bias. The current
TrackTransformeroutperforms GNN-based JN classification. - Models 3-4 vs 5: Different temporal scales. Models 3-4 are streaming (~128 frame chunks). Model 5 is offline (full match).
Each step is independently validatable against the current pipeline:
- Multi-task crop model (Path A): Merge ReID + JND into one model with shared backbone and three heads. Drop into the current pipeline replacing ReID + JND. Validate via PTFE.
- Category head on tracker (Model 3): Add category classification MLP to SUSHI's final-depth node embeddings. Validate tracking HOTA is preserved + category accuracy matches standalone classifier.
- MOTIP as detector (Model 1, Path B): Already in codebase. Evaluate against heatmap detector. Replace when it matches or exceeds.
- Temporal crop encoding (Path B, optional): Add temporal attention to Model 2. Measure improvement over Path A.
- Identity Resolver (Model 5): Tracklet2ID continues as-is, consuming richer features from Models 2-3.
Steps 1-2 deliver the core consolidation. Steps 3-4 are independent enhancements.
The current infrastructure pain points -- monolithic precompute hash, implicit model compatibility, manual cascade -- become simpler with fewer models and fewer dependency edges.
Currently, the PlayerTrackingDataGenerator hashes all model IDs together, and nothing prevents deploying a SUSHI trained with PE-262 embeddings alongside a PE-300 ReID model. With 7 models and ~10 dependency edges, tracking compatibility is error-prone.
With 5 models and 4 dependency edges, provenance becomes tractable. Each model records which upstream models it was trained with. The ModelCollection in models.yaml (from the active AI-302/AI-303 migration) validates compatibility at promotion time. Fewer models means fewer compatibility relationships and fewer ways to get it wrong.
The monolithic PlayerTrackingDataGenerator invalidates all artifacts when any model changes. With the proposed architecture, precompute splits naturally into two layers:
| Layer | Produces | Hash depends on | Invalidated by |
|---|---|---|---|
| Frame-level | Detections + crop encoder embeddings per frame | Detector ref + Crop Encoder ref | Detector or Crop Encoder change |
| Track-level | Track assignments + category labels | Frame-level hash + Tracker ref | Tracker change (or upstream) |
Changing the JN classifier or Identity Resolver requires zero precompute regeneration -- they consume existing embeddings and track assignments. Changing the tracker regenerates only track-level artifacts. This aligns with the existing lab.datasets BaseArtifactGenerator framework (two generators with content-addressable hashing). The existing AI-244 issue covers the implementation path.
Current cascade for a ReID change: precompute all, retrain SUSHI, retrain Cat Cls, retrain JN Cls, run PTFE. Four sequential manual steps.
Proposed cascade for a Crop Encoder change: precompute frame-level, retrain Tracker+Cat, retrain JN Cls, run PTFE. Three steps, and JN Cls can train in parallel with PTFE (no dependency). The DAG is small enough that a shell script suffices -- a full Metaflow orchestration flow is likely YAGNI until model upgrades become more frequent.
- Tracker architecture for Model 3: SUSHI vs MOTIP vs OVTR vs PuTR
- Multi-task training: does adding a category head to SUSHI degrade tracking HOTA?
- Stop-gradient vs end-to-end: should classification gradients flow through the GNN backbone?
- Path A vs Path B: does temporal aggregation improve over frame-level embeddings?
- Micro-tracklet window size (5 vs 10 vs 15 frames)
- Whether DETR coarse embeddings help the tracker alongside crop embeddings
- MOTIP (Gao et al., CVPR 2025): Tracking as ID prediction. In codebase, adapted for keypoints. 82 citations.
- OVTR (Li et al., ICLR 2025): End-to-end open-vocabulary tracker modeling motion, appearance, and category simultaneously.
- PuTR (Liu et al., 2024): Pure Transformer for separated online MOT. Strong on SportsMOT.
- Koshkina et al. (WACV 2025): Jersey number + team ID features improve long-term sports player tracking.
- Hybrid-SORT (Yang et al., AAAI 2024): Training-free weak-cue enhancement, plug-and-play on any detector.