metasyn/hf-token-tracking.md

HF Token GPU Rendering: Tracking & Sequencing (PRs, approach, status)

Fractional GPU Test - G6f

g6f Instance Type Evaluation

Background

Slack ML runs GPU inference workloads (Slack AI, Slackbot, semantic search) that currently require full NVIDIA L4 GPUs via g6 instance types. Many of these models only need 1/4 or 1/8 of available GPU RAM, meaning we are significantly over-provisioning.

AWS g6f instances offer fractional NVIDIA L4 GPUs at substantially lower price points. Evaluating these instances is a V2MOM priority for Moonlet Inference cost savings.

What are g6f instances?

g6f instances provide fractional slices of NVIDIA L4 GPUs without requiring MIG (Multi-Instance GPU), which is only available on P4/P5 instance types. Each g6f instance exposes a GPU partition as a single logical GPU to the operating system.

Instance Type	vCPUs	RAM (GiB)	GPU Partition	GPU Memory (MiB)	On-Demand $/hr
g6f.large	2	8	1/8	~2,861	$0.20
g6f.xlarge	4	16	1/8	~2,861	$0.40
g6f.2xlarge	8	32	1/4	~5,722	$0.77
g6f.4xlarge	16	64	1/2	~11,444	$1.53

For comparison, the smallest full-GPU option:

Instance Type	vCPUs	RAM (GiB)	GPUs	GPU Memory (MiB)	On-Demand $/hr
g6.xlarge	4	16	1	22,888	$0.80

Cost Savings Potential

If workloads that currently use g6.xlarge ($0.80/hr) can run on g6f.large ($0.20/hr), that represents a 75% cost reduction per instance. Even moving to g6f.2xlarge ($0.77/hr) for workloads needing ~5.7 GiB of GPU memory saves cost while right-sizing GPU memory allocation, freeing full L4 GPUs for workloads that actually need them.

Architecture Overview

How g6f Gets Provisioned

flowchart TD
    TF["Terraform<br/>provisioner_instance_families<br/>= [G6F_ENABLED, ...]"] --> AA["ArgoCD ApplicationSet<br/>reads cluster annotations"]
    AA --> HV["Helm Values<br/>customdata.instanceFamilyLabels<br/>= JSON array"]
    HV --> NP["NodePool<br/>karpenter.k8s.aws/instance-family<br/>In: [g6f]"]
    HV --> NO["NodeOverlay<br/>nvidia.com/gpu: 1<br/>(EC2 API reports 0)"]
    NP --> KS["Karpenter Scheduler"]
    NO --> KS
    KS --> EC2["EC2 g6f Instance<br/>NVIDIA L4 fractional GPU<br/>GRID driver"]
    POD["Pod<br/>nvidia.com/gpu: 1<br/>instanceFamily: g6f"] --> KS

    style NO fill:#f9f,stroke:#333
    style EC2 fill:#bfb,stroke:#333

GPU Targeting Decision Tree

flowchart TD
    START["Model needs GPU"] --> Q1{"Need specific<br/>GPU memory tier?"}
    Q1 -->|No| Q2{"Need specific<br/>instance family?"}
    Q1 -->|Yes| IT["Use instanceType<br/>e.g. g6f.2xlarge"]
    Q2 -->|No| GN["Use gpuName: l4<br/>Karpenter picks best fit"]
    Q2 -->|Yes, fractional| IF_G6F["Use instanceFamily: g6f"]
    Q2 -->|Yes, full GPU| IF_G6["Use instanceFamily: g6"]

    GN -.->|WARNING| WARN["If g6 + g6f both in NodePool,<br/>pod may land on either.<br/>g6f.large has only 2.8 GiB GPU RAM"]

    style WARN fill:#fbb,stroke:#333
    style IT fill:#bbf,stroke:#333
    style IF_G6F fill:#bfb,stroke:#333

Karpenter Support Status

The Problem

The EC2 DescribeInstanceTypes API returns GPU Count: 0 for g6f instances (it uses a new LogicalGpuCount field instead). This means Karpenter cannot natively:

Provision g6f instances for workloads requesting nvidia.com/gpu
Correctly account for GPU resources in NodePool limits

See: aws/karpenter-provider-aws#8368

The Workaround: NodeOverlay (alpha)

Karpenter v1.7+ supports NodeOverlay, an alpha feature that allows overriding instance type capabilities. This lets us manually declare that g6f instances have 1 GPU.

Requires the NodeOverlay feature gate enabled in Karpenter settings.

Current State

Cluster	Karpenter	NodeOverlay Feature Gate	NodeOverlay Applied
ml-dev-cmh	v1.9.0	Enabled	Yes (from Phase 1)
ml-dev-iad	v1.9.0	Enabled	Yes (applied 2026-04-08)
ml-dev-pdx	v1.9.0	Enabled	Yes (applied 2026-04-13)

Phase 1: Infrastructure Validation

Step 1.1: Confirm Karpenter version and feature gates

Status: DONE (ml-dev-cmh, ml-dev-iad, and ml-dev-pdx)

Step 1.2: Apply NodeOverlay for g6f

Status: DONE (all three clusters)

apiVersion: karpenter.sh/v1alpha1
kind: NodeOverlay
metadata:
  name: g6f-gpu-overlay
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values: ['g6f']
  capacity:
    nvidia.com/gpu: "1"

Step 1.3-1.5: Initial Testing (ml-dev-cmh, March 2026)

Status: DONE — datacenter driver failed. See original testing notes below.

Both NVIDIA 560.35.05 and 570.172.08 datacenter drivers failed on g6f because PCI device ID 10de:27b8 is not in the datacenter driver's supported device list. AWS documentation confirms g6f only supports GRID drivers.

Step 1.6: GRID Driver Implementation

Status: DONE — PR #888 merged 2026-04-06

PR #888 added NVIDIA GRID driver support to the nvidia-gpu cookbook:

driver-type attribute (datacenter vs grid) controls which driver is installed
GRID driver: .run installer from s3://ec2-linux-nvidia-drivers/grid-18.4/ with SHA-256 checksum verification
upgrade_envs = %w(dev) — dev environment gets GRID, all others stay on datacenter
Tested via ship quick on GPU longshoreman ASG (bake + provision both passed, 48/48 ServerSpec)
nvidia-smi confirmed: Driver Version: 570.172.08, NVIDIA L4 GPU

Step 1.7: g6f Full Validation with GRID Driver + Pod Limit Fix (ml-dev-iad, 2026-04-14)

Status: DONE — Phase 1 fully complete

Pod Limit Fix

PR #969 merged 2026-04-13 — adds g6f to ip_per_eni mappings, also fixes g6e values.

Instance Type	Max ENIs	IPv4/ENI	Calculated max_pods
g6f.large	2	10	18
g6f.xlarge	4	15	23
g6f.2xlarge	4	15	23
g6f.4xlarge	8	30	38

Validation Results (2026-04-14)

Tested on ml-dev-iad with a g6f.2xlarge node (ip-10-56-126-0.ec2.internal).

Check	Result
NodeOverlay applied	`Ready=True`, `ValidationSucceeded=True`
g6f.2xlarge node launch	Karpenter created NodeClaim, EC2 instance launched
`nvidia.com/gpu` Capacity	1
`nvidia.com/gpu` Allocatable	1 (previously 0 before pod limit fix)
Allocatable pods	23 (previously ~11 before pod limit fix)
driver-validation (init container)	PASSED
toolkit-validation (init container)	PASSED
cuda-validation (init container)	PASSED
plugin-validation (init container)	PASSED
nvidia-operator-validator	"all validations are successful"
nvidia-cuda-validator	"cuda workload validation is successful"
ML workload pods scheduling	YES — moonlet-ai-search-summary-messages and moonlet-reflector running on g6f node

nvidia-smi Output from Workload Pod

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4-6Q                   On  |   00000000:31:00.0 Off |                    0 |
| N/A   N/A    P0            N/A  /  N/A  |       0MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

GPU: NVIDIA L4-6Q — the "6Q" indicates a 6 GiB GRID/vGPU partition of the L4
GPU Memory: 6144 MiB (6 GiB) — matches g6f.2xlarge 1/4 partition spec
Driver: 570.172.08 (GRID)
CUDA: 12.8

Phase 2: Workload Compatibility

Phase 1 complete. Phase 2 is now unblocked.

Re-validate g6f with correct pod limits (after PR #969 merge + AMI bake) — DONE 2026-04-14
Confirm nvidia-smi shows fractional GPU memory from a workload pod — DONE: NVIDIA L4-6Q, 6144 MiB
Identify candidate models/workloads that fit within g6f GPU memory tiers — DONE 2026-04-14 (see below)
Add instanceFamily node affinity to moonlet chart + mlctl — DONE 2026-04-14 (data-airflow#74317)
Fix and validate load test tooling for A/B comparison — DONE 2026-04-16 (data-airflow#74389), see Load Test Tooling Validation below
A/B load test: deploy same model on g6 vs g6f, compare latency — DONE 2026-04-20 for gte_multilingual_base_query_v1 (see results) and msmarco_tinybert_crossencoder_v1 (see results)
Validate inference accuracy matches full-GPU baseline
Test GPU memory pressure / OOM behavior at partition boundaries

GPU Memory Profiling Results (2026-04-14)

Sampled nvidia-smi across multiple full g6 nodes on ml-dev-iad. All current GPU workloads use well under 2 GiB of GPU memory — a fraction of the 23 GiB available on full L4 GPUs.

Node	GPU 0	GPU 1	GPU 2	GPU 3
ip-10-56-126-93	358 MiB	550 MiB	1,056 MiB	1,056 MiB
ip-10-55-125-145	0 MiB	1,724 MiB	0 MiB	926 MiB
ip-10-57-127-57	1,056 MiB	604 MiB	0 MiB	0 MiB
ip-10-56-252-162	1,416 MiB	—	—	—
ip-10-57-126-2	1,080 MiB	0 MiB	0 MiB	0 MiB

Peak observed: 1,724 MiB (finetuned-gte-multilingual model) — this fits in the smallest g6f tier (g6f.large at 2,861 MiB) with 40% headroom.

Model	GPU Memory (observed)	Fits in g6f.large?
finetuned-gte-multilingual	~1,416–1,724 MiB	Yes (2,861 MiB available)
gte-multilingual-base variants	~1,056 MiB	Yes
bert-* models	~550–926 MiB	Yes
mmarco-roberta-crossencoder	~550 MiB	Yes
msmarco-minilml6-embedder	~358 MiB	Yes
msmarco-tinybert-crossencoder	~358 MiB	Yes

Conclusion: Every current ML inference workload could run on g6f.large ($0.20/hr) instead of g6.xlarge ($0.80/hr) — a 75% cost reduction per GPU instance.

Instance Family Affinity for A/B Testing (2026-04-14)

Both g6 and g6f nodes have karpenter.k8s.aws/instance-gpu-name=l4 (the EC2 API reports the GPU as "L4" for both). So the existing gpuName: "l4" cannot distinguish them. data-airflow#74317 adds instanceFamily as a new node affinity field targeting karpenter.k8s.aws/instance-family. It is mutually exclusive with gpuName — the chart fails with a clear error if both are set.

Load Test Tooling Validation (2026-04-16)

data-airflow#74389 fixed several issues in the load test tooling that blocked A/B testing. Validated end-to-end on ml-dev-iad via three branch-deployed builds.

Issues Fixed

XDS cluster name had per-suffix identity: _worker_annotations appended the suffix to the XDS cluster name (e.g. moonlet-load-test-flat-test). This identity was never registered with envoy, so workers had no upstream routes — resulting in 0 RPS despite successful pod scheduling. Fixed by hardcoding moonlet-load-test.
LoadTestShape auto-activation: Locust auto-discovers LoadTestShape subclasses in the loaded file and uses the first one found, overriding --users. This meant flat-rate tests always followed the Spike pattern. Fixed by adding --class-picker to the locust command, which enables a web UI dropdown for selecting "Default" (flat-rate) or a specific shape (Spike, Soak, Breakpoint).
Cumulative stage duration bug in Spike.tick(): The tick() method compared run_time directly against each stage's duration field instead of accumulating elapsed time. This meant the ramp-down stage (duration=30s) was unreachable after the peak stage (duration=120s). Fixed by accumulating elapsed time across stages.

Build / Deploy / Test Commands

# 1. Build the ml-locust image from the PR branch
gondolier build ml-locust --branch xjohnson-loadtest-flat-mode

# 2. Deploy the build to dev
#    (build IDs from gondolier output)
gondolier deploy ml-locust --build <build-id> --stage dev

# 3. Run a flat-rate load test (uses new image via --image)
mlctl load-test start \
  --model-name msmarco_tinybert_crossencoder_v1 \
  --users 20 \
  --spawn-rate 5 \
  --run-time 5m \
  --suffix flat-v2 \
  --image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:6b1507bfaeab926f4bc64f29201b43df607aff9093291da56b8365ca96045cc8"

# 4. Check test status
mlctl load-test status <test-name>

# 5. Get results
mlctl load-test report <test-name>

# 6. Inspect stats directly from master pod
kubectl -n locust-system --context ml-dev-iad \
  exec <master-pod> -c <master-container> -- \
  curl -s http://localhost:8089/stats/requests

# 7. Clean up
mlctl load-test delete <test-name>

Test Results

Test 1: XDS routing fix (build 019d9710-5bbf, deployment 019d9717-05de)

mlctl load-test start \
  --model-name msmarco_tinybert_crossencoder_v1 \
  --users 20 --spawn-rate 5 --run-time 5m \
  --suffix flat-test \
  --image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:..."

Endpoint	Requests	Failures	Avg Latency	p95	p99	RPS
/grpc.health.v1.Health/Check	163,972	0	4ms	7ms	10ms	~561
/slack.moonlet.MoonletService/Predict	52,556	0	42ms	59ms	98ms	~176
Aggregated	216,528	0 (0.0%)	14ms	47ms	62ms	701.9

Confirmed: 20 users (flat-rate), 3 workers, XDS cluster moonlet-load-test received full route table.

Test 2: Flat-rate mode (build 019d9780-af6f, deployment 019d978a-ede2)

mlctl load-test start \
  --model-name msmarco_tinybert_crossencoder_v1 \
  --users 20 --spawn-rate 5 --run-time 5m \
  --suffix flat-v2 \
  --image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:6b1507bfaeab926f4bc64f29201b43df607aff9093291da56b8365ca96045cc8"

Endpoint	Requests	Failures	Avg Latency	p95	p99	RPS
/grpc.health.v1.Health/Check	164,671	0	4.3ms	7ms	10ms	~560
/slack.moonlet.MoonletService/Predict	52,083	0	42.6ms	59ms	99ms	~176
Aggregated	216,754	0 (0.0%)	13.5ms	47ms	62ms	736.3

Confirmed: 20 users (flat-rate respected, no shape auto-activation), 3 workers.

Test 3: Shape picker UI (build 019d97af-a903, deployment 019d97bd-32ce)

mlctl load-test start \
  --model-name msmarco_tinybert_crossencoder_v1 \
  --users 20 --spawn-rate 5 --run-time 3m \
  --suffix flat-v3 \
  --image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:4ddd479aa050a7501c39aae68e6a47c9b703d200f50c6587c27120b09d3f6af0"

Verified via curl http://localhost:8089/ on the master pod:

{
  "available_shape_classes": ["Default", "Breakpoint", "Soak", "Spike"],
  "show_userclass_picker": true,
  "available_user_classes": ["BasicUser", "HealthChecker"]
}

Users can select "Default" for flat-rate or a specific shape from the web UI dropdown.

Summary

0 failures across 400k+ total requests
Flat-rate mode works (20 users held constant, no spike)
Shape picker dropdown functional in web UI
XDS routing stable with base cluster name
Consistent ~736 RPS throughput across test runs
Load test tooling is ready for the g6 vs g6f A/B comparison

A/B Load Test Plan

Once data-airflow#74317 is merged, run the following to compare g6 vs g6f latency for a representative model:

# 1. Deploy the model pinned to g6 (full L4)
mlctl deploy moonlet \
  --deployment gte_multilingual_base_query_v1 \
  --cluster ml-dev-iad \
  --suffix g6 \
  --instance-family g6

# 2. Deploy the same model pinned to g6f (fractional L4)
mlctl deploy moonlet \
  --deployment gte_multilingual_base_query_v1 \
  --cluster ml-dev-iad \
  --suffix g6f \
  --instance-family g6f

# 3. Run 15-minute load test against g6 deployment
mlctl load-test start \
  --model-name gte_multilingual_base_query_v1 \
  --suffix g6 \
  --run-time 15m \
  --users 100 \
  --spawn-rate 1

# 4. Run 15-minute load test against g6f deployment
mlctl load-test start \
  --model-name gte_multilingual_base_query_v1 \
  --suffix g6f \
  --run-time 15m \
  --users 100 \
  --spawn-rate 1

# 5. Compare p50/p95/p99 latency from Locust results
# 6. Clean up
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix g6 --mode cleanup
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix g6f --mode cleanup

Success criteria: g6f p99 latency within 10% of g6 baseline for the same model and load profile.

A/B Load Test Attempt (2026-04-17)

Attempted a direct g6 vs g6f comparison for msmarco_tinybert_crossencoder_v1 and msmarco_minilml6_embedder_v0. The g6 baseline ran successfully; the g6f side hit multiple infrastructure issues.

g6 Baseline Results (running on existing g6 nodes)

msmarco_tinybert_crossencoder_v1 (1 replica, g6 node):

Metric	At 100 users	At 500 users
Predict avg latency	~1,011 ms	3,164 ms
Predict failure rate	0%	96% (UNAVAILABLE — upstream timeout)
Health check avg	17 ms	902 ms
Total RPS	~211	~103
Total requests	—	27,418

msmarco_minilml6_embedder_v0 (2 replicas, g6 nodes):

Metric	At 100 users	At 500 users
Predict avg latency	~62 ms	1,097 ms
Predict failure rate	0%	0%
Health check avg	3 ms	372 ms
Total RPS	~573	~237
Total requests	—	66,958

g6f Issues Encountered

Multiple attempts to deploy g6f-pinned copies failed due to compounding infrastructure issues:

minGpuMemoryMiB and --instance-type CLI flags not yet on master — PR #74317 hasn't merged, so mlctl deploy moonlet doesn't support --instance-type or --min-gpu-memory. Had to use raw helm upgrade --set commands to pass instanceType=g6f.2xlarge.
Initial deploy used g6f.xlarge — but no g6f.xlarge nodes exist — dev cluster only has g6f.2xlarge nodes. Karpenter couldn't provision new g6f.xlarge nodes due to the nvidia.com/gpu: "0" NodePool label conflict (see #3).
nvidia.com/gpu: "0" NodePool label blocks new g6f provisioning — the NodePool template sets nvidia.com/gpu: "0" as a label on all nodes. EC2 reports g6f GPU count as 0, so Karpenter thinks g6f.xlarge has nvidia.com/gpu=0 which conflicts with pod requests of nvidia.com/gpu: 1. Existing g6f nodes work (NodeOverlay + device plugin expose the real GPU), but Karpenter can't provision new ones.
Existing g6f nodes at capacity — all 3 existing g6f.2xlarge nodes had their single GPU fully allocated. Combined with #3 preventing new node provisioning, g6f pods stayed Pending.
Branch confusion — the chart changes (instanceType affinity, g6f anti-affinity) are on the xjohnson-add-instance-family-affinity branch, while mlctl deploy flags are only available with that branch's code. Running mlctl from the wrong branch silently deployed without affinity rules.

Status (resolved 2026-04-20)

The infrastructure blockers from 2026-04-17 were resolved by merging data-airflow#74317 and data-airflow#74489. See the successful A/B comparison below.

A/B Load Test Results: g6 vs g6f (2026-04-20)

Successful direct comparison of gte_multilingual_base_query_v1 on g6 (full L4) vs g6f (fractional L4). Both deployed as suffixed copies with --instance-type affinity, verified on correct node types.

Test configuration: 100 concurrent users, 10/s spawn rate, 10m run time, 3 workers, 1 replica per variant.

Node placement verified:

g6 copy → g6.2xlarge node (ip-10-57-124-136)
g6f copy → g6f.2xlarge node (ip-10-57-126-106)

Results

Metric	g6 (full L4, 22.8 GiB VRAM)	g6f (1/4 L4, 5.7 GiB VRAM)	Delta
Avg Predict latency	168 ms	352 ms	+109% (2.1x slower)
Throughput (RPS)	591	271	-54%
Failure rate	0%	0%	identical
Total requests (10m)	344,048	167,961	—

Analysis

Latency: g6f is ~2.1x slower than g6 for this model. This is expected — g6f.2xlarge provides a 1/4 partition of the L4 GPU, so compute throughput is proportionally lower.
Reliability: Both instance types handled 100 concurrent users with zero failures. g6f is fully functional for this workload.
Cost-adjusted throughput: g6.2xlarge costs ~$1.52/hr vs g6f.2xlarge at ~$0.77/hr (2x cheaper). At 2.1x slower, the cost-per-request is roughly equivalent. However, for latency-sensitive workloads at this scale, g6 is better. For cost-sensitive workloads with lower concurrency or less latency sensitivity, g6f provides identical reliability at half the price.
Right-sizing opportunity: The gte model uses ~1,056 MiB GPU memory — well within g6f.large (2,861 MiB, $0.20/hr) or g6f.xlarge (2,861 MiB, $0.40/hr). Testing on g6f.large would show whether the smaller vCPU/RAM is the bottleneck, or if the GPU partition size is the limiting factor.

Commands Used

# Deploy
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6f --instance-type g6f.2xlarge
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6

# Load tests
mlctl load-test start --model-name gte_multilingual_base_query_v1 --suffix lt-g6f --name gte-g6f --users 100 --spawn-rate 10 --run-time 10m
mlctl load-test start --model-name gte_multilingual_base_query_v1 --suffix lt-g6 --name gte-g6 --users 100 --spawn-rate 10 --run-time 10m

# Reports
mlctl load-test report gte-g6f
mlctl load-test report gte-g6

# Cleanup
mlctl load-test delete gte-g6f && mlctl load-test delete gte-g6
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6f --mode cleanup --force-cleanup
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6 --mode cleanup --force-cleanup

A/B Load Test Results: tinybert on g6 vs g6f (2026-04-20)

Direct comparison of msmarco_tinybert_crossencoder_v1 (one of the smallest moonlet models, ~358 MiB GPU memory) on g6.2xlarge (full L4) vs g6f.2xlarge (1/4 L4).

Why tinybert? The gte comparison above showed a 2.1x slowdown. Tinybert is a much smaller model, so the hypothesis was that the latency delta would be narrower since compute isn't the bottleneck.

Test configuration: 100 concurrent users, 5/s spawn rate, 10m run time, 3 workers, 1 replica per variant. Used a custom locustfile that connects directly to k8s service DNS (bypassing envoy) because suffixed deployments register under different envoy cluster names.

Node placement verified:

g6 copy → g6.2xlarge node (ip-10-55-252-253)
g6f copy → g6f.2xlarge node (ip-10-55-124-120)

Results

Metric	g6.2xlarge (full L4)	g6f.2xlarge (1/4 L4)	Delta
Total requests (10m)	1,689	1,103	-35%
Throughput (RPS)	2.82	1.84	-35%
Avg Predict latency	33,903 ms	51,042 ms	+51%
Min latency (best proxy for raw inference)	873 ms	1,461 ms	+67%
Median (p50)	35,000 ms	54,000 ms	+54%
p95	36,000 ms	55,000 ms	+53%
p99	36,000 ms	55,000 ms	+53%
Failure rate	0%	0%	identical

Analysis

High absolute latencies are queueing artifacts: 100 concurrent users hitting a single GPU creates massive contention. The median/p95/p99 are nearly identical because the queue is always saturated. Min latency is the best measure of raw inference speed.
Min latency: g6f is ~67% slower (873ms → 1,461ms). Even for a tiny model (~358 MiB VRAM), the fractional GPU partition imposes a significant compute penalty.
Throughput: g6f delivers 35% fewer requests in the same time period under identical load.
Zero failures: Both instance types handle the load without errors or OOM.
Comparison with gte results: gte showed a 2.1x slowdown, tinybert shows a 1.67x slowdown. The hypothesis was partially confirmed — smaller models see a narrower gap, but it's still substantial.

Cost-Performance Tradeoff

Instance	$/hr	Min latency	Throughput	Cost per 1K requests
g6.2xlarge	$0.80	873ms	2.82 RPS	$0.079
g6f.2xlarge	$0.20	1,461ms	1.84 RPS	$0.030

For latency-insensitive workloads, g6f.2xlarge is ~62% cheaper per request. For latency-sensitive workloads, the 67% min-latency penalty may exceed p99 SLA budgets.

g6f.4xlarge (1/2 L4) — BLOCKED

Could not provision g6f.4xlarge nodes due to the nvidia.com/gpu: "0" NodePool label issue (see Blockers). Karpenter refuses to provision nodes when the NodePool template declares nvidia.com/gpu: "0" but the pod requests nvidia.com/gpu: 1. The g6f.4xlarge test would show the performance of a 1/2 L4 partition, which we expect to be ~30-40% slower than full L4 (narrower gap than the 1/4 partition).

Custom Locustfile Required for Suffixed Deployments

The baked-in locustfile uses envoy (localhost:9001) with model routing via x-slack-moonlet-service metadata. Suffixed deployments (--suffix lt-g6) register under different envoy cluster names, so the standard locustfile cannot route to them. A custom locustfile connecting directly to k8s service DNS was required:

mlctl load-test start \
  --locustfile /tmp/locustfile-tinybert.py \
  --host moonlet-msmarco-tinybert-crossencoder-v1-lt-g6.default.svc.cluster.local:8000 \
  --name tinybert-g6 \
  --image $LOCUST_ML_IMAGE \
  --users 100 --spawn-rate 5 --run-time 10m

Bug found: The custom locustfile command is missing the locust prefix (line 364 of loadtest.py), causing exec format error when using non-default images like ml-base-dev. Tracked for fix in a follow-up PR.

TODO: Streamline A/B Load Testing

The g6 vs g6f comparison should be a single command. Today it requires ~15 manual steps across 3 tools (mlctl, helm, kubectl) with multiple failure modes. Here's what needs to change:

1. `mlctl benchmark` command (NEW)

Add a new mlctl benchmark subcommand that automates the entire A/B flow:

mlctl benchmark \
  --deployment msmarco_tinybert_crossencoder_v1 \
  --variants "g6=default" "g6f=--instance-type g6f.2xlarge" \
  --users 100 --spawn-rate 10 --run-time 10m

This should:

Deploy suffixed copies for each variant (e.g. lt-g6, lt-g6f)
Wait for all pods to be Ready
Run load tests sequentially or in parallel
Collect reports into a single comparison table
Clean up all deployments and load tests on completion (or Ctrl-C)
Output a markdown table suitable for pasting into a PR

2. `mlctl deploy moonlet` needs `--instance-type` and `--min-gpu-memory` on master

These flags exist on the PR branch but not on master. Without them, deploying to specific instance types requires raw helm --set commands with knowledge of chart internals (values file paths, image URIs, set-key syntax). This is the #1 source of friction.

PR: data-airflow#74317 — needs merge.

3. Load test `--run-time` doesn't reliably stop the test

Tests started with --run-time 10m ran for 20+ minutes. The --class-picker flag combined with autostart: true may cause locust to ignore --run-time. The LocustTest operator's autoquit should respect --run-time, but it didn't.

Fixed in data-airflow#74543: Added --headless to extraArgs when autostart=True, which ensures Locust enforces --run-time.

4. `mlctl load-test stop` is broken

mlctl load-test stop tries to set spec.worker.replicas: 0, which fails CRD validation (minimum: 1). There's no way to gracefully stop a running test short of deleting it.

Fixed in data-airflow#74543: Now uses kubectl exec curl -X POST /stop via the Locust master pod's REST API.

5. `mlctl load-test report` needs p50/p95/p99 percentiles + compare command

The current report only shows avg latency. For A/B comparison, percentile latencies (p50, p95, p99) are essential. The Locust /stats/requests endpoint has this data.

Fixed in data-airflow#74543: Report now shows p50/p95/p99 columns. Also added mlctl load-test compare subcommand for side-by-side comparison:

mlctl load-test compare test-g6 test-g6f --labels g6 g6f

6. Pre-flight checks before deploying GPU workloads

Today there's no validation that the target instance type has available capacity or that Karpenter can provision it. This leads to silent Pending pods.

Fix needed: Add a --dry-run or pre-flight check to mlctl deploy moonlet that:

Checks if matching nodes exist with available GPU resources
Checks if Karpenter NodePool allows the target instance family
Warns if the NodePool has conflicting labels (e.g. nvidia.com/gpu: "0")

Formalizing NodeOverlay Deployment

The NodeOverlay was previously applied manually via kubectl apply. To formalize this:

G6F_ENABLED in provisioner_instance_families already adds g6f to NodePool allowed families (via _helpers.tpl in bedrock-argocd)
But g6f doesn't work without a NodeOverlay (EC2 API reports 0 GPUs)
bedrock-argocd#440 couples the NodeOverlay to G6F_ENABLED — when g6f is enabled for a cluster, the overlay is automatically created
This prevents the footgun of enabling g6f without the overlay

Phase 3: Cost Measurement & Production Rollout

Checklist

Cost Model — 10 Active GPU Models (Production)

All currently on g6.xlarge ($0.80/hr on-demand). Migrating to g6f.large ($0.20/hr) = 75% savings.

Model	Prod minReplicas	Tier	Current $/hr	g6f $/hr	Savings $/hr
msmarco_tinybert_crossencoder_v1	10	2	$8.00	$2.00	$6.00
mmarco_roberta_crossencoder_v2	10	2	$8.00	$2.00	$6.00
gte_multilingual_base_v1	5	1	$4.00	$1.00	$3.00
gte_multilingual_base_file_v1	5	1	$4.00	$1.00	$3.00
finetuned_gte_multilingual_v0	5	2	$4.00	$1.00	$3.00
bert_action_items_classifier_v0	2	2	$1.60	$0.40	$1.20
msmarco_minilml6_embedder_v0	2	3	$1.60	$0.40	$1.20
gte_multilingual_base_query_v1	2	2	$1.60	$0.40	$1.20
bert_semantic_priority_proxy_v1	2	2	$1.60	$0.40	$1.20
bert_display_explain_cta_v0	2	3	$1.60	$0.40	$1.20
TOTAL	45		$36.00/hr	$9.00/hr	$27.00/hr

Maximum annual savings ceiling: $27/hr x 8,760 hrs = ~$236K/yr (minReplicas only — actual savings higher with HPA scaling)

Note: gte models use shmVolume: true and request 3-4 CPUs — may need g6f.xlarge ($0.40/hr) instead of g6f.large, reducing those to 50% savings. Realistic estimate: **$180K-$200K/yr**.

Rollout Strategy

flowchart LR
    subgraph "Dev Validation"
        D1["Wave 1: 2 smallest<br/>tinybert + minilml6<br/>no shmVolume"]
        D1 -->|1 week soak| D2["Wave 2: shmVolume<br/>gte_base_v1<br/>gte_base_file_v1"]
    end

    subgraph "Prod Rollout"
        D2 -->|stable| GRID["Enable GRID<br/>driver in prod"]
        GRID --> P1["Enable g6f NodePool<br/>prod-iad-2 canary"]
        P1 --> P3["Wave 3: Tier 3<br/>minilml6, bert_display<br/>2 replicas each"]
        P3 -->|1 week| P4["Wave 4: Tier 2<br/>tinybert, roberta, etc<br/>one at a time, 24h bake"]
        P4 -->|2 weeks| P5["Wave 5: Tier 1<br/>gte models<br/>highest criticality"]
    end

    style D1 fill:#bfb,stroke:#333
    style P3 fill:#bbf,stroke:#333
    style P5 fill:#fbf,stroke:#333

Prod Enablement Blockers

Blocker	Status	PR/Action
NodeOverlay GitOps	DONE	bedrock-argocd#440 — merged
`nvidia.com/gpu: "0"` NodePool label fix	DONE	bedrock-argocd#446 — merged
Load test tooling	DONE	data-airflow#74389 — merged
instanceType affinity + g6f anti-affinity	Merged	data-airflow#74317 — merged 2026-04-17
Dev model migration	Open (CI green, needs review + possible rebase)	data-airflow#74336
g6f in prod NodePool	Open (blocked)	bedrock-argocd#443 — blocked on GRID driver
GRID driver in prod	BLOCKER	Need shipyard-chef-repo PR to add `prod` to `upgrade_envs`

Cost Measurement — PromQL Queries

Baseline (capture BEFORE migration)

# 1. GPU node count by instance type
count by (label_node_kubernetes_io_instance_type) (
  kube_node_labels{label_type="karpenter", label_node_kubernetes_io_instance_type=~"g6.*"}
)

# 2. Per-model pod → instance type mapping
count by (label_node_kubernetes_io_instance_type, label_slack_com_moonlet_model) (
  kube_pod_info{namespace="default", pod=~"moonlet-.*"}
  * on(node) group_left(label_node_kubernetes_io_instance_type)
  kube_node_labels{}
)

# 3. GPU memory usage per model (DCGM)
avg by (exported_pod) (DCGM_FI_DEV_FB_USED{namespace="default", exported_pod=~"moonlet-.*"})

# 4. GPU memory headroom per model
avg by (exported_pod) (DCGM_FI_DEV_FB_FREE{namespace="default", exported_pod=~"moonlet-.*"})

# 5. Latency baseline (p99) per model
histogram_quantile(0.99, sum by (le, model_name) (
  rate(grpc_server_handling_seconds_bucket{grpc_method="Predict"}[5m])
))

After Migration

# 6. Estimated hourly cost (g6 vs g6f node counts)
(count(kube_node_labels{label_karpenter_k8s_aws_instance_family="g6"}) * 0.80)
+
(count(kube_node_labels{label_karpenter_k8s_aws_instance_family="g6f"}) * 0.20)

# 7. Pod distribution: g6 vs g6f per model
count by (label_karpenter_k8s_aws_instance_family, label_slack_com_moonlet_model) (
  kube_pod_info{pod=~"moonlet-.*"}
  * on(node) group_left(label_karpenter_k8s_aws_instance_family)
  kube_node_labels{}
)

Grafana Dashboard Panels (add to "Moonlet GPU" dashboard)

New row — "g6f Cost Comparison":

Node Count by Instance Family — stat panel (query #1)
Estimated Hourly GPU Cost — stat panel (query #6)
Pod Distribution (g6 vs g6f) — table (query #7)
GPU VRAM Headroom by Instance Family — timeseries (query #4)
Latency: g6 vs g6f — timeseries (query #5 split by family)

Future Simplification: `instance-gpu-memory` Label

Sarah suggested using karpenter.k8s.aws/instance-gpu-memory with Gt operator to replace the entire minGpuMemoryMiB if/else ladder. This would be ideal but currently doesn't work for g6f — Karpenter populates this label from the EC2 DescribeInstanceTypes API, and g6f reports GPU count as 0 (uses LogicalGpuCount). NodeOverlay can only override capacity, not labels. Once Karpenter adds label override support to NodeOverlay (upstream RFC in progress), this simplification becomes possible.

Edge Cases & Scheduling Considerations

CRITICAL: g6f Must Be Opt-In, Not Automatic

Problem: Kubernetes GPU scheduling uses nvidia.com/gpu (integer count) — there is no VRAM-aware scheduling. Both g6.xlarge (22.8 GiB VRAM) and g6f.large (2.8 GiB VRAM) report nvidia.com/gpu: 1. If g6f is in the same NodePool as g6, Karpenter may choose the cheapest option (g6f.large at $0.20/hr) for any GPU workload, even if the model needs more than 2.8 GiB. This causes silent GPU OOM at runtime — the pod starts, loads the model, and crashes.

flowchart TD
    POD["Pod requests<br/>nvidia.com/gpu: 1<br/>gpuName: l4"] --> KS["Karpenter Scheduler"]
    KS --> PICK{"Cheapest node<br/>with 1 GPU?"}
    PICK -->|g6f.large $0.20/hr| G6F["g6f.large<br/>2,861 MiB VRAM"]
    PICK -->|g6.xlarge $0.80/hr| G6["g6.xlarge<br/>22,888 MiB VRAM"]
    G6F --> CHECK{"Model fits<br/>in 2.8 GiB?"}
    CHECK -->|Yes| OK["OK"]
    CHECK -->|No| OOM["GPU OOM<br/>silent failure"]

    style OOM fill:#f44,stroke:#333,color:#fff
    style G6F fill:#fbb,stroke:#333
    style G6 fill:#bfb,stroke:#333

Decision: g6f instances must be opt-in only. A workload should never land on g6f unless it explicitly opts in.

Implemented solution (data-airflow#74317 + stacked PR):

The moonlet chart deployment template now enforces g6f opt-in via anti-affinity. Every GPU pod that does NOT explicitly target g6f gets a NotIn expression excluding all 4 g6f instance types (g6f.large, g6f.xlarge, g6f.2xlarge, g6f.4xlarge). This is safe even when g6f is enabled in the Karpenter NodePool.

There are three ways to opt in to g6f:

instanceType (most explicit): Pin to a specific g6f size, e.g. instanceType: "g6f.large". Use when you know the exact VRAM tier you need.
instanceFamily (family-level): Set instanceFamily: "g6f" to allow any g6f size. Only safe if you're OK with Karpenter choosing any VRAM tier.
minGpuMemoryMiB (VRAM-aware, recommended): Specify the model's GPU memory requirement in MiB. The template automatically excludes g6f tiers with insufficient VRAM and allows the rest — including full g6 nodes. This is the safest and most ergonomic option.

All three are mutually exclusive with each other (enforced by {{- fail }} in the Helm template).

`minGpuMemoryMiB` — VRAM-Aware Scheduling (2026-04-15)

Since Kubernetes has no native VRAM-aware scheduling, the moonlet chart implements a client-side workaround via the minGpuMemoryMiB field. The template maps the memory requirement to g6f exclusion rules at render time:

`minGpuMemoryMiB`	g6f.large/xlarge (2,861 MiB)	g6f.2xlarge (5,722 MiB)	g6f.4xlarge (11,444 MiB)	Full g6 (22,888 MiB)
500	Allowed	Allowed	Allowed	Allowed
3000	Excluded	Allowed	Allowed	Allowed
6000	Excluded	Excluded	Allowed	Allowed
12000	Excluded	Excluded	Excluded	Allowed
0 (default)	Excluded	Excluded	Excluded	Allowed

Usage in values:

moonletDeployments:
  msmarco_tinybert_crossencoder_v1:
    resources:
      minGpuMemoryMiB: 500   # observed 358 MiB — all g6f tiers fit

Usage via mlctl CLI:

mlctl deploy moonlet \
  --deployment msmarco_tinybert_crossencoder_v1 \
  --cluster ml-dev-iad \
  --min-gpu-memory 500

GPU Memory Sizing

All g6f sizes report nvidia.com/gpu: "1" via NodeOverlay. Karpenter cannot distinguish memory tiers between g6f.large (2.8 GiB) and g6f.4xlarge (11.4 GiB). If a model needs more GPU memory than the provisioned g6f size provides, it will OOM at runtime — there is no scheduling-time protection.

Mitigation (implemented):

minGpuMemoryMiB auto-excludes g6f tiers that are too small (see table above)
instanceType (e.g., g6f.2xlarge) for models that need a specific VRAM tier
Default behavior (minGpuMemoryMiB=0) excludes ALL g6f — safe by default
Current profiling shows all models fit in g6f.large (2.8 GiB) with 40%+ headroom

Custom Labels Limitation

NodeOverlay (v1alpha1) can only override capacity (resource counts) and price — it cannot set labels. So we cannot add a custom label like karpenter.k8s.aws/instance-gpu-fractional=true to g6f nodes. The Karpenter team has an upstream RFC proposing instance-gpu-fractional labels, but this is not yet available.

Until then, karpenter.k8s.aws/instance-family is the only reliable way to distinguish g6 from g6f.

Karpenter Consolidation Behavior

With WhenEmptyOrUnderutilized consolidation (default for ML clusters), Karpenter may:

Consolidate workloads from g6 to g6f if g6f is cheaper and meets resource requirements
This is generally desirable for cost savings when g6f is in the NodePool
But could cause GPU OOM if consolidation moves a large-VRAM model to g6f

Mitigation: If g6f is opt-in only (option 2 above), consolidation cannot move workloads to g6f unless the pod already has the right affinity. This is another reason to keep g6f out of the shared GPU NodePool.

GPU Operator Compatibility

The NVIDIA GPU Operator's device plugin exposes nvidia.com/gpu based on actual hardware detection, not Karpenter labels. On g6f nodes with GRID driver, the device plugin correctly detects 1 GPU partition and exposes it. The NodeOverlay's nvidia.com/gpu: "1" must match what the device plugin reports — which it does.

Observability

Existing Dashboards

Two dashboards exist in the ML Services folder:

Moonlet Service (moonlet-service) — fleet health, gRPC latency/success, alert tiers, KEDA scaling, CUDA OOM logging (will fire on g6f OOMs)
Moonlet GPU (moonlet-gpu-uid) — DCGM metrics: GPU util, VRAM used/free/headroom per deployment, utilization percentiles

The GPU dashboard joins DCGM_FI_DEV_* metrics to pods via kube_pod_labels on Prometheus_Kubernetes. No queries currently filter by instance type or family — all scoped to cluster only.

Proposed Additions to GPU Dashboard

New variable: instance_family via label_values(kube_node_labels{cluster="ml-dev-iad"}, label_karpenter_k8s_aws_instance_family)

New row — "Instance Family Comparison":

Panel	Query Pattern
GPU Nodes by Family	`count by (label_node_kubernetes_io_instance_type) (kube_node_labels{...})`
VRAM Used by Family	`DCGM_FI_DEV_FB_USED` joined via `kube_pod_info` → `kube_node_labels`
VRAM Headroom by Family	`DCGM_FI_DEV_FB_FREE` — critical for g6f OOM risk detection
Latency by Family	`grpc_server_handling_seconds` joined to node labels

Key join (pod → node → instance family):

kube_pod_info{cluster="ml-dev-iad"} * on(cluster, node) group_left(label_karpenter_k8s_aws_instance_family) kube_node_labels{cluster="ml-dev-iad"}

Risks and Considerations

Risk	Severity	Mitigation
g6f requires NVIDIA GRID driver — datacenter drivers do not work	RESOLVED	PR #888 merged — GRID 570.172.08 installed in dev
g6f missing from ip_per_eni — causes pod limit of ~11	RESOLVED	PR #969 merged 2026-04-13
NodeOverlay applied manually — not managed by GitOps	RESOLVED	bedrock-argocd#440 — auto-creates NodeOverlay when G6F_ENABLED
No way to distinguish g6 from g6f via gpuName	RESOLVED	data-airflow#74317 — merged 2026-04-17: instanceFamily affinity + g6f anti-affinity + minGpuMemoryMiB
GRID driver AMI cannot be shared with existing g6 nodes	Low	Same AMI works — upgrade workflow installs GRID at provision time in dev
NodeOverlay is alpha - API may change	Medium	Pin Karpenter version; monitor upstream RFC progress
GPU memory OOM on fractional partitions	MITIGATED	`minGpuMemoryMiB` auto-excludes undersized g6f tiers; default (0) blocks all g6f; anti-affinity enforced in chart
g6f.large pod capacity may be tight (18 pods)	Medium	Use g6f.xlarge+ (23 pods) for production workloads
Mixed g6/g6f scheduling with gpuName: "l4"	MITIGATED	Chart auto-adds g6f NotIn anti-affinity for pods not explicitly targeting g6f
GRID driver not yet enabled in prod	BLOCKER	Need shipyard-chef-repo PR to add prod to `upgrade_envs`
Karpenter consolidation churn	Low	Monitor consolidation metrics post-rollout

PRs

PR	Status	Purpose
chef-repo#154249	Merged	Create GPU longshoreman ASG
chef-repo#154263	Merged	Add kubernetes tags to GPU ASG
chef-repo#154310	Merged	Fix instance_type + nebula-ca type tag
shipyard-chef-repo#888	Merged	GRID driver support in nvidia-gpu cookbook
shipyard-chef-repo#959	Open	Deduplicate lspci NVLink detection
shipyard-chef-repo#969	Merged	Add g6f to ip_per_eni, fix g6e values, fix shipyard_lookup CI
bedrock-argocd#440	Merged	Auto-create g6f NodeOverlay when G6F_ENABLED
bedrock-argocd#446	Merged	Remove `nvidia.com/gpu: "0"` label from NodePool templates
bedrock-argocd#443	Open (blocked on GRID driver)	Enable g6f NodePool in prod-iad-2
data-airflow#74317	Merged (2026-04-17)	instanceType affinity + minGpuMemoryMiB + g6f anti-affinity + helm tests
data-airflow#74389	Merged	Locust load test fixes (--class-picker, cumulative tick, shapes cleanup)
data-airflow#74336	Open (CI green, needs review — assigned to shenkens; may need rebase post-#74317 merge)	Dev migration: 2 smallest GPU models via minGpuMemoryMiB
data-airflow#74409	Open (CI green, no reviews yet; may need rebase post-#74317 merge)	Dev migration values + load test comparison commands
data-airflow#74489	Merged (2026-04-17)	Fix load test CLI: --users/--spawn-rate/--run-time not overridden by LoadTestShapes

References

AWS NVIDIA driver compatibility table (g6f = GRID only)
AWS NVIDIA GRID driver installation guide
GRID driver S3 bucket: s3://ec2-linux-nvidia-drivers/
Karpenter g6f support issue (aws/karpenter-provider-aws#8368)
Karpenter NodeOverlay docs
Karpenter Feature Gates

Original Phase 1 Testing Notes (March 2026)

Click to expand detailed March testing notes

Step 1.5: Debugging — Why the NVIDIA stack fails on g6f

1.5.4: Check driver-validation logs — the root cause

NVRM: The NVIDIA GPU 0000:31:00.0 (PCI ID: 10de:27b8)
NVRM: NVIDIA 560.35.05 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
nvidia: probe of 0000:31:00.0 failed with error -1
NVRM: The NVIDIA probe routine failed for 1 device(s).
NVRM: None of the NVIDIA devices were initialized.

nvidia-smi cannot communicate with the driver on the g6f node.

1.5.5-1.5.6: Confirmed same AMI and GPU operator config as working g6 cluster

1.5.7: Confirmed GPU hardware IS detected by NFD

NFD sees the GPU hardware. The node labels show nvidia.com/gpu.present=true.

1.5.8: Kernel dmesg — the real error

NVRM: The NVIDIA GPU 0000:31:00.0 (PCI ID: 10de:27b8)
NVRM: NVIDIA 560.35.05 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
nvidia: probe of 0000:31:00.0 failed with error -1

BLOCKER: NVIDIA datacenter driver does not support g6f PCI device ID

g6f fractional L4 GPU presents PCI device ID 10de:27b8 — not in the datacenter driver's supported device list.

AWS documentation confirms: g6f instances do NOT support Tesla/datacenter drivers — only GRID drivers.

Step 1.7 (original): Retest with NVIDIA 570.x datacenter driver (2026-03-19)

Status: DONE — 570.x datacenter driver also fails. Datacenter driver ruled out.

Same failure with driver 570.172.08:

NVRM: The NVIDIA GPU 0000:31:00.0 (PCI ID: 10de:27b8)
NVRM: NVIDIA 570.172.08 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
nvidia: probe of 0000:31:00.0 failed with error -1

metasyn/hf-token-tracking.md

g6f Instance Type Evaluation

Background

What are g6f instances?

Cost Savings Potential

Architecture Overview

How g6f Gets Provisioned

GPU Targeting Decision Tree

Karpenter Support Status

The Problem

The Workaround: NodeOverlay (alpha)

Current State

Phase 1: Infrastructure Validation

Step 1.1: Confirm Karpenter version and feature gates

Step 1.2: Apply NodeOverlay for g6f

Step 1.3-1.5: Initial Testing (ml-dev-cmh, March 2026)

Step 1.6: GRID Driver Implementation

Step 1.7: g6f Full Validation with GRID Driver + Pod Limit Fix (ml-dev-iad, 2026-04-14)

Pod Limit Fix

Validation Results (2026-04-14)

nvidia-smi Output from Workload Pod

Phase 2: Workload Compatibility

GPU Memory Profiling Results (2026-04-14)

Instance Family Affinity for A/B Testing (2026-04-14)

Load Test Tooling Validation (2026-04-16)

Issues Fixed

Build / Deploy / Test Commands

Test Results

Summary

A/B Load Test Plan

A/B Load Test Attempt (2026-04-17)

g6 Baseline Results (running on existing g6 nodes)

g6f Issues Encountered

Status (resolved 2026-04-20)

A/B Load Test Results: g6 vs g6f (2026-04-20)

Results

Analysis

Commands Used

A/B Load Test Results: tinybert on g6 vs g6f (2026-04-20)

Results

Analysis

Cost-Performance Tradeoff

g6f.4xlarge (1/2 L4) — BLOCKED

Custom Locustfile Required for Suffixed Deployments

TODO: Streamline A/B Load Testing

1. mlctl benchmark command (NEW)

2. mlctl deploy moonlet needs --instance-type and --min-gpu-memory on master

3. Load test --run-time doesn't reliably stop the test

4. mlctl load-test stop is broken

5. mlctl load-test report needs p50/p95/p99 percentiles + compare command

6. Pre-flight checks before deploying GPU workloads

Formalizing NodeOverlay Deployment

Phase 3: Cost Measurement & Production Rollout

Checklist

Cost Model — 10 Active GPU Models (Production)

Rollout Strategy

Prod Enablement Blockers

Cost Measurement — PromQL Queries

Baseline (capture BEFORE migration)

After Migration

Grafana Dashboard Panels (add to "Moonlet GPU" dashboard)

Future Simplification: instance-gpu-memory Label

Edge Cases & Scheduling Considerations

CRITICAL: g6f Must Be Opt-In, Not Automatic

minGpuMemoryMiB — VRAM-Aware Scheduling (2026-04-15)

GPU Memory Sizing

Custom Labels Limitation

Karpenter Consolidation Behavior

GPU Operator Compatibility

Observability

Existing Dashboards

Proposed Additions to GPU Dashboard

Risks and Considerations

PRs

References

Original Phase 1 Testing Notes (March 2026)

Step 1.5: Debugging — Why the NVIDIA stack fails on g6f

1.5.4: Check driver-validation logs — the root cause

1.5.5-1.5.6: Confirmed same AMI and GPU operator config as working g6 cluster

1.5.7: Confirmed GPU hardware IS detected by NFD

1. `mlctl benchmark` command (NEW)

2. `mlctl deploy moonlet` needs `--instance-type` and `--min-gpu-memory` on master

3. Load test `--run-time` doesn't reliably stop the test

4. `mlctl load-test stop` is broken

5. `mlctl load-test report` needs p50/p95/p99 percentiles + compare command

Future Simplification: `instance-gpu-memory` Label

`minGpuMemoryMiB` — VRAM-Aware Scheduling (2026-04-15)