HF Token GPU Rendering: Tracking & Sequencing (PRs, approach, status)
Fractional GPU Test - G6f
Fractional GPU Test - G6f
Slack ML runs GPU inference workloads (Slack AI, Slackbot, semantic search) that currently require full NVIDIA L4 GPUs via g6 instance types. Many of these models only need 1/4 or 1/8 of available GPU RAM, meaning we are significantly over-provisioning.
AWS g6f instances offer fractional NVIDIA L4 GPUs at substantially lower price points. Evaluating these instances is a V2MOM priority for Moonlet Inference cost savings.
g6f instances provide fractional slices of NVIDIA L4 GPUs without requiring MIG (Multi-Instance GPU), which is only available on P4/P5 instance types. Each g6f instance exposes a GPU partition as a single logical GPU to the operating system.
| Instance Type | vCPUs | RAM (GiB) | GPU Partition | GPU Memory (MiB) | On-Demand $/hr |
|---|---|---|---|---|---|
| g6f.large | 2 | 8 | 1/8 | ~2,861 | $0.20 |
| g6f.xlarge | 4 | 16 | 1/8 | ~2,861 | $0.40 |
| g6f.2xlarge | 8 | 32 | 1/4 | ~5,722 | $0.77 |
| g6f.4xlarge | 16 | 64 | 1/2 | ~11,444 | $1.53 |
For comparison, the smallest full-GPU option:
| Instance Type | vCPUs | RAM (GiB) | GPUs | GPU Memory (MiB) | On-Demand $/hr |
|---|---|---|---|---|---|
| g6.xlarge | 4 | 16 | 1 | 22,888 | $0.80 |
If workloads that currently use g6.xlarge ($0.80/hr) can run on g6f.large ($0.20/hr), that represents a 75% cost reduction per instance. Even moving to g6f.2xlarge ($0.77/hr) for workloads needing ~5.7 GiB of GPU memory saves cost while right-sizing GPU memory allocation, freeing full L4 GPUs for workloads that actually need them.
flowchart TD
TF["Terraform<br/>provisioner_instance_families<br/>= [G6F_ENABLED, ...]"] --> AA["ArgoCD ApplicationSet<br/>reads cluster annotations"]
AA --> HV["Helm Values<br/>customdata.instanceFamilyLabels<br/>= JSON array"]
HV --> NP["NodePool<br/>karpenter.k8s.aws/instance-family<br/>In: [g6f]"]
HV --> NO["NodeOverlay<br/>nvidia.com/gpu: 1<br/>(EC2 API reports 0)"]
NP --> KS["Karpenter Scheduler"]
NO --> KS
KS --> EC2["EC2 g6f Instance<br/>NVIDIA L4 fractional GPU<br/>GRID driver"]
POD["Pod<br/>nvidia.com/gpu: 1<br/>instanceFamily: g6f"] --> KS
style NO fill:#f9f,stroke:#333
style EC2 fill:#bfb,stroke:#333
flowchart TD
START["Model needs GPU"] --> Q1{"Need specific<br/>GPU memory tier?"}
Q1 -->|No| Q2{"Need specific<br/>instance family?"}
Q1 -->|Yes| IT["Use instanceType<br/>e.g. g6f.2xlarge"]
Q2 -->|No| GN["Use gpuName: l4<br/>Karpenter picks best fit"]
Q2 -->|Yes, fractional| IF_G6F["Use instanceFamily: g6f"]
Q2 -->|Yes, full GPU| IF_G6["Use instanceFamily: g6"]
GN -.->|WARNING| WARN["If g6 + g6f both in NodePool,<br/>pod may land on either.<br/>g6f.large has only 2.8 GiB GPU RAM"]
style WARN fill:#fbb,stroke:#333
style IT fill:#bbf,stroke:#333
style IF_G6F fill:#bfb,stroke:#333
The EC2 DescribeInstanceTypes API returns GPU Count: 0 for g6f instances (it uses a new LogicalGpuCount field instead). This means Karpenter cannot natively:
- Provision g6f instances for workloads requesting
nvidia.com/gpu - Correctly account for GPU resources in NodePool limits
See: aws/karpenter-provider-aws#8368
Karpenter v1.7+ supports NodeOverlay, an alpha feature that allows overriding instance type capabilities. This lets us manually declare that g6f instances have 1 GPU.
Requires the NodeOverlay feature gate enabled in Karpenter settings.
| Cluster | Karpenter | NodeOverlay Feature Gate | NodeOverlay Applied |
|---|---|---|---|
| ml-dev-cmh | v1.9.0 | Enabled | Yes (from Phase 1) |
| ml-dev-iad | v1.9.0 | Enabled | Yes (applied 2026-04-08) |
| ml-dev-pdx | v1.9.0 | Enabled | Yes (applied 2026-04-13) |
Status: DONE (ml-dev-cmh, ml-dev-iad, and ml-dev-pdx)
Status: DONE (all three clusters)
apiVersion: karpenter.sh/v1alpha1
kind: NodeOverlay
metadata:
name: g6f-gpu-overlay
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: ['g6f']
capacity:
nvidia.com/gpu: "1"Status: DONE — datacenter driver failed. See original testing notes below.
Both NVIDIA 560.35.05 and 570.172.08 datacenter drivers failed on g6f because PCI device ID 10de:27b8 is not in the datacenter driver's supported device list. AWS documentation confirms g6f only supports GRID drivers.
Status: DONE — PR #888 merged 2026-04-06
PR #888 added NVIDIA GRID driver support to the nvidia-gpu cookbook:
driver-typeattribute (datacentervsgrid) controls which driver is installed- GRID driver:
.runinstaller froms3://ec2-linux-nvidia-drivers/grid-18.4/with SHA-256 checksum verification upgrade_envs = %w(dev)— dev environment gets GRID, all others stay on datacenter- Tested via
ship quickon GPU longshoreman ASG (bake + provision both passed, 48/48 ServerSpec) nvidia-smiconfirmed:Driver Version: 570.172.08,NVIDIA L4GPU
Status: DONE — Phase 1 fully complete
PR #969 merged 2026-04-13 — adds g6f to ip_per_eni mappings, also fixes g6e values.
| Instance Type | Max ENIs | IPv4/ENI | Calculated max_pods |
|---|---|---|---|
| g6f.large | 2 | 10 | 18 |
| g6f.xlarge | 4 | 15 | 23 |
| g6f.2xlarge | 4 | 15 | 23 |
| g6f.4xlarge | 8 | 30 | 38 |
Tested on ml-dev-iad with a g6f.2xlarge node (ip-10-56-126-0.ec2.internal).
| Check | Result |
|---|---|
| NodeOverlay applied | Ready=True, ValidationSucceeded=True |
| g6f.2xlarge node launch | Karpenter created NodeClaim, EC2 instance launched |
nvidia.com/gpu Capacity |
1 |
nvidia.com/gpu Allocatable |
1 (previously 0 before pod limit fix) |
| Allocatable pods | 23 (previously ~11 before pod limit fix) |
| driver-validation (init container) | PASSED |
| toolkit-validation (init container) | PASSED |
| cuda-validation (init container) | PASSED |
| plugin-validation (init container) | PASSED |
| nvidia-operator-validator | "all validations are successful" |
| nvidia-cuda-validator | "cuda workload validation is successful" |
| ML workload pods scheduling | YES — moonlet-ai-search-summary-messages and moonlet-reflector running on g6f node |
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4-6Q On | 00000000:31:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
- GPU:
NVIDIA L4-6Q— the "6Q" indicates a 6 GiB GRID/vGPU partition of the L4 - GPU Memory:
6144 MiB(6 GiB) — matches g6f.2xlarge 1/4 partition spec - Driver:
570.172.08(GRID) - CUDA:
12.8
Phase 1 complete. Phase 2 is now unblocked.
- Re-validate g6f with correct pod limits (after PR #969 merge + AMI bake) — DONE 2026-04-14
- Confirm
nvidia-smishows fractional GPU memory from a workload pod — DONE: NVIDIA L4-6Q, 6144 MiB - Identify candidate models/workloads that fit within g6f GPU memory tiers — DONE 2026-04-14 (see below)
- Add
instanceFamilynode affinity to moonlet chart + mlctl — DONE 2026-04-14 (data-airflow#74317) - Fix and validate load test tooling for A/B comparison — DONE 2026-04-16 (data-airflow#74389), see Load Test Tooling Validation below
- A/B load test: deploy same model on g6 vs g6f, compare latency — DONE 2026-04-20 for
gte_multilingual_base_query_v1(see results) andmsmarco_tinybert_crossencoder_v1(see results) - Validate inference accuracy matches full-GPU baseline
- Test GPU memory pressure / OOM behavior at partition boundaries
Sampled nvidia-smi across multiple full g6 nodes on ml-dev-iad. All current GPU workloads use well under 2 GiB of GPU memory — a fraction of the 23 GiB available on full L4 GPUs.
| Node | GPU 0 | GPU 1 | GPU 2 | GPU 3 |
|---|---|---|---|---|
| ip-10-56-126-93 | 358 MiB | 550 MiB | 1,056 MiB | 1,056 MiB |
| ip-10-55-125-145 | 0 MiB | 1,724 MiB | 0 MiB | 926 MiB |
| ip-10-57-127-57 | 1,056 MiB | 604 MiB | 0 MiB | 0 MiB |
| ip-10-56-252-162 | 1,416 MiB | — | — | — |
| ip-10-57-126-2 | 1,080 MiB | 0 MiB | 0 MiB | 0 MiB |
Peak observed: 1,724 MiB (finetuned-gte-multilingual model) — this fits in the smallest g6f tier (g6f.large at 2,861 MiB) with 40% headroom.
| Model | GPU Memory (observed) | Fits in g6f.large? |
|---|---|---|
| finetuned-gte-multilingual | ~1,416–1,724 MiB | Yes (2,861 MiB available) |
| gte-multilingual-base variants | ~1,056 MiB | Yes |
| bert-* models | ~550–926 MiB | Yes |
| mmarco-roberta-crossencoder | ~550 MiB | Yes |
| msmarco-minilml6-embedder | ~358 MiB | Yes |
| msmarco-tinybert-crossencoder | ~358 MiB | Yes |
Conclusion: Every current ML inference workload could run on g6f.large ($0.20/hr) instead of g6.xlarge ($0.80/hr) — a 75% cost reduction per GPU instance.
Both g6 and g6f nodes have karpenter.k8s.aws/instance-gpu-name=l4 (the EC2 API reports the GPU as "L4" for both). So the existing gpuName: "l4" cannot distinguish them. data-airflow#74317 adds instanceFamily as a new node affinity field targeting karpenter.k8s.aws/instance-family. It is mutually exclusive with gpuName — the chart fails with a clear error if both are set.
data-airflow#74389 fixed several issues in the load test tooling that blocked A/B testing. Validated end-to-end on ml-dev-iad via three branch-deployed builds.
-
XDS cluster name had per-suffix identity:
_worker_annotationsappended the suffix to the XDS cluster name (e.g.moonlet-load-test-flat-test). This identity was never registered with envoy, so workers had no upstream routes — resulting in 0 RPS despite successful pod scheduling. Fixed by hardcodingmoonlet-load-test. -
LoadTestShape auto-activation: Locust auto-discovers
LoadTestShapesubclasses in the loaded file and uses the first one found, overriding--users. This meant flat-rate tests always followed theSpikepattern. Fixed by adding--class-pickerto the locust command, which enables a web UI dropdown for selecting "Default" (flat-rate) or a specific shape (Spike, Soak, Breakpoint). -
Cumulative stage duration bug in Spike.tick(): The
tick()method comparedrun_timedirectly against each stage'sdurationfield instead of accumulating elapsed time. This meant the ramp-down stage (duration=30s) was unreachable after the peak stage (duration=120s). Fixed by accumulating elapsed time across stages.
# 1. Build the ml-locust image from the PR branch
gondolier build ml-locust --branch xjohnson-loadtest-flat-mode
# 2. Deploy the build to dev
# (build IDs from gondolier output)
gondolier deploy ml-locust --build <build-id> --stage dev
# 3. Run a flat-rate load test (uses new image via --image)
mlctl load-test start \
--model-name msmarco_tinybert_crossencoder_v1 \
--users 20 \
--spawn-rate 5 \
--run-time 5m \
--suffix flat-v2 \
--image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:6b1507bfaeab926f4bc64f29201b43df607aff9093291da56b8365ca96045cc8"
# 4. Check test status
mlctl load-test status <test-name>
# 5. Get results
mlctl load-test report <test-name>
# 6. Inspect stats directly from master pod
kubectl -n locust-system --context ml-dev-iad \
exec <master-pod> -c <master-container> -- \
curl -s http://localhost:8089/stats/requests
# 7. Clean up
mlctl load-test delete <test-name>Test 1: XDS routing fix (build 019d9710-5bbf, deployment 019d9717-05de)
mlctl load-test start \
--model-name msmarco_tinybert_crossencoder_v1 \
--users 20 --spawn-rate 5 --run-time 5m \
--suffix flat-test \
--image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:..."
| Endpoint | Requests | Failures | Avg Latency | p95 | p99 | RPS |
|---|---|---|---|---|---|---|
| /grpc.health.v1.Health/Check | 163,972 | 0 | 4ms | 7ms | 10ms | ~561 |
| /slack.moonlet.MoonletService/Predict | 52,556 | 0 | 42ms | 59ms | 98ms | ~176 |
| Aggregated | 216,528 | 0 (0.0%) | 14ms | 47ms | 62ms | 701.9 |
Confirmed: 20 users (flat-rate), 3 workers, XDS cluster moonlet-load-test received full route table.
Test 2: Flat-rate mode (build 019d9780-af6f, deployment 019d978a-ede2)
mlctl load-test start \
--model-name msmarco_tinybert_crossencoder_v1 \
--users 20 --spawn-rate 5 --run-time 5m \
--suffix flat-v2 \
--image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:6b1507bfaeab926f4bc64f29201b43df607aff9093291da56b8365ca96045cc8"
| Endpoint | Requests | Failures | Avg Latency | p95 | p99 | RPS |
|---|---|---|---|---|---|---|
| /grpc.health.v1.Health/Check | 164,671 | 0 | 4.3ms | 7ms | 10ms | ~560 |
| /slack.moonlet.MoonletService/Predict | 52,083 | 0 | 42.6ms | 59ms | 99ms | ~176 |
| Aggregated | 216,754 | 0 (0.0%) | 13.5ms | 47ms | 62ms | 736.3 |
Confirmed: 20 users (flat-rate respected, no shape auto-activation), 3 workers.
Test 3: Shape picker UI (build 019d97af-a903, deployment 019d97bd-32ce)
mlctl load-test start \
--model-name msmarco_tinybert_crossencoder_v1 \
--users 20 --spawn-rate 5 --run-time 3m \
--suffix flat-v3 \
--image "388288586445.dkr.ecr.us-east-1.amazonaws.com/ml-base-dev@sha256:4ddd479aa050a7501c39aae68e6a47c9b703d200f50c6587c27120b09d3f6af0"
Verified via curl http://localhost:8089/ on the master pod:
{
"available_shape_classes": ["Default", "Breakpoint", "Soak", "Spike"],
"show_userclass_picker": true,
"available_user_classes": ["BasicUser", "HealthChecker"]
}Users can select "Default" for flat-rate or a specific shape from the web UI dropdown.
- 0 failures across 400k+ total requests
- Flat-rate mode works (20 users held constant, no spike)
- Shape picker dropdown functional in web UI
- XDS routing stable with base cluster name
- Consistent ~736 RPS throughput across test runs
- Load test tooling is ready for the g6 vs g6f A/B comparison
Once data-airflow#74317 is merged, run the following to compare g6 vs g6f latency for a representative model:
# 1. Deploy the model pinned to g6 (full L4)
mlctl deploy moonlet \
--deployment gte_multilingual_base_query_v1 \
--cluster ml-dev-iad \
--suffix g6 \
--instance-family g6
# 2. Deploy the same model pinned to g6f (fractional L4)
mlctl deploy moonlet \
--deployment gte_multilingual_base_query_v1 \
--cluster ml-dev-iad \
--suffix g6f \
--instance-family g6f
# 3. Run 15-minute load test against g6 deployment
mlctl load-test start \
--model-name gte_multilingual_base_query_v1 \
--suffix g6 \
--run-time 15m \
--users 100 \
--spawn-rate 1
# 4. Run 15-minute load test against g6f deployment
mlctl load-test start \
--model-name gte_multilingual_base_query_v1 \
--suffix g6f \
--run-time 15m \
--users 100 \
--spawn-rate 1
# 5. Compare p50/p95/p99 latency from Locust results
# 6. Clean up
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix g6 --mode cleanup
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix g6f --mode cleanupSuccess criteria: g6f p99 latency within 10% of g6 baseline for the same model and load profile.
Attempted a direct g6 vs g6f comparison for msmarco_tinybert_crossencoder_v1 and msmarco_minilml6_embedder_v0. The g6 baseline ran successfully; the g6f side hit multiple infrastructure issues.
msmarco_tinybert_crossencoder_v1 (1 replica, g6 node):
| Metric | At 100 users | At 500 users |
|---|---|---|
| Predict avg latency | ~1,011 ms | 3,164 ms |
| Predict failure rate | 0% | 96% (UNAVAILABLE — upstream timeout) |
| Health check avg | 17 ms | 902 ms |
| Total RPS | ~211 | ~103 |
| Total requests | — | 27,418 |
msmarco_minilml6_embedder_v0 (2 replicas, g6 nodes):
| Metric | At 100 users | At 500 users |
|---|---|---|
| Predict avg latency | ~62 ms | 1,097 ms |
| Predict failure rate | 0% | 0% |
| Health check avg | 3 ms | 372 ms |
| Total RPS | ~573 | ~237 |
| Total requests | — | 66,958 |
Multiple attempts to deploy g6f-pinned copies failed due to compounding infrastructure issues:
-
minGpuMemoryMiBand--instance-typeCLI flags not yet on master — PR #74317 hasn't merged, somlctl deploy moonletdoesn't support--instance-typeor--min-gpu-memory. Had to use rawhelm upgrade --setcommands to passinstanceType=g6f.2xlarge. -
Initial deploy used
g6f.xlarge— but no g6f.xlarge nodes exist — dev cluster only has g6f.2xlarge nodes. Karpenter couldn't provision new g6f.xlarge nodes due to thenvidia.com/gpu: "0"NodePool label conflict (see #3). -
nvidia.com/gpu: "0"NodePool label blocks new g6f provisioning — the NodePool template setsnvidia.com/gpu: "0"as a label on all nodes. EC2 reports g6f GPU count as 0, so Karpenter thinks g6f.xlarge hasnvidia.com/gpu=0which conflicts with pod requests ofnvidia.com/gpu: 1. Existing g6f nodes work (NodeOverlay + device plugin expose the real GPU), but Karpenter can't provision new ones. -
Existing g6f nodes at capacity — all 3 existing g6f.2xlarge nodes had their single GPU fully allocated. Combined with #3 preventing new node provisioning, g6f pods stayed Pending.
-
Branch confusion — the chart changes (
instanceTypeaffinity, g6f anti-affinity) are on thexjohnson-add-instance-family-affinitybranch, whilemlctldeploy flags are only available with that branch's code. Runningmlctlfrom the wrong branch silently deployed without affinity rules.
The infrastructure blockers from 2026-04-17 were resolved by merging data-airflow#74317 and data-airflow#74489. See the successful A/B comparison below.
Successful direct comparison of gte_multilingual_base_query_v1 on g6 (full L4) vs g6f (fractional L4). Both deployed as suffixed copies with --instance-type affinity, verified on correct node types.
Test configuration: 100 concurrent users, 10/s spawn rate, 10m run time, 3 workers, 1 replica per variant.
Node placement verified:
- g6 copy →
g6.2xlargenode (ip-10-57-124-136) - g6f copy →
g6f.2xlargenode (ip-10-57-126-106)
| Metric | g6 (full L4, 22.8 GiB VRAM) | g6f (1/4 L4, 5.7 GiB VRAM) | Delta |
|---|---|---|---|
| Avg Predict latency | 168 ms | 352 ms | +109% (2.1x slower) |
| Throughput (RPS) | 591 | 271 | -54% |
| Failure rate | 0% | 0% | identical |
| Total requests (10m) | 344,048 | 167,961 | — |
- Latency: g6f is ~2.1x slower than g6 for this model. This is expected — g6f.2xlarge provides a 1/4 partition of the L4 GPU, so compute throughput is proportionally lower.
- Reliability: Both instance types handled 100 concurrent users with zero failures. g6f is fully functional for this workload.
- Cost-adjusted throughput: g6.2xlarge costs ~$1.52/hr vs g6f.2xlarge at ~$0.77/hr (2x cheaper). At 2.1x slower, the cost-per-request is roughly equivalent. However, for latency-sensitive workloads at this scale, g6 is better. For cost-sensitive workloads with lower concurrency or less latency sensitivity, g6f provides identical reliability at half the price.
- Right-sizing opportunity: The gte model uses ~1,056 MiB GPU memory — well within g6f.large (2,861 MiB, $0.20/hr) or g6f.xlarge (2,861 MiB, $0.40/hr). Testing on g6f.large would show whether the smaller vCPU/RAM is the bottleneck, or if the GPU partition size is the limiting factor.
# Deploy
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6f --instance-type g6f.2xlarge
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6
# Load tests
mlctl load-test start --model-name gte_multilingual_base_query_v1 --suffix lt-g6f --name gte-g6f --users 100 --spawn-rate 10 --run-time 10m
mlctl load-test start --model-name gte_multilingual_base_query_v1 --suffix lt-g6 --name gte-g6 --users 100 --spawn-rate 10 --run-time 10m
# Reports
mlctl load-test report gte-g6f
mlctl load-test report gte-g6
# Cleanup
mlctl load-test delete gte-g6f && mlctl load-test delete gte-g6
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6f --mode cleanup --force-cleanup
mlctl deploy moonlet --deployment gte_multilingual_base_query_v1 --suffix lt-g6 --mode cleanup --force-cleanupDirect comparison of msmarco_tinybert_crossencoder_v1 (one of the smallest moonlet models, ~358 MiB GPU memory) on g6.2xlarge (full L4) vs g6f.2xlarge (1/4 L4).
Why tinybert? The gte comparison above showed a 2.1x slowdown. Tinybert is a much smaller model, so the hypothesis was that the latency delta would be narrower since compute isn't the bottleneck.
Test configuration: 100 concurrent users, 5/s spawn rate, 10m run time, 3 workers, 1 replica per variant. Used a custom locustfile that connects directly to k8s service DNS (bypassing envoy) because suffixed deployments register under different envoy cluster names.
Node placement verified:
- g6 copy →
g6.2xlargenode (ip-10-55-252-253) - g6f copy →
g6f.2xlargenode (ip-10-55-124-120)
| Metric | g6.2xlarge (full L4) | g6f.2xlarge (1/4 L4) | Delta |
|---|---|---|---|
| Total requests (10m) | 1,689 | 1,103 | -35% |
| Throughput (RPS) | 2.82 | 1.84 | -35% |
| Avg Predict latency | 33,903 ms | 51,042 ms | +51% |
| Min latency (best proxy for raw inference) | 873 ms | 1,461 ms | +67% |
| Median (p50) | 35,000 ms | 54,000 ms | +54% |
| p95 | 36,000 ms | 55,000 ms | +53% |
| p99 | 36,000 ms | 55,000 ms | +53% |
| Failure rate | 0% | 0% | identical |
- High absolute latencies are queueing artifacts: 100 concurrent users hitting a single GPU creates massive contention. The median/p95/p99 are nearly identical because the queue is always saturated. Min latency is the best measure of raw inference speed.
- Min latency: g6f is ~67% slower (873ms → 1,461ms). Even for a tiny model (~358 MiB VRAM), the fractional GPU partition imposes a significant compute penalty.
- Throughput: g6f delivers 35% fewer requests in the same time period under identical load.
- Zero failures: Both instance types handle the load without errors or OOM.
- Comparison with gte results: gte showed a 2.1x slowdown, tinybert shows a 1.67x slowdown. The hypothesis was partially confirmed — smaller models see a narrower gap, but it's still substantial.
| Instance | $/hr | Min latency | Throughput | Cost per 1K requests |
|---|---|---|---|---|
| g6.2xlarge | $0.80 | 873ms | 2.82 RPS | $0.079 |
| g6f.2xlarge | $0.20 | 1,461ms | 1.84 RPS | $0.030 |
For latency-insensitive workloads, g6f.2xlarge is ~62% cheaper per request. For latency-sensitive workloads, the 67% min-latency penalty may exceed p99 SLA budgets.
Could not provision g6f.4xlarge nodes due to the nvidia.com/gpu: "0" NodePool label issue (see Blockers). Karpenter refuses to provision nodes when the NodePool template declares nvidia.com/gpu: "0" but the pod requests nvidia.com/gpu: 1. The g6f.4xlarge test would show the performance of a 1/2 L4 partition, which we expect to be ~30-40% slower than full L4 (narrower gap than the 1/4 partition).
The baked-in locustfile uses envoy (localhost:9001) with model routing via x-slack-moonlet-service metadata. Suffixed deployments (--suffix lt-g6) register under different envoy cluster names, so the standard locustfile cannot route to them. A custom locustfile connecting directly to k8s service DNS was required:
mlctl load-test start \
--locustfile /tmp/locustfile-tinybert.py \
--host moonlet-msmarco-tinybert-crossencoder-v1-lt-g6.default.svc.cluster.local:8000 \
--name tinybert-g6 \
--image $LOCUST_ML_IMAGE \
--users 100 --spawn-rate 5 --run-time 10mBug found: The custom locustfile command is missing the locust prefix (line 364 of loadtest.py), causing exec format error when using non-default images like ml-base-dev. Tracked for fix in a follow-up PR.
The g6 vs g6f comparison should be a single command. Today it requires ~15 manual steps across 3 tools (mlctl, helm, kubectl) with multiple failure modes. Here's what needs to change:
Add a new mlctl benchmark subcommand that automates the entire A/B flow:
mlctl benchmark \
--deployment msmarco_tinybert_crossencoder_v1 \
--variants "g6=default" "g6f=--instance-type g6f.2xlarge" \
--users 100 --spawn-rate 10 --run-time 10mThis should:
- Deploy suffixed copies for each variant (e.g.
lt-g6,lt-g6f) - Wait for all pods to be Ready
- Run load tests sequentially or in parallel
- Collect reports into a single comparison table
- Clean up all deployments and load tests on completion (or Ctrl-C)
- Output a markdown table suitable for pasting into a PR
These flags exist on the PR branch but not on master. Without them, deploying to specific instance types requires raw helm --set commands with knowledge of chart internals (values file paths, image URIs, set-key syntax). This is the #1 source of friction.
PR: data-airflow#74317 — needs merge.
Tests started with --run-time 10m ran for 20+ minutes. The --class-picker flag combined with autostart: true may cause locust to ignore --run-time. The LocustTest operator's autoquit should respect --run-time, but it didn't.
Fixed in data-airflow#74543: Added --headless to extraArgs when autostart=True, which ensures Locust enforces --run-time.
mlctl load-test stop tries to set spec.worker.replicas: 0, which fails CRD validation (minimum: 1). There's no way to gracefully stop a running test short of deleting it.
Fixed in data-airflow#74543: Now uses kubectl exec curl -X POST /stop via the Locust master pod's REST API.
The current report only shows avg latency. For A/B comparison, percentile latencies (p50, p95, p99) are essential. The Locust /stats/requests endpoint has this data.
Fixed in data-airflow#74543: Report now shows p50/p95/p99 columns. Also added mlctl load-test compare subcommand for side-by-side comparison:
mlctl load-test compare test-g6 test-g6f --labels g6 g6fToday there's no validation that the target instance type has available capacity or that Karpenter can provision it. This leads to silent Pending pods.
Fix needed: Add a --dry-run or pre-flight check to mlctl deploy moonlet that:
- Checks if matching nodes exist with available GPU resources
- Checks if Karpenter NodePool allows the target instance family
- Warns if the NodePool has conflicting labels (e.g.
nvidia.com/gpu: "0")
The NodeOverlay was previously applied manually via kubectl apply. To formalize this:
G6F_ENABLEDinprovisioner_instance_familiesalready addsg6fto NodePool allowed families (via_helpers.tplin bedrock-argocd)- But g6f doesn't work without a NodeOverlay (EC2 API reports 0 GPUs)
- bedrock-argocd#440 couples the NodeOverlay to
G6F_ENABLED— when g6f is enabled for a cluster, the overlay is automatically created - This prevents the footgun of enabling g6f without the overlay
- bedrock-argocd#440 merged — NodeOverlay auto-created when G6F_ENABLED
- bedrock-argocd#446 merged — Remove
nvidia.com/gpu: "0"from NodePool templates - data-airflow#74389 merged — Locust load test fixes
- Load test tooling validated — 216K requests, 0 failures, 702 avg RPS
- data-airflow#74317 merged — instanceType affinity + minGpuMemoryMiB + g6f anti-affinity (merged 2026-04-17)
- Capture baseline cost metrics — run PromQL queries below before any migration
- Dev Wave 1: Migrate 2 smallest models (
msmarco_tinybert_crossencoder_v1,msmarco_minilml6_embedder_v0) viaminGpuMemoryMiB: 500 - Dev Wave 2: Migrate shmVolume models (
gte_multilingual_base_v1,gte_multilingual_base_file_v1) - GRID driver enabled in prod — shipyard-chef-repo PR needed
- Enable g6f in prod-iad-2 (canary) — bedrock-argocd PR
- Prod Wave 3: Tier 3 models (2 lowest-risk)
- Prod Wave 4: Tier 2 models (one at a time, 24h bake)
- Prod Wave 5: Tier 1 models (after 2+ weeks stability)
- Update Grafana dashboard — add g6f cost comparison panels
All currently on g6.xlarge ($0.80/hr on-demand). Migrating to g6f.large ($0.20/hr) = 75% savings.
| Model | Prod minReplicas | Tier | Current $/hr | g6f $/hr | Savings $/hr |
|---|---|---|---|---|---|
| msmarco_tinybert_crossencoder_v1 | 10 | 2 | $8.00 | $2.00 | $6.00 |
| mmarco_roberta_crossencoder_v2 | 10 | 2 | $8.00 | $2.00 | $6.00 |
| gte_multilingual_base_v1 | 5 | 1 | $4.00 | $1.00 | $3.00 |
| gte_multilingual_base_file_v1 | 5 | 1 | $4.00 | $1.00 | $3.00 |
| finetuned_gte_multilingual_v0 | 5 | 2 | $4.00 | $1.00 | $3.00 |
| bert_action_items_classifier_v0 | 2 | 2 | $1.60 | $0.40 | $1.20 |
| msmarco_minilml6_embedder_v0 | 2 | 3 | $1.60 | $0.40 | $1.20 |
| gte_multilingual_base_query_v1 | 2 | 2 | $1.60 | $0.40 | $1.20 |
| bert_semantic_priority_proxy_v1 | 2 | 2 | $1.60 | $0.40 | $1.20 |
| bert_display_explain_cta_v0 | 2 | 3 | $1.60 | $0.40 | $1.20 |
| TOTAL | 45 | $36.00/hr | $9.00/hr | $27.00/hr |
Maximum annual savings ceiling: $27/hr x 8,760 hrs = ~$236K/yr (minReplicas only — actual savings higher with HPA scaling)
Note: gte models use shmVolume: true and request 3-4 CPUs — may need g6f.xlarge ($0.40/hr) instead of g6f.large, reducing those to 50% savings. Realistic estimate: **$180K-$200K/yr**.
flowchart LR
subgraph "Dev Validation"
D1["Wave 1: 2 smallest<br/>tinybert + minilml6<br/>no shmVolume"]
D1 -->|1 week soak| D2["Wave 2: shmVolume<br/>gte_base_v1<br/>gte_base_file_v1"]
end
subgraph "Prod Rollout"
D2 -->|stable| GRID["Enable GRID<br/>driver in prod"]
GRID --> P1["Enable g6f NodePool<br/>prod-iad-2 canary"]
P1 --> P3["Wave 3: Tier 3<br/>minilml6, bert_display<br/>2 replicas each"]
P3 -->|1 week| P4["Wave 4: Tier 2<br/>tinybert, roberta, etc<br/>one at a time, 24h bake"]
P4 -->|2 weeks| P5["Wave 5: Tier 1<br/>gte models<br/>highest criticality"]
end
style D1 fill:#bfb,stroke:#333
style P3 fill:#bbf,stroke:#333
style P5 fill:#fbf,stroke:#333
| Blocker | Status | PR/Action |
|---|---|---|
| NodeOverlay GitOps | DONE | bedrock-argocd#440 — merged |
nvidia.com/gpu: "0" NodePool label fix |
DONE | bedrock-argocd#446 — merged |
| Load test tooling | DONE | data-airflow#74389 — merged |
| instanceType affinity + g6f anti-affinity | Merged | data-airflow#74317 — merged 2026-04-17 |
| Dev model migration | Open (CI green, needs review + possible rebase) | data-airflow#74336 |
| g6f in prod NodePool | Open (blocked) | bedrock-argocd#443 — blocked on GRID driver |
| GRID driver in prod | BLOCKER | Need shipyard-chef-repo PR to add prod to upgrade_envs |
# 1. GPU node count by instance type
count by (label_node_kubernetes_io_instance_type) (
kube_node_labels{label_type="karpenter", label_node_kubernetes_io_instance_type=~"g6.*"}
)
# 2. Per-model pod → instance type mapping
count by (label_node_kubernetes_io_instance_type, label_slack_com_moonlet_model) (
kube_pod_info{namespace="default", pod=~"moonlet-.*"}
* on(node) group_left(label_node_kubernetes_io_instance_type)
kube_node_labels{}
)
# 3. GPU memory usage per model (DCGM)
avg by (exported_pod) (DCGM_FI_DEV_FB_USED{namespace="default", exported_pod=~"moonlet-.*"})
# 4. GPU memory headroom per model
avg by (exported_pod) (DCGM_FI_DEV_FB_FREE{namespace="default", exported_pod=~"moonlet-.*"})
# 5. Latency baseline (p99) per model
histogram_quantile(0.99, sum by (le, model_name) (
rate(grpc_server_handling_seconds_bucket{grpc_method="Predict"}[5m])
))
# 6. Estimated hourly cost (g6 vs g6f node counts)
(count(kube_node_labels{label_karpenter_k8s_aws_instance_family="g6"}) * 0.80)
+
(count(kube_node_labels{label_karpenter_k8s_aws_instance_family="g6f"}) * 0.20)
# 7. Pod distribution: g6 vs g6f per model
count by (label_karpenter_k8s_aws_instance_family, label_slack_com_moonlet_model) (
kube_pod_info{pod=~"moonlet-.*"}
* on(node) group_left(label_karpenter_k8s_aws_instance_family)
kube_node_labels{}
)
New row — "g6f Cost Comparison":
- Node Count by Instance Family — stat panel (query #1)
- Estimated Hourly GPU Cost — stat panel (query #6)
- Pod Distribution (g6 vs g6f) — table (query #7)
- GPU VRAM Headroom by Instance Family — timeseries (query #4)
- Latency: g6 vs g6f — timeseries (query #5 split by family)
Sarah suggested using karpenter.k8s.aws/instance-gpu-memory with Gt operator to replace the entire minGpuMemoryMiB if/else ladder. This would be ideal but currently doesn't work for g6f — Karpenter populates this label from the EC2 DescribeInstanceTypes API, and g6f reports GPU count as 0 (uses LogicalGpuCount). NodeOverlay can only override capacity, not labels. Once Karpenter adds label override support to NodeOverlay (upstream RFC in progress), this simplification becomes possible.
Problem: Kubernetes GPU scheduling uses nvidia.com/gpu (integer count) — there is no VRAM-aware scheduling. Both g6.xlarge (22.8 GiB VRAM) and g6f.large (2.8 GiB VRAM) report nvidia.com/gpu: 1. If g6f is in the same NodePool as g6, Karpenter may choose the cheapest option (g6f.large at $0.20/hr) for any GPU workload, even if the model needs more than 2.8 GiB. This causes silent GPU OOM at runtime — the pod starts, loads the model, and crashes.
flowchart TD
POD["Pod requests<br/>nvidia.com/gpu: 1<br/>gpuName: l4"] --> KS["Karpenter Scheduler"]
KS --> PICK{"Cheapest node<br/>with 1 GPU?"}
PICK -->|g6f.large $0.20/hr| G6F["g6f.large<br/>2,861 MiB VRAM"]
PICK -->|g6.xlarge $0.80/hr| G6["g6.xlarge<br/>22,888 MiB VRAM"]
G6F --> CHECK{"Model fits<br/>in 2.8 GiB?"}
CHECK -->|Yes| OK["OK"]
CHECK -->|No| OOM["GPU OOM<br/>silent failure"]
style OOM fill:#f44,stroke:#333,color:#fff
style G6F fill:#fbb,stroke:#333
style G6 fill:#bfb,stroke:#333
Decision: g6f instances must be opt-in only. A workload should never land on g6f unless it explicitly opts in.
Implemented solution (data-airflow#74317 + stacked PR):
The moonlet chart deployment template now enforces g6f opt-in via anti-affinity. Every GPU pod that does NOT explicitly target g6f gets a NotIn expression excluding all 4 g6f instance types (g6f.large, g6f.xlarge, g6f.2xlarge, g6f.4xlarge). This is safe even when g6f is enabled in the Karpenter NodePool.
There are three ways to opt in to g6f:
-
instanceType(most explicit): Pin to a specific g6f size, e.g.instanceType: "g6f.large". Use when you know the exact VRAM tier you need. -
instanceFamily(family-level): SetinstanceFamily: "g6f"to allow any g6f size. Only safe if you're OK with Karpenter choosing any VRAM tier. -
minGpuMemoryMiB(VRAM-aware, recommended): Specify the model's GPU memory requirement in MiB. The template automatically excludes g6f tiers with insufficient VRAM and allows the rest — including full g6 nodes. This is the safest and most ergonomic option.
All three are mutually exclusive with each other (enforced by {{- fail }} in the Helm template).
Since Kubernetes has no native VRAM-aware scheduling, the moonlet chart implements a client-side workaround via the minGpuMemoryMiB field. The template maps the memory requirement to g6f exclusion rules at render time:
minGpuMemoryMiB |
g6f.large/xlarge (2,861 MiB) | g6f.2xlarge (5,722 MiB) | g6f.4xlarge (11,444 MiB) | Full g6 (22,888 MiB) |
|---|---|---|---|---|
| 500 | Allowed | Allowed | Allowed | Allowed |
| 3000 | Excluded | Allowed | Allowed | Allowed |
| 6000 | Excluded | Excluded | Allowed | Allowed |
| 12000 | Excluded | Excluded | Excluded | Allowed |
| 0 (default) | Excluded | Excluded | Excluded | Allowed |
Usage in values:
moonletDeployments:
msmarco_tinybert_crossencoder_v1:
resources:
minGpuMemoryMiB: 500 # observed 358 MiB — all g6f tiers fitUsage via mlctl CLI:
mlctl deploy moonlet \
--deployment msmarco_tinybert_crossencoder_v1 \
--cluster ml-dev-iad \
--min-gpu-memory 500All g6f sizes report nvidia.com/gpu: "1" via NodeOverlay. Karpenter cannot distinguish memory tiers between g6f.large (2.8 GiB) and g6f.4xlarge (11.4 GiB). If a model needs more GPU memory than the provisioned g6f size provides, it will OOM at runtime — there is no scheduling-time protection.
Mitigation (implemented):
minGpuMemoryMiBauto-excludes g6f tiers that are too small (see table above)instanceType(e.g.,g6f.2xlarge) for models that need a specific VRAM tier- Default behavior (minGpuMemoryMiB=0) excludes ALL g6f — safe by default
- Current profiling shows all models fit in g6f.large (2.8 GiB) with 40%+ headroom
NodeOverlay (v1alpha1) can only override capacity (resource counts) and price — it cannot set labels. So we cannot add a custom label like karpenter.k8s.aws/instance-gpu-fractional=true to g6f nodes. The Karpenter team has an upstream RFC proposing instance-gpu-fractional labels, but this is not yet available.
Until then, karpenter.k8s.aws/instance-family is the only reliable way to distinguish g6 from g6f.
With WhenEmptyOrUnderutilized consolidation (default for ML clusters), Karpenter may:
- Consolidate workloads from g6 to g6f if g6f is cheaper and meets resource requirements
- This is generally desirable for cost savings when g6f is in the NodePool
- But could cause GPU OOM if consolidation moves a large-VRAM model to g6f
Mitigation: If g6f is opt-in only (option 2 above), consolidation cannot move workloads to g6f unless the pod already has the right affinity. This is another reason to keep g6f out of the shared GPU NodePool.
The NVIDIA GPU Operator's device plugin exposes nvidia.com/gpu based on actual hardware detection, not Karpenter labels. On g6f nodes with GRID driver, the device plugin correctly detects 1 GPU partition and exposes it. The NodeOverlay's nvidia.com/gpu: "1" must match what the device plugin reports — which it does.
Two dashboards exist in the ML Services folder:
- Moonlet Service (
moonlet-service) — fleet health, gRPC latency/success, alert tiers, KEDA scaling, CUDA OOM logging (will fire on g6f OOMs) - Moonlet GPU (
moonlet-gpu-uid) — DCGM metrics: GPU util, VRAM used/free/headroom per deployment, utilization percentiles
The GPU dashboard joins DCGM_FI_DEV_* metrics to pods via kube_pod_labels on Prometheus_Kubernetes. No queries currently filter by instance type or family — all scoped to cluster only.
New variable: instance_family via label_values(kube_node_labels{cluster="ml-dev-iad"}, label_karpenter_k8s_aws_instance_family)
New row — "Instance Family Comparison":
| Panel | Query Pattern |
|---|---|
| GPU Nodes by Family | count by (label_node_kubernetes_io_instance_type) (kube_node_labels{...}) |
| VRAM Used by Family | DCGM_FI_DEV_FB_USED joined via kube_pod_info → kube_node_labels |
| VRAM Headroom by Family | DCGM_FI_DEV_FB_FREE — critical for g6f OOM risk detection |
| Latency by Family | grpc_server_handling_seconds joined to node labels |
Key join (pod → node → instance family):
kube_pod_info{cluster="ml-dev-iad"} * on(cluster, node) group_left(label_karpenter_k8s_aws_instance_family) kube_node_labels{cluster="ml-dev-iad"}
| Risk | Severity | Mitigation |
|---|---|---|
| g6f requires NVIDIA GRID driver — datacenter drivers do not work | RESOLVED | PR #888 merged — GRID 570.172.08 installed in dev |
| g6f missing from ip_per_eni — causes pod limit of ~11 | RESOLVED | PR #969 merged 2026-04-13 |
| NodeOverlay applied manually — not managed by GitOps | RESOLVED | bedrock-argocd#440 — auto-creates NodeOverlay when G6F_ENABLED |
| No way to distinguish g6 from g6f via gpuName | RESOLVED | data-airflow#74317 — merged 2026-04-17: instanceFamily affinity + g6f anti-affinity + minGpuMemoryMiB |
| GRID driver AMI cannot be shared with existing g6 nodes | Low | Same AMI works — upgrade workflow installs GRID at provision time in dev |
| NodeOverlay is alpha - API may change | Medium | Pin Karpenter version; monitor upstream RFC progress |
| GPU memory OOM on fractional partitions | MITIGATED | minGpuMemoryMiB auto-excludes undersized g6f tiers; default (0) blocks all g6f; anti-affinity enforced in chart |
| g6f.large pod capacity may be tight (18 pods) | Medium | Use g6f.xlarge+ (23 pods) for production workloads |
| Mixed g6/g6f scheduling with gpuName: "l4" | MITIGATED | Chart auto-adds g6f NotIn anti-affinity for pods not explicitly targeting g6f |
| GRID driver not yet enabled in prod | BLOCKER | Need shipyard-chef-repo PR to add prod to upgrade_envs |
| Karpenter consolidation churn | Low | Monitor consolidation metrics post-rollout |
| PR | Status | Purpose |
|---|---|---|
| chef-repo#154249 | Merged | Create GPU longshoreman ASG |
| chef-repo#154263 | Merged | Add kubernetes tags to GPU ASG |
| chef-repo#154310 | Merged | Fix instance_type + nebula-ca type tag |
| shipyard-chef-repo#888 | Merged | GRID driver support in nvidia-gpu cookbook |
| shipyard-chef-repo#959 | Open | Deduplicate lspci NVLink detection |
| shipyard-chef-repo#969 | Merged | Add g6f to ip_per_eni, fix g6e values, fix shipyard_lookup CI |
| bedrock-argocd#440 | Merged | Auto-create g6f NodeOverlay when G6F_ENABLED |
| bedrock-argocd#446 | Merged | Remove nvidia.com/gpu: "0" label from NodePool templates |
| bedrock-argocd#443 | Open (blocked on GRID driver) | Enable g6f NodePool in prod-iad-2 |
| data-airflow#74317 | Merged (2026-04-17) | instanceType affinity + minGpuMemoryMiB + g6f anti-affinity + helm tests |
| data-airflow#74389 | Merged | Locust load test fixes (--class-picker, cumulative tick, shapes cleanup) |
| data-airflow#74336 | Open (CI green, needs review — assigned to shenkens; may need rebase post-#74317 merge) | Dev migration: 2 smallest GPU models via minGpuMemoryMiB |
| data-airflow#74409 | Open (CI green, no reviews yet; may need rebase post-#74317 merge) | Dev migration values + load test comparison commands |
| data-airflow#74489 | Merged (2026-04-17) | Fix load test CLI: --users/--spawn-rate/--run-time not overridden by LoadTestShapes |
- AWS NVIDIA driver compatibility table (g6f = GRID only)
- AWS NVIDIA GRID driver installation guide
- GRID driver S3 bucket:
s3://ec2-linux-nvidia-drivers/ - Karpenter g6f support issue (aws/karpenter-provider-aws#8368)
- Karpenter NodeOverlay docs
- Karpenter Feature Gates
Click to expand detailed March testing notes
NVRM: The NVIDIA GPU 0000:31:00.0 (PCI ID: 10de:27b8)
NVRM: NVIDIA 560.35.05 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
nvidia: probe of 0000:31:00.0 failed with error -1
NVRM: The NVIDIA probe routine failed for 1 device(s).
NVRM: None of the NVIDIA devices were initialized.
nvidia-smi cannot communicate with the driver on the g6f node.
NFD sees the GPU hardware. The node labels show nvidia.com/gpu.present=true.
NVRM: The NVIDIA GPU 0000:31:00.0 (PCI ID: 10de:27b8)
NVRM: NVIDIA 560.35.05 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
nvidia: probe of 0000:31:00.0 failed with error -1
g6f fractional L4 GPU presents PCI device ID 10de:27b8 — not in the datacenter driver's supported device list.
AWS documentation confirms: g6f instances do NOT support Tesla/datacenter drivers — only GRID drivers.
Status: DONE — 570.x datacenter driver also fails. Datacenter driver ruled out.
Same failure with driver 570.172.08:
NVRM: The NVIDIA GPU 0000:31:00.0 (PCI ID: 10de:27b8)
NVRM: NVIDIA 570.172.08 driver release.
NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
nvidia: probe of 0000:31:00.0 failed with error -1