Demonstrates that Apple Neural Engine (ANE) achieves significantly higher throughput with INT8 W8A8 quantization vs FP16, consistent with native INT8 datapath support.
| Method | FP16 | INT8 W8A8 | Ratio |
|---|---|---|---|
| anemll-profile (CoreML) | 17.99 TOPS, 15.3 ms | 33.86 TOPS, 8.2 ms | 1.88x |
| Private API (wall-clock) | 18.22 TFLOPS, 15.1 ms | 34.22 TOPS, 8.0 ms | 1.88x |
$ OS_ACTIVITY_DT_MODE=YES anemll-profile /tmp/int8_compute_fp16_512.mlpackage
═══════════════════════════════════════════════════════════════
ANE CostModel Report: int8_compute_fp16_512.mlpackage
═══════════════════════════════════════════════════════════════
Model size: 64.1 MB
Format: ML Program
Compute: CPU+ANE
Total ops: 896
ANE ops: 128 (100.0% of cost)
Op Type Count ms/op Total ms GFLOP GB/s Share Bound
──────────────────────────────── ───── ────────── ────────── ──────── ──────── ────── ──────
conv 128 1.800458 230.459 274.8160 4.72 100.0% Comp
Measured: 15.275 ms/prediction (65.5 iter/s, 10 runs)
Compute: 17990.99 GFLOP/s (17.9910 TOPS)
Speedup: 15.1x vs sequential estimate
$ OS_ACTIVITY_DT_MODE=YES anemll-profile /tmp/int8_compute_w8a8_512.mlpackage
═══════════════════════════════════════════════════════════════
ANE CostModel Report: int8_compute_w8a8_512.mlpackage
═══════════════════════════════════════════════════════════════
Model size: 32.2 MB
Format: ML Program
Compute: CPU+ANE
Total ops: 1531
ANE ops: 382 (100.0% of cost)
Op Type Count ms/op Total ms GFLOP GB/s Share Bound
──────────────────────────────── ───── ────────── ────────── ──────── ──────── ────── ──────
conv 128 1.800458 230.459 274.8160 4.72 89.2% Comp
quantize 127 0.103401 13.132 0.5326 58.03 5.1% Mem
dequantize 127 0.103401 13.132 0.5326 58.03 5.1% Mem
ANE Op Types: ios18.conv (128), ios18.quantize (127), ios18.dequantize (127)
Measured: 8.151 ms/prediction (122.7 iter/s, 10 runs)
Compute: 33862.37 GFLOP/s (33.8624 TOPS)
Speedup: 31.7x vs sequential estimate
$ ./inmem_peak 1 64
=== Programmatic MIL → In-Memory ANE Peak (FP16, batch=1, sp=64x64) ===
ANE hw: type=h17, numANEs=1, numCores=2, QoS=0 (RealTime)
IP clock=2448 MHz, MAC clock=306.0 MHz
20.05 TFLOPS/cluster × 1 = 20.05 TFLOPS system
Config W(MB) GFLOP ms/eval TFLOPS %peak Est.MHz
-------------------------------------------------------------------------------
128x conv 512ch 64x64 64.0 274.88 15.083 ms 18.22 90.9% 278
96x conv 512ch 64x64 48.0 206.16 11.401 ms 18.08 90.2% 276
64x conv 512ch 64x64 32.0 137.44 7.749 ms 17.74 88.4% 271
128x conv 384ch 64x64 36.0 154.62 8.727 ms 17.72 88.3% 270
256x conv 256ch 64x64 32.0 137.44 7.837 ms 17.54 87.4% 268
128x conv 256ch 64x64 16.0 68.72 4.012 ms 17.13 85.4% 261
$ ./inmem_peak_int8
=== In-Memory ANE Peak INT8 W8A8 (batch=1, sp=64x64) ===
ANE hw: type=h17, numANEs=1, numCores=2, QoS=0 (RealTime)
IP clock=2448 MHz, MAC clock=306.0 MHz
FP16 peak: 20.05 TFLOPS/cluster
INT8 peak (if native): 40.11 TOPS/cluster
Data: fp16(in) → [conv(W8) → quant → dequant] × N → conv(W8) → fp16(out)
Config W(MB) GOP ms/eval TOPS %fp16 %int8 Est.MHz
------------------------------------------------------------------------------------
128x conv 512ch 64x64 32.0 274.88 8.032 ms 34.22 170.6% 85.3% 522
96x conv 512ch 64x64 24.0 206.16 6.012 ms 34.29 171.0% 85.5% 523
64x conv 512ch 64x64 16.0 137.44 4.111 ms 33.44 166.7% 83.4% 510
128x conv 384ch 64x64 18.0 154.62 4.663 ms 33.16 165.3% 82.7% 506
128x conv 256ch 64x64 8.0 68.72 2.235 ms 30.74 153.3% 76.7% 469
Note:
%fp16,%int8, andEst.MHzcolumns assume 16 MAC arrays × 2048 MACs per cluster at the estimated clock. These are derived estimates, not hardware-measured counters.
The speedup comes from halving L2 SRAM bandwidth for activations between tiles:
FP16 path: L2 --[2B/elem]--> MAC --[2B/elem]--> L2 (bottleneck)
INT8 path: L2 --[1B/elem]--> MAC --[1B/elem]--> L2 (2x bandwidth)
The quantize/dequantize ops between layers tell the ANE compiler to store activations as int8 in L2.
Without them, activations stay fp16 = 2x more L2 traffic = lower throughput.
Creates FP16 and W8A8 .mlpackage files for benchmarking with anemll-profile.
Both FP16 and W8A8 use the same opset_version=ct.target.iOS18 and fp16 I/O for a fair comparison.
pip install coremltools numpy
python gen_int8_bench.py
# Creates: /tmp/int8_compute_fp16_512.mlpackage
# /tmp/int8_compute_w8a8_512.mlpackage
# Then profile:
brew tap anemll/tap && brew install anemll-profile
OS_ACTIVITY_DT_MODE=YES anemll-profile /tmp/int8_compute_fp16_512.mlpackage
OS_ACTIVITY_DT_MODE=YES anemll-profile /tmp/int8_compute_w8a8_512.mlpackageGenerates MIL programmatically, compiles and runs directly on ANE via _ANEInMemoryModel.
xcrun clang -O2 -Wall -fobjc-arc -framework Foundation -framework IOSurface -ldl \
-o inmem_peak_int8 inmem_peak_int8.m
./inmem_peak_int8 # default: batch=1, sp=64x64
./inmem_peak_int8 1 64 --relu # with relu (fused, free)fp16(in) → conv(int8_weights) → quantize(fp16→int8) → int8 in L2 → dequantize(int8→fp16) → next conv
- Weights:
constexpr_affine_dequantize— int8 stored, dequantized at compile time - Activations:
quantize/dequantizebetween layers — stored as int8 in L2 SRAM - Conv: MAC arrays operate on fp16 activations × int8 weights (or native int8×int8)
- ReLU: Fused by ANE post-processor, zero cost (use
--reluflag)
- macOS 15+ (Sequoia) with Apple Silicon (M1+)
- Xcode command line tools
- Python 3.10–3.12 with
coremltools >= 9.0andnumpy(for model generation)- Note:
coremltools 9.0does not yet support Python 3.14; use 3.12 if on Homebrew Python
- Note:
anemll-profilevia Homebrew (for CoreML profiling)
Mac NEO
Computer : MacBook Neo
CPU : Apple A18 Pro
CoreML : FP16 15.86 TOPS INT8 30.53 TOPS