bnomei/pr-15741.md

PR #15741 Hot Path Summary

This note summarizes the "hotpath" discussed in netty/netty#15741, based on the PR discussion plus Frigg inspection of the merged code in this checkout.

Merged commit in this tree: accd981104dfe23dbe6208a16d197b7b3f5b8c94

What "hotpath" Means Here

The PR discussion points to the adaptive allocator's thread-local direct-allocation fast path, not a general allocator slow path.

Why that interpretation is the right one:

The PR description says it reduces costly atomic operations in the "thread local allocation's fast path."
The added benchmark focuses on direct allocation throughput.
The review thread discusses keeping owner-thread segments "hot" in the local free list instead of mixing them with the external queue.

Relevant discussion links:

PR overview: netty/netty#15741
Review suggestion about batching from external to local list: netty/netty#15741 (comment)
Author reply preferring local hot segments: netty/netty#15741 (comment)
Follow-up note on chunk queue sizing and remaining ref-count work: netty/netty#15741 (comment)

Minimal Hot Path

Using Frigg on the current tree, the minimal direct-allocation path is:

microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L214 directAllocation(...) calls state.performDirectAllocation().
microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L133 performDirectAllocation() picks a size, releases the previous buffer, then calls allocateDirect(allocator, size).
microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L129 allocateDirect(...) jumps into ByteBufAllocator.directBuffer(size).
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L393 The allocator takes the thread-local magazine path: tlMag.newBuffer() and then tlMag.tryAllocate(...).
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L844 Magazine.tryAllocate(...) immediately takes the no-lock branch when allocationLock == null, which is the thread-local case.
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L881 Allocation succeeds when the current chunk can readInitInto(...).
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1306 SizeClassedChunk.readInitInto(...) calls nextAvailableSegmentOffset() to pick the backing segment.

The most important hot instruction path is the owner-thread segment selection and release path inside SizeClassedChunk.

What PR #15741 Changed In That Path

buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1041 Magazine.newBuffer() now gets the wrapper object directly from a recycler on the thread-local path, removing queue/handle indirection from wrapper reuse.
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L715 createLocalFreeList() builds a thread-local stack of segment offsets for size-class chunks.
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1225 IntStack is the simple local LIFO used to keep owner-thread segment reuse cheap.
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1323 nextAvailableSegmentOffset() prefers localFreeList.pop() and only falls back to externalFreeList.poll() when the local stack is empty.
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1372 releaseSegmentOffsetIntoFreeList(...) mirrors the split: owner-thread frees go to the local LIFO stack, off-thread frees go to the MPSC external queue.
buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L125 The shared chunk reuse queue is still capacity-bound to availableProcessors() * 2, which the PR thread later calls out as not fully solved.

Why This Helps

The merged optimization is mostly "keep owner-thread reuse owner-thread local."

The thread-local magazine avoids lock acquisition on the fast path.
Recently freed owner-thread segments are reused from a local LIFO stack.
Cross-thread coordination is deferred until the local free list is empty or a different thread performs the release.
The design favors cache and core locality over aggressively draining the external queue into the local list.

That matches the author comment in review: the local segments are more likely to still be "hot" on the owner thread core, so they are preferred over mixing in external segments.

Benchmark Context Added By The PR

The PR added microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java, which exercises this path directly.

Two details from the benchmark are relevant:

microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L176 The setup creates the allocator and optional pollution workload.
microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L205 The pollution step uses both a normal thread and a FastThreadLocalThread, which is relevant because the PR is explicitly optimizing thread-local behavior.

Remaining Follow-Ups Mentioned In The PR Thread

The PR did not claim to solve everything. The follow-up items explicitly mentioned in the thread were:

size-class chunk queue sizing and better use of available queue capacity
removing extra reference-count operations on size-class chunks
possibly more improvements in later PRs rather than in this merge

In short: PR #15741 optimized the adaptive allocator by making the common owner-thread allocation/release cycle stay local, simpler, and less atomic-heavy.