This note summarizes the "hotpath" discussed in netty/netty#15741, based on the PR discussion plus Frigg inspection of the merged code in this checkout.
Merged commit in this tree: accd981104dfe23dbe6208a16d197b7b3f5b8c94
The PR discussion points to the adaptive allocator's thread-local direct-allocation fast path, not a general allocator slow path.
Why that interpretation is the right one:
- The PR description says it reduces costly atomic operations in the "thread local allocation's fast path."
- The added benchmark focuses on direct allocation throughput.
- The review thread discusses keeping owner-thread segments "hot" in the local free list instead of mixing them with the external queue.
Relevant discussion links:
- PR overview: netty/netty#15741
- Review suggestion about batching from external to local list: netty/netty#15741 (comment)
- Author reply preferring local hot segments: netty/netty#15741 (comment)
- Follow-up note on chunk queue sizing and remaining ref-count work: netty/netty#15741 (comment)
Using Frigg on the current tree, the minimal direct-allocation path is:
- microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L214
directAllocation(...)callsstate.performDirectAllocation(). - microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L133
performDirectAllocation()picks a size, releases the previous buffer, then callsallocateDirect(allocator, size). - microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L129
allocateDirect(...)jumps intoByteBufAllocator.directBuffer(size). - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L393
The allocator takes the thread-local magazine path:
tlMag.newBuffer()and thentlMag.tryAllocate(...). - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L844
Magazine.tryAllocate(...)immediately takes the no-lock branch whenallocationLock == null, which is the thread-local case. - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L881
Allocation succeeds when the current chunk can
readInitInto(...). - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1306
SizeClassedChunk.readInitInto(...)callsnextAvailableSegmentOffset()to pick the backing segment.
The most important hot instruction path is the owner-thread segment selection and release path inside SizeClassedChunk.
- buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1041
Magazine.newBuffer()now gets the wrapper object directly from a recycler on the thread-local path, removing queue/handle indirection from wrapper reuse. - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L715
createLocalFreeList()builds a thread-local stack of segment offsets for size-class chunks. - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1225
IntStackis the simple local LIFO used to keep owner-thread segment reuse cheap. - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1323
nextAvailableSegmentOffset()preferslocalFreeList.pop()and only falls back toexternalFreeList.poll()when the local stack is empty. - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L1372
releaseSegmentOffsetIntoFreeList(...)mirrors the split: owner-thread frees go to the local LIFO stack, off-thread frees go to the MPSC external queue. - buffer/src/main/java/io/netty/buffer/AdaptivePoolingAllocator.java#L125
The shared chunk reuse queue is still capacity-bound to
availableProcessors() * 2, which the PR thread later calls out as not fully solved.
The merged optimization is mostly "keep owner-thread reuse owner-thread local."
- The thread-local magazine avoids lock acquisition on the fast path.
- Recently freed owner-thread segments are reused from a local LIFO stack.
- Cross-thread coordination is deferred until the local free list is empty or a different thread performs the release.
- The design favors cache and core locality over aggressively draining the external queue into the local list.
That matches the author comment in review: the local segments are more likely to still be "hot" on the owner thread core, so they are preferred over mixing in external segments.
The PR added microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java, which exercises this path directly.
Two details from the benchmark are relevant:
- microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L176 The setup creates the allocator and optional pollution workload.
- microbench/src/main/java/io/netty/microbench/buffer/ByteBufAllocatorAllocPatternBenchmark.java#L205
The pollution step uses both a normal thread and a
FastThreadLocalThread, which is relevant because the PR is explicitly optimizing thread-local behavior.
The PR did not claim to solve everything. The follow-up items explicitly mentioned in the thread were:
- size-class chunk queue sizing and better use of available queue capacity
- removing extra reference-count operations on size-class chunks
- possibly more improvements in later PRs rather than in this merge
In short: PR #15741 optimized the adaptive allocator by making the common owner-thread allocation/release cycle stay local, simpler, and less atomic-heavy.