Skip to content

Instantly share code, notes, and snippets.

View nmoinvaz's full-sized avatar

Nathan Moin Vaziri nmoinvaz

  • Los Angeles, California
View GitHub Profile
@nmoinvaz
nmoinvaz / zlib-ng-check-lens-bench-results.md
Created April 23, 2026 00:58
zlib-ng zng_check_lens: SIMD vs SWAR vs scalar benchmark (spun out of PR #2267)

zng_check_lens: SIMD vs SWAR vs scalar

Validity-check benchmark for the zng_check_lens(lens, codes) function proposed in the PR #2267 discussion. All three variants scan lens[0..codes-1] and return -1 if any entry exceeds MAX_BITS (15). Input is all-valid (random values in [0, 15]) so the worst case — a full scan with no early exit — is measured.

Variants:

@nmoinvaz
nmoinvaz / zlib-ng-count-lengths-swar-results.md
Last active April 23, 2026 01:18
zlib-ng count_lengths: SWAR vs SIMD benchmark investigation (spun out of PR #2267)

count_lengths: SWAR vs SIMD

Investigation spun out of the PR #2267 discussion on zlib-ng: can the SIMD paths in count_lengths (inftrees.c) be replaced with a SWAR implementation using zng_memread_8?

SWAR design

Mirror the pair-interleaved 8-bit-lane structure of the active SIMD path. Two pairs of uint64_t accumulators (s1_lo/s1_hi,

@nmoinvaz
nmoinvaz / zlib-ng-pr2267-check-lens.c
Last active April 23, 2026 00:46
zlib-ng PR #2267: SIMD lens[] validity prescan with zero-init invariant
/* SIMD validity check for a Huffman code-length buffer.
*
* Returns 0 if every entry in lens[0..codes-1] is <= MAX_BITS,
* otherwise returns -1. Called from zng_inflate_table before
* count_lengths to guard against the out-of-bounds read of one[]
* described in zlib-ng issue #2266.
*
* Main loop scans 8 uint16_t per iteration via 128-bit vector
* compares; a scalar tail handles the remaining 0-7 entries. No
* assumptions about buffer padding or caller identity.
@nmoinvaz
nmoinvaz / zlib-ng-pr2261-visibility-runall.sh
Last active April 18, 2026 04:00
zlib-ng PR #2261 — visibility("internal") vs visibility("hidden") for Z_INTERNAL. Four-way audit: GCC source, linker source (mold/lld/BFD/gold/ld.so), community prior art (LLVM issue #9555 aliased internal→hidden in 2015), and empirical cross-builds (301 zlib-ng objects across 8 arches, 0 disassembly differences). Consensus: internal is dead-end…
#!/bin/bash
# Toy-test cross-compile: compile a minimal C file (see
# zlib-ng-pr2261-visibility-test.c) with every GCC cross compiler we can
# reasonably get, twice per architecture (visibility("hidden") and
# visibility("internal")), and diff the assembly output — ignoring the
# .hidden/.internal pseudo-op so any real codegen difference surfaces.
#
# Run via Docker:
# docker run --rm -v /path/to/scripts:/work -w /work \
# debian:trixie-slim bash /work/zlib-ng-pr2261-visibility-runall.sh
@nmoinvaz
nmoinvaz / repro_v142.c
Created April 17, 2026 20:14
zlib-ng: Minimal reproducer for MSVC v142 _mm_set_epi64x miscompile on 32-bit x86
/* Minimal reproducer: MSVC v142 _mm_set_epi64x miscompile, 32-bit x86.
*
* cl /O2 /arch:SSE2 repro_v142.c && repro_v142.exe
*
* v142 Win32: FAIL (wrong result due to register corruption)
* v143+ Win32: PASS
*
* https://developercommunity.visualstudio.com/t/10853479
*/
#include <stdio.h>
@nmoinvaz
nmoinvaz / zlib-ng_chorba_v142_asm_analysis.md
Created April 17, 2026 20:07
zlib-ng: MSVC v142 chorba SSE2 miscompile — assembly analysis of _mm_cvtsi64_si128 polyfill bug and fix

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

MSVC v142 (Visual Studio 2019) miscompiles the _mm_cvtsi64_si128 polyfill on 32-bit Windows when it is implemented as _mm_set_epi64x(0, a). The bug manifests in chorba_small_nondestructive_sse2, where the ~crc value intended for an XMM register is instead routed through a GPR, overwriting the live edi register.

Replacing _mm_set_epi64x(0, a) with _mm_loadl_epi64((const __m128i*)&a) forces the compiler to emit MOVQ xmm, m64, which sidesteps the buggy synthesis path entirely.

The source code trigger

@nmoinvaz
nmoinvaz / gist_chorba_v142_asm.md
Created April 17, 2026 19:28
zlib-ng: MSVC v142 chorba SSE2 miscompile — assembly analysis of _mm_cvtsi64_si128 polyfill bug and fix

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

MSVC v142 (Visual Studio 2019) miscompiles the _mm_cvtsi64_si128 polyfill on 32-bit Windows when it is implemented as _mm_set_epi64x(0, a). The bug manifests in chorba_small_nondestructive_sse2, where the ~crc value intended for an XMM register is instead routed through a GPR, overwriting the live edi register.

Replacing _mm_set_epi64x(0, a) with _mm_loadl_epi64((const __m128i*)&a) forces the compiler to emit MOVQ xmm, m64, which sidesteps the buggy synthesis path entirely.

The source code trigger

@nmoinvaz
nmoinvaz / zlib-ng-longest-match-offset-search.md
Created April 14, 2026 01:42
zlib-ng: integer-hash offset-search in longest_match (silesia L8 -17 to -60% time, neutral ratio)

zlib-ng: integer-hash offset-search in longest_match

Branch: improvements/offset-search-int-hash Commits: a944f45b + 0f83f476

Summary

Extended longest_match (the non-slow variant used by levels 1-8) with the fast-zlib offset-search rewinding, using the 4-byte integer hash that levels 1-8 already use elsewhere in the hash table. The offset search is

@nmoinvaz
nmoinvaz / zlib-ng-slide_hash-neon-interleave.md
Last active April 13, 2026 03:07
zlib-ng: slide_hash NEON+C interleave investigation (Apple M5)

slide_hash NEON + C interleave investigation

Investigation of alternative slide_hash implementations for zlib-ng: can we make slide_hash_c_chain and slide_hash_neon faster by interleaving the slide of head and prev, by widening the unroll, by combining loops, or by switching between ldp q,q / stp q,q and ld1 {v,v} / st1 {v,v} addressing modes?

TL;DR

@nmoinvaz
nmoinvaz / zlib-ng-2248-longest-match-slow-scan-endstr-offset.md
Created April 11, 2026 03:50
zlib-ng #2248 — longest_match_slow scan_endstr offset fix benchmark

zlib-ng #2248 — longest_match_slow scan_endstr offset fix

Benchmark results for the fix in #2248: LONGEST_MATCH_SLOW was hashing the wrong 3-byte window when looking for a chain head near the end of the current match. The comment in match_tpl.h and the upstream gildor2/fast_zlib source both specify len - (STD_MIN_MATCH - 1) (== len - 2 for STD_MIN_MATCH == 3), but the code was using len - (STD_MIN_MATCH + 1) (== len - 4), placing the hashed window entirely inside the already-matched region instead of ending one byte past the current match.