Nathan Moin Vaziri nmoinvaz

zng_check_lens: SIMD vs SWAR vs scalar

Validity-check benchmark for the zng_check_lens(lens, codes) function proposed in the PR #2267 discussion. All three variants scan lens[0..codes-1] and return -1 if any entry exceeds MAX_BITS (15). Input is all-valid (random values in [0, 15]) so the worst case — a full scan with no early exit — is measured.

Variants:

count_lengths: SWAR vs SIMD

Investigation spun out of the PR #2267 discussion on zlib-ng: can the SIMD paths in count_lengths (inftrees.c) be replaced with a SWAR implementation using zng_memread_8?

SWAR design

Mirror the pair-interleaved 8-bit-lane structure of the active SIMD path. Two pairs of uint64_t accumulators (s1_lo/s1_hi,

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

MSVC v142 (Visual Studio 2019) miscompiles the _mm_cvtsi64_si128 polyfill on 32-bit Windows when it is implemented as _mm_set_epi64x(0, a). The bug manifests in chorba_small_nondestructive_sse2, where the ~crc value intended for an XMM register is instead routed through a GPR, overwriting the live edi register.

Replacing _mm_set_epi64x(0, a) with _mm_loadl_epi64((const __m128i*)&a) forces the compiler to emit MOVQ xmm, m64, which sidesteps the buggy synthesis path entirely.

The source code trigger

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

MSVC v142 (Visual Studio 2019) miscompiles the _mm_cvtsi64_si128 polyfill on 32-bit Windows when it is implemented as _mm_set_epi64x(0, a). The bug manifests in chorba_small_nondestructive_sse2, where the ~crc value intended for an XMM register is instead routed through a GPR, overwriting the live edi register.

Replacing _mm_set_epi64x(0, a) with _mm_loadl_epi64((const __m128i*)&a) forces the compiler to emit MOVQ xmm, m64, which sidesteps the buggy synthesis path entirely.

The source code trigger

zlib-ng: integer-hash offset-search in `longest_match`

Branch: improvements/offset-search-int-hash Commits: a944f45b + 0f83f476

Summary

Extended longest_match (the non-slow variant used by levels 1-8) with the fast-zlib offset-search rewinding, using the 4-byte integer hash that levels 1-8 already use elsewhere in the hash table. The offset search is

slide_hash NEON + C interleave investigation

Investigation of alternative slide_hash implementations for zlib-ng: can we make slide_hash_c_chain and slide_hash_neon faster by interleaving the slide of head and prev, by widening the unroll, by combining loops, or by switching between ldp q,q / stp q,q and ld1 {v,v} / st1 {v,v} addressing modes?

TL;DR

zlib-ng #2248 — longest_match_slow scan_endstr offset fix

Benchmark results for the fix in #2248: LONGEST_MATCH_SLOW was hashing the wrong 3-byte window when looking for a chain head near the end of the current match. The comment in match_tpl.h and the upstream gildor2/fast_zlib source both specify len - (STD_MIN_MATCH - 1) (== len - 2 for STD_MIN_MATCH == 3), but the code was using len - (STD_MIN_MATCH + 1) (== len - 4), placing the hashed window entirely inside the already-matched region instead of ending one byte past the current match.

	/* SIMD validity check for a Huffman code-length buffer.
	*
	* Returns 0 if every entry in lens[0..codes-1] is <= MAX_BITS,
	* otherwise returns -1. Called from zng_inflate_table before
	* count_lengths to guard against the out-of-bounds read of one[]
	* described in zlib-ng issue #2266.
	*
	* Main loop scans 8 uint16_t per iteration via 128-bit vector
	* compares; a scalar tail handles the remaining 0-7 entries. No
	* assumptions about buffer padding or caller identity.

	#!/bin/bash
	# Toy-test cross-compile: compile a minimal C file (see
	# zlib-ng-pr2261-visibility-test.c) with every GCC cross compiler we can
	# reasonably get, twice per architecture (visibility("hidden") and
	# visibility("internal")), and diff the assembly output — ignoring the
	# .hidden/.internal pseudo-op so any real codegen difference surfaces.
	#
	# Run via Docker:
	# docker run --rm -v /path/to/scripts:/work -w /work \
	# debian:trixie-slim bash /work/zlib-ng-pr2261-visibility-runall.sh

	/* Minimal reproducer: MSVC v142 _mm_set_epi64x miscompile, 32-bit x86.
	*
	* cl /O2 /arch:SSE2 repro_v142.c && repro_v142.exe
	*
	* v142 Win32: FAIL (wrong result due to register corruption)
	* v143+ Win32: PASS
	*
	* https://developercommunity.visualstudio.com/t/10853479
	*/
	#include <stdio.h>

Nathan Moin Vaziri nmoinvaz

zng_check_lens: SIMD vs SWAR vs scalar

count_lengths: SWAR vs SIMD

SWAR design

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

The source code trigger

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

The source code trigger

zlib-ng: integer-hash offset-search in longest_match

Summary

slide_hash NEON + C interleave investigation

TL;DR

zlib-ng #2248 — longest_match_slow scan_endstr offset fix

zlib-ng: integer-hash offset-search in `longest_match`