MSVC v142 (Visual Studio 2019) miscompiles the _mm_cvtsi64_si128 polyfill on 32-bit Windows when it is implemented as _mm_set_epi64x(0, a). The bug manifests in chorba_small_nondestructive_sse2, where the ~crc value intended for an XMM register is instead routed through a GPR, overwriting the live edi register.
Replacing _mm_set_epi64x(0, a) with _mm_loadl_epi64((const __m128i*)&a) forces the compiler to emit MOVQ xmm, m64, which sidesteps the buggy synthesis path entirely.
// arch/x86/crc32_chorba_sse2.c, line 43
uint32_t chorba_small_nondestructive_sse2(uint32_t crc, const uint8_t *buf, size_t len) {
// ...
uint64_t next1 = ~crc;
// ...
__m128i next12 = _mm_cvtsi64_si128(next1); // ← triggers the bugOn 32-bit x86, there are no 64-bit GPRs, so _mm_cvtsi64_si128 is not natively provided by MSVC. The zlib-ng polyfill implements it in arch/x86/x86_intrins.h.
static inline __m128i _mm_cvtsi64_si128(int64_t a) {
- return _mm_set_epi64x(0, a); // buggy on v142
+ return _mm_loadl_epi64((const __m128i*)&a); // correct on all versions
}Both are semantically identical — they produce { a, 0 } in the XMM register. The difference is in how v142 lowers them to machine code.
Buggy (_mm_set_epi64x) — 5 instructions:
__mm_cvtsi64_si128:
movq xmm0, mmword ptr [esp+4]
xorps xmm1, xmm1
punpcklqdq xmm0, xmm1 ; zero-extend high qword
ret
Fixed (_mm_loadl_epi64) — 2 instructions:
__mm_cvtsi64_si128:
movq xmm0, mmword ptr [esp+4]
ret
Both start with the same MOVQ load. The buggy version adds an unnecessary xorps+punpcklqdq to zero the high qword (which MOVQ already does by definition). These extra instructions are harmless in isolation — the real problem is what v142 does when it inlines this into the caller.
; uint32_t chorba_small_nondestructive_sse2(uint32_t crc, const uint8_t *buf, size_t len)
; cdecl: crc=[ebp+8], buf=[ebp+0Ch], len=[ebp+10h]
_chorba_small_nondestructive_sse2:
push ebp
mov ebp, esp
and esp, 0FFFFFFF0h ; align stack to 16 bytes
sub esp, 148h ; allocate locals
; ... security cookie ...
mov eax, dword ptr [ebp+0Ch] ; eax = buf
push esi
push edi
; ... memset final[9] ...
mov eax, dword ptr [ebp+8] ; eax = crc
mov esi, dword ptr [ebp+10h] ; esi = len ← len lives in esi
not eax ; eax = ~crc
mov ecx, dword ptr [esp+14h] ; ecx = buf (reloaded)
xor edi, edi ; edi = 0 (i = 0) ← i lives in edi
mov dword ptr [esp+18h], eax ; [esp+18h] = ~crc (next1, as 64-bit with high dword = 0)
; ... xorps xmm0/xmm3 to zero next34/next56 ...
At this point, edi = 0 (loop counter i) and esi = len. The value ~crc has been stored to [esp+18h] as a 64-bit value (low dword = ~crc, high dword = 0 from line 0x4C).
Buggy:
00000057: 8B 7C 24 18 mov edi, dword ptr [esp+18h] ; edi = ~crc ← WRONG!
Fixed:
00000057: F3 0F 7E 7C 24 18 movq xmm7, mmword ptr [esp+18h] ; xmm7 = ~crc ← CORRECT
The compiler is materializing next12 = _mm_cvtsi64_si128(next1). This should load ~crc into an XMM register. Instead, v142 emits a 32-bit GPR load into edi.
edi was just zeroed (xor edi, edi at offset 0x43) to initialize the loop counter i = 0. The buggy mov edi, [esp+18h] overwrites edi with the value of ~crc, destroying the loop counter.
This is what Microsoft's Developer Community report describes: "when the program reaches first scope change of function chorba_small_nondestructive_sse, len parameter of the function gets replaced by contents of crc parameter." The register allocator has confused which value goes where — routing an XMM-destined value through a GPR that holds live data.
_mm_loadl_epi64 is a memory-load primitive — it maps directly to the MOVQ xmm, m64 instruction with no intermediate steps. The compiler has no freedom to "synthesize" it through GPRs because the intrinsic's semantics demand a single SSE load instruction.
_mm_set_epi64x(0, a) is a multi-scalar constructor — the compiler must synthesize it from multiple operations (move the 64-bit value into position, zero the upper half). On v142's 32-bit backend, this synthesis triggers a register allocation bug where it routes part of the operation through a GPR, clobbering live data.
Build: MSVC v142 (14.29.30133), Win32 (x86), RelWithDebInfo, Ninja
| Polyfill | chorba_sse2 tests |
chorba_sse41 tests |
All CRC32 tests |
|---|---|---|---|
_mm_set_epi64x(0, a) |
24 FAILED / 325 | (not tested separately) | 24 FAILED / 2275 |
_mm_loadl_epi64(&a) |
325 PASSED | 325 PASSED | 2275 PASSED |
The 24 failures are the test cases with buffer sizes large enough (≥100 bytes) to enter the SSE2 code path through chorba_small_nondestructive_sse2. Smaller buffers take the scalar path and are unaffected.
- Windows 11 Pro (10.0.26200)
- MSVC v142 toolchain version 14.29.30133 (via VS 2026 Community with legacy toolset)
- Target: x86 (Win32, 32-bit)
- Build type: RelWithDebInfo
- MS Developer Community #10853479 — original bug report, closed as fixed in VS 2022 17.11
- zlib-ng/zlib-ng#1872 — original SSE2 Chorba PR (where the bug was first hit)
- zlib-ng/zlib-ng#2260 — current cleanup PR