Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Created April 17, 2026 20:07
Show Gist options
  • Select an option

  • Save nmoinvaz/8706ca1a24501415ae6b22733d347a73 to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/8706ca1a24501415ae6b22733d347a73 to your computer and use it in GitHub Desktop.
zlib-ng: MSVC v142 chorba SSE2 miscompile — assembly analysis of _mm_cvtsi64_si128 polyfill bug and fix

zlib-ng: MSVC v142 Chorba SSE2 miscompile — assembly analysis

Overview

MSVC v142 (Visual Studio 2019) miscompiles the _mm_cvtsi64_si128 polyfill on 32-bit Windows when it is implemented as _mm_set_epi64x(0, a). The bug manifests in chorba_small_nondestructive_sse2, where the ~crc value intended for an XMM register is instead routed through a GPR, overwriting the live edi register.

Replacing _mm_set_epi64x(0, a) with _mm_loadl_epi64((const __m128i*)&a) forces the compiler to emit MOVQ xmm, m64, which sidesteps the buggy synthesis path entirely.

The source code trigger

// arch/x86/crc32_chorba_sse2.c, line 43
uint32_t chorba_small_nondestructive_sse2(uint32_t crc, const uint8_t *buf, size_t len) {
    // ...
    uint64_t next1 = ~crc;
    // ...
    __m128i next12 = _mm_cvtsi64_si128(next1);   // ← triggers the bug

On 32-bit x86, there are no 64-bit GPRs, so _mm_cvtsi64_si128 is not natively provided by MSVC. The zlib-ng polyfill implements it in arch/x86/x86_intrins.h.

The polyfill change

 static inline __m128i _mm_cvtsi64_si128(int64_t a) {
-   return _mm_set_epi64x(0, a);        // buggy on v142
+    return _mm_loadl_epi64((const __m128i*)&a);  // correct on all versions
 }

Both are semantically identical — they produce { a, 0 } in the XMM register. The difference is in how v142 lowers them to machine code.

Polyfill assembly (standalone, before inlining)

Buggy (_mm_set_epi64x) — 5 instructions:

__mm_cvtsi64_si128:
  movq        xmm0, mmword ptr [esp+4]
  xorps       xmm1, xmm1
  punpcklqdq  xmm0, xmm1          ; zero-extend high qword
  ret

Fixed (_mm_loadl_epi64) — 2 instructions:

__mm_cvtsi64_si128:
  movq        xmm0, mmword ptr [esp+4]
  ret

Both start with the same MOVQ load. The buggy version adds an unnecessary xorps+punpcklqdq to zero the high qword (which MOVQ already does by definition). These extra instructions are harmless in isolation — the real problem is what v142 does when it inlines this into the caller.

The bug: what v142 generates after inlining

Function prologue and register assignments

; uint32_t chorba_small_nondestructive_sse2(uint32_t crc, const uint8_t *buf, size_t len)
;   cdecl: crc=[ebp+8], buf=[ebp+0Ch], len=[ebp+10h]

_chorba_small_nondestructive_sse2:
  push        ebp
  mov         ebp, esp
  and         esp, 0FFFFFFF0h           ; align stack to 16 bytes
  sub         esp, 148h                 ; allocate locals
  ; ... security cookie ...
  mov         eax, dword ptr [ebp+0Ch]  ; eax = buf
  push        esi
  push        edi
  ; ... memset final[9] ...
  mov         eax, dword ptr [ebp+8]    ; eax = crc
  mov         esi, dword ptr [ebp+10h]  ; esi = len          ← len lives in esi
  not         eax                       ; eax = ~crc
  mov         ecx, dword ptr [esp+14h]  ; ecx = buf (reloaded)
  xor         edi, edi                  ; edi = 0 (i = 0)    ← i lives in edi
  mov         dword ptr [esp+18h], eax  ; [esp+18h] = ~crc (next1, as 64-bit with high dword = 0)
  ; ... xorps xmm0/xmm3 to zero next34/next56 ...

At this point, edi = 0 (loop counter i) and esi = len. The value ~crc has been stored to [esp+18h] as a 64-bit value (low dword = ~crc, high dword = 0 from line 0x4C).

The critical instruction — offset 0x57

Buggy:

  00000057: 8B 7C 24 18        mov    edi, dword ptr [esp+18h]   ; edi = ~crc  ← WRONG!

Fixed:

  00000057: F3 0F 7E 7C 24 18  movq   xmm7, mmword ptr [esp+18h] ; xmm7 = ~crc ← CORRECT

What went wrong

The compiler is materializing next12 = _mm_cvtsi64_si128(next1). This should load ~crc into an XMM register. Instead, v142 emits a 32-bit GPR load into edi.

edi was just zeroed (xor edi, edi at offset 0x43) to initialize the loop counter i = 0. The buggy mov edi, [esp+18h] overwrites edi with the value of ~crc, destroying the loop counter.

This is what Microsoft's Developer Community report describes: "when the program reaches first scope change of function chorba_small_nondestructive_sse, len parameter of the function gets replaced by contents of crc parameter." The register allocator has confused which value goes where — routing an XMM-destined value through a GPR that holds live data.

Why _mm_loadl_epi64 fixes it

_mm_loadl_epi64 is a memory-load primitive — it maps directly to the MOVQ xmm, m64 instruction with no intermediate steps. The compiler has no freedom to "synthesize" it through GPRs because the intrinsic's semantics demand a single SSE load instruction.

_mm_set_epi64x(0, a) is a multi-scalar constructor — the compiler must synthesize it from multiple operations (move the 64-bit value into position, zero the upper half). On v142's 32-bit backend, this synthesis triggers a register allocation bug where it routes part of the operation through a GPR, clobbering live data.

Test results

Build: MSVC v142 (14.29.30133), Win32 (x86), RelWithDebInfo, Ninja

Polyfill chorba_sse2 tests chorba_sse41 tests All CRC32 tests
_mm_set_epi64x(0, a) 24 FAILED / 325 (not tested separately) 24 FAILED / 2275
_mm_loadl_epi64(&a) 325 PASSED 325 PASSED 2275 PASSED

The 24 failures are the test cases with buffer sizes large enough (≥100 bytes) to enter the SSE2 code path through chorba_small_nondestructive_sse2. Smaller buffers take the scalar path and are unaffected.

Machine specs

  • Windows 11 Pro (10.0.26200)
  • MSVC v142 toolchain version 14.29.30133 (via VS 2026 Community with legacy toolset)
  • Target: x86 (Win32, 32-bit)
  • Build type: RelWithDebInfo

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment