public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] lib/zlib: fix GCOV-induced crashes from concurrent inflate_fast()
@ 2026-03-30 14:32 Konstantin Khorenko
  2026-03-30 14:32 ` [PATCH 1/1] lib/zlib: use atomic GCOV counters to prevent crash in inflate_fast Konstantin Khorenko
  0 siblings, 1 reply; 3+ messages in thread
From: Konstantin Khorenko @ 2026-03-30 14:32 UTC (permalink / raw)
  To: Peter Oberparleiter, Mikhail Zaslonko
  Cc: Steffen Klassert, Herbert Xu, Masahiro Yamada, Josh Poimboeuf,
	Vasileios Almpanis, Pavel Tikhomirov, linux-kernel, netdev,
	Konstantin Khorenko

## Summary

GCC can merge global GCOV branch counters with loop induction variables,
causing out-of-bounds memory writes when the same function executes
concurrently on multiple CPUs. We observed kernel crashes in zlib's
inflate_fast() during IPComp (IP Payload Compression) decompression when
the kernel is built with CONFIG_GCOV_KERNEL=y.

This patch adds -fprofile-update=atomic to the zlib Makefiles to prevent
the problematic optimization.

## Problem Discovery

The issue was discovered while running the LTP networking stress test on
a 6.12-based kernel with GCOV enabled:

  Test:     LTP net_stress.ipsec_udp -> udp4_ipsec06
  Command:  udp_ipsec.sh -p comp -m transport -s 100:1000:65000:R65000
  Kernel:   6.12.0-55.52.1.el10 based (x86_64) with CONFIG_GCOV_KERNEL=y

The crash occurred in inflate_fast() during concurrent IPComp packet
decompression:

  BUG: unable to handle page fault for address: ffffd0a3c0902ffa
  RIP: inflate_fast+1431
  Call Trace:
   zlib_inflate
   __deflate_decompress
   crypto_comp_decompress
   ipcomp_decompress [xfrm_ipcomp]
   ipcomp_input [xfrm_ipcomp]
   xfrm_input

Analysis showed a write 3.4 MB past the end of a 65 KB decompression
buffer, hitting unmapped vmalloc guard pages.

## Verification on Upstream Kernel

We verified the bug is present in upstream Linux v7.0-rc5 (commit
46b513250491) with both:

- GCC 14.2.1 20250110 (Red Hat 14.2.1-7)
- GCC 16.0.1 20260327 (experimental, built from source)

We compiled lib/zlib_inflate/inffast.c with the kernel's standard GCOV
flags (including the existing -fno-tree-loop-im workaround from commit
2b40e1ea76d4) and inspected the assembly output. Both compilers exhibit
the problematic optimization.

## Root Cause: GCOV Counter IV-Merging

When CONFIG_GCOV_KERNEL=y, GCC instruments every basic block with counter
updates. In zlib's inner copy loops, GCC can optimize by merging these
global GCOV counters with the loop induction variables that compute store
addresses.

For example, in the pattern-fill loop in inflate_fast() (conceptually
`do { *sout++ = pat16; } while (--loops);`), GCC may generate code that:

1. Loads the current GCOV counter value from global memory
2. Uses that value to compute the base address, start index, and end bound
3. On each iteration, uses the counter as both:
   - The value to write back to the global counter
   - The index for the data store

This optimization is valid for single-threaded code but breaks when the
same function executes concurrently on multiple CPUs, because the GCOV
counter is a single global variable shared by all CPUs.

### Assembly Examples

**GCC 14.2.1 (Red Hat)** — pattern-fill loop without -fprofile-update=atomic:

    movq    __gcov0.inflate_fast+248(%rip), %r9
    leaq    1(%r9), %rax                        # rax = counter + 1
    leaq    (%r9,%rdi), %rsi                    # rsi = counter + loops (end)
    negq    %r9
    leaq    (%r15,%r9,2), %r9                   # r9 = base - counter*2
  .L42:
    movq    %rax, __gcov0.inflate_fast+248(%rip)
    movw    %cx, (%r9,%rax,2)                   # WRITE using merged counter
    addq    $1, %rax
    cmpq    %rax, %rsi
    jne     .L42

Here, %rax serves dual purpose: GCOV counter and memory index.  If another
CPU updates __gcov0.inflate_fast+248 between the initial load and the loop
iteration, the write address becomes invalid.

**GCC 16.0.1 (experimental)** — same pattern, different register allocation:

    movq    __gcov0.inflate_fast+248(%rip), %r14
    leaq    1(%r14), %rdx
    leaq    (%r14,%r8), %rdi                    # end bound from counter
    negq    %r14
    leaq    (%rbx,%r14,2), %rbx                 # base address from counter
  .L42:
    movq    %rdx, __gcov0.inflate_fast+248(%rip)
    movw    %si, (%rbx,%rdx,2)                  # WRITE using merged counter
    addq    $1, %rdx
    cmpq    %rdx, %rdi
    jne     .L42

Both compilers merge the counter with the loop IV; both are vulnerable.

**With -fprofile-update=atomic** (GCC 14.2.1):

    movq    %rdi, %r9                           # pure pointer
  .L42:
    lock addq   $1, __gcov0.inflate_fast+248(%rip)
    addq    $2, %r9
    movw    %ax, -2(%r9)                        # WRITE using pure pointer
    subq    $1, %rdx
    jne     .L42

The GCOV counter update is isolated as a standalone atomic instruction.
The write address is computed purely from local registers, making
concurrent execution safe.

## Why Existing Workarounds Don't Help

The kernel already passes -fno-tree-loop-im when building with GCOV
(commit 2b40e1ea76d4, "gcov: disable tree-loop-im to reduce stack usage").
That flag prevents GCC from hoisting loop-invariant memory operations out
of loops, which was causing excessive stack usage.

However, -fno-tree-loop-im does NOT prevent the IVopts (induction variable
optimization) pass from merging GCOV counters with loop induction
variables.  These are separate optimization passes, and the IV-merging
happens even with -fno-tree-loop-im present.

## The Fix: -fprofile-update=atomic

Adding -fprofile-update=atomic tells GCC that profile counters may be
accessed concurrently. This causes GCC to:

1. Use atomic instructions (lock addq) for counter updates
2. Treat counter variables as opaque — they cannot be merged with loop IVs

This completely eliminates the problematic optimization while preserving
GCOV functionality.

The flag is added only to lib/zlib_inflate/, lib/zlib_deflate/, and
lib/zlib_dfltcc/ Makefiles to minimize performance overhead. Zlib is
particularly vulnerable because:

- inflate_fast() has tight inner loops that GCC heavily optimizes
- IPComp and other subsystems call zlib from multiple CPUs concurrently
- The per-CPU isolation (scratch buffers, crypto transforms) protects
  data structures but cannot protect global GCOV counters

Applying -fprofile-update=atomic globally would make all GCOV counter
updates use atomic instructions, adding overhead throughout the kernel.
By scoping it to zlib, we fix the known crash site with minimal impact.

## GCC Bug Report Status

A bug report for the upstream GCC project is being filed.

i'm not sure if GCC team will accept this as a bug as the compiler
optimization looks technically valid under the C standard (data races on
non-atomic variables are undefined behavior), but it creates a practical
problem for kernel code that routinely executes the same function on
multiple CPUs.

And anyway we need to fix/workaround the issue until/if the compiler is fixed.

## Testing

Verified the fix by:

1. Compiling lib/zlib_inflate/inffast.c with and without
   -fprofile-update=atomic using the kernel's exact build flags
2. Inspecting assembly output to confirm the atomic flag prevents
   counter-IV merging
3. Testing on upstream v7.0-rc5 with GCC 14.2.1 and GCC 16.0.1

---

Konstantin Khorenko (1):
  lib/zlib: use atomic GCOV counters to prevent crash in inflate_fast

 lib/zlib_deflate/Makefile | 6 ++++++
 lib/zlib_dfltcc/Makefile  | 6 ++++++
 lib/zlib_inflate/Makefile | 7 +++++++
 3 files changed, 19 insertions(+)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/1] lib/zlib: use atomic GCOV counters to prevent crash in inflate_fast
  2026-03-30 14:32 [PATCH 0/1] lib/zlib: fix GCOV-induced crashes from concurrent inflate_fast() Konstantin Khorenko
@ 2026-03-30 14:32 ` Konstantin Khorenko
  2026-04-01  9:44   ` Peter Oberparleiter
  0 siblings, 1 reply; 3+ messages in thread
From: Konstantin Khorenko @ 2026-03-30 14:32 UTC (permalink / raw)
  To: Peter Oberparleiter, Mikhail Zaslonko
  Cc: Steffen Klassert, Herbert Xu, Masahiro Yamada, Josh Poimboeuf,
	Vasileios Almpanis, Pavel Tikhomirov, linux-kernel, netdev,
	Konstantin Khorenko

GCC's GCOV instrumentation can merge global branch counters with loop
induction variables as an optimization. In inflate_fast(), the inner
copy loops can be transformed so that GCOV counter values participate
in computing loop addresses and bounds. Since GCOV counters are global
(not per-CPU), concurrent execution on different CPUs causes the counter
to change mid-computation, producing inconsistent address calculations
and out-of-bounds memory writes.

The crash manifests during IPComp (IP Payload Compression) processing
when inflate_fast() runs concurrently on multiple CPUs:

  BUG: unable to handle page fault for address: ffffd0a3c0902ffa
  RIP: inflate_fast+1431
  Call Trace:
   zlib_inflate
   __deflate_decompress
   crypto_comp_decompress
   ipcomp_decompress [xfrm_ipcomp]
   ipcomp_input [xfrm_ipcomp]
   xfrm_input

In one observed case, the compiler merged a global GCOV counter with the
loop induction variable that also indexed stores. Another CPU modified
the counter between the setup and iteration phases, causing a write
3.4 MB past the end of a 65 KB buffer.

The kernel already uses -fno-tree-loop-im for GCOV builds (commit
2b40e1ea76d4) to prevent a different optimization issue. That flag
prevents GCC from hoisting loop-invariant memory operations but does
NOT prevent the IVopts pass from merging counters with induction
variables.

Add -fprofile-update=atomic to zlib Makefiles. This tells GCC that
GCOV counters may be concurrently accessed, causing counter updates to
use atomic instructions (lock addq) instead of plain load/store.
This prevents the compiler from merging counters with loop induction
variables. The flag is scoped to zlib only to minimize performance
overhead from atomic operations in the rest of the kernel.

Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Reviewed-by: Vasileios Almpanis <vasileios.almpanis@virtuozzo.com>
Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 lib/zlib_deflate/Makefile | 6 ++++++
 lib/zlib_dfltcc/Makefile  | 6 ++++++
 lib/zlib_inflate/Makefile | 7 +++++++
 3 files changed, 19 insertions(+)

diff --git a/lib/zlib_deflate/Makefile b/lib/zlib_deflate/Makefile
index 2622e03c0b942..dc0b3e5660e9e 100644
--- a/lib/zlib_deflate/Makefile
+++ b/lib/zlib_deflate/Makefile
@@ -7,6 +7,12 @@
 # decompression code.
 #
 
+# Force atomic GCOV counter updates to prevent GCC from merging global
+# counters with loop induction variables (see lib/zlib_inflate/Makefile).
+ifdef CONFIG_GCOV_KERNEL
+ccflags-y += -fprofile-update=atomic
+endif
+
 obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate.o
 
 zlib_deflate-objs := deflate.o deftree.o deflate_syms.o
diff --git a/lib/zlib_dfltcc/Makefile b/lib/zlib_dfltcc/Makefile
index 66e1c96387c40..fb08749d2ee7b 100644
--- a/lib/zlib_dfltcc/Makefile
+++ b/lib/zlib_dfltcc/Makefile
@@ -6,6 +6,12 @@
 # This is the code for s390 zlib hardware support.
 #
 
+# Force atomic GCOV counter updates to prevent GCC from merging global
+# counters with loop induction variables (see lib/zlib_inflate/Makefile).
+ifdef CONFIG_GCOV_KERNEL
+ccflags-y += -fprofile-update=atomic
+endif
+
 obj-$(CONFIG_ZLIB_DFLTCC) += zlib_dfltcc.o
 
 zlib_dfltcc-objs := dfltcc.o dfltcc_deflate.o dfltcc_inflate.o
diff --git a/lib/zlib_inflate/Makefile b/lib/zlib_inflate/Makefile
index 27327d3e9f541..8707c649adda5 100644
--- a/lib/zlib_inflate/Makefile
+++ b/lib/zlib_inflate/Makefile
@@ -14,6 +14,13 @@
 # uncompression can be done without blocking on allocation).
 #
 
+# Force atomic GCOV counter updates to prevent GCC from merging global
+# counters with loop induction variables — concurrent inflate_fast()
+# execution on multiple CPUs causes out-of-bounds writes otherwise.
+ifdef CONFIG_GCOV_KERNEL
+ccflags-y += -fprofile-update=atomic
+endif
+
 obj-$(CONFIG_ZLIB_INFLATE) += zlib_inflate.o
 
 zlib_inflate-objs := inffast.o inflate.o infutil.o \
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1] lib/zlib: use atomic GCOV counters to prevent crash in inflate_fast
  2026-03-30 14:32 ` [PATCH 1/1] lib/zlib: use atomic GCOV counters to prevent crash in inflate_fast Konstantin Khorenko
@ 2026-04-01  9:44   ` Peter Oberparleiter
  0 siblings, 0 replies; 3+ messages in thread
From: Peter Oberparleiter @ 2026-04-01  9:44 UTC (permalink / raw)
  To: Konstantin Khorenko, Mikhail Zaslonko
  Cc: Steffen Klassert, Herbert Xu, Masahiro Yamada, Josh Poimboeuf,
	Vasileios Almpanis, Pavel Tikhomirov, linux-kernel, netdev

On 30.03.2026 16:32, Konstantin Khorenko wrote:
> GCC's GCOV instrumentation can merge global branch counters with loop
> induction variables as an optimization. In inflate_fast(), the inner
> copy loops can be transformed so that GCOV counter values participate
> in computing loop addresses and bounds. Since GCOV counters are global
> (not per-CPU), concurrent execution on different CPUs causes the counter
> to change mid-computation, producing inconsistent address calculations
> and out-of-bounds memory writes.
> 
> The crash manifests during IPComp (IP Payload Compression) processing
> when inflate_fast() runs concurrently on multiple CPUs:
> 
>   BUG: unable to handle page fault for address: ffffd0a3c0902ffa
>   RIP: inflate_fast+1431
>   Call Trace:
>    zlib_inflate
>    __deflate_decompress
>    crypto_comp_decompress
>    ipcomp_decompress [xfrm_ipcomp]
>    ipcomp_input [xfrm_ipcomp]
>    xfrm_input
> 
> In one observed case, the compiler merged a global GCOV counter with the
> loop induction variable that also indexed stores. Another CPU modified
> the counter between the setup and iteration phases, causing a write
> 3.4 MB past the end of a 65 KB buffer.
> 
> The kernel already uses -fno-tree-loop-im for GCOV builds (commit
> 2b40e1ea76d4) to prevent a different optimization issue. That flag
> prevents GCC from hoisting loop-invariant memory operations but does
> NOT prevent the IVopts pass from merging counters with induction
> variables.
> 
> Add -fprofile-update=atomic to zlib Makefiles. This tells GCC that
> GCOV counters may be concurrently accessed, causing counter updates to
> use atomic instructions (lock addq) instead of plain load/store.
> This prevents the compiler from merging counters with loop induction
> variables. The flag is scoped to zlib only to minimize performance
> overhead from atomic operations in the rest of the kernel.
> 
> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
> Reviewed-by: Vasileios Almpanis <vasileios.almpanis@virtuozzo.com>
> Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

I'm all for introducing -fprofile-update=atomic to GCOV CFLAGS as it not
only addresses this bug, but makes coverage data more consistent
overall. My only suggestion would be to apply it at global scope
(top-level Makefile), not restricting it to zlib alone. Since
GCOV-instrumented kernels already have a significant performance hit due
to the added profiling code, this side-effect of using atomic
instructions can IMO be safely ignored.

Unfortunately, while compile-testing this suggested change to the global
Makefile, I ran into the following build assert which needs more
investigation:

net/core/skbuff.c:5163:9: note: in expansion of macro ‘BUILD_BUG_ON’
 5163 |         BUILD_BUG_ON(skb_ext_total_length() > 255);


-- 
Peter Oberparleiter
Linux on IBM Z Development - IBM Germany R&D

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-01  9:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 14:32 [PATCH 0/1] lib/zlib: fix GCOV-induced crashes from concurrent inflate_fast() Konstantin Khorenko
2026-03-30 14:32 ` [PATCH 1/1] lib/zlib: use atomic GCOV counters to prevent crash in inflate_fast Konstantin Khorenko
2026-04-01  9:44   ` Peter Oberparleiter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox