Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
       [not found] <22ffc044-4cc7-468c-b11d-9b838c92e82b@siemens.com>
@ 2026-04-01 16:58 ` Ionut Nechita (Wind River)
  2026-04-02  4:42   ` Nam Cao
  2026-04-02  9:49   ` Tomas Glozar
  0 siblings, 2 replies; 4+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-04-01 16:58 UTC (permalink / raw)
  To: jan.kiszka
  Cc: crwood, florian.bezdeka, ionut.nechita, namcao, brauner,
	linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf,
	frederic, vschneid, gregkh, chris.friesen,
	viorel-catalin.rapiteanu, iulian.mocanu

From: Ionut Nechita <ionut.nechita@windriver.com>

Crystal, Jan, Florian, thanks for the detailed feedback. I've redone
all testing addressing each point raised. All tests below use HT
disabled (sibling cores offlined), as Jan requested.

Setup:
  - Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket,
    32 cores, HT disabled via sibling cores offlined)
  - Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16
    rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31
    iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1
  - eosnoise run with: ./osnoise -c 1-15
  - Duration: 120s per test

Tested kernels (all vanilla, built from upstream sources):
  - 6.18.20-vanilla      (non-RT, PREEMPT_DYNAMIC)
  - 6.18.20-vanilla      (PREEMPT_RT, with and without rwlock revert)
  - 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert)

I tested 6 configurations to isolate the exact failure mode:

  #  Kernel          Config   Tool            Revert  Result
  -- --------------- -------- --------------- ------- ----------------
  1  6.18.20         non-RT   eosnoise        no      clean (100%)
  2  6.18.20         RT       eosnoise        no      D state (hung)
  3  6.18.20         RT       eosnoise        yes     clean (100%)
  4  6.18.20         RT       kernel osnoise  no      clean (99.999%)
  5  7.0-rc6-next    RT       eosnoise        no      93% avail, 57us
  6  7.0-rc6-next    RT       eosnoise        yes     clean (99.99%)

Key findings:

1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state.

   The process blocks in do_epoll_ctl() during perf_buffer__new() setup
   (libbpf's perf_event_open + epoll_ctl loop). strace shows progressive
   degradation as fds are added to the epoll instance:

     CPU  0-13:  epoll_ctl  ~8 us     (normal)
     CPU 14:     epoll_ctl  16 ms     (2000x slower)
     CPU 15:     epoll_ctl  80 ms     (10000x slower)
     CPU 16:     epoll_ctl  80 ms
     CPU 17:     epoll_ctl  20 ms
     CPU 18:     epoll_ctl  -- hung, never returns --

   Kernel stack of the hung process (3+ minutes in D state):

     [<0>] do_epoll_ctl+0xa57/0xf20
     [<0>] __x64_sys_epoll_ctl+0x5d/0xa0
     [<0>] do_syscall_64+0x7c/0xe30
     [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe
   noise. The difference from 6.18 is likely additional fixes in
   linux-next that prevent the complete deadlock but not the contention.

3. Kernel osnoise tracer (test #4) shows zero noise on the same
   6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the
   issue is specifically in the epoll rt_mutex path, not in osnoise
   measurement methodology.

   Kernel osnoise output (6.18.20-rt, spinlock, no revert):
     99.999% availability, 1-4 ns max noise, RES=6 total in 120s

4. Non-RT kernel (test #1) with the same spinlock change shows zero
   noise. This confirms the issue is the spinlock-to-rt_mutex conversion
   on PREEMPT_RT, not the spinlock change itself.

IRQ deltas on isolated CPU1 (120s):

                    6.18.20-rt   6.18.20-rt   6.18.20      6.18.20-rt
                    spinlock     rwlock(rev)  non-RT       kernel osnoise
  RES (IPI):        (D state)    3            1            6
  LOC (timer):      (D state)    3,325        1,185        245
  IWI (irq work):   (D state)    565,988      1,433        121

                    7.0-rc6-rt   7.0-rc6-rt
                    spinlock     rwlock(rev)
  RES (IPI):        330,000+     2
  LOC (timer):      120,585      120,585
  IWI (irq work):   585,785      585,785

The mechanism, refined:

Crystal was right that this is specific to the BPF perf_event_output +
epoll pattern, not any arbitrary epoll user. I verified this: a plain
perf_event_open + epoll_ctl program without BPF does not trigger the
issue.

What triggers it is libbpf's perf_buffer__new(), which creates one
PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer,
and adds all fds to a single epoll instance. When BPF programs are
attached to high-frequency tracepoints (irq_handler_entry/exit,
softirq_entry/exit, sched_switch), every interrupt on every CPU calls
bpf_perf_event_output() which invokes ep_poll_callback() under
ep->lock.

On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating
callbacks simultaneously into the same epoll instance, the rt_mutex
PI mechanism creates unbounded contention. On 6.18 this results in
a permanent D state hang. On 7.0 it results in ~330,000 reschedule
IPIs hitting isolated cores over 120 seconds (~2,750/s per core).

With rwlock, ep_poll_callback() uses read_lock which allows concurrent
readers without cross-CPU contention — the callbacks execute in
parallel without generating IPIs.

This pattern (BPF tracepoint programs + perf ring buffer + epoll) is
the standard architecture used by BCC tools (opensnoop, execsnoop,
biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based
observability tool. A permanent D state hang when running such tools
on PREEMPT_RT is a significant regression.

I'm not proposing a specific fix -- the previous suggestions
(raw_spinlock trylock, lockless path) were rightly rejected. But the
regression exists and needs to be addressed. The ep->lock contention
under high-frequency BPF callbacks on PREEMPT_RT is a new problem
that the rwlock->spinlock conversion introduced.

Separate question: could eosnoise itself be improved to avoid this
contention? For example, using one epoll instance per CPU instead of
a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF)
instead of the per-cpu perf buffer which requires epoll. If the
consensus is that the kernel side is working as intended and the tool
should adapt, I'd like to understand what the recommended pattern is
for BPF observability tools on PREEMPT_RT.

Ionut

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
  2026-04-01 16:58 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River)
@ 2026-04-02  4:42   ` Nam Cao
  2026-04-02  8:59     ` Ionut Nechita (Wind River)
  2026-04-02  9:49   ` Tomas Glozar
  1 sibling, 1 reply; 4+ messages in thread
From: Nam Cao @ 2026-04-02  4:42 UTC (permalink / raw)
  To: Ionut Nechita (Wind River), jan.kiszka
  Cc: crwood, florian.bezdeka, ionut.nechita, brauner, linux-fsdevel,
	linux-rt-users, stable, linux-kernel, bpf, frederic, vschneid,
	gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu

"Ionut Nechita (Wind River)" <ionut.nechita@windriver.com> writes:
> Crystal, Jan, Florian, thanks for the detailed feedback. I've redone
> all testing addressing each point raised. All tests below use HT
> disabled (sibling cores offlined), as Jan requested.
>
> Setup:
>   - Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket,
>     32 cores, HT disabled via sibling cores offlined)
>   - Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16
>     rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31
>     iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1
>   - eosnoise run with: ./osnoise -c 1-15
>   - Duration: 120s per test
>
> Tested kernels (all vanilla, built from upstream sources):
>   - 6.18.20-vanilla      (non-RT, PREEMPT_DYNAMIC)
>   - 6.18.20-vanilla      (PREEMPT_RT, with and without rwlock revert)
>   - 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert)
>
> I tested 6 configurations to isolate the exact failure mode:
>
>   #  Kernel          Config   Tool            Revert  Result
>   -- --------------- -------- --------------- ------- ----------------
>   1  6.18.20         non-RT   eosnoise        no      clean (100%)
>   2  6.18.20         RT       eosnoise        no      D state (hung)
>   3  6.18.20         RT       eosnoise        yes     clean (100%)
>   4  6.18.20         RT       kernel osnoise  no      clean (99.999%)
>   5  7.0-rc6-next    RT       eosnoise        no      93% avail, 57us
>   6  7.0-rc6-next    RT       eosnoise        yes     clean (99.99%)

Thanks for the detailed analysis.

> Key findings:
>
> 1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state.
>
>    The process blocks in do_epoll_ctl() during perf_buffer__new() setup
>    (libbpf's perf_event_open + epoll_ctl loop). strace shows progressive
>    degradation as fds are added to the epoll instance:
>
>      CPU  0-13:  epoll_ctl  ~8 us     (normal)
>      CPU 14:     epoll_ctl  16 ms     (2000x slower)
>      CPU 15:     epoll_ctl  80 ms     (10000x slower)
>      CPU 16:     epoll_ctl  80 ms
>      CPU 17:     epoll_ctl  20 ms
>      CPU 18:     epoll_ctl  -- hung, never returns --
>
>    Kernel stack of the hung process (3+ minutes in D state):
>
>      [<0>] do_epoll_ctl+0xa57/0xf20
>      [<0>] __x64_sys_epoll_ctl+0x5d/0xa0
>      [<0>] do_syscall_64+0x7c/0xe30
>      [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> 2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe
>    noise. The difference from 6.18 is likely additional fixes in
>    linux-next that prevent the complete deadlock but not the contention.
>
> 3. Kernel osnoise tracer (test #4) shows zero noise on the same
>    6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the
>    issue is specifically in the epoll rt_mutex path, not in osnoise
>    measurement methodology.
>
>    Kernel osnoise output (6.18.20-rt, spinlock, no revert):
>      99.999% availability, 1-4 ns max noise, RES=6 total in 120s
>
> 4. Non-RT kernel (test #1) with the same spinlock change shows zero
>    noise. This confirms the issue is the spinlock-to-rt_mutex conversion
>    on PREEMPT_RT, not the spinlock change itself.
>
> IRQ deltas on isolated CPU1 (120s):
>
>                     6.18.20-rt   6.18.20-rt   6.18.20      6.18.20-rt
>                     spinlock     rwlock(rev)  non-RT       kernel osnoise
>   RES (IPI):        (D state)    3            1            6
>   LOC (timer):      (D state)    3,325        1,185        245
>   IWI (irq work):   (D state)    565,988      1,433        121
>
>                     7.0-rc6-rt   7.0-rc6-rt
>                     spinlock     rwlock(rev)
>   RES (IPI):        330,000+     2
>   LOC (timer):      120,585      120,585
>   IWI (irq work):   585,785      585,785
>
> The mechanism, refined:
>
> Crystal was right that this is specific to the BPF perf_event_output +
> epoll pattern, not any arbitrary epoll user. I verified this: a plain
> perf_event_open + epoll_ctl program without BPF does not trigger the
> issue.
>
> What triggers it is libbpf's perf_buffer__new(), which creates one
> PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer,
> and adds all fds to a single epoll instance. When BPF programs are
> attached to high-frequency tracepoints (irq_handler_entry/exit,
> softirq_entry/exit, sched_switch), every interrupt on every CPU calls
> bpf_perf_event_output() which invokes ep_poll_callback() under
> ep->lock.
>
> On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating
> callbacks simultaneously into the same epoll instance, the rt_mutex
> PI mechanism creates unbounded contention.  On 6.18 this results in
> a permanent D state hang. On 7.0 it results in ~330,000 reschedule
> IPIs hitting isolated cores over 120 seconds (~2,750/s per core).
>
> With rwlock, ep_poll_callback() uses read_lock which allows concurrent
> readers without cross-CPU contention — the callbacks execute in
> parallel without generating IPIs.

These IPIs do not exist without eosnoise running. eosnoise introduces
these noises into the system. For a noise tracer tool, it is certainly
eosnoise's responsibility to make sure it does not measure noises
originating from itself.

> This pattern (BPF tracepoint programs + perf ring buffer + epoll) is
> the standard architecture used by BCC tools (opensnoop, execsnoop,
> biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based
> observability tool. A permanent D state hang when running such tools
> on PREEMPT_RT is a significant regression.

7.0-rc6-next is still using spin lock but has no hang problem. Likely
you are hitting a different problem here which appears when spin lock is
used, which has been fixed somewhere between 6.18.20 and 7.0-rc6-next.

If you still have the energy for it, a git bisect between 6.18.20 and
7.0-rc6-next will tell us which commit made the hang issue disappear.

> I'm not proposing a specific fix -- the previous suggestions
> (raw_spinlock trylock, lockless path) were rightly rejected. But the
> regression exists and needs to be addressed. The ep->lock contention
> under high-frequency BPF callbacks on PREEMPT_RT is a new problem
> that the rwlock->spinlock conversion introduced.
>
> Separate question: could eosnoise itself be improved to avoid this
> contention? For example, using one epoll instance per CPU instead of
> a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF)
> instead of the per-cpu perf buffer which requires epoll. If the
> consensus is that the kernel side is working as intended and the tool
> should adapt, I'd like to understand what the recommended pattern is
> for BPF observability tools on PREEMPT_RT.

I am not familiar with eosnoise, I can't tell you. I tried compiling
eosnoise but that failed. I managed to fix the compile failure, then I
got run-time failure.

It depends on what eosnoise is using epoll for. If it is just waiting
for PERF_COUNT_SW_BPF_OUTPUT to happen, perhaps we can change to some
sort of polling implementation (e.g. wake up every 100ms to check for
data).

Best regards,
Nam

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
  2026-04-02  4:42   ` Nam Cao
@ 2026-04-02  8:59     ` Ionut Nechita (Wind River)
  0 siblings, 0 replies; 4+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-04-02  8:59 UTC (permalink / raw)
  To: namcao
  Cc: jan.kiszka, crwood, florian.bezdeka, ionut.nechita, brauner,
	linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf,
	frederic, vschneid, gregkh, chris.friesen,
	viorel-catalin.rapiteanu, iulian.mocanu

From: Ionut Nechita <ionut.nechita@windriver.com>

Nam, thanks for the feedback. I agree with your analysis -- this is
really two separate problems:

1. epoll_ctl D state hang on 6.18.20-rt (kernel-side)

   This hang does not reproduce on 7.0-rc6-next which still uses the
   spinlock, so something between 6.18.20 and 7.0-rc6-next fixed it.
   A git bisect would identify the fix. I'll try to get to it when
   time permits, but since this is a different issue from the original
   report it will be prioritized separately.

2. eosnoise self-noise on PREEMPT_RT (tool-side)

   You're right that the noise and IPIs measured on 7.0-rc6-next
   originate from eosnoise itself -- the BPF callbacks on every
   tracepoint hit generate ep_poll_callback() contention that the
   tool then measures as system noise. This is a tool problem, not
   a kernel regression.

   I'll flag this internally with my team. The fix is likely one of:
   switching to BPF ring buffer (BPF_MAP_TYPE_RINGBUF) which avoids
   the per-cpu perf buffer + epoll path entirely, using per-CPU epoll
   instances, polling as you suggested, or switching to a different
   tool altogether.

Thanks to everyone for the thorough review -- it helped separate what
initially looked like one problem into two distinct issues.

Ionut

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
  2026-04-01 16:58 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River)
  2026-04-02  4:42   ` Nam Cao
@ 2026-04-02  9:49   ` Tomas Glozar
  1 sibling, 0 replies; 4+ messages in thread
From: Tomas Glozar @ 2026-04-02  9:49 UTC (permalink / raw)
  To: Ionut Nechita (Wind River)
  Cc: jan.kiszka, crwood, florian.bezdeka, namcao, brauner,
	linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf,
	frederic, vschneid, gregkh, chris.friesen,
	viorel-catalin.rapiteanu, iulian.mocanu

st 1. 4. 2026 v 19:08 odesílatel Ionut Nechita (Wind River)
<ionut.nechita@windriver.com> napsal:
>
> Separate question: could eosnoise itself be improved to avoid this
> contention? For example, using one epoll instance per CPU instead of
> a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF)
> instead of the per-cpu perf buffer which requires epoll.

Neither BPF ring buffers nor perf event buffers strictly require you
to use epoll. Just as a BPF ring buffer can be read using libbpf's
ring_buffer__consume() [1] without polling, perf_buffer__consume() [2]
can be used the same way for the perf event ringbuffer; neither of the
functions block. If you need to poll, BPF ring buffer also uses
epoll_wait() [3] so that won't make a difference (or is there another
way to poll it?)

[1] https://docs.ebpf.io/ebpf-library/libbpf/userspace/ring_buffer__consume/
[2] https://docs.ebpf.io/ebpf-library/libbpf/userspace/perf_buffer__consume/
[3] https://github.com/libbpf/libbpf/blob/master/src/ringbuf.c#L341

That being said, BPF ring buffer is not per-CPU and should allow
collecting data from all CPUs into one buffer.

> If the consensus is that the kernel side is working as intended and the tool
> should adapt, I'd like to understand what the recommended pattern is
> for BPF observability tools on PREEMPT_RT.

The ideal solution is to aggregate data in BPF directly, not in
userspace, and collect them at the end of the measurement, when
possible. This is what rtla-timerlat does for collecting samples [4]
where it was implemented to prevent the collecting user space thread
from being overloaded with too many samples on systems with a large
number of CPU; polling on ring buffer is used to signal end of tracing
on latency threshold only, no issues have been reported with that. To
collect data about system noise, timerlat collects the events in an
ftrace ring buffer, and then analyzes the tail of the buffer (i.e.
what is relevant to the spike, not all data throughout the entire
measurement) in user space [5]. The same could be replicated in
eosnoise, i.e. collecting the data into a ringbuffer and only reading
the tail in userspace, if that suffices for your use case.

[4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/tracing/rtla/src/timerlat.bpf.c
[5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/tracing/rtla/src/timerlat_aa.c

Tomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-02  9:49 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <22ffc044-4cc7-468c-b11d-9b838c92e82b@siemens.com>
2026-04-01 16:58 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River)
2026-04-02  4:42   ` Nam Cao
2026-04-02  8:59     ` Ionut Nechita (Wind River)
2026-04-02  9:49   ` Tomas Glozar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox