public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Nam Cao <namcao@linutronix.de>
To: "Ionut Nechita (Wind River)" <ionut.nechita@windriver.com>,
	jan.kiszka@siemens.com
Cc: crwood@redhat.com, florian.bezdeka@siemens.com,
	ionut.nechita@windriver.com, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	stable@vger.kernel.org, linux-kernel@vger.kernel.org,
	bpf@vger.kernel.org, frederic@kernel.org, vschneid@redhat.com,
	gregkh@linuxfoundation.org, chris.friesen@windriver.com,
	viorel-catalin.rapiteanu@windriver.com,
	iulian.mocanu@windriver.com
Subject: Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
Date: Thu, 02 Apr 2026 06:42:32 +0200	[thread overview]
Message-ID: <878qb6x9af.fsf@yellow.woof> (raw)
In-Reply-To: <20260401165841.532687-1-ionut.nechita@windriver.com>

"Ionut Nechita (Wind River)" <ionut.nechita@windriver.com> writes:
> Crystal, Jan, Florian, thanks for the detailed feedback. I've redone
> all testing addressing each point raised. All tests below use HT
> disabled (sibling cores offlined), as Jan requested.
>
> Setup:
>   - Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket,
>     32 cores, HT disabled via sibling cores offlined)
>   - Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16
>     rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31
>     iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1
>   - eosnoise run with: ./osnoise -c 1-15
>   - Duration: 120s per test
>
> Tested kernels (all vanilla, built from upstream sources):
>   - 6.18.20-vanilla      (non-RT, PREEMPT_DYNAMIC)
>   - 6.18.20-vanilla      (PREEMPT_RT, with and without rwlock revert)
>   - 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert)
>
> I tested 6 configurations to isolate the exact failure mode:
>
>   #  Kernel          Config   Tool            Revert  Result
>   -- --------------- -------- --------------- ------- ----------------
>   1  6.18.20         non-RT   eosnoise        no      clean (100%)
>   2  6.18.20         RT       eosnoise        no      D state (hung)
>   3  6.18.20         RT       eosnoise        yes     clean (100%)
>   4  6.18.20         RT       kernel osnoise  no      clean (99.999%)
>   5  7.0-rc6-next    RT       eosnoise        no      93% avail, 57us
>   6  7.0-rc6-next    RT       eosnoise        yes     clean (99.99%)

Thanks for the detailed analysis.

> Key findings:
>
> 1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state.
>
>    The process blocks in do_epoll_ctl() during perf_buffer__new() setup
>    (libbpf's perf_event_open + epoll_ctl loop). strace shows progressive
>    degradation as fds are added to the epoll instance:
>
>      CPU  0-13:  epoll_ctl  ~8 us     (normal)
>      CPU 14:     epoll_ctl  16 ms     (2000x slower)
>      CPU 15:     epoll_ctl  80 ms     (10000x slower)
>      CPU 16:     epoll_ctl  80 ms
>      CPU 17:     epoll_ctl  20 ms
>      CPU 18:     epoll_ctl  -- hung, never returns --
>
>    Kernel stack of the hung process (3+ minutes in D state):
>
>      [<0>] do_epoll_ctl+0xa57/0xf20
>      [<0>] __x64_sys_epoll_ctl+0x5d/0xa0
>      [<0>] do_syscall_64+0x7c/0xe30
>      [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> 2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe
>    noise. The difference from 6.18 is likely additional fixes in
>    linux-next that prevent the complete deadlock but not the contention.
>
> 3. Kernel osnoise tracer (test #4) shows zero noise on the same
>    6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the
>    issue is specifically in the epoll rt_mutex path, not in osnoise
>    measurement methodology.
>
>    Kernel osnoise output (6.18.20-rt, spinlock, no revert):
>      99.999% availability, 1-4 ns max noise, RES=6 total in 120s
>
> 4. Non-RT kernel (test #1) with the same spinlock change shows zero
>    noise. This confirms the issue is the spinlock-to-rt_mutex conversion
>    on PREEMPT_RT, not the spinlock change itself.
>
> IRQ deltas on isolated CPU1 (120s):
>
>                     6.18.20-rt   6.18.20-rt   6.18.20      6.18.20-rt
>                     spinlock     rwlock(rev)  non-RT       kernel osnoise
>   RES (IPI):        (D state)    3            1            6
>   LOC (timer):      (D state)    3,325        1,185        245
>   IWI (irq work):   (D state)    565,988      1,433        121
>
>                     7.0-rc6-rt   7.0-rc6-rt
>                     spinlock     rwlock(rev)
>   RES (IPI):        330,000+     2
>   LOC (timer):      120,585      120,585
>   IWI (irq work):   585,785      585,785
>
> The mechanism, refined:
>
> Crystal was right that this is specific to the BPF perf_event_output +
> epoll pattern, not any arbitrary epoll user. I verified this: a plain
> perf_event_open + epoll_ctl program without BPF does not trigger the
> issue.
>
> What triggers it is libbpf's perf_buffer__new(), which creates one
> PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer,
> and adds all fds to a single epoll instance. When BPF programs are
> attached to high-frequency tracepoints (irq_handler_entry/exit,
> softirq_entry/exit, sched_switch), every interrupt on every CPU calls
> bpf_perf_event_output() which invokes ep_poll_callback() under
> ep->lock.
>
> On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating
> callbacks simultaneously into the same epoll instance, the rt_mutex
> PI mechanism creates unbounded contention.  On 6.18 this results in
> a permanent D state hang. On 7.0 it results in ~330,000 reschedule
> IPIs hitting isolated cores over 120 seconds (~2,750/s per core).
>
> With rwlock, ep_poll_callback() uses read_lock which allows concurrent
> readers without cross-CPU contention — the callbacks execute in
> parallel without generating IPIs.

These IPIs do not exist without eosnoise running. eosnoise introduces
these noises into the system. For a noise tracer tool, it is certainly
eosnoise's responsibility to make sure it does not measure noises
originating from itself.

> This pattern (BPF tracepoint programs + perf ring buffer + epoll) is
> the standard architecture used by BCC tools (opensnoop, execsnoop,
> biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based
> observability tool. A permanent D state hang when running such tools
> on PREEMPT_RT is a significant regression.

7.0-rc6-next is still using spin lock but has no hang problem. Likely
you are hitting a different problem here which appears when spin lock is
used, which has been fixed somewhere between 6.18.20 and 7.0-rc6-next.

If you still have the energy for it, a git bisect between 6.18.20 and
7.0-rc6-next will tell us which commit made the hang issue disappear.

> I'm not proposing a specific fix -- the previous suggestions
> (raw_spinlock trylock, lockless path) were rightly rejected. But the
> regression exists and needs to be addressed. The ep->lock contention
> under high-frequency BPF callbacks on PREEMPT_RT is a new problem
> that the rwlock->spinlock conversion introduced.
>
> Separate question: could eosnoise itself be improved to avoid this
> contention? For example, using one epoll instance per CPU instead of
> a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF)
> instead of the per-cpu perf buffer which requires epoll. If the
> consensus is that the kernel side is working as intended and the tool
> should adapt, I'd like to understand what the recommended pattern is
> for BPF observability tools on PREEMPT_RT.

I am not familiar with eosnoise, I can't tell you. I tried compiling
eosnoise but that failed. I managed to fix the compile failure, then I
got run-time failure.

It depends on what eosnoise is using epoll for. If it is just waiting
for PERF_COUNT_SW_BPF_OUTPUT to happen, perhaps we can change to some
sort of polling implementation (e.g. wake up every 100ms to check for
data).

Best regards,
Nam

  reply	other threads:[~2026-04-02  4:42 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River)
2026-03-26 14:31 ` Sebastian Andrzej Siewior
2026-03-26 14:52 ` Greg KH
2026-03-26 16:21   ` Ionut Nechita (Wind River)
2026-03-26 18:12 ` Crystal Wood
2026-03-27  7:44   ` Florian Bezdeka
2026-03-27 18:36     ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River)
2026-03-27 21:20       ` Crystal Wood
2026-03-28  6:00       ` Jan Kiszka
2026-04-01 16:58         ` Ionut Nechita (Wind River)
2026-04-02  4:42           ` Nam Cao [this message]
2026-04-02  8:59             ` Ionut Nechita (Wind River)
2026-04-02  9:49           ` Tomas Glozar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878qb6x9af.fsf@yellow.woof \
    --to=namcao@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=chris.friesen@windriver.com \
    --cc=crwood@redhat.com \
    --cc=florian.bezdeka@siemens.com \
    --cc=frederic@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=ionut.nechita@windriver.com \
    --cc=iulian.mocanu@windriver.com \
    --cc=jan.kiszka@siemens.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=viorel-catalin.rapiteanu@windriver.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox