From: Crystal Wood <crwood@redhat.com>
To: "Ionut Nechita (Wind River)" <ionut.nechita@windriver.com>,
florian.bezdeka@siemens.com
Cc: namcao@linutronix.de, brauner@kernel.org,
linux-fsdevel@vger.kernel.org, linux-rt-users@vger.kernel.org,
stable@vger.kernel.org, linux-kernel@vger.kernel.org,
frederic@kernel.org, vschneid@redhat.com,
gregkh@linuxfoundation.org, chris.friesen@windriver.com,
viorel-catalin.rapiteanu@windriver.com,
iulian.mocanu@windriver.com, jan.kiszka@siemens.com
Subject: Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
Date: Fri, 27 Mar 2026 16:20:17 -0500 [thread overview]
Message-ID: <279cd4c6b4eece55a63936f2ad0912e41be7838b.camel@redhat.com> (raw)
In-Reply-To: <20260327183610.594667-1-ionut.nechita@windriver.com>
On Fri, 2026-03-27 at 20:36 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
>
> On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote:
> > A revert alone is not an option as it would bring back [1] and [2]
> > for all LTS releases that did not receive [3].
>
> Florian, Crystal, thanks for the feedback.
>
> I understand the revert concern regarding the CFS throttle deadlock.
> However, I want to clarify that the noise regression on isolated cores
> is a separate issue from the deadlock fixed by [3], and it remains
> unfixed even on linux-next which has [3] merged or not.
Nobody's saying that [3] would fix your issue. They're saying that the
deadlock issue is the reason why simply reverting the epoll change is
not acceptable, at least on kernels without [3].
> I've done extensive testing across multiple kernels to identify the
> exact mechanism. Here are the results.
>
> Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise)
> which uses perf_event_open() + epoll on each monitored CPU, combined
> with /proc/interrupts delta measurement.
I recommend sticking with the kernel's osnoise (with or without rtla).
Besides the IPI issue, it doesn't look like eosnoise is being
maintained anymore, ever since osnoise went into the kernel.
> Setup:
> - Hardware: x86_64, SMT/HT enabled (CPUs 0-63)
> - Boot: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48
> rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63
> - Duration: 120s per test
>
> IRQ delta on isolated CPUs (representative CPU1, 120s sample):
>
> 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt
> spinlock spinlock spinlock rwlock(rev) rwlock(rev)
> RES (IPI): 324,279 323,864 321,594 0 1
> LOC (timer): 50,827 53,995 59,793 125,791 125,791
> IWI (irq work): 359,590 357,289 357,798 588,245 588,245
>
> osnoise on isolated CPUs (per 950ms sample):
>
> 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt
> spinlock spinlock spinlock rwlock(rev) rwlock(rev)
> MAX noise (ns): ~57,000 ~57,000 ~57,000 ~9 ~140
> IRQ/sample: ~7,280 ~7,030 ~7,020 ~1 ~961
> Thread/sample: ~6,330 ~6,090 ~6,090 ~1 ~1
> Availability: ~93.5% ~93.5% ~93.5% ~100% ~99.99%
>
> The smoking gun is RES (reschedule IPI): ~322,000 on every isolated CPU
> in 120 seconds with the spinlock, essentially zero with rwlock. That is
> ~2,680 reschedule IPIs per second hitting each isolated core.
>
> The mechanism: on PREEMPT_RT, spinlock_t becomes rt_mutex. When the
> eBPF osnoise tool (or any BPF/perf tool using epoll) calls
> epoll_ctl(EPOLL_CTL_ADD) for perf events on each CPU,
I don't see BPF calls from the inner loop of osnoise_main(). There are
BPF hooks for various interruptions... I'm guessing there's a loop
where each hook causes an IPI that causes another BPF hook. I
wouldn't have expected a wakeup for every sample, but it seems like
that's the default specified by libbpf (eosnoise doesn't set
sample_period).
> ep_poll_callback()
> runs under ep->lock (now rt_mutex) in IRQ context. The rt_mutex PI
> mechanism sends reschedule IPIs to wake waiters, which hit isolated
> cores. With rwlock, read_lock() in ep_poll_callback() does not generate
> cross-CPU IPIs.
Because it doesn't need to block in the first place (unless there's a
writer).
> Note on the tool: the eBPF osnoise tracer itself creates epoll activity
> on all CPUs via perf_event_open() + epoll_ctl(). This is representative
> of real-world scenarios where any BPF/perf monitoring tool, or system
> services like systemd/journald using epoll, would trigger the same
> regression on isolated cores.
Using BPF to hook IRQ entry/exit isn't representative of real-world
scenarios. Assuming I'm right about the underlying cause, this is an
issue with eosnoise, that the epoll change exacerbates.
> When using the kernel's built-in osnoise tracer (which does not use
> epoll), isolated cores show 1ns noise / 1 IRQ per sample on all kernels
> regardless of spinlock vs rwlock — confirming the noise source is
> specifically the epoll spinlock contention path.
>
> Key finding: the task-based CFS throttle series [3] (Aaron Lu, merged
> in 6.18/linux-next) does NOT fix this issue. The regression is identical
> on 6.12, 6.18, and linux-next 7.0-rc5 with the spinlock. Only reverting
> to rwlock eliminates it.
>
> To answer Crystal's question "when would you ever reach that path on an
> isolated CPU?" — the answer is: any tool or service that uses
> perf_event_open() + epoll across all CPUs (BPF tools, perf, monitoring
> agents) will trigger ep_poll_callback() on isolated CPUs. On RT with the
> spinlock, this generates ~2,680 reschedule IPIs/s per isolated core.
Keep in mind that if you use kernel services, you can't expect perfect
isolation, or to never block on a mutex or get a callback -- but this
eosnoise issue does not mean that any perf_event_open() + epoll user
will be getting thousands of IPIs per second.
> The eventpoll spinlock noise regression needs its own fix — perhaps
> a lockless path in ep_poll_callback() for the RT case, or
Again, if you mean the old lockless path, RT is exactly where we don't
want that. What would be the reason to do this *only* for RT?
> converting ep->lock to a raw_spinlock with trylock semantics to avoid
> the rt_mutex IPI overhead.
Among other problems (what happens if the trylock fails? why a trylock
in the first place?), you can't call wake_up() with a raw lock held.
It has its own non-raw spinlock.
-Crystal
next prev parent reply other threads:[~2026-03-27 21:20 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River)
2026-03-26 14:31 ` Sebastian Andrzej Siewior
2026-03-26 14:52 ` Greg KH
2026-03-26 16:21 ` Ionut Nechita (Wind River)
2026-03-26 18:12 ` Crystal Wood
2026-03-27 7:44 ` Florian Bezdeka
2026-03-27 18:36 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River)
2026-03-27 21:20 ` Crystal Wood [this message]
2026-03-28 6:00 ` Jan Kiszka
2026-04-01 16:58 ` Ionut Nechita (Wind River)
2026-04-02 4:42 ` Nam Cao
2026-04-02 8:59 ` Ionut Nechita (Wind River)
2026-04-02 9:49 ` Tomas Glozar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=279cd4c6b4eece55a63936f2ad0912e41be7838b.camel@redhat.com \
--to=crwood@redhat.com \
--cc=brauner@kernel.org \
--cc=chris.friesen@windriver.com \
--cc=florian.bezdeka@siemens.com \
--cc=frederic@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=ionut.nechita@windriver.com \
--cc=iulian.mocanu@windriver.com \
--cc=jan.kiszka@siemens.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=namcao@linutronix.de \
--cc=stable@vger.kernel.org \
--cc=viorel-catalin.rapiteanu@windriver.com \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox