Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Crystal Wood <crwood@redhat.com>
To: "Ionut Nechita (Wind River)" <ionut.nechita@windriver.com>,
	 florian.bezdeka@siemens.com
Cc: namcao@linutronix.de, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, linux-rt-users@vger.kernel.org,
	stable@vger.kernel.org,  linux-kernel@vger.kernel.org,
	frederic@kernel.org, vschneid@redhat.com,
	 gregkh@linuxfoundation.org, chris.friesen@windriver.com,
	 viorel-catalin.rapiteanu@windriver.com,
	iulian.mocanu@windriver.com,  jan.kiszka@siemens.com
Subject: Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores
Date: Fri, 27 Mar 2026 16:20:17 -0500	[thread overview]
Message-ID: <279cd4c6b4eece55a63936f2ad0912e41be7838b.camel@redhat.com> (raw)
In-Reply-To: <20260327183610.594667-1-ionut.nechita@windriver.com>

On Fri, 2026-03-27 at 20:36 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote:
> > A revert alone is not an option as it would bring back [1] and [2]
> > for all LTS releases that did not receive [3].
> 
> Florian, Crystal, thanks for the feedback.
> 
> I understand the revert concern regarding the CFS throttle deadlock.
> However, I want to clarify that the noise regression on isolated cores
> is a separate issue from the deadlock fixed by [3], and it remains
> unfixed even on linux-next which has [3] merged or not.

Nobody's saying that [3] would fix your issue.  They're saying that the
deadlock issue is the reason why simply reverting the epoll change is
not acceptable, at least on kernels without [3].

> I've done extensive testing across multiple kernels to identify the
> exact mechanism. Here are the results.
> 
> Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise)
> which uses perf_event_open() + epoll on each monitored CPU, combined
> with /proc/interrupts delta measurement.

I recommend sticking with the kernel's osnoise (with or without rtla).

Besides the IPI issue, it doesn't look like eosnoise is being
maintained anymore, ever since osnoise went into the kernel.

> Setup:
>   - Hardware: x86_64, SMT/HT enabled (CPUs 0-63)
>   - Boot: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48
>     rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63
>   - Duration: 120s per test
> 
> IRQ delta on isolated CPUs (representative CPU1, 120s sample):
> 
>                     6.12.79-rt    6.18.20-rt    7.0-rc5-next-rt   6.18.19-rt    7.0-rc5-next-rt
>                     spinlock      spinlock      spinlock           rwlock(rev)   rwlock(rev)
>   RES (IPI):        324,279       323,864       321,594            0             1
>   LOC (timer):       50,827        53,995        59,793           125,791       125,791
>   IWI (irq work):  359,590       357,289       357,798           588,245       588,245
> 
> osnoise on isolated CPUs (per 950ms sample):
> 
>                     6.12.79-rt    6.18.20-rt    7.0-rc5-next-rt   6.18.19-rt    7.0-rc5-next-rt
>                     spinlock      spinlock      spinlock           rwlock(rev)   rwlock(rev)
>   MAX noise (ns):   ~57,000       ~57,000       ~57,000            ~9            ~140
>   IRQ/sample:       ~7,280        ~7,030        ~7,020             ~1            ~961
>   Thread/sample:    ~6,330        ~6,090        ~6,090             ~1            ~1
>   Availability:     ~93.5%        ~93.5%        ~93.5%             ~100%         ~99.99%
> 
> The smoking gun is RES (reschedule IPI): ~322,000 on every isolated CPU
> in 120 seconds with the spinlock, essentially zero with rwlock. That is
> ~2,680 reschedule IPIs per second hitting each isolated core.
> 
> The mechanism: on PREEMPT_RT, spinlock_t becomes rt_mutex. When the
> eBPF osnoise tool (or any BPF/perf tool using epoll) calls
> epoll_ctl(EPOLL_CTL_ADD) for perf events on each CPU, 

I don't see BPF calls from the inner loop of osnoise_main().  There are
BPF hooks for various interruptions...  I'm guessing there's a loop
where each hook causes an IPI that causes another BPF hook.  I
wouldn't have expected a wakeup for every sample, but it seems like
that's the default specified by libbpf (eosnoise doesn't set
sample_period).

> ep_poll_callback()
> runs under ep->lock (now rt_mutex) in IRQ context. The rt_mutex PI
> mechanism sends reschedule IPIs to wake waiters, which hit isolated
> cores. With rwlock, read_lock() in ep_poll_callback() does not generate
> cross-CPU IPIs.

Because it doesn't need to block in the first place (unless there's a
writer).

> Note on the tool: the eBPF osnoise tracer itself creates epoll activity
> on all CPUs via perf_event_open() + epoll_ctl(). This is representative
> of real-world scenarios where any BPF/perf monitoring tool, or system
> services like systemd/journald using epoll, would trigger the same
> regression on isolated cores.

Using BPF to hook IRQ entry/exit isn't representative of real-world
scenarios.  Assuming I'm right about the underlying cause, this is an
issue with eosnoise, that the epoll change exacerbates.

> When using the kernel's built-in osnoise tracer (which does not use
> epoll), isolated cores show 1ns noise / 1 IRQ per sample on all kernels
> regardless of spinlock vs rwlock — confirming the noise source is
> specifically the epoll spinlock contention path.
>
> Key finding: the task-based CFS throttle series [3] (Aaron Lu, merged
> in 6.18/linux-next) does NOT fix this issue. The regression is identical
> on 6.12, 6.18, and linux-next 7.0-rc5 with the spinlock. Only reverting
> to rwlock eliminates it.
> 
> To answer Crystal's question "when would you ever reach that path on an
> isolated CPU?" — the answer is: any tool or service that uses
> perf_event_open() + epoll across all CPUs (BPF tools, perf, monitoring
> agents) will trigger ep_poll_callback() on isolated CPUs. On RT with the
> spinlock, this generates ~2,680 reschedule IPIs/s per isolated core.

Keep in mind that if you use kernel services, you can't expect perfect
isolation, or to never block on a mutex or get a callback -- but this
eosnoise issue does not mean that any perf_event_open() + epoll user
will be getting thousands of IPIs per second.

> The eventpoll spinlock noise regression needs its own fix — perhaps 
> a lockless path in ep_poll_callback() for the RT case, or 

Again, if you mean the old lockless path, RT is exactly where we don't
want that.  What would be the reason to do this *only* for RT?

> converting ep->lock to a raw_spinlock with trylock semantics to avoid 
> the rt_mutex IPI overhead.

Among other problems (what happens if the trylock fails?  why a trylock
in the first place?), you can't call wake_up() with a raw lock held. 
It has its own non-raw spinlock.

-Crystal

next prev parent reply	other threads:[~2026-03-27 21:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River)
2026-03-26 14:31 ` Sebastian Andrzej Siewior
2026-03-26 14:52 ` Greg KH
2026-03-26 16:21   ` Ionut Nechita (Wind River)
2026-03-26 18:12 ` Crystal Wood
2026-03-27  7:44   ` Florian Bezdeka
2026-03-27 18:36     ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River)
2026-03-27 21:20       ` Crystal Wood [this message]
2026-03-28  6:00       ` Jan Kiszka
2026-04-01 16:58         ` Ionut Nechita (Wind River)
2026-04-02  4:42           ` Nam Cao
2026-04-02  8:59             ` Ionut Nechita (Wind River)
2026-04-02  9:49           ` Tomas Glozar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=279cd4c6b4eece55a63936f2ad0912e41be7838b.camel@redhat.com \
    --to=crwood@redhat.com \
    --cc=brauner@kernel.org \
    --cc=chris.friesen@windriver.com \
    --cc=florian.bezdeka@siemens.com \
    --cc=frederic@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=ionut.nechita@windriver.com \
    --cc=iulian.mocanu@windriver.com \
    --cc=jan.kiszka@siemens.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=namcao@linutronix.de \
    --cc=stable@vger.kernel.org \
    --cc=viorel-catalin.rapiteanu@windriver.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox