* [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores
@ 2026-03-26 14:00 Ionut Nechita (Wind River)
2026-03-26 14:31 ` Sebastian Andrzej Siewior
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-26 14:00 UTC (permalink / raw)
To: namcao, brauner
Cc: linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic,
vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu,
iulian.mocanu
Hi,
I'm reporting a regression introduced by commit 0c43094f8cc9
("eventpoll: Replace rwlock with spinlock"), backported to stable 6.12.y.
On a PREEMPT_RT system with nohz_full isolated cores, this commit causes
significant osnoise degradation on the isolated CPUs.
Setup:
- Kernel: 6.12.78 with PREEMPT_RT
- Hardware: x86_64, dual-socket (CPUs 0-63)
- Boot params: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48
rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63
- Tool: osnoise tracer (./osnoise -c 1-16,33-48)
With commit applied (spinlock, kernel 6.12.78-vanilla-0):
CPU RUNTIME MAX_NOISE AVAIL% NOISE NMI IRQ SIRQ Thread
[001] 950000 50163 94.719% 14 0 6864 0 5922
[004] 950000 50294 94.705% 14 0 6864 0 5920
[007] 950000 49782 94.759% 14 0 6864 1 5921
[033] 950000 49528 94.786% 15 0 6864 2 5922
[016] 950000 48551 94.889% 20 0 6863 19 5942
[008] 950000 44343 95.332% 14 0 6864 0 5925
With commit reverted (rwlock restored, kernel 6.12.78-vanilla-1):
CPU RUNTIME MAX_NOISE AVAIL% NOISE NMI IRQ SIRQ Thread
[001] 950000 0 100.000% 0 0 6 0 0
[004] 950000 0 100.000% 0 0 4 0 0
[007] 950000 0 100.000% 0 0 4 0 0
[033] 950000 0 100.000% 0 0 4 0 0
[016] 950000 0 100.000% 0 0 5 0 0
[008] 950000 7 99.999% 7 0 5 0 0
Summary across all isolated cores (32 CPUs):
With spinlock With rwlock (reverted)
MAX noise (ns): 44,343 - 51,869 0 - 10
IRQ count/sample: ~6,650 - 6,870 3 - 7
Thread noise/sample: ~5,700 - 5,940 0 - 1
CPU availability: 94.5% - 95.3% ~100%
The regression is roughly 3 orders of magnitude in noise on isolated
cores. The test was run over many consecutive samples and the pattern
is consistent: with the spinlock, every isolated core sees thousands
of IRQs and ~50µs of noise per 950ms sample window. With the rwlock,
the cores are essentially silent.
Note that CPU 016 occasionally shows SIRQ noise (softirq) with both
kernels, which is a separate known issue with the tick on the first
nohz_full CPU. The eventpoll regression is the dominant noise source.
My understanding of the root cause: the original rwlock allowed
ep_poll_callback() (producer side, running from IRQ context on any CPU)
to use read_lock, which does not cause cross-CPU contention on isolated
cores when no local epoll activity exists. With the spinlock conversion,
on PREEMPT_RT spinlock_t becomes an rt_mutex. This means that even if
the isolated core is not involved in any epoll activity, the lock's
cacheline bouncing and potential PI-boosted wakeups from housekeeping
CPUs can inject noise into the isolated cores via IPI or cache
invalidation traffic.
The commit message acknowledges the throughput regression but argues
real workloads won't notice. However, for RT/latency-sensitive
deployments with CPU isolation, the impact is severe and measurable
even with zero local epoll usage.
I believe this needs either:
a) A revert of the backport for stable RT trees, or
b) A fix that avoids the spinlock contention path for isolated CPUs
I can provide the full osnoise trace data if needed.
Tested on:
Linux system-0 6.12.78-vanilla-{0,1} SMP PREEMPT_RT x86_64
Linux system-0 6.12.57-vanilla-{0,1} SMP PREEMPT_RT x86_64
Thanks,
Ionut.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores 2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River) @ 2026-03-26 14:31 ` Sebastian Andrzej Siewior 2026-03-26 14:52 ` Greg KH 2026-03-26 18:12 ` Crystal Wood 2 siblings, 0 replies; 13+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-26 14:31 UTC (permalink / raw) To: Ionut Nechita (Wind River) Cc: namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu On 2026-03-26 16:00:57 [+0200], Ionut Nechita (Wind River) wrote: > Summary across all isolated cores (32 CPUs): > > With spinlock With rwlock (reverted) > MAX noise (ns): 44,343 - 51,869 0 - 10 > IRQ count/sample: ~6,650 - 6,870 3 - 7 > Thread noise/sample: ~5,700 - 5,940 0 - 1 > CPU availability: 94.5% - 95.3% ~100% is there some load or just idle with osnoise? > My understanding of the root cause: the original rwlock allowed > ep_poll_callback() (producer side, running from IRQ context on any CPU) > to use read_lock, which does not cause cross-CPU contention on isolated > cores when no local epoll activity exists. With the spinlock conversion, > on PREEMPT_RT spinlock_t becomes an rt_mutex. This means that even if > the isolated core is not involved in any epoll activity, the lock's > cacheline bouncing and potential PI-boosted wakeups from housekeeping > CPUs can inject noise into the isolated cores via IPI or cache > invalidation traffic. With the read_lock() you can acquire the lock with multiple readers. Each read will increment the "reader counter" so there is cache line activity. If a isolated CPU does not participate, it does not participate. With the change to spinlock_t there can be only one user at a time. So the other have to wait and again, isolated core which don't participate are not affected. > The commit message acknowledges the throughput regression but argues > real workloads won't notice. However, for RT/latency-sensitive > deployments with CPU isolation, the impact is severe and measurable > even with zero local epoll usage. > > I believe this needs either: > a) A revert of the backport for stable RT trees, or I highly doubt since it affected RT loads. > b) A fix that avoids the spinlock contention path for isolated CPUs > > I can provide the full osnoise trace data if needed. So the question is why are the isolated core affected if they don't participate is epoll. > Tested on: > Linux system-0 6.12.78-vanilla-{0,1} SMP PREEMPT_RT x86_64 > Linux system-0 6.12.57-vanilla-{0,1} SMP PREEMPT_RT x86_64 > > Thanks, > Ionut. Sebastian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores 2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River) 2026-03-26 14:31 ` Sebastian Andrzej Siewior @ 2026-03-26 14:52 ` Greg KH 2026-03-26 16:21 ` Ionut Nechita (Wind River) 2026-03-26 18:12 ` Crystal Wood 2 siblings, 1 reply; 13+ messages in thread From: Greg KH @ 2026-03-26 14:52 UTC (permalink / raw) To: Ionut Nechita (Wind River) Cc: namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu On Thu, Mar 26, 2026 at 04:00:57PM +0200, Ionut Nechita (Wind River) wrote: > Hi, > > I'm reporting a regression introduced by commit 0c43094f8cc9 > ("eventpoll: Replace rwlock with spinlock"), backported to stable 6.12.y. Does this regression also show up in the 6.18 release and newer? If so, please work to address it there first, as that is where it needs to be handled first. thanks, greg k-h ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores 2026-03-26 14:52 ` Greg KH @ 2026-03-26 16:21 ` Ionut Nechita (Wind River) 0 siblings, 0 replies; 13+ messages in thread From: Ionut Nechita (Wind River) @ 2026-03-26 16:21 UTC (permalink / raw) To: gregkh Cc: brauner, chris.friesen, frederic, ionut.nechita, iulian.mocanu, linux-fsdevel, linux-kernel, linux-rt-users, namcao, stable, viorel-catalin.rapiteanu, vschneid Hi Greg, > Does this regression also show up in the 6.18 release and newer? I haven't tested on 6.18 yet. I'll try to reproduce on a recent 6.18 LTS and mainline kernel with PREEMPT_RT and follow up with results. Thanks, Ionut. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores 2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River) 2026-03-26 14:31 ` Sebastian Andrzej Siewior 2026-03-26 14:52 ` Greg KH @ 2026-03-26 18:12 ` Crystal Wood 2026-03-27 7:44 ` Florian Bezdeka 2 siblings, 1 reply; 13+ messages in thread From: Crystal Wood @ 2026-03-26 18:12 UTC (permalink / raw) To: Ionut Nechita (Wind River), namcao, brauner Cc: linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu On Thu, 2026-03-26 at 16:00 +0200, Ionut Nechita (Wind River) wrote: > Hi, > > I'm reporting a regression introduced by commit 0c43094f8cc9 > ("eventpoll: Replace rwlock with spinlock"), backported to stable 6.12.y. > > On a PREEMPT_RT system with nohz_full isolated cores, this commit causes > significant osnoise degradation on the isolated CPUs. > > Setup: > - Kernel: 6.12.78 with PREEMPT_RT > - Hardware: x86_64, dual-socket (CPUs 0-63) > - Boot params: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48 > rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63 > - Tool: osnoise tracer (./osnoise -c 1-16,33-48) Is SMT disabled? > With commit applied (spinlock, kernel 6.12.78-vanilla-0): > > CPU RUNTIME MAX_NOISE AVAIL% NOISE NMI IRQ SIRQ Thread > [001] 950000 50163 94.719% 14 0 6864 0 5922 > [004] 950000 50294 94.705% 14 0 6864 0 5920 > [007] 950000 49782 94.759% 14 0 6864 1 5921 > [033] 950000 49528 94.786% 15 0 6864 2 5922 > [016] 950000 48551 94.889% 20 0 6863 19 5942 > [008] 950000 44343 95.332% 14 0 6864 0 5925 > > With commit reverted (rwlock restored, kernel 6.12.78-vanilla-1): > > CPU RUNTIME MAX_NOISE AVAIL% NOISE NMI IRQ SIRQ Thread > [001] 950000 0 100.000% 0 0 6 0 0 > [004] 950000 0 100.000% 0 0 4 0 0 > [007] 950000 0 100.000% 0 0 4 0 0 > [033] 950000 0 100.000% 0 0 4 0 0 > [016] 950000 0 100.000% 0 0 5 0 0 > [008] 950000 7 99.999% 7 0 5 0 0 > > Summary across all isolated cores (32 CPUs): > > With spinlock With rwlock (reverted) > MAX noise (ns): 44,343 - 51,869 0 - 10 > IRQ count/sample: ~6,650 - 6,870 3 - 7 > Thread noise/sample: ~5,700 - 5,940 0 - 1 > CPU availability: 94.5% - 95.3% ~100% > > The regression is roughly 3 orders of magnitude in noise on isolated > cores. The test was run over many consecutive samples and the pattern > is consistent: with the spinlock, every isolated core sees thousands > of IRQs and ~50µs of noise per 950ms sample window. With the rwlock, > the cores are essentially silent. > > Note that CPU 016 occasionally shows SIRQ noise (softirq) with both > kernels, which is a separate known issue with the tick on the first > nohz_full CPU. The eventpoll regression is the dominant noise source. > > My understanding of the root cause: the original rwlock allowed > ep_poll_callback() (producer side, running from IRQ context on any CPU) > to use read_lock, which does not cause cross-CPU contention on isolated > cores when no local epoll activity exists. With the spinlock conversion, > on PREEMPT_RT spinlock_t becomes an rt_mutex. This means that even if > the isolated core is not involved in any epoll activity, the lock's > cacheline bouncing and potential PI-boosted wakeups from housekeeping > CPUs can inject noise into the isolated cores via IPI or cache > invalidation traffic. That sounds like a general isolation problem... it's not a bug for non- isolated CPUs to bounce cachelines or send IPIs to each other. Whether it's IPIs or not, osnoise is showing IRQs on the isolated CPUs, so I'd look into which IRQs and why. Even with the patch reverted, there are some IRQs on the isolated CPUs. > > The commit message acknowledges the throughput regression but argues > real workloads won't notice. However, for RT/latency-sensitive > deployments with CPU isolation, the impact is severe and measurable > even with zero local epoll usage. > > I believe this needs either: > a) A revert of the backport for stable RT trees, or Even if the patch weren't trying to address an RT issue in the first place, this would just be a bandaid rather than a real solution. > b) A fix that avoids the spinlock contention path for isolated CPUs If there's truly no epoll activity on the isolated CPUs, when would you ever reach that path on an isolated CPU? -Crystal ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores 2026-03-26 18:12 ` Crystal Wood @ 2026-03-27 7:44 ` Florian Bezdeka 2026-03-27 18:36 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River) 0 siblings, 1 reply; 13+ messages in thread From: Florian Bezdeka @ 2026-03-27 7:44 UTC (permalink / raw) To: Crystal Wood, Ionut Nechita (Wind River), namcao, brauner Cc: linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu, jan.kiszka On Thu, 2026-03-26 at 13:12 -0500, Crystal Wood wrote: > On Thu, 2026-03-26 at 16:00 +0200, Ionut Nechita (Wind River) wrote: > > Hi, > > > > I'm reporting a regression introduced by commit 0c43094f8cc9 > > ("eventpoll: Replace rwlock with spinlock"), backported to stable 6.12.y. > > > > On a PREEMPT_RT system with nohz_full isolated cores, this commit causes > > significant osnoise degradation on the isolated CPUs. > > > > Setup: > > - Kernel: 6.12.78 with PREEMPT_RT > > - Hardware: x86_64, dual-socket (CPUs 0-63) > > - Boot params: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48 > > rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63 > > - Tool: osnoise tracer (./osnoise -c 1-16,33-48) > > Is SMT disabled? > > > > With commit applied (spinlock, kernel 6.12.78-vanilla-0): > > > > CPU RUNTIME MAX_NOISE AVAIL% NOISE NMI IRQ SIRQ Thread > > [001] 950000 50163 94.719% 14 0 6864 0 5922 > > [004] 950000 50294 94.705% 14 0 6864 0 5920 > > [007] 950000 49782 94.759% 14 0 6864 1 5921 > > [033] 950000 49528 94.786% 15 0 6864 2 5922 > > [016] 950000 48551 94.889% 20 0 6863 19 5942 > > [008] 950000 44343 95.332% 14 0 6864 0 5925 > > > > With commit reverted (rwlock restored, kernel 6.12.78-vanilla-1): > > > > CPU RUNTIME MAX_NOISE AVAIL% NOISE NMI IRQ SIRQ Thread > > [001] 950000 0 100.000% 0 0 6 0 0 > > [004] 950000 0 100.000% 0 0 4 0 0 > > [007] 950000 0 100.000% 0 0 4 0 0 > > [033] 950000 0 100.000% 0 0 4 0 0 > > [016] 950000 0 100.000% 0 0 5 0 0 > > [008] 950000 7 99.999% 7 0 5 0 0 > > > > Summary across all isolated cores (32 CPUs): > > > > With spinlock With rwlock (reverted) > > MAX noise (ns): 44,343 - 51,869 0 - 10 > > IRQ count/sample: ~6,650 - 6,870 3 - 7 > > Thread noise/sample: ~5,700 - 5,940 0 - 1 > > CPU availability: 94.5% - 95.3% ~100% > > > > The regression is roughly 3 orders of magnitude in noise on isolated > > cores. The test was run over many consecutive samples and the pattern > > is consistent: with the spinlock, every isolated core sees thousands > > of IRQs and ~50µs of noise per 950ms sample window. With the rwlock, > > the cores are essentially silent. > > > > Note that CPU 016 occasionally shows SIRQ noise (softirq) with both > > kernels, which is a separate known issue with the tick on the first > > nohz_full CPU. The eventpoll regression is the dominant noise source. > > > > My understanding of the root cause: the original rwlock allowed > > ep_poll_callback() (producer side, running from IRQ context on any CPU) > > to use read_lock, which does not cause cross-CPU contention on isolated > > cores when no local epoll activity exists. With the spinlock conversion, > > on PREEMPT_RT spinlock_t becomes an rt_mutex. This means that even if > > the isolated core is not involved in any epoll activity, the lock's > > cacheline bouncing and potential PI-boosted wakeups from housekeeping > > CPUs can inject noise into the isolated cores via IPI or cache > > invalidation traffic. > > That sounds like a general isolation problem... it's not a bug for non- > isolated CPUs to bounce cachelines or send IPIs to each other. > > Whether it's IPIs or not, osnoise is showing IRQs on the isolated CPUs, > so I'd look into which IRQs and why. Even with the patch reverted, > there are some IRQs on the isolated CPUs. > > > > > The commit message acknowledges the throughput regression but argues > > real workloads won't notice. However, for RT/latency-sensitive > > deployments with CPU isolation, the impact is severe and measurable > > even with zero local epoll usage. > > > > I believe this needs either: > > a) A revert of the backport for stable RT trees, or > > Even if the patch weren't trying to address an RT issue in the first > place, this would just be a bandaid rather than a real solution. A revert alone is not an option as it would bring back [1] and [2] for all LTS releases that did not receive [3]. If my memory is correct only 6.18 has it. The result was a system lockup easily triggered by using the epoll interface. [1] https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [2] https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [3] https://lore.kernel.org/all/20250829081120.806-1-ziqianlu@bytedance.com/ Florian ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-03-27 7:44 ` Florian Bezdeka @ 2026-03-27 18:36 ` Ionut Nechita (Wind River) 2026-03-27 21:20 ` Crystal Wood 2026-03-28 6:00 ` Jan Kiszka 0 siblings, 2 replies; 13+ messages in thread From: Ionut Nechita (Wind River) @ 2026-03-27 18:36 UTC (permalink / raw) To: florian.bezdeka Cc: crwood, ionut.nechita, namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu, jan.kiszka From: Ionut Nechita <ionut.nechita@windriver.com> On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote: > A revert alone is not an option as it would bring back [1] and [2] > for all LTS releases that did not receive [3]. Florian, Crystal, thanks for the feedback. I understand the revert concern regarding the CFS throttle deadlock. However, I want to clarify that the noise regression on isolated cores is a separate issue from the deadlock fixed by [3], and it remains unfixed even on linux-next which has [3] merged or not. I've done extensive testing across multiple kernels to identify the exact mechanism. Here are the results. Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise) which uses perf_event_open() + epoll on each monitored CPU, combined with /proc/interrupts delta measurement. Setup: - Hardware: x86_64, SMT/HT enabled (CPUs 0-63) - Boot: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48 rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63 - Duration: 120s per test IRQ delta on isolated CPUs (representative CPU1, 120s sample): 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt spinlock spinlock spinlock rwlock(rev) rwlock(rev) RES (IPI): 324,279 323,864 321,594 0 1 LOC (timer): 50,827 53,995 59,793 125,791 125,791 IWI (irq work): 359,590 357,289 357,798 588,245 588,245 osnoise on isolated CPUs (per 950ms sample): 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt spinlock spinlock spinlock rwlock(rev) rwlock(rev) MAX noise (ns): ~57,000 ~57,000 ~57,000 ~9 ~140 IRQ/sample: ~7,280 ~7,030 ~7,020 ~1 ~961 Thread/sample: ~6,330 ~6,090 ~6,090 ~1 ~1 Availability: ~93.5% ~93.5% ~93.5% ~100% ~99.99% The smoking gun is RES (reschedule IPI): ~322,000 on every isolated CPU in 120 seconds with the spinlock, essentially zero with rwlock. That is ~2,680 reschedule IPIs per second hitting each isolated core. The mechanism: on PREEMPT_RT, spinlock_t becomes rt_mutex. When the eBPF osnoise tool (or any BPF/perf tool using epoll) calls epoll_ctl(EPOLL_CTL_ADD) for perf events on each CPU, ep_poll_callback() runs under ep->lock (now rt_mutex) in IRQ context. The rt_mutex PI mechanism sends reschedule IPIs to wake waiters, which hit isolated cores. With rwlock, read_lock() in ep_poll_callback() does not generate cross-CPU IPIs. Note on the tool: the eBPF osnoise tracer itself creates epoll activity on all CPUs via perf_event_open() + epoll_ctl(). This is representative of real-world scenarios where any BPF/perf monitoring tool, or system services like systemd/journald using epoll, would trigger the same regression on isolated cores. When using the kernel's built-in osnoise tracer (which does not use epoll), isolated cores show 1ns noise / 1 IRQ per sample on all kernels regardless of spinlock vs rwlock — confirming the noise source is specifically the epoll spinlock contention path. Key finding: the task-based CFS throttle series [3] (Aaron Lu, merged in 6.18/linux-next) does NOT fix this issue. The regression is identical on 6.12, 6.18, and linux-next 7.0-rc5 with the spinlock. Only reverting to rwlock eliminates it. To answer Crystal's question "when would you ever reach that path on an isolated CPU?" — the answer is: any tool or service that uses perf_event_open() + epoll across all CPUs (BPF tools, perf, monitoring agents) will trigger ep_poll_callback() on isolated CPUs. On RT with the spinlock, this generates ~2,680 reschedule IPIs/s per isolated core. The eventpoll spinlock noise regression needs its own fix — perhaps a lockless path in ep_poll_callback() for the RT case, or converting ep->lock to a raw_spinlock with trylock semantics to avoid the rt_mutex IPI overhead. Ionut ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-03-27 18:36 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River) @ 2026-03-27 21:20 ` Crystal Wood 2026-03-28 6:00 ` Jan Kiszka 1 sibling, 0 replies; 13+ messages in thread From: Crystal Wood @ 2026-03-27 21:20 UTC (permalink / raw) To: Ionut Nechita (Wind River), florian.bezdeka Cc: namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu, jan.kiszka On Fri, 2026-03-27 at 20:36 +0200, Ionut Nechita (Wind River) wrote: > From: Ionut Nechita <ionut.nechita@windriver.com> > > On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote: > > A revert alone is not an option as it would bring back [1] and [2] > > for all LTS releases that did not receive [3]. > > Florian, Crystal, thanks for the feedback. > > I understand the revert concern regarding the CFS throttle deadlock. > However, I want to clarify that the noise regression on isolated cores > is a separate issue from the deadlock fixed by [3], and it remains > unfixed even on linux-next which has [3] merged or not. Nobody's saying that [3] would fix your issue. They're saying that the deadlock issue is the reason why simply reverting the epoll change is not acceptable, at least on kernels without [3]. > I've done extensive testing across multiple kernels to identify the > exact mechanism. Here are the results. > > Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise) > which uses perf_event_open() + epoll on each monitored CPU, combined > with /proc/interrupts delta measurement. I recommend sticking with the kernel's osnoise (with or without rtla). Besides the IPI issue, it doesn't look like eosnoise is being maintained anymore, ever since osnoise went into the kernel. > Setup: > - Hardware: x86_64, SMT/HT enabled (CPUs 0-63) > - Boot: nohz_full=1-16,33-48 isolcpus=nohz,domain,managed_irq,1-16,33-48 > rcu_nocbs=1-31,33-63 kthread_cpus=0,32 irqaffinity=17-31,49-63 > - Duration: 120s per test > > IRQ delta on isolated CPUs (representative CPU1, 120s sample): > > 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt > spinlock spinlock spinlock rwlock(rev) rwlock(rev) > RES (IPI): 324,279 323,864 321,594 0 1 > LOC (timer): 50,827 53,995 59,793 125,791 125,791 > IWI (irq work): 359,590 357,289 357,798 588,245 588,245 > > osnoise on isolated CPUs (per 950ms sample): > > 6.12.79-rt 6.18.20-rt 7.0-rc5-next-rt 6.18.19-rt 7.0-rc5-next-rt > spinlock spinlock spinlock rwlock(rev) rwlock(rev) > MAX noise (ns): ~57,000 ~57,000 ~57,000 ~9 ~140 > IRQ/sample: ~7,280 ~7,030 ~7,020 ~1 ~961 > Thread/sample: ~6,330 ~6,090 ~6,090 ~1 ~1 > Availability: ~93.5% ~93.5% ~93.5% ~100% ~99.99% > > The smoking gun is RES (reschedule IPI): ~322,000 on every isolated CPU > in 120 seconds with the spinlock, essentially zero with rwlock. That is > ~2,680 reschedule IPIs per second hitting each isolated core. > > The mechanism: on PREEMPT_RT, spinlock_t becomes rt_mutex. When the > eBPF osnoise tool (or any BPF/perf tool using epoll) calls > epoll_ctl(EPOLL_CTL_ADD) for perf events on each CPU, I don't see BPF calls from the inner loop of osnoise_main(). There are BPF hooks for various interruptions... I'm guessing there's a loop where each hook causes an IPI that causes another BPF hook. I wouldn't have expected a wakeup for every sample, but it seems like that's the default specified by libbpf (eosnoise doesn't set sample_period). > ep_poll_callback() > runs under ep->lock (now rt_mutex) in IRQ context. The rt_mutex PI > mechanism sends reschedule IPIs to wake waiters, which hit isolated > cores. With rwlock, read_lock() in ep_poll_callback() does not generate > cross-CPU IPIs. Because it doesn't need to block in the first place (unless there's a writer). > Note on the tool: the eBPF osnoise tracer itself creates epoll activity > on all CPUs via perf_event_open() + epoll_ctl(). This is representative > of real-world scenarios where any BPF/perf monitoring tool, or system > services like systemd/journald using epoll, would trigger the same > regression on isolated cores. Using BPF to hook IRQ entry/exit isn't representative of real-world scenarios. Assuming I'm right about the underlying cause, this is an issue with eosnoise, that the epoll change exacerbates. > When using the kernel's built-in osnoise tracer (which does not use > epoll), isolated cores show 1ns noise / 1 IRQ per sample on all kernels > regardless of spinlock vs rwlock — confirming the noise source is > specifically the epoll spinlock contention path. > > Key finding: the task-based CFS throttle series [3] (Aaron Lu, merged > in 6.18/linux-next) does NOT fix this issue. The regression is identical > on 6.12, 6.18, and linux-next 7.0-rc5 with the spinlock. Only reverting > to rwlock eliminates it. > > To answer Crystal's question "when would you ever reach that path on an > isolated CPU?" — the answer is: any tool or service that uses > perf_event_open() + epoll across all CPUs (BPF tools, perf, monitoring > agents) will trigger ep_poll_callback() on isolated CPUs. On RT with the > spinlock, this generates ~2,680 reschedule IPIs/s per isolated core. Keep in mind that if you use kernel services, you can't expect perfect isolation, or to never block on a mutex or get a callback -- but this eosnoise issue does not mean that any perf_event_open() + epoll user will be getting thousands of IPIs per second. > The eventpoll spinlock noise regression needs its own fix — perhaps > a lockless path in ep_poll_callback() for the RT case, or Again, if you mean the old lockless path, RT is exactly where we don't want that. What would be the reason to do this *only* for RT? > converting ep->lock to a raw_spinlock with trylock semantics to avoid > the rt_mutex IPI overhead. Among other problems (what happens if the trylock fails? why a trylock in the first place?), you can't call wake_up() with a raw lock held. It has its own non-raw spinlock. -Crystal ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-03-27 18:36 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River) 2026-03-27 21:20 ` Crystal Wood @ 2026-03-28 6:00 ` Jan Kiszka 2026-04-01 16:58 ` Ionut Nechita (Wind River) 1 sibling, 1 reply; 13+ messages in thread From: Jan Kiszka @ 2026-03-28 6:00 UTC (permalink / raw) To: Ionut Nechita (Wind River), florian.bezdeka Cc: crwood, namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu On 27.03.26 19:36, Ionut Nechita (Wind River) wrote: > From: Ionut Nechita <ionut.nechita@windriver.com> > > On Thu, 2026-03-27 at 08:44 +0100, Florian Bezdeka wrote: >> A revert alone is not an option as it would bring back [1] and [2] >> for all LTS releases that did not receive [3]. > > Florian, Crystal, thanks for the feedback. > > I understand the revert concern regarding the CFS throttle deadlock. > However, I want to clarify that the noise regression on isolated cores > is a separate issue from the deadlock fixed by [3], and it remains > unfixed even on linux-next which has [3] merged or not. > > I've done extensive testing across multiple kernels to identify the > exact mechanism. Here are the results. > > Tool: eBPF-based osnoise tracer (https://gitlab.com/rt-linux-tools/eosnoise) > which uses perf_event_open() + epoll on each monitored CPU, combined > with /proc/interrupts delta measurement. > > Setup: > - Hardware: x86_64, SMT/HT enabled (CPUs 0-63) I think Crystal already asked: Are you disabling HT then by taking the the siblings offline for the isolated cores? If not, the measurements are a bit questionable from an RT perspective. Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-03-28 6:00 ` Jan Kiszka @ 2026-04-01 16:58 ` Ionut Nechita (Wind River) 2026-04-02 4:42 ` Nam Cao 2026-04-02 9:49 ` Tomas Glozar 0 siblings, 2 replies; 13+ messages in thread From: Ionut Nechita (Wind River) @ 2026-04-01 16:58 UTC (permalink / raw) To: jan.kiszka Cc: crwood, florian.bezdeka, ionut.nechita, namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu From: Ionut Nechita <ionut.nechita@windriver.com> Crystal, Jan, Florian, thanks for the detailed feedback. I've redone all testing addressing each point raised. All tests below use HT disabled (sibling cores offlined), as Jan requested. Setup: - Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket, 32 cores, HT disabled via sibling cores offlined) - Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16 rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31 iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1 - eosnoise run with: ./osnoise -c 1-15 - Duration: 120s per test Tested kernels (all vanilla, built from upstream sources): - 6.18.20-vanilla (non-RT, PREEMPT_DYNAMIC) - 6.18.20-vanilla (PREEMPT_RT, with and without rwlock revert) - 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert) I tested 6 configurations to isolate the exact failure mode: # Kernel Config Tool Revert Result -- --------------- -------- --------------- ------- ---------------- 1 6.18.20 non-RT eosnoise no clean (100%) 2 6.18.20 RT eosnoise no D state (hung) 3 6.18.20 RT eosnoise yes clean (100%) 4 6.18.20 RT kernel osnoise no clean (99.999%) 5 7.0-rc6-next RT eosnoise no 93% avail, 57us 6 7.0-rc6-next RT eosnoise yes clean (99.99%) Key findings: 1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state. The process blocks in do_epoll_ctl() during perf_buffer__new() setup (libbpf's perf_event_open + epoll_ctl loop). strace shows progressive degradation as fds are added to the epoll instance: CPU 0-13: epoll_ctl ~8 us (normal) CPU 14: epoll_ctl 16 ms (2000x slower) CPU 15: epoll_ctl 80 ms (10000x slower) CPU 16: epoll_ctl 80 ms CPU 17: epoll_ctl 20 ms CPU 18: epoll_ctl -- hung, never returns -- Kernel stack of the hung process (3+ minutes in D state): [<0>] do_epoll_ctl+0xa57/0xf20 [<0>] __x64_sys_epoll_ctl+0x5d/0xa0 [<0>] do_syscall_64+0x7c/0xe30 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e 2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe noise. The difference from 6.18 is likely additional fixes in linux-next that prevent the complete deadlock but not the contention. 3. Kernel osnoise tracer (test #4) shows zero noise on the same 6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the issue is specifically in the epoll rt_mutex path, not in osnoise measurement methodology. Kernel osnoise output (6.18.20-rt, spinlock, no revert): 99.999% availability, 1-4 ns max noise, RES=6 total in 120s 4. Non-RT kernel (test #1) with the same spinlock change shows zero noise. This confirms the issue is the spinlock-to-rt_mutex conversion on PREEMPT_RT, not the spinlock change itself. IRQ deltas on isolated CPU1 (120s): 6.18.20-rt 6.18.20-rt 6.18.20 6.18.20-rt spinlock rwlock(rev) non-RT kernel osnoise RES (IPI): (D state) 3 1 6 LOC (timer): (D state) 3,325 1,185 245 IWI (irq work): (D state) 565,988 1,433 121 7.0-rc6-rt 7.0-rc6-rt spinlock rwlock(rev) RES (IPI): 330,000+ 2 LOC (timer): 120,585 120,585 IWI (irq work): 585,785 585,785 The mechanism, refined: Crystal was right that this is specific to the BPF perf_event_output + epoll pattern, not any arbitrary epoll user. I verified this: a plain perf_event_open + epoll_ctl program without BPF does not trigger the issue. What triggers it is libbpf's perf_buffer__new(), which creates one PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer, and adds all fds to a single epoll instance. When BPF programs are attached to high-frequency tracepoints (irq_handler_entry/exit, softirq_entry/exit, sched_switch), every interrupt on every CPU calls bpf_perf_event_output() which invokes ep_poll_callback() under ep->lock. On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating callbacks simultaneously into the same epoll instance, the rt_mutex PI mechanism creates unbounded contention. On 6.18 this results in a permanent D state hang. On 7.0 it results in ~330,000 reschedule IPIs hitting isolated cores over 120 seconds (~2,750/s per core). With rwlock, ep_poll_callback() uses read_lock which allows concurrent readers without cross-CPU contention — the callbacks execute in parallel without generating IPIs. This pattern (BPF tracepoint programs + perf ring buffer + epoll) is the standard architecture used by BCC tools (opensnoop, execsnoop, biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based observability tool. A permanent D state hang when running such tools on PREEMPT_RT is a significant regression. I'm not proposing a specific fix -- the previous suggestions (raw_spinlock trylock, lockless path) were rightly rejected. But the regression exists and needs to be addressed. The ep->lock contention under high-frequency BPF callbacks on PREEMPT_RT is a new problem that the rwlock->spinlock conversion introduced. Separate question: could eosnoise itself be improved to avoid this contention? For example, using one epoll instance per CPU instead of a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF) instead of the per-cpu perf buffer which requires epoll. If the consensus is that the kernel side is working as intended and the tool should adapt, I'd like to understand what the recommended pattern is for BPF observability tools on PREEMPT_RT. Ionut ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-04-01 16:58 ` Ionut Nechita (Wind River) @ 2026-04-02 4:42 ` Nam Cao 2026-04-02 8:59 ` Ionut Nechita (Wind River) 2026-04-02 9:49 ` Tomas Glozar 1 sibling, 1 reply; 13+ messages in thread From: Nam Cao @ 2026-04-02 4:42 UTC (permalink / raw) To: Ionut Nechita (Wind River), jan.kiszka Cc: crwood, florian.bezdeka, ionut.nechita, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu "Ionut Nechita (Wind River)" <ionut.nechita@windriver.com> writes: > Crystal, Jan, Florian, thanks for the detailed feedback. I've redone > all testing addressing each point raised. All tests below use HT > disabled (sibling cores offlined), as Jan requested. > > Setup: > - Hardware: Intel Xeon Gold 6338N (Ice Lake, single socket, > 32 cores, HT disabled via sibling cores offlined) > - Boot: nohz_full=1-16 isolcpus=nohz,domain,managed_irq,1-16 > rcu_nocbs=1-31 kthread_cpus=0 irqaffinity=17-31 > iommu=pt nmi_watchdog=0 intel_pstate=none skew_tick=1 > - eosnoise run with: ./osnoise -c 1-15 > - Duration: 120s per test > > Tested kernels (all vanilla, built from upstream sources): > - 6.18.20-vanilla (non-RT, PREEMPT_DYNAMIC) > - 6.18.20-vanilla (PREEMPT_RT, with and without rwlock revert) > - 7.0.0-rc6-next-20260331 (PREEMPT_RT, with and without rwlock revert) > > I tested 6 configurations to isolate the exact failure mode: > > # Kernel Config Tool Revert Result > -- --------------- -------- --------------- ------- ---------------- > 1 6.18.20 non-RT eosnoise no clean (100%) > 2 6.18.20 RT eosnoise no D state (hung) > 3 6.18.20 RT eosnoise yes clean (100%) > 4 6.18.20 RT kernel osnoise no clean (99.999%) > 5 7.0-rc6-next RT eosnoise no 93% avail, 57us > 6 7.0-rc6-next RT eosnoise yes clean (99.99%) Thanks for the detailed analysis. > Key findings: > > 1. On 6.18.20-rt with spinlock, eosnoise hangs permanently in D state. > > The process blocks in do_epoll_ctl() during perf_buffer__new() setup > (libbpf's perf_event_open + epoll_ctl loop). strace shows progressive > degradation as fds are added to the epoll instance: > > CPU 0-13: epoll_ctl ~8 us (normal) > CPU 14: epoll_ctl 16 ms (2000x slower) > CPU 15: epoll_ctl 80 ms (10000x slower) > CPU 16: epoll_ctl 80 ms > CPU 17: epoll_ctl 20 ms > CPU 18: epoll_ctl -- hung, never returns -- > > Kernel stack of the hung process (3+ minutes in D state): > > [<0>] do_epoll_ctl+0xa57/0xf20 > [<0>] __x64_sys_epoll_ctl+0x5d/0xa0 > [<0>] do_syscall_64+0x7c/0xe30 > [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > 2. On 7.0-rc6-next-rt with spinlock, eosnoise runs but with severe > noise. The difference from 6.18 is likely additional fixes in > linux-next that prevent the complete deadlock but not the contention. > > 3. Kernel osnoise tracer (test #4) shows zero noise on the same > 6.18.20-rt+spinlock kernel where eosnoise hangs. This confirms the > issue is specifically in the epoll rt_mutex path, not in osnoise > measurement methodology. > > Kernel osnoise output (6.18.20-rt, spinlock, no revert): > 99.999% availability, 1-4 ns max noise, RES=6 total in 120s > > 4. Non-RT kernel (test #1) with the same spinlock change shows zero > noise. This confirms the issue is the spinlock-to-rt_mutex conversion > on PREEMPT_RT, not the spinlock change itself. > > IRQ deltas on isolated CPU1 (120s): > > 6.18.20-rt 6.18.20-rt 6.18.20 6.18.20-rt > spinlock rwlock(rev) non-RT kernel osnoise > RES (IPI): (D state) 3 1 6 > LOC (timer): (D state) 3,325 1,185 245 > IWI (irq work): (D state) 565,988 1,433 121 > > 7.0-rc6-rt 7.0-rc6-rt > spinlock rwlock(rev) > RES (IPI): 330,000+ 2 > LOC (timer): 120,585 120,585 > IWI (irq work): 585,785 585,785 > > The mechanism, refined: > > Crystal was right that this is specific to the BPF perf_event_output + > epoll pattern, not any arbitrary epoll user. I verified this: a plain > perf_event_open + epoll_ctl program without BPF does not trigger the > issue. > > What triggers it is libbpf's perf_buffer__new(), which creates one > PERF_COUNT_SW_BPF_OUTPUT perf_event per CPU, mmaps the ring buffer, > and adds all fds to a single epoll instance. When BPF programs are > attached to high-frequency tracepoints (irq_handler_entry/exit, > softirq_entry/exit, sched_switch), every interrupt on every CPU calls > bpf_perf_event_output() which invokes ep_poll_callback() under > ep->lock. > > On PREEMPT_RT, ep->lock is an rt_mutex. With 15+ CPUs generating > callbacks simultaneously into the same epoll instance, the rt_mutex > PI mechanism creates unbounded contention. On 6.18 this results in > a permanent D state hang. On 7.0 it results in ~330,000 reschedule > IPIs hitting isolated cores over 120 seconds (~2,750/s per core). > > With rwlock, ep_poll_callback() uses read_lock which allows concurrent > readers without cross-CPU contention — the callbacks execute in > parallel without generating IPIs. These IPIs do not exist without eosnoise running. eosnoise introduces these noises into the system. For a noise tracer tool, it is certainly eosnoise's responsibility to make sure it does not measure noises originating from itself. > This pattern (BPF tracepoint programs + perf ring buffer + epoll) is > the standard architecture used by BCC tools (opensnoop, execsnoop, > biolatency, tcpconnect, etc.), bpftrace, and any libbpf-based > observability tool. A permanent D state hang when running such tools > on PREEMPT_RT is a significant regression. 7.0-rc6-next is still using spin lock but has no hang problem. Likely you are hitting a different problem here which appears when spin lock is used, which has been fixed somewhere between 6.18.20 and 7.0-rc6-next. If you still have the energy for it, a git bisect between 6.18.20 and 7.0-rc6-next will tell us which commit made the hang issue disappear. > I'm not proposing a specific fix -- the previous suggestions > (raw_spinlock trylock, lockless path) were rightly rejected. But the > regression exists and needs to be addressed. The ep->lock contention > under high-frequency BPF callbacks on PREEMPT_RT is a new problem > that the rwlock->spinlock conversion introduced. > > Separate question: could eosnoise itself be improved to avoid this > contention? For example, using one epoll instance per CPU instead of > a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF) > instead of the per-cpu perf buffer which requires epoll. If the > consensus is that the kernel side is working as intended and the tool > should adapt, I'd like to understand what the recommended pattern is > for BPF observability tools on PREEMPT_RT. I am not familiar with eosnoise, I can't tell you. I tried compiling eosnoise but that failed. I managed to fix the compile failure, then I got run-time failure. It depends on what eosnoise is using epoll for. If it is just waiting for PERF_COUNT_SW_BPF_OUTPUT to happen, perhaps we can change to some sort of polling implementation (e.g. wake up every 100ms to check for data). Best regards, Nam ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-04-02 4:42 ` Nam Cao @ 2026-04-02 8:59 ` Ionut Nechita (Wind River) 0 siblings, 0 replies; 13+ messages in thread From: Ionut Nechita (Wind River) @ 2026-04-02 8:59 UTC (permalink / raw) To: namcao Cc: jan.kiszka, crwood, florian.bezdeka, ionut.nechita, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu From: Ionut Nechita <ionut.nechita@windriver.com> Nam, thanks for the feedback. I agree with your analysis -- this is really two separate problems: 1. epoll_ctl D state hang on 6.18.20-rt (kernel-side) This hang does not reproduce on 7.0-rc6-next which still uses the spinlock, so something between 6.18.20 and 7.0-rc6-next fixed it. A git bisect would identify the fix. I'll try to get to it when time permits, but since this is a different issue from the original report it will be prioritized separately. 2. eosnoise self-noise on PREEMPT_RT (tool-side) You're right that the noise and IPIs measured on 7.0-rc6-next originate from eosnoise itself -- the BPF callbacks on every tracepoint hit generate ep_poll_callback() contention that the tool then measures as system noise. This is a tool problem, not a kernel regression. I'll flag this internally with my team. The fix is likely one of: switching to BPF ring buffer (BPF_MAP_TYPE_RINGBUF) which avoids the per-cpu perf buffer + epoll path entirely, using per-CPU epoll instances, polling as you suggested, or switching to a different tool altogether. Thanks to everyone for the thorough review -- it helped separate what initially looked like one problem into two distinct issues. Ionut ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us noise spikes on isolated PREEMPT_RT cores 2026-04-01 16:58 ` Ionut Nechita (Wind River) 2026-04-02 4:42 ` Nam Cao @ 2026-04-02 9:49 ` Tomas Glozar 1 sibling, 0 replies; 13+ messages in thread From: Tomas Glozar @ 2026-04-02 9:49 UTC (permalink / raw) To: Ionut Nechita (Wind River) Cc: jan.kiszka, crwood, florian.bezdeka, namcao, brauner, linux-fsdevel, linux-rt-users, stable, linux-kernel, bpf, frederic, vschneid, gregkh, chris.friesen, viorel-catalin.rapiteanu, iulian.mocanu st 1. 4. 2026 v 19:08 odesílatel Ionut Nechita (Wind River) <ionut.nechita@windriver.com> napsal: > > Separate question: could eosnoise itself be improved to avoid this > contention? For example, using one epoll instance per CPU instead of > a single shared one, or using BPF ring buffer (BPF_MAP_TYPE_RINGBUF) > instead of the per-cpu perf buffer which requires epoll. Neither BPF ring buffers nor perf event buffers strictly require you to use epoll. Just as a BPF ring buffer can be read using libbpf's ring_buffer__consume() [1] without polling, perf_buffer__consume() [2] can be used the same way for the perf event ringbuffer; neither of the functions block. If you need to poll, BPF ring buffer also uses epoll_wait() [3] so that won't make a difference (or is there another way to poll it?) [1] https://docs.ebpf.io/ebpf-library/libbpf/userspace/ring_buffer__consume/ [2] https://docs.ebpf.io/ebpf-library/libbpf/userspace/perf_buffer__consume/ [3] https://github.com/libbpf/libbpf/blob/master/src/ringbuf.c#L341 That being said, BPF ring buffer is not per-CPU and should allow collecting data from all CPUs into one buffer. > If the consensus is that the kernel side is working as intended and the tool > should adapt, I'd like to understand what the recommended pattern is > for BPF observability tools on PREEMPT_RT. The ideal solution is to aggregate data in BPF directly, not in userspace, and collect them at the end of the measurement, when possible. This is what rtla-timerlat does for collecting samples [4] where it was implemented to prevent the collecting user space thread from being overloaded with too many samples on systems with a large number of CPU; polling on ring buffer is used to signal end of tracing on latency threshold only, no issues have been reported with that. To collect data about system noise, timerlat collects the events in an ftrace ring buffer, and then analyzes the tail of the buffer (i.e. what is relevant to the spike, not all data throughout the entire measurement) in user space [5]. The same could be replicated in eosnoise, i.e. collecting the data into a ringbuffer and only reading the tail in userspace, if that suffices for your use case. [4] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/tools/tracing/rtla/src/timerlat.bpf.c [5] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/tracing/rtla/src/timerlat_aa.c Tomas ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-04-02 9:49 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-26 14:00 [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50µs noise spikes on isolated PREEMPT_RT cores Ionut Nechita (Wind River) 2026-03-26 14:31 ` Sebastian Andrzej Siewior 2026-03-26 14:52 ` Greg KH 2026-03-26 16:21 ` Ionut Nechita (Wind River) 2026-03-26 18:12 ` Crystal Wood 2026-03-27 7:44 ` Florian Bezdeka 2026-03-27 18:36 ` [REGRESSION] osnoise: "eventpoll: Replace rwlock with spinlock" causes ~50us " Ionut Nechita (Wind River) 2026-03-27 21:20 ` Crystal Wood 2026-03-28 6:00 ` Jan Kiszka 2026-04-01 16:58 ` Ionut Nechita (Wind River) 2026-04-02 4:42 ` Nam Cao 2026-04-02 8:59 ` Ionut Nechita (Wind River) 2026-04-02 9:49 ` Tomas Glozar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox