[RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
@ 2023-10-12 15:07 Valentin Schneider
  2025-04-09  6:41 ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2023-10-12 15:07 UTC (permalink / raw)
  To: linux-rt-users, linux-kernel
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, Juri Lelli,
	Clark Williams, Luis Claudio R. Goncalves

Hi folks,

We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
we haven't been able to come out with a reproducer (yet), I don't see anything
upstream that would prevent them from happening. 

The setup involves eventpoll, CFS bandwidth controller and timer
expiry, and the sequence looks as follows (time-ordered):

p_read (on CPUn, CFS with bandwidth controller active)
======

ep_poll_callback()
  read_lock_irqsave()
  ...
  try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
                      due to having no more runtime
    preempt_enable()
      preempt_schedule() <- switch out due to p_read being now throttled

p_write
=======

ep_poll()
  write_lock_irq() <- blocks due to having active readers (p_read)

ktimers/n
=========

timerfd_tmrproc()
`\
  ep_poll_callback()
  `\
    read_lock_irqsave() <- blocks due to having active writer (p_write)

From this point we have a circular dependency:

  p_read -> ktimers/n (to replenish runtime of p_read)
  ktimers/n -> p_write (to let ktimers/n acquire the readlock)
  p_write -> p_read (to let p_write acquire the writelock)

IIUC reverting
  286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
should unblock this as the ktimers/n thread wouldn't block, but then we're back
to having the indefinite starvation so I wouldn't necessarily call this a win.

Two options I'm seeing:
- Prevent p_read from being preempted when it's doing the wakeups under the
  readlock (icky)
- Prevent ktimers / ksoftirqd (*) from running the wakeups that have
  ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
  kworker /should/ do.

(*) It's not just timerfd, I've also seen it via net::sock_def_readable -
it should be anything that's pollable.

I'm still scratching my head on this, so any suggestions/comments welcome!

Cheers,
Valentin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2023-10-12 15:07 [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller Valentin Schneider
@ 2025-04-09  6:41 ` Jan Kiszka
  2025-04-09  9:29   ` K Prateek Nayak
  2025-04-09 13:21   ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 18+ messages in thread
From: Jan Kiszka @ 2025-04-09  6:41 UTC (permalink / raw)
  To: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior,
	linux-kernel, kprateek.nayak
  Cc: Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 12.10.23 17:07, Valentin Schneider wrote:
> Hi folks,
> 
> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
> we haven't been able to come out with a reproducer (yet), I don't see anything
> upstream that would prevent them from happening. 
> 
> The setup involves eventpoll, CFS bandwidth controller and timer
> expiry, and the sequence looks as follows (time-ordered):
> 
> p_read (on CPUn, CFS with bandwidth controller active)
> ======
> 
> ep_poll_callback()
>   read_lock_irqsave()
>   ...
>   try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
>                       due to having no more runtime
>     preempt_enable()
>       preempt_schedule() <- switch out due to p_read being now throttled
> 
> p_write
> =======
> 
> ep_poll()
>   write_lock_irq() <- blocks due to having active readers (p_read)
> 
> ktimers/n
> =========
> 
> timerfd_tmrproc()
> `\
>   ep_poll_callback()
>   `\
>     read_lock_irqsave() <- blocks due to having active writer (p_write)
> 
> 
> From this point we have a circular dependency:
> 
>   p_read -> ktimers/n (to replenish runtime of p_read)
>   ktimers/n -> p_write (to let ktimers/n acquire the readlock)
>   p_write -> p_read (to let p_write acquire the writelock)
> 
> IIUC reverting
>   286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
> should unblock this as the ktimers/n thread wouldn't block, but then we're back
> to having the indefinite starvation so I wouldn't necessarily call this a win.
> 
> Two options I'm seeing:
> - Prevent p_read from being preempted when it's doing the wakeups under the
>   readlock (icky)
> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have
>   ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
>   kworker /should/ do.
> 
> (*) It's not just timerfd, I've also seen it via net::sock_def_readable -
> it should be anything that's pollable.
> 
> I'm still scratching my head on this, so any suggestions/comments welcome!
> 

We are hunting for quite some time sporadic lock-ups or RT systems, 
first only in the field (sigh), now finally also in the lab. Those have 
a fairly high overlap with what was described here. Our baselines so 
far: 6.1-rt, Debian and vanilla. We are currently preparing experiments 
with latest mainline.

While this thread remained silent afterwards, we have found [1][2][3] as 
apparently related. But this means we are still with this RT bug, even 
in latest 6.15-rc1?

Jan

[1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@redhat.com/
[2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/
[3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09  6:41 ` Jan Kiszka
@ 2025-04-09  9:29   ` K Prateek Nayak
  2025-04-09 12:13     ` Aaron Lu
  2025-04-09 13:21   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2025-04-09  9:29 UTC (permalink / raw)
  To: Jan Kiszka, Valentin Schneider, linux-rt-users,
	Sebastian Andrzej Siewior, linux-kernel, Aaron Lu
  Cc: Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

(+ Aaron)

Hello Jan,

On 4/9/2025 12:11 PM, Jan Kiszka wrote:
> On 12.10.23 17:07, Valentin Schneider wrote:
>> Hi folks,
>>
>> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
>> we haven't been able to come out with a reproducer (yet), I don't see anything
>> upstream that would prevent them from happening.
>>
>> The setup involves eventpoll, CFS bandwidth controller and timer
>> expiry, and the sequence looks as follows (time-ordered):
>>
>> p_read (on CPUn, CFS with bandwidth controller active)
>> ======
>>
>> ep_poll_callback()
>>    read_lock_irqsave()
>>    ...
>>    try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
>>                        due to having no more runtime
>>      preempt_enable()
>>        preempt_schedule() <- switch out due to p_read being now throttled
>>
>> p_write
>> =======
>>
>> ep_poll()
>>    write_lock_irq() <- blocks due to having active readers (p_read)
>>
>> ktimers/n
>> =========
>>
>> timerfd_tmrproc()
>> `\
>>    ep_poll_callback()
>>    `\
>>      read_lock_irqsave() <- blocks due to having active writer (p_write)
>>
>>
>>  From this point we have a circular dependency:
>>
>>    p_read -> ktimers/n (to replenish runtime of p_read)
>>    ktimers/n -> p_write (to let ktimers/n acquire the readlock)
>>    p_write -> p_read (to let p_write acquire the writelock)
>>
>> IIUC reverting
>>    286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
>> should unblock this as the ktimers/n thread wouldn't block, but then we're back
>> to having the indefinite starvation so I wouldn't necessarily call this a win.
>>
>> Two options I'm seeing:
>> - Prevent p_read from being preempted when it's doing the wakeups under the
>>    readlock (icky)
>> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have
>>    ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
>>    kworker /should/ do.
>>
>> (*) It's not just timerfd, I've also seen it via net::sock_def_readable -
>> it should be anything that's pollable.
>>
>> I'm still scratching my head on this, so any suggestions/comments welcome!
>>
> 
> We are hunting for quite some time sporadic lock-ups or RT systems,
> first only in the field (sigh), now finally also in the lab. Those have
> a fairly high overlap with what was described here. Our baselines so
> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments
> with latest mainline.

Do the backtrace from these lockups show tasks (specifically ktimerd)
waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling
becomes the reason for long delay / circular dependency. Is cfs bandwidth
throttling being used on these systems that run into these lockups?
Otherwise, your issue might be completely different.

> 
> While this thread remained silent afterwards, we have found [1][2][3] as
> apparently related. But this means we are still with this RT bug, even
> in latest 6.15-rc1?

I'm pretty sure a bunch of locking related stuff has been reworked to
accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns
have been replaced with alternatives like RCU. Recently introduced
dl_server infrastructure also helps prevent starvation of fair tasks
which can allow progress and prevent lockups. I would recommend
checking if the most recent -rt release can still reproduce your
issue:
https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/

Note: Aaron Lu is working on Valentin's approach of deferring cfs
throttling to exit to user mode boundary
https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/

If you still run into the issue of a lockup / long latencies on latest
-rt release and your system is using cfs bandwidth controls, you can
perhaps try running with Valentin's or Aaron's series to check if
throttle deferral helps your scenario.

> 
> Jan
> 
> [1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@redhat.com/
> [2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/
> [3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/

I'm mostly testing and reviewing Aaron's series now since per-task
throttling seems to be the way forward based on discussions in the
community.

> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09  9:29   ` K Prateek Nayak
@ 2025-04-09 12:13     ` Aaron Lu
  2025-04-09 13:44       ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Aaron Lu @ 2025-04-09 12:13 UTC (permalink / raw)
  To: K Prateek Nayak, Jan Kiszka
  Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote:
> (+ Aaron)

Thank you Prateek for bring me in.

> Hello Jan,
> 
> On 4/9/2025 12:11 PM, Jan Kiszka wrote:
> > On 12.10.23 17:07, Valentin Schneider wrote:
> > > Hi folks,
> > > 
> > > We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
> > > we haven't been able to come out with a reproducer (yet), I don't see anything
> > > upstream that would prevent them from happening.
> > > 
> > > The setup involves eventpoll, CFS bandwidth controller and timer
> > > expiry, and the sequence looks as follows (time-ordered):
> > > 
> > > p_read (on CPUn, CFS with bandwidth controller active)
> > > ======
> > > 
> > > ep_poll_callback()
> > >    read_lock_irqsave()
> > >    ...
> > >    try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
> > >                        due to having no more runtime
> > >      preempt_enable()
> > >        preempt_schedule() <- switch out due to p_read being now throttled
> > > 
> > > p_write
> > > =======
> > > 
> > > ep_poll()
> > >    write_lock_irq() <- blocks due to having active readers (p_read)
> > > 
> > > ktimers/n
> > > =========
> > > 
> > > timerfd_tmrproc()
> > > `\
> > >    ep_poll_callback()
> > >    `\
> > >      read_lock_irqsave() <- blocks due to having active writer (p_write)
> > > 
> > > 
> > >  From this point we have a circular dependency:
> > > 
> > >    p_read -> ktimers/n (to replenish runtime of p_read)
> > >    ktimers/n -> p_write (to let ktimers/n acquire the readlock)
> > >    p_write -> p_read (to let p_write acquire the writelock)
> > > 
> > > IIUC reverting
> > >    286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
> > > should unblock this as the ktimers/n thread wouldn't block, but then we're back
> > > to having the indefinite starvation so I wouldn't necessarily call this a win.
> > > 
> > > Two options I'm seeing:
> > > - Prevent p_read from being preempted when it's doing the wakeups under the
> > >    readlock (icky)
> > > - Prevent ktimers / ksoftirqd (*) from running the wakeups that have
> > >    ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
> > >    kworker /should/ do.
> > > 
> > > (*) It's not just timerfd, I've also seen it via net::sock_def_readable -
> > > it should be anything that's pollable.
> > > 
> > > I'm still scratching my head on this, so any suggestions/comments welcome!
> > > 
> > 
> > We are hunting for quite some time sporadic lock-ups or RT systems,
> > first only in the field (sigh), now finally also in the lab. Those have
> > a fairly high overlap with what was described here. Our baselines so
> > far: 6.1-rt, Debian and vanilla. We are currently preparing experiments
> > with latest mainline.
> 
> Do the backtrace from these lockups show tasks (specifically ktimerd)
> waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling
> becomes the reason for long delay / circular dependency. Is cfs bandwidth
> throttling being used on these systems that run into these lockups?
> Otherwise, your issue might be completely different.

Agree.

> > 
> > While this thread remained silent afterwards, we have found [1][2][3] as
> > apparently related. But this means we are still with this RT bug, even
> > in latest 6.15-rc1?
> 
> I'm pretty sure a bunch of locking related stuff has been reworked to
> accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns
> have been replaced with alternatives like RCU. Recently introduced
> dl_server infrastructure also helps prevent starvation of fair tasks
> which can allow progress and prevent lockups. I would recommend
> checking if the most recent -rt release can still reproduce your
> issue:
> https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/
> 
> Note: Aaron Lu is working on Valentin's approach of deferring cfs
> throttling to exit to user mode boundary
> https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/
> 
> If you still run into the issue of a lockup / long latencies on latest
> -rt release and your system is using cfs bandwidth controls, you can
> perhaps try running with Valentin's or Aaron's series to check if
> throttle deferral helps your scenario.

I just sent out v2 :-)
https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/

Hi Jan,

If you want to give it a try, please try v2.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09 12:13     ` Aaron Lu
@ 2025-04-09 13:44       ` Jan Kiszka
  2025-04-14 14:50         ` K Prateek Nayak
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2025-04-09 13:44 UTC (permalink / raw)
  To: Aaron Lu, K Prateek Nayak
  Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 09.04.25 14:13, Aaron Lu wrote:
> On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote:
>> (+ Aaron)
> 
> Thank you Prateek for bring me in.
> 
>> Hello Jan,
>>
>> On 4/9/2025 12:11 PM, Jan Kiszka wrote:
>>> On 12.10.23 17:07, Valentin Schneider wrote:
>>>> Hi folks,
>>>>
>>>> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
>>>> we haven't been able to come out with a reproducer (yet), I don't see anything
>>>> upstream that would prevent them from happening.
>>>>
>>>> The setup involves eventpoll, CFS bandwidth controller and timer
>>>> expiry, and the sequence looks as follows (time-ordered):
>>>>
>>>> p_read (on CPUn, CFS with bandwidth controller active)
>>>> ======
>>>>
>>>> ep_poll_callback()
>>>>    read_lock_irqsave()
>>>>    ...
>>>>    try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
>>>>                        due to having no more runtime
>>>>      preempt_enable()
>>>>        preempt_schedule() <- switch out due to p_read being now throttled
>>>>
>>>> p_write
>>>> =======
>>>>
>>>> ep_poll()
>>>>    write_lock_irq() <- blocks due to having active readers (p_read)
>>>>
>>>> ktimers/n
>>>> =========
>>>>
>>>> timerfd_tmrproc()
>>>> `\
>>>>    ep_poll_callback()
>>>>    `\
>>>>      read_lock_irqsave() <- blocks due to having active writer (p_write)
>>>>
>>>>
>>>>  From this point we have a circular dependency:
>>>>
>>>>    p_read -> ktimers/n (to replenish runtime of p_read)
>>>>    ktimers/n -> p_write (to let ktimers/n acquire the readlock)
>>>>    p_write -> p_read (to let p_write acquire the writelock)
>>>>
>>>> IIUC reverting
>>>>    286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
>>>> should unblock this as the ktimers/n thread wouldn't block, but then we're back
>>>> to having the indefinite starvation so I wouldn't necessarily call this a win.
>>>>
>>>> Two options I'm seeing:
>>>> - Prevent p_read from being preempted when it's doing the wakeups under the
>>>>    readlock (icky)
>>>> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have
>>>>    ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
>>>>    kworker /should/ do.
>>>>
>>>> (*) It's not just timerfd, I've also seen it via net::sock_def_readable -
>>>> it should be anything that's pollable.
>>>>
>>>> I'm still scratching my head on this, so any suggestions/comments welcome!
>>>>
>>>
>>> We are hunting for quite some time sporadic lock-ups or RT systems,
>>> first only in the field (sigh), now finally also in the lab. Those have
>>> a fairly high overlap with what was described here. Our baselines so
>>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments
>>> with latest mainline.
>>
>> Do the backtrace from these lockups show tasks (specifically ktimerd)
>> waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling
>> becomes the reason for long delay / circular dependency. Is cfs bandwidth
>> throttling being used on these systems that run into these lockups?
>> Otherwise, your issue might be completely different.
> 
> Agree.
> 
>>>
>>> While this thread remained silent afterwards, we have found [1][2][3] as
>>> apparently related. But this means we are still with this RT bug, even
>>> in latest 6.15-rc1?
>>
>> I'm pretty sure a bunch of locking related stuff has been reworked to
>> accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns
>> have been replaced with alternatives like RCU. Recently introduced
>> dl_server infrastructure also helps prevent starvation of fair tasks
>> which can allow progress and prevent lockups. I would recommend
>> checking if the most recent -rt release can still reproduce your
>> issue:
>> https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/
>>
>> Note: Aaron Lu is working on Valentin's approach of deferring cfs
>> throttling to exit to user mode boundary
>> https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/
>>
>> If you still run into the issue of a lockup / long latencies on latest
>> -rt release and your system is using cfs bandwidth controls, you can
>> perhaps try running with Valentin's or Aaron's series to check if
>> throttle deferral helps your scenario.
> 
> I just sent out v2 :-)
> https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/
> 
> Hi Jan,
> 
> If you want to give it a try, please try v2.
> 

Thanks, we are updating our setup right now.

BTW, does anyone already have a test case that produces the lockup issue
with one or two simple programs and some hectic CFS bandwidth settings?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09 13:44       ` Jan Kiszka
@ 2025-04-14 14:50         ` K Prateek Nayak
  2025-04-14 15:05           ` Sebastian Andrzej Siewior
  2025-04-14 16:21           ` K Prateek Nayak
  0 siblings, 2 replies; 18+ messages in thread
From: K Prateek Nayak @ 2025-04-14 14:50 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu
  Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

Hello Jan,

On 4/9/2025 7:14 PM, Jan Kiszka wrote:
> On 09.04.25 14:13, Aaron Lu wrote:
>> On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote:
>>> (+ Aaron)
>>
>> Thank you Prateek for bring me in.
>>
>>> Hello Jan,
>>>
>>> On 4/9/2025 12:11 PM, Jan Kiszka wrote:
>>>> On 12.10.23 17:07, Valentin Schneider wrote:
>>>>> Hi folks,
>>>>>
>>>>> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
>>>>> we haven't been able to come out with a reproducer (yet), I don't see anything
>>>>> upstream that would prevent them from happening.
>>>>>
>>>>> The setup involves eventpoll, CFS bandwidth controller and timer
>>>>> expiry, and the sequence looks as follows (time-ordered):
>>>>>
>>>>> p_read (on CPUn, CFS with bandwidth controller active)
>>>>> ======
>>>>>
>>>>> ep_poll_callback()
>>>>>     read_lock_irqsave()
>>>>>     ...
>>>>>     try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
>>>>>                         due to having no more runtime
>>>>>       preempt_enable()
>>>>>         preempt_schedule() <- switch out due to p_read being now throttled
>>>>>
>>>>> p_write
>>>>> =======
>>>>>
>>>>> ep_poll()
>>>>>     write_lock_irq() <- blocks due to having active readers (p_read)
>>>>>
>>>>> ktimers/n
>>>>> =========
>>>>>
>>>>> timerfd_tmrproc()
>>>>> `\
>>>>>     ep_poll_callback()
>>>>>     `\
>>>>>       read_lock_irqsave() <- blocks due to having active writer (p_write)
>>>>>
>>>>>
>>>>>   From this point we have a circular dependency:
>>>>>
>>>>>     p_read -> ktimers/n (to replenish runtime of p_read)
>>>>>     ktimers/n -> p_write (to let ktimers/n acquire the readlock)
>>>>>     p_write -> p_read (to let p_write acquire the writelock)
>>>>>
>>>>> IIUC reverting
>>>>>     286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
>>>>> should unblock this as the ktimers/n thread wouldn't block, but then we're back
>>>>> to having the indefinite starvation so I wouldn't necessarily call this a win.
>>>>>
>>>>> Two options I'm seeing:
>>>>> - Prevent p_read from being preempted when it's doing the wakeups under the
>>>>>     readlock (icky)
>>>>> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have
>>>>>     ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
>>>>>     kworker /should/ do.
>>>>>
>>>>> (*) It's not just timerfd, I've also seen it via net::sock_def_readable -
>>>>> it should be anything that's pollable.
>>>>>
>>>>> I'm still scratching my head on this, so any suggestions/comments welcome!
>>>>>
>>>>
>>>> We are hunting for quite some time sporadic lock-ups or RT systems,
>>>> first only in the field (sigh), now finally also in the lab. Those have
>>>> a fairly high overlap with what was described here. Our baselines so
>>>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments
>>>> with latest mainline.
>>>
>>> Do the backtrace from these lockups show tasks (specifically ktimerd)
>>> waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling
>>> becomes the reason for long delay / circular dependency. Is cfs bandwidth
>>> throttling being used on these systems that run into these lockups?
>>> Otherwise, your issue might be completely different.
>>
>> Agree.
>>
>>>>
>>>> While this thread remained silent afterwards, we have found [1][2][3] as
>>>> apparently related. But this means we are still with this RT bug, even
>>>> in latest 6.15-rc1?
>>>
>>> I'm pretty sure a bunch of locking related stuff has been reworked to
>>> accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns
>>> have been replaced with alternatives like RCU. Recently introduced
>>> dl_server infrastructure also helps prevent starvation of fair tasks
>>> which can allow progress and prevent lockups. I would recommend
>>> checking if the most recent -rt release can still reproduce your
>>> issue:
>>> https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/
>>>
>>> Note: Aaron Lu is working on Valentin's approach of deferring cfs
>>> throttling to exit to user mode boundary
>>> https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/
>>>
>>> If you still run into the issue of a lockup / long latencies on latest
>>> -rt release and your system is using cfs bandwidth controls, you can
>>> perhaps try running with Valentin's or Aaron's series to check if
>>> throttle deferral helps your scenario.
>>
>> I just sent out v2 :-)
>> https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/
>>
>> Hi Jan,
>>
>> If you want to give it a try, please try v2.
>>
> 
> Thanks, we are updating our setup right now.
> 
> BTW, does anyone already have a test case that produces the lockup issue
> with one or two simple programs and some hectic CFS bandwidth settings?

This is your cue to grab a brown paper bag since what I'm about to paste
below is probably lifetime without parole in the RT land but I believe
it gets close to the scenario described by Valentin:

(Based on v6.15-rc1; I haven't yet tested this with Aaron's series yet)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..7ed0a4923ca2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6497,6 +6497,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
  	int count = 0;
  
  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
+	pr_crit("sched_cfs_period_timer: Started on CPU%d\n", smp_processor_id());
  	for (;;) {
  		overrun = hrtimer_forward_now(timer, cfs_b->period);
  		if (!overrun)
@@ -6537,6 +6538,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
  	}
  	if (idle)
  		cfs_b->period_active = 0;
+	pr_crit("sched_cfs_period_timer: Finished on CPU%d\n", smp_processor_id());
  	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
  
  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
diff --git a/kernel/sys.c b/kernel/sys.c
index c434968e9f5d..d68b05963b88 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2470,6 +2470,79 @@ static int prctl_get_auxv(void __user *addr, unsigned long len)
  	return sizeof(mm->saved_auxv);
  }
  
+/* These variables will be used in dumb ways. */
+raw_spinlock_t dwdt_spin_lock;
+struct hrtimer dwtd_timer;
+DEFINE_RWLOCK(dwdt_lock);
+
+/* Should send ktimerd into a deadlock */
+static enum hrtimer_restart deadlock_timer(struct hrtimer *timer)
+{
+	pr_crit("deadlock_timer: Started on CPU%d\n", smp_processor_id());
+	/* Should hit rtlock slowpath after kthread writer. */
+	read_lock(&dwdt_lock);
+	read_unlock(&dwdt_lock);
+	pr_crit("deadlock_timer: Finished on CPU%d\n", smp_processor_id());
+	return HRTIMER_NORESTART;
+}
+
+/* kthread function to preempt fair thread and block on write lock. */
+static int grab_dumb_lock(void *data)
+{
+	pr_crit("RT kthread: Started on CPU%d\n", smp_processor_id());
+	write_lock_irq(&dwdt_lock);
+	write_unlock_irq(&dwdt_lock);
+	pr_crit("RT kthread: Finished on CPU%d\n", smp_processor_id());
+
+	return 0;
+}
+
+/* Try to send ktimerd into a deadlock. */
+static void dumb_ways_to_die(unsigned long loops)
+{
+	struct task_struct *kt;
+	unsigned long i;
+	int cpu;
+
+	migrate_disable();
+
+	cpu = smp_processor_id();
+	pr_crit("dumb_ways_to_die: Started on CPU%d with %lu loops\n", cpu, loops);
+
+	raw_spin_lock_init(&dwdt_spin_lock);
+	hrtimer_setup(&dwtd_timer, deadlock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	kt = kthread_create_on_cpu(&grab_dumb_lock, NULL, cpu, "dumb-thread");
+
+	read_lock_irq(&dwdt_lock);
+
+	/* Dummy lock; Disables preemption. */
+	raw_spin_lock(&dwdt_spin_lock);
+
+	pr_crit("dumb_ways_to_die: Queuing timer on CPU%d\n", cpu);
+	/* Start a timer that will run before the bandwidth timer. */
+	hrtimer_forward_now(&dwtd_timer, ns_to_ktime(10000));
+	hrtimer_start_expires(&dwtd_timer, HRTIMER_MODE_ABS_PINNED);
+
+	pr_crit("dumb_ways_to_die: Waking up RT kthread on CPU%d\n", cpu);
+	sched_set_fifo(kt); /* Create a high priority thread. */
+	wake_up_process(kt);
+
+	/* Exhaust bandwidth of caller */
+	for (i = 0; i < loops; ++i)
+		cpu_relax();
+
+	/* Enable preemption; kt should preempt now. */
+	raw_spin_unlock(&dwdt_spin_lock);
+
+	/* Waste time just in case RT task has not preempted us. (very unlikely!) */
+	for (i = 0; i < loops; ++i)
+		cpu_relax();
+
+	read_unlock_irq(&dwdt_lock);
+	pr_crit("dumb_ways_to_die: Finished on CPU%d with %lu loops\n", cpu, loops);
+	migrate_enable();
+}
+
  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  		unsigned long, arg4, unsigned long, arg5)
  {
@@ -2483,6 +2556,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
  
  	error = 0;
  	switch (option) {
+	case 666:
+		dumb_ways_to_die(arg2);
+		break;
  	case PR_SET_PDEATHSIG:
  		if (!valid_signal(arg2)) {
  			error = -EINVAL;
--

The above adds a prctl() to trigger the scenario whose flow I've
described with some inline comments. Patterns like above is a crime
but I've done it in the name of science. Steps to reproduce:

     # mkdir /sys/fs/cgroup/CG0
     # echo $$ > /sys/fs/cgroup/CG0/cgroup.procs
     # echo "500000 1000000" > /sys/fs/cgroup/CG0/cpu.max

     # dmesg | tail -n 2 # Find the CPU where bandwidth timer is running
     [  175.919325] sched_cfs_period_timer: Started on CPU214
     [  175.919330] sched_cfs_period_timer: Finished on CPU214

     # taskset -c 214 perl -e 'syscall 157,666,50000000' # Pin perl to same CPU, 50M loops

Note: You have to pin the perl command to the same CPU as the timer for it
to run into stalls. It may take a couple of attempts. Also please adjust
the number of loops of cpu_relax() based on your setup. In my case, 50M
loops runs long enough to exhaust the cfs bandwidth.

With this I see:

     sched_cfs_period_timer: Started on CPU214
     sched_cfs_period_timer: Finished on CPU214
     dumb_ways_to_die: Started on CPU214 with 50000000 loops
     dumb_ways_to_die: Queuing timer on CPU214
     dumb_ways_to_die: Waking up RT kthread on CPU214
     RT kthread: Started on CPU214
     deadlock_timer: Started on CPU214
     rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
     rcu:     Tasks blocked on level-1 rcu_node (CPUs 208-223): P1975/3:b..l
     rcu:     (detected by 124, t=15002 jiffies, g=3201, q=138 ncpus=256)
     task:ktimers/214     state:D stack:0     pid:1975  tgid:1975  ppid:2      task_flags:0x4208040 flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x401/0x15a0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? update_rq_clock+0x7c/0x120
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? rt_mutex_setprio+0x1c2/0x480
      schedule_rtlock+0x1e/0x40
      rtlock_slowlock_locked+0x20e/0xc60
      rt_read_lock+0x8f/0x190
      ? __pfx_deadlock_timer+0x10/0x10
      deadlock_timer+0x28/0x50
      __hrtimer_run_queues+0xfd/0x2e0
      hrtimer_run_softirq+0x9d/0xf0
      handle_softirqs.constprop.0+0xc1/0x2a0
      ? __pfx_smpboot_thread_fn+0x10/0x10
      run_ktimerd+0x3e/0x80
      smpboot_thread_fn+0xf3/0x220
      kthread+0xff/0x210
      ? rt_spin_lock+0x3c/0xc0
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x34/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1a/0x30
      </TASK>

I get rcub stall messages after a while (from a separate trace):

     INFO: task rcub/4:462 blocked for more than 120 seconds.
           Not tainted 6.15.0-rc1-test+ #743
     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     task:rcub/4          state:D stack:0     pid:462   tgid:462   ppid:2      task_flags:0x208040 flags:0x00004000
     Call Trace:
      <TASK>
      __schedule+0x401/0x15a0
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? srso_alias_return_thunk+0x5/0xfbef5
      ? rt_mutex_adjust_prio_chain+0xa5/0x7e0
      rt_mutex_schedule+0x20/0x40
      rt_mutex_slowlock_block.constprop.0+0x42/0x1e0
      __rt_mutex_slowlock_locked.constprop.0+0xa7/0x210
      rt_mutex_slowlock.constprop.0+0x4e/0xc0
      rcu_boost_kthread+0xe3/0x320
      ? __pfx_rcu_boost_kthread+0x10/0x10
      kthread+0xff/0x210
      ? rt_spin_lock+0x3c/0xc0
      ? __pfx_kthread+0x10/0x10
      ret_from_fork+0x34/0x50
      ? __pfx_kthread+0x10/0x10
      ret_from_fork_asm+0x1a/0x30
      </TASK>

I left the program on for good 5 minutes and it did not budge after the
splat.

Note: I could not reproduce the splat with !PREEMPT_RT kernel
(CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the
cfs bandwidth.

> 
> Jan
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-14 14:50         ` K Prateek Nayak
@ 2025-04-14 15:05           ` Sebastian Andrzej Siewior
  2025-04-14 15:18             ` K Prateek Nayak
  2025-04-15  5:35             ` Jan Kiszka
  2025-04-14 16:21           ` K Prateek Nayak
  1 sibling, 2 replies; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-14 15:05 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Jan Kiszka, Aaron Lu, Valentin Schneider, linux-rt-users,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 2025-04-14 20:20:04 [+0530], K Prateek Nayak wrote:
> Note: I could not reproduce the splat with !PREEMPT_RT kernel
> (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the
> cfs bandwidth.

Not sure what this has to do with anything.
On !RT the read_lock() in the timer can be acquired even with a pending
writer. The writer keeps spinning until the main thread is gone. There
should be no RCU boosting but the RCU still is there, too.

On RT the read_lock() in the timer block, the write blocks, too. So
every blocker on the lock is scheduled out until the reader is gone. On
top of that, the reader gets RCU boosted with FIFO-1 by default to get
out.

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-14 15:05           ` Sebastian Andrzej Siewior
@ 2025-04-14 15:18             ` K Prateek Nayak
  2025-04-15  5:35             ` Jan Kiszka
  1 sibling, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2025-04-14 15:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Jan Kiszka, Aaron Lu, Valentin Schneider, linux-rt-users,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

Hello Sebastian,

On 4/14/2025 8:35 PM, Sebastian Andrzej Siewior wrote:
> On 2025-04-14 20:20:04 [+0530], K Prateek Nayak wrote:
>> Note: I could not reproduce the splat with !PREEMPT_RT kernel
>> (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the
>> cfs bandwidth.
> 
> Not sure what this has to do with anything.

Let me clarify a bit more:

- Fair task with cfs_bandwidth limits triggers the prctl(666, 50000000)

- The prctl() takes a read_lock_irq() excpet on PREEMPT_RT this does not
   disable interrupt.

- I take a dummy lock to stall the preemption

- Within the read_lock critical section, I queue a timer that takes the
   read_lock.

- I also wakeup up a high priority RT task that that takes the
   write_lock

As soon as I drop the dummy raw_spin_lock:

- High priority RT task runs, tries to take the write_lock but cannot
   since the preempted fair task has the read end still.

- Next ktimerd runs trying to grab the read_lock() but is put in the
   slowpath since ktimerd has tried to take the write_lock

- The fair task runs out of bandwidth and is preempted but this requires
   the ktimerd to run the replenish function which is queued behind the
   already preempted timer function trying to grab the read_lock()

Isn't this the scenario that Valentin's original summary describes?
If I've got something wrong please do correct me.

> On !RT the read_lock() in the timer can be acquired even with a pending
> writer. The writer keeps spinning until the main thread is gone. There
> should be no RCU boosting but the RCU still is there, too.

On !RT, the read_lock_irq() in fair task will not be preempted in the
first place so progress is guaranteed that way right?

> 
> On RT the read_lock() in the timer block, the write blocks, too. So
> every blocker on the lock is scheduled out until the reader is gone. On
> top of that, the reader gets RCU boosted with FIFO-1 by default to get
> out.

Except there is a circular dependency now:

- fair task needs bandwidth replenishment to progress and drop lock.
- rt task needs fair task to drop the lock and grab the write end.
- ktimerd requires rt task to grab and drop the lock to make progress.

I'm fairly new to the PREEMPT_RT bits so if I've missed something,
please do let me know and sorry for any noise.

> 
> Sebastian

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-14 15:05           ` Sebastian Andrzej Siewior
  2025-04-14 15:18             ` K Prateek Nayak
@ 2025-04-15  5:35             ` Jan Kiszka
  2025-04-15  6:23               ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2025-04-15  5:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, K Prateek Nayak
  Cc: Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel,
	Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 14.04.25 17:05, Sebastian Andrzej Siewior wrote:
> On 2025-04-14 20:20:04 [+0530], K Prateek Nayak wrote:
>> Note: I could not reproduce the splat with !PREEMPT_RT kernel
>> (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the
>> cfs bandwidth.
> 
> Not sure what this has to do with anything.
> On !RT the read_lock() in the timer can be acquired even with a pending
> writer. The writer keeps spinning until the main thread is gone. There
> should be no RCU boosting but the RCU still is there, too.
> 
> On RT the read_lock() in the timer block, the write blocks, too. So
> every blocker on the lock is scheduled out until the reader is gone. On
> top of that, the reader gets RCU boosted with FIFO-1 by default to get
> out.

There is no boosting of the active readers on RT as there is no
information recorded about who is currently holding a read lock. This is
the whole point why rwlocks are hairy with RT, I thought.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-15  5:35             ` Jan Kiszka
@ 2025-04-15  6:23               ` Sebastian Andrzej Siewior
  2025-04-15  6:54                 ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-15  6:23 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote:
> > On RT the read_lock() in the timer block, the write blocks, too. So
> > every blocker on the lock is scheduled out until the reader is gone. On
> > top of that, the reader gets RCU boosted with FIFO-1 by default to get
> > out.
> 
> There is no boosting of the active readers on RT as there is no
> information recorded about who is currently holding a read lock. This is
> the whole point why rwlocks are hairy with RT, I thought.

Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with
SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you
get stuck in a RCU section for too long then this boosting will take
effect by making the task, within the RCU section, the owner of the
boost-lock and the boosting task will try to acquire it. This is used to
get SCHED_OTHER tasks out of the RCU section.
But if a SCHED_FIFO task is on the CPU then this boosting will have to
no effect because the scheduler will not switch to a task with lower
priority.

> Jan

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-15  6:23               ` Sebastian Andrzej Siewior
@ 2025-04-15  6:54                 ` Jan Kiszka
  2025-04-15  8:00                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2025-04-15  6:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 15.04.25 08:23, Sebastian Andrzej Siewior wrote:
> On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote:
>>> On RT the read_lock() in the timer block, the write blocks, too. So
>>> every blocker on the lock is scheduled out until the reader is gone. On
>>> top of that, the reader gets RCU boosted with FIFO-1 by default to get
>>> out.
>>
>> There is no boosting of the active readers on RT as there is no
>> information recorded about who is currently holding a read lock. This is
>> the whole point why rwlocks are hairy with RT, I thought.
> 
> Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with
> SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you
> get stuck in a RCU section for too long then this boosting will take
> effect by making the task, within the RCU section, the owner of the
> boost-lock and the boosting task will try to acquire it. This is used to
> get SCHED_OTHER tasks out of the RCU section.
> But if a SCHED_FIFO task is on the CPU then this boosting will have to
> no effect because the scheduler will not switch to a task with lower
> priority.

Does that boosting happen to need ktimersd or ksoftirqd (which both are
stalling in our case)? I'm still looking for the reason why it does not
help in the observed stall scenarios.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-15  6:54                 ` Jan Kiszka
@ 2025-04-15  8:00                   ` Sebastian Andrzej Siewior
  2025-04-15 10:23                     ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-15  8:00 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 2025-04-15 08:54:01 [+0200], Jan Kiszka wrote:
> On 15.04.25 08:23, Sebastian Andrzej Siewior wrote:
> > On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote:
> >>> On RT the read_lock() in the timer block, the write blocks, too. So
> >>> every blocker on the lock is scheduled out until the reader is gone. On
> >>> top of that, the reader gets RCU boosted with FIFO-1 by default to get
> >>> out.
> >>
> >> There is no boosting of the active readers on RT as there is no
> >> information recorded about who is currently holding a read lock. This is
> >> the whole point why rwlocks are hairy with RT, I thought.
> > 
> > Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with
> > SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you
> > get stuck in a RCU section for too long then this boosting will take
> > effect by making the task, within the RCU section, the owner of the
> > boost-lock and the boosting task will try to acquire it. This is used to
> > get SCHED_OTHER tasks out of the RCU section.
> > But if a SCHED_FIFO task is on the CPU then this boosting will have to
> > no effect because the scheduler will not switch to a task with lower
> > priority.
> 
> Does that boosting happen to need ktimersd or ksoftirqd (which both are
> stalling in our case)? I'm still looking for the reason why it does not
> help in the observed stall scenarios.

Your problem is that you likely have many reader which need to get out
first. That spinlock replacement will help. I'm not sure about the CFS
patch referenced in the thread here.

That boosting requires a RCU reader that starts the mechanism (on rcu
unlock). But I don't think that it will help. You would also need to
raise the priority above to the writer level (manually) and that will
likely break other things. It is meant to unstuck SCHED_OTHER tasks and
not boost stuck reader as a side effect. Also I am not sure how that
works with multiple tasks.

> Jan

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-15  8:00                   ` Sebastian Andrzej Siewior
@ 2025-04-15 10:23                     ` Jan Kiszka
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kiszka @ 2025-04-15 10:23 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 15.04.25 10:00, Sebastian Andrzej Siewior wrote:
> On 2025-04-15 08:54:01 [+0200], Jan Kiszka wrote:
>> On 15.04.25 08:23, Sebastian Andrzej Siewior wrote:
>>> On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote:
>>>>> On RT the read_lock() in the timer block, the write blocks, too. So
>>>>> every blocker on the lock is scheduled out until the reader is gone. On
>>>>> top of that, the reader gets RCU boosted with FIFO-1 by default to get
>>>>> out.
>>>>
>>>> There is no boosting of the active readers on RT as there is no
>>>> information recorded about who is currently holding a read lock. This is
>>>> the whole point why rwlocks are hairy with RT, I thought.
>>>
>>> Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with
>>> SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you
>>> get stuck in a RCU section for too long then this boosting will take
>>> effect by making the task, within the RCU section, the owner of the
>>> boost-lock and the boosting task will try to acquire it. This is used to
>>> get SCHED_OTHER tasks out of the RCU section.
>>> But if a SCHED_FIFO task is on the CPU then this boosting will have to
>>> no effect because the scheduler will not switch to a task with lower
>>> priority.
>>
>> Does that boosting happen to need ktimersd or ksoftirqd (which both are
>> stalling in our case)? I'm still looking for the reason why it does not
>> help in the observed stall scenarios.
> 
> Your problem is that you likely have many reader which need to get out
> first. That spinlock replacement will help. I'm not sure about the CFS
> patch referenced in the thread here.

Nope, we only have two readers, one which is scheduled out by CFS and
another one - in soft IRQ context - that is getting stuck after the
writer promoted the held lock to a write lock.

> 
> That boosting requires a RCU reader that starts the mechanism (on rcu
> unlock). But I don't think that it will help. You would also need to
> raise the priority above to the writer level (manually) and that will
> likely break other things. It is meant to unstuck SCHED_OTHER tasks and
> not boost stuck reader as a side effect. Also I am not sure how that
> works with multiple tasks.

Ok, that is likely why we don't see that coming in for helping us out.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-14 14:50         ` K Prateek Nayak
  2025-04-14 15:05           ` Sebastian Andrzej Siewior
@ 2025-04-14 16:21           ` K Prateek Nayak
  1 sibling, 0 replies; 18+ messages in thread
From: K Prateek Nayak @ 2025-04-14 16:21 UTC (permalink / raw)
  To: Jan Kiszka, Aaron Lu
  Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior,
	linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 4/14/2025 8:20 PM, K Prateek Nayak wrote:
>>
>> BTW, does anyone already have a test case that produces the lockup issue
>> with one or two simple programs and some hectic CFS bandwidth settings?
> 
> This is your cue to grab a brown paper bag since what I'm about to paste
> below is probably lifetime without parole in the RT land but I believe
> it gets close to the scenario described by Valentin:
> 
> (Based on v6.15-rc1; I haven't yet tested this with Aaron's series yet)

I tried this with Aaron's series [1] and I did not run into any rcu
stalls yet. following are dmesg logs:

     [  122.853909] sched_cfs_period_timer: Started on CPU248
     [  122.853912] sched_cfs_period_timer: Finished on CPU248
     [  123.726232] dumb_ways_to_die: Started on CPU248 with 50000000 loops
     [  123.726574] dumb_ways_to_die: Queuing timer on CPU248
     [  123.726577] dumb_ways_to_die: Waking up RT kthread on CPU248
     [  125.768969] RT kthread: Started on CPU248
     [  125.769050] deadlock_timer: Started on CPU248

     # Fair task runs, drops rwlock, is preempted

     [  126.666709] RT kthread: Finished on CPU248

     # RT kthread finishes

     [  126.666737] deadlock_timer: Finished on CPU248

     # ktimerd function finishes and unblocks replenish

     [  126.666740] sched_cfs_period_timer: Started on CPU248
     [  126.666741] sched_cfs_period_timer: Finished on CPU248

     # cfs task runs prctl() to completion and is throttled

     [  126.666762] dumb_ways_to_die: Finished on CPU248 with 50000000 loops

     # cfs_bandwidth continues to catch up on slack accumulated

     [  126.851820] sched_cfs_period_timer: Started on CPU248
     [  126.851825] sched_cfs_period_timer: Finished on CPU248

[1] https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09  6:41 ` Jan Kiszka
  2025-04-09  9:29   ` K Prateek Nayak
@ 2025-04-09 13:21   ` Sebastian Andrzej Siewior
  2025-04-09 13:41     ` Jan Kiszka
  1 sibling, 1 reply; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-09 13:21 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak,
	Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 2025-04-09 08:41:44 [+0200], Jan Kiszka wrote:
> We are hunting for quite some time sporadic lock-ups or RT systems, 
> first only in the field (sigh), now finally also in the lab. Those have 
> a fairly high overlap with what was described here. Our baselines so 
> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments 
> with latest mainline.
> 
> While this thread remained silent afterwards, we have found [1][2][3] as 
> apparently related. But this means we are still with this RT bug, even 
> in latest 6.15-rc1?

Not sure the commits are related. The problem here is that RW locks are
not really real time friendly. Frederick had a simple fix to it
	https://lore.kernel.org/all/20210825132754.GA895675@lothringen/

but yeah. The alternative, which I didn't look into, would be to replace
the reader side with RCU so we would just have the writer lock. That
mean we need to RW lock because of performance…

> Jan
> 
> [1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@redhat.com/
> [2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/
> [3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/
> 

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09 13:21   ` Sebastian Andrzej Siewior
@ 2025-04-09 13:41     ` Jan Kiszka
  2025-04-09 13:52       ` Jan Kiszka
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2025-04-09 13:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak,
	Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 09.04.25 15:21, Sebastian Andrzej Siewior wrote:
> On 2025-04-09 08:41:44 [+0200], Jan Kiszka wrote:
>> We are hunting for quite some time sporadic lock-ups or RT systems, 
>> first only in the field (sigh), now finally also in the lab. Those have 
>> a fairly high overlap with what was described here. Our baselines so 
>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments 
>> with latest mainline.
>>
>> While this thread remained silent afterwards, we have found [1][2][3] as 
>> apparently related. But this means we are still with this RT bug, even 
>> in latest 6.15-rc1?
> 
> Not sure the commits are related. The problem here is that RW locks are
> not really real time friendly. Frederick had a simple fix to it
> 	https://lore.kernel.org/all/20210825132754.GA895675@lothringen/
> 
> but yeah. The alternative, which I didn't look into, would be to replace
> the reader side with RCU so we would just have the writer lock. That
> mean we need to RW lock because of performance…
> 

We know that epoll is not a good idea for RT programs. However, our
problem is that already non-RT programs manage to lock up an RT-enabled
system.

We are currently collecting more data to show what we are seeing, plus
will try out the latest patches on the latest kernels.

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09 13:41     ` Jan Kiszka
@ 2025-04-09 13:52       ` Jan Kiszka
  2025-04-09 13:57         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kiszka @ 2025-04-09 13:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak,
	Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 09.04.25 15:41, Jan Kiszka wrote:
> On 09.04.25 15:21, Sebastian Andrzej Siewior wrote:
>> On 2025-04-09 08:41:44 [+0200], Jan Kiszka wrote:
>>> We are hunting for quite some time sporadic lock-ups or RT systems, 
>>> first only in the field (sigh), now finally also in the lab. Those have 
>>> a fairly high overlap with what was described here. Our baselines so 
>>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments 
>>> with latest mainline.
>>>
>>> While this thread remained silent afterwards, we have found [1][2][3] as 
>>> apparently related. But this means we are still with this RT bug, even 
>>> in latest 6.15-rc1?
>>
>> Not sure the commits are related. The problem here is that RW locks are
>> not really real time friendly. Frederick had a simple fix to it
>> 	https://lore.kernel.org/all/20210825132754.GA895675@lothringen/
>>
>> but yeah. The alternative, which I didn't look into, would be to replace
>> the reader side with RCU so we would just have the writer lock. That
>> mean we need to RW lock because of performance…
>>
> 
> We know that epoll is not a good idea for RT programs. However, our
> problem is that already non-RT programs manage to lock up an RT-enabled
> system.

On second glance, Frederic's patch would probably also avoid the issue
we are seeing as it should bring PI to the CFS-throttled read-lock
holder (which is now a write-lock holder).

But given how old that proposal is, I assume the performance impact was
even for the RT kernel too much, wasn't it?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
  2025-04-09 13:52       ` Jan Kiszka
@ 2025-04-09 13:57         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-04-09 13:57 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak,
	Thomas Gleixner, Juri Lelli, Clark Williams,
	Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer,
	Florian Bezdeka

On 2025-04-09 15:52:19 [+0200], Jan Kiszka wrote:
> On second glance, Frederic's patch would probably also avoid the issue
> we are seeing as it should bring PI to the CFS-throttled read-lock
> holder (which is now a write-lock holder).
> 
> But given how old that proposal is, I assume the performance impact was
> even for the RT kernel too much, wasn't it?

I don't remember any numbers and how bad things would get just the fear
of it. But I guess this is workload related in terms how much does the
RW lock improve the situation. RT then will multiple the CPU and reader
resources to the point where it does not scale and present you the
lockups.

> Jan

Sebastian

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-04-15 10:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-12 15:07 [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller Valentin Schneider
2025-04-09  6:41 ` Jan Kiszka
2025-04-09  9:29   ` K Prateek Nayak
2025-04-09 12:13     ` Aaron Lu
2025-04-09 13:44       ` Jan Kiszka
2025-04-14 14:50         ` K Prateek Nayak
2025-04-14 15:05           ` Sebastian Andrzej Siewior
2025-04-14 15:18             ` K Prateek Nayak
2025-04-15  5:35             ` Jan Kiszka
2025-04-15  6:23               ` Sebastian Andrzej Siewior
2025-04-15  6:54                 ` Jan Kiszka
2025-04-15  8:00                   ` Sebastian Andrzej Siewior
2025-04-15 10:23                     ` Jan Kiszka
2025-04-14 16:21           ` K Prateek Nayak
2025-04-09 13:21   ` Sebastian Andrzej Siewior
2025-04-09 13:41     ` Jan Kiszka
2025-04-09 13:52       ` Jan Kiszka
2025-04-09 13:57         ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox