* [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller
@ 2023-10-12 15:07 Valentin Schneider
2025-04-09 6:41 ` Jan Kiszka
0 siblings, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2023-10-12 15:07 UTC (permalink / raw)
To: linux-rt-users, linux-kernel
Cc: Sebastian Andrzej Siewior, Thomas Gleixner, Juri Lelli,
Clark Williams, Luis Claudio R. Goncalves
Hi folks,
We've had reports of stalls happening on our v6.0-ish frankenkernels, and while
we haven't been able to come out with a reproducer (yet), I don't see anything
upstream that would prevent them from happening.
The setup involves eventpoll, CFS bandwidth controller and timer
expiry, and the sequence looks as follows (time-ordered):
p_read (on CPUn, CFS with bandwidth controller active)
======
ep_poll_callback()
read_lock_irqsave()
...
try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched
due to having no more runtime
preempt_enable()
preempt_schedule() <- switch out due to p_read being now throttled
p_write
=======
ep_poll()
write_lock_irq() <- blocks due to having active readers (p_read)
ktimers/n
=========
timerfd_tmrproc()
`\
ep_poll_callback()
`\
read_lock_irqsave() <- blocks due to having active writer (p_write)
From this point we have a circular dependency:
p_read -> ktimers/n (to replenish runtime of p_read)
ktimers/n -> p_write (to let ktimers/n acquire the readlock)
p_write -> p_read (to let p_write acquire the writelock)
IIUC reverting
286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation")
should unblock this as the ktimers/n thread wouldn't block, but then we're back
to having the indefinite starvation so I wouldn't necessarily call this a win.
Two options I'm seeing:
- Prevent p_read from being preempted when it's doing the wakeups under the
readlock (icky)
- Prevent ktimers / ksoftirqd (*) from running the wakeups that have
ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a
kworker /should/ do.
(*) It's not just timerfd, I've also seen it via net::sock_def_readable -
it should be anything that's pollable.
I'm still scratching my head on this, so any suggestions/comments welcome!
Cheers,
Valentin
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2023-10-12 15:07 [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller Valentin Schneider @ 2025-04-09 6:41 ` Jan Kiszka 2025-04-09 9:29 ` K Prateek Nayak 2025-04-09 13:21 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 18+ messages in thread From: Jan Kiszka @ 2025-04-09 6:41 UTC (permalink / raw) To: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior, linux-kernel, kprateek.nayak Cc: Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 12.10.23 17:07, Valentin Schneider wrote: > Hi folks, > > We've had reports of stalls happening on our v6.0-ish frankenkernels, and while > we haven't been able to come out with a reproducer (yet), I don't see anything > upstream that would prevent them from happening. > > The setup involves eventpoll, CFS bandwidth controller and timer > expiry, and the sequence looks as follows (time-ordered): > > p_read (on CPUn, CFS with bandwidth controller active) > ====== > > ep_poll_callback() > read_lock_irqsave() > ... > try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched > due to having no more runtime > preempt_enable() > preempt_schedule() <- switch out due to p_read being now throttled > > p_write > ======= > > ep_poll() > write_lock_irq() <- blocks due to having active readers (p_read) > > ktimers/n > ========= > > timerfd_tmrproc() > `\ > ep_poll_callback() > `\ > read_lock_irqsave() <- blocks due to having active writer (p_write) > > > From this point we have a circular dependency: > > p_read -> ktimers/n (to replenish runtime of p_read) > ktimers/n -> p_write (to let ktimers/n acquire the readlock) > p_write -> p_read (to let p_write acquire the writelock) > > IIUC reverting > 286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation") > should unblock this as the ktimers/n thread wouldn't block, but then we're back > to having the indefinite starvation so I wouldn't necessarily call this a win. > > Two options I'm seeing: > - Prevent p_read from being preempted when it's doing the wakeups under the > readlock (icky) > - Prevent ktimers / ksoftirqd (*) from running the wakeups that have > ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a > kworker /should/ do. > > (*) It's not just timerfd, I've also seen it via net::sock_def_readable - > it should be anything that's pollable. > > I'm still scratching my head on this, so any suggestions/comments welcome! > We are hunting for quite some time sporadic lock-ups or RT systems, first only in the field (sigh), now finally also in the lab. Those have a fairly high overlap with what was described here. Our baselines so far: 6.1-rt, Debian and vanilla. We are currently preparing experiments with latest mainline. While this thread remained silent afterwards, we have found [1][2][3] as apparently related. But this means we are still with this RT bug, even in latest 6.15-rc1? Jan [1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@redhat.com/ [2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/ [3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/ -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 6:41 ` Jan Kiszka @ 2025-04-09 9:29 ` K Prateek Nayak 2025-04-09 12:13 ` Aaron Lu 2025-04-09 13:21 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 18+ messages in thread From: K Prateek Nayak @ 2025-04-09 9:29 UTC (permalink / raw) To: Jan Kiszka, Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior, linux-kernel, Aaron Lu Cc: Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka (+ Aaron) Hello Jan, On 4/9/2025 12:11 PM, Jan Kiszka wrote: > On 12.10.23 17:07, Valentin Schneider wrote: >> Hi folks, >> >> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while >> we haven't been able to come out with a reproducer (yet), I don't see anything >> upstream that would prevent them from happening. >> >> The setup involves eventpoll, CFS bandwidth controller and timer >> expiry, and the sequence looks as follows (time-ordered): >> >> p_read (on CPUn, CFS with bandwidth controller active) >> ====== >> >> ep_poll_callback() >> read_lock_irqsave() >> ... >> try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched >> due to having no more runtime >> preempt_enable() >> preempt_schedule() <- switch out due to p_read being now throttled >> >> p_write >> ======= >> >> ep_poll() >> write_lock_irq() <- blocks due to having active readers (p_read) >> >> ktimers/n >> ========= >> >> timerfd_tmrproc() >> `\ >> ep_poll_callback() >> `\ >> read_lock_irqsave() <- blocks due to having active writer (p_write) >> >> >> From this point we have a circular dependency: >> >> p_read -> ktimers/n (to replenish runtime of p_read) >> ktimers/n -> p_write (to let ktimers/n acquire the readlock) >> p_write -> p_read (to let p_write acquire the writelock) >> >> IIUC reverting >> 286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation") >> should unblock this as the ktimers/n thread wouldn't block, but then we're back >> to having the indefinite starvation so I wouldn't necessarily call this a win. >> >> Two options I'm seeing: >> - Prevent p_read from being preempted when it's doing the wakeups under the >> readlock (icky) >> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have >> ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a >> kworker /should/ do. >> >> (*) It's not just timerfd, I've also seen it via net::sock_def_readable - >> it should be anything that's pollable. >> >> I'm still scratching my head on this, so any suggestions/comments welcome! >> > > We are hunting for quite some time sporadic lock-ups or RT systems, > first only in the field (sigh), now finally also in the lab. Those have > a fairly high overlap with what was described here. Our baselines so > far: 6.1-rt, Debian and vanilla. We are currently preparing experiments > with latest mainline. Do the backtrace from these lockups show tasks (specifically ktimerd) waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling becomes the reason for long delay / circular dependency. Is cfs bandwidth throttling being used on these systems that run into these lockups? Otherwise, your issue might be completely different. > > While this thread remained silent afterwards, we have found [1][2][3] as > apparently related. But this means we are still with this RT bug, even > in latest 6.15-rc1? I'm pretty sure a bunch of locking related stuff has been reworked to accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns have been replaced with alternatives like RCU. Recently introduced dl_server infrastructure also helps prevent starvation of fair tasks which can allow progress and prevent lockups. I would recommend checking if the most recent -rt release can still reproduce your issue: https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/ Note: Aaron Lu is working on Valentin's approach of deferring cfs throttling to exit to user mode boundary https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/ If you still run into the issue of a lockup / long latencies on latest -rt release and your system is using cfs bandwidth controls, you can perhaps try running with Valentin's or Aaron's series to check if throttle deferral helps your scenario. > > Jan > > [1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@redhat.com/ > [2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/ > [3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/ I'm mostly testing and reviewing Aaron's series now since per-task throttling seems to be the way forward based on discussions in the community. > -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 9:29 ` K Prateek Nayak @ 2025-04-09 12:13 ` Aaron Lu 2025-04-09 13:44 ` Jan Kiszka 0 siblings, 1 reply; 18+ messages in thread From: Aaron Lu @ 2025-04-09 12:13 UTC (permalink / raw) To: K Prateek Nayak, Jan Kiszka Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote: > (+ Aaron) Thank you Prateek for bring me in. > Hello Jan, > > On 4/9/2025 12:11 PM, Jan Kiszka wrote: > > On 12.10.23 17:07, Valentin Schneider wrote: > > > Hi folks, > > > > > > We've had reports of stalls happening on our v6.0-ish frankenkernels, and while > > > we haven't been able to come out with a reproducer (yet), I don't see anything > > > upstream that would prevent them from happening. > > > > > > The setup involves eventpoll, CFS bandwidth controller and timer > > > expiry, and the sequence looks as follows (time-ordered): > > > > > > p_read (on CPUn, CFS with bandwidth controller active) > > > ====== > > > > > > ep_poll_callback() > > > read_lock_irqsave() > > > ... > > > try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched > > > due to having no more runtime > > > preempt_enable() > > > preempt_schedule() <- switch out due to p_read being now throttled > > > > > > p_write > > > ======= > > > > > > ep_poll() > > > write_lock_irq() <- blocks due to having active readers (p_read) > > > > > > ktimers/n > > > ========= > > > > > > timerfd_tmrproc() > > > `\ > > > ep_poll_callback() > > > `\ > > > read_lock_irqsave() <- blocks due to having active writer (p_write) > > > > > > > > > From this point we have a circular dependency: > > > > > > p_read -> ktimers/n (to replenish runtime of p_read) > > > ktimers/n -> p_write (to let ktimers/n acquire the readlock) > > > p_write -> p_read (to let p_write acquire the writelock) > > > > > > IIUC reverting > > > 286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation") > > > should unblock this as the ktimers/n thread wouldn't block, but then we're back > > > to having the indefinite starvation so I wouldn't necessarily call this a win. > > > > > > Two options I'm seeing: > > > - Prevent p_read from being preempted when it's doing the wakeups under the > > > readlock (icky) > > > - Prevent ktimers / ksoftirqd (*) from running the wakeups that have > > > ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a > > > kworker /should/ do. > > > > > > (*) It's not just timerfd, I've also seen it via net::sock_def_readable - > > > it should be anything that's pollable. > > > > > > I'm still scratching my head on this, so any suggestions/comments welcome! > > > > > > > We are hunting for quite some time sporadic lock-ups or RT systems, > > first only in the field (sigh), now finally also in the lab. Those have > > a fairly high overlap with what was described here. Our baselines so > > far: 6.1-rt, Debian and vanilla. We are currently preparing experiments > > with latest mainline. > > Do the backtrace from these lockups show tasks (specifically ktimerd) > waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling > becomes the reason for long delay / circular dependency. Is cfs bandwidth > throttling being used on these systems that run into these lockups? > Otherwise, your issue might be completely different. Agree. > > > > While this thread remained silent afterwards, we have found [1][2][3] as > > apparently related. But this means we are still with this RT bug, even > > in latest 6.15-rc1? > > I'm pretty sure a bunch of locking related stuff has been reworked to > accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns > have been replaced with alternatives like RCU. Recently introduced > dl_server infrastructure also helps prevent starvation of fair tasks > which can allow progress and prevent lockups. I would recommend > checking if the most recent -rt release can still reproduce your > issue: > https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/ > > Note: Aaron Lu is working on Valentin's approach of deferring cfs > throttling to exit to user mode boundary > https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/ > > If you still run into the issue of a lockup / long latencies on latest > -rt release and your system is using cfs bandwidth controls, you can > perhaps try running with Valentin's or Aaron's series to check if > throttle deferral helps your scenario. I just sent out v2 :-) https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/ Hi Jan, If you want to give it a try, please try v2. Thanks. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 12:13 ` Aaron Lu @ 2025-04-09 13:44 ` Jan Kiszka 2025-04-14 14:50 ` K Prateek Nayak 0 siblings, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2025-04-09 13:44 UTC (permalink / raw) To: Aaron Lu, K Prateek Nayak Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 09.04.25 14:13, Aaron Lu wrote: > On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote: >> (+ Aaron) > > Thank you Prateek for bring me in. > >> Hello Jan, >> >> On 4/9/2025 12:11 PM, Jan Kiszka wrote: >>> On 12.10.23 17:07, Valentin Schneider wrote: >>>> Hi folks, >>>> >>>> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while >>>> we haven't been able to come out with a reproducer (yet), I don't see anything >>>> upstream that would prevent them from happening. >>>> >>>> The setup involves eventpoll, CFS bandwidth controller and timer >>>> expiry, and the sequence looks as follows (time-ordered): >>>> >>>> p_read (on CPUn, CFS with bandwidth controller active) >>>> ====== >>>> >>>> ep_poll_callback() >>>> read_lock_irqsave() >>>> ... >>>> try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched >>>> due to having no more runtime >>>> preempt_enable() >>>> preempt_schedule() <- switch out due to p_read being now throttled >>>> >>>> p_write >>>> ======= >>>> >>>> ep_poll() >>>> write_lock_irq() <- blocks due to having active readers (p_read) >>>> >>>> ktimers/n >>>> ========= >>>> >>>> timerfd_tmrproc() >>>> `\ >>>> ep_poll_callback() >>>> `\ >>>> read_lock_irqsave() <- blocks due to having active writer (p_write) >>>> >>>> >>>> From this point we have a circular dependency: >>>> >>>> p_read -> ktimers/n (to replenish runtime of p_read) >>>> ktimers/n -> p_write (to let ktimers/n acquire the readlock) >>>> p_write -> p_read (to let p_write acquire the writelock) >>>> >>>> IIUC reverting >>>> 286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation") >>>> should unblock this as the ktimers/n thread wouldn't block, but then we're back >>>> to having the indefinite starvation so I wouldn't necessarily call this a win. >>>> >>>> Two options I'm seeing: >>>> - Prevent p_read from being preempted when it's doing the wakeups under the >>>> readlock (icky) >>>> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have >>>> ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a >>>> kworker /should/ do. >>>> >>>> (*) It's not just timerfd, I've also seen it via net::sock_def_readable - >>>> it should be anything that's pollable. >>>> >>>> I'm still scratching my head on this, so any suggestions/comments welcome! >>>> >>> >>> We are hunting for quite some time sporadic lock-ups or RT systems, >>> first only in the field (sigh), now finally also in the lab. Those have >>> a fairly high overlap with what was described here. Our baselines so >>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments >>> with latest mainline. >> >> Do the backtrace from these lockups show tasks (specifically ktimerd) >> waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling >> becomes the reason for long delay / circular dependency. Is cfs bandwidth >> throttling being used on these systems that run into these lockups? >> Otherwise, your issue might be completely different. > > Agree. > >>> >>> While this thread remained silent afterwards, we have found [1][2][3] as >>> apparently related. But this means we are still with this RT bug, even >>> in latest 6.15-rc1? >> >> I'm pretty sure a bunch of locking related stuff has been reworked to >> accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns >> have been replaced with alternatives like RCU. Recently introduced >> dl_server infrastructure also helps prevent starvation of fair tasks >> which can allow progress and prevent lockups. I would recommend >> checking if the most recent -rt release can still reproduce your >> issue: >> https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/ >> >> Note: Aaron Lu is working on Valentin's approach of deferring cfs >> throttling to exit to user mode boundary >> https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/ >> >> If you still run into the issue of a lockup / long latencies on latest >> -rt release and your system is using cfs bandwidth controls, you can >> perhaps try running with Valentin's or Aaron's series to check if >> throttle deferral helps your scenario. > > I just sent out v2 :-) > https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/ > > Hi Jan, > > If you want to give it a try, please try v2. > Thanks, we are updating our setup right now. BTW, does anyone already have a test case that produces the lockup issue with one or two simple programs and some hectic CFS bandwidth settings? Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 13:44 ` Jan Kiszka @ 2025-04-14 14:50 ` K Prateek Nayak 2025-04-14 15:05 ` Sebastian Andrzej Siewior 2025-04-14 16:21 ` K Prateek Nayak 0 siblings, 2 replies; 18+ messages in thread From: K Prateek Nayak @ 2025-04-14 14:50 UTC (permalink / raw) To: Jan Kiszka, Aaron Lu Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka Hello Jan, On 4/9/2025 7:14 PM, Jan Kiszka wrote: > On 09.04.25 14:13, Aaron Lu wrote: >> On Wed, Apr 09, 2025 at 02:59:18PM +0530, K Prateek Nayak wrote: >>> (+ Aaron) >> >> Thank you Prateek for bring me in. >> >>> Hello Jan, >>> >>> On 4/9/2025 12:11 PM, Jan Kiszka wrote: >>>> On 12.10.23 17:07, Valentin Schneider wrote: >>>>> Hi folks, >>>>> >>>>> We've had reports of stalls happening on our v6.0-ish frankenkernels, and while >>>>> we haven't been able to come out with a reproducer (yet), I don't see anything >>>>> upstream that would prevent them from happening. >>>>> >>>>> The setup involves eventpoll, CFS bandwidth controller and timer >>>>> expiry, and the sequence looks as follows (time-ordered): >>>>> >>>>> p_read (on CPUn, CFS with bandwidth controller active) >>>>> ====== >>>>> >>>>> ep_poll_callback() >>>>> read_lock_irqsave() >>>>> ... >>>>> try_to_wake_up() <- enqueue causes an update_curr() + sets need_resched >>>>> due to having no more runtime >>>>> preempt_enable() >>>>> preempt_schedule() <- switch out due to p_read being now throttled >>>>> >>>>> p_write >>>>> ======= >>>>> >>>>> ep_poll() >>>>> write_lock_irq() <- blocks due to having active readers (p_read) >>>>> >>>>> ktimers/n >>>>> ========= >>>>> >>>>> timerfd_tmrproc() >>>>> `\ >>>>> ep_poll_callback() >>>>> `\ >>>>> read_lock_irqsave() <- blocks due to having active writer (p_write) >>>>> >>>>> >>>>> From this point we have a circular dependency: >>>>> >>>>> p_read -> ktimers/n (to replenish runtime of p_read) >>>>> ktimers/n -> p_write (to let ktimers/n acquire the readlock) >>>>> p_write -> p_read (to let p_write acquire the writelock) >>>>> >>>>> IIUC reverting >>>>> 286deb7ec03d ("locking/rwbase: Mitigate indefinite writer starvation") >>>>> should unblock this as the ktimers/n thread wouldn't block, but then we're back >>>>> to having the indefinite starvation so I wouldn't necessarily call this a win. >>>>> >>>>> Two options I'm seeing: >>>>> - Prevent p_read from being preempted when it's doing the wakeups under the >>>>> readlock (icky) >>>>> - Prevent ktimers / ksoftirqd (*) from running the wakeups that have >>>>> ep_poll_callback() as a wait_queue_entry callback. Punting that to e.g. a >>>>> kworker /should/ do. >>>>> >>>>> (*) It's not just timerfd, I've also seen it via net::sock_def_readable - >>>>> it should be anything that's pollable. >>>>> >>>>> I'm still scratching my head on this, so any suggestions/comments welcome! >>>>> >>>> >>>> We are hunting for quite some time sporadic lock-ups or RT systems, >>>> first only in the field (sigh), now finally also in the lab. Those have >>>> a fairly high overlap with what was described here. Our baselines so >>>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments >>>> with latest mainline. >>> >>> Do the backtrace from these lockups show tasks (specifically ktimerd) >>> waiting on a rwsem? Throttle deferral helps if cfs bandwidth throttling >>> becomes the reason for long delay / circular dependency. Is cfs bandwidth >>> throttling being used on these systems that run into these lockups? >>> Otherwise, your issue might be completely different. >> >> Agree. >> >>>> >>>> While this thread remained silent afterwards, we have found [1][2][3] as >>>> apparently related. But this means we are still with this RT bug, even >>>> in latest 6.15-rc1? >>> >>> I'm pretty sure a bunch of locking related stuff has been reworked to >>> accommodate PREEMPT_RT since v6.1. Many rwsem based locking patterns >>> have been replaced with alternatives like RCU. Recently introduced >>> dl_server infrastructure also helps prevent starvation of fair tasks >>> which can allow progress and prevent lockups. I would recommend >>> checking if the most recent -rt release can still reproduce your >>> issue: >>> https://lore.kernel.org/lkml/20250331095610.ulLtPP2C@linutronix.de/ >>> >>> Note: Aaron Lu is working on Valentin's approach of deferring cfs >>> throttling to exit to user mode boundary >>> https://lore.kernel.org/lkml/20250313072030.1032893-1-ziqianlu@bytedance.com/ >>> >>> If you still run into the issue of a lockup / long latencies on latest >>> -rt release and your system is using cfs bandwidth controls, you can >>> perhaps try running with Valentin's or Aaron's series to check if >>> throttle deferral helps your scenario. >> >> I just sent out v2 :-) >> https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/ >> >> Hi Jan, >> >> If you want to give it a try, please try v2. >> > > Thanks, we are updating our setup right now. > > BTW, does anyone already have a test case that produces the lockup issue > with one or two simple programs and some hectic CFS bandwidth settings? This is your cue to grab a brown paper bag since what I'm about to paste below is probably lifetime without parole in the RT land but I believe it gets close to the scenario described by Valentin: (Based on v6.15-rc1; I haven't yet tested this with Aaron's series yet) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e43993a4e580..7ed0a4923ca2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6497,6 +6497,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) int count = 0; raw_spin_lock_irqsave(&cfs_b->lock, flags); + pr_crit("sched_cfs_period_timer: Started on CPU%d\n", smp_processor_id()); for (;;) { overrun = hrtimer_forward_now(timer, cfs_b->period); if (!overrun) @@ -6537,6 +6538,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) } if (idle) cfs_b->period_active = 0; + pr_crit("sched_cfs_period_timer: Finished on CPU%d\n", smp_processor_id()); raw_spin_unlock_irqrestore(&cfs_b->lock, flags); return idle ? HRTIMER_NORESTART : HRTIMER_RESTART; diff --git a/kernel/sys.c b/kernel/sys.c index c434968e9f5d..d68b05963b88 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2470,6 +2470,79 @@ static int prctl_get_auxv(void __user *addr, unsigned long len) return sizeof(mm->saved_auxv); } +/* These variables will be used in dumb ways. */ +raw_spinlock_t dwdt_spin_lock; +struct hrtimer dwtd_timer; +DEFINE_RWLOCK(dwdt_lock); + +/* Should send ktimerd into a deadlock */ +static enum hrtimer_restart deadlock_timer(struct hrtimer *timer) +{ + pr_crit("deadlock_timer: Started on CPU%d\n", smp_processor_id()); + /* Should hit rtlock slowpath after kthread writer. */ + read_lock(&dwdt_lock); + read_unlock(&dwdt_lock); + pr_crit("deadlock_timer: Finished on CPU%d\n", smp_processor_id()); + return HRTIMER_NORESTART; +} + +/* kthread function to preempt fair thread and block on write lock. */ +static int grab_dumb_lock(void *data) +{ + pr_crit("RT kthread: Started on CPU%d\n", smp_processor_id()); + write_lock_irq(&dwdt_lock); + write_unlock_irq(&dwdt_lock); + pr_crit("RT kthread: Finished on CPU%d\n", smp_processor_id()); + + return 0; +} + +/* Try to send ktimerd into a deadlock. */ +static void dumb_ways_to_die(unsigned long loops) +{ + struct task_struct *kt; + unsigned long i; + int cpu; + + migrate_disable(); + + cpu = smp_processor_id(); + pr_crit("dumb_ways_to_die: Started on CPU%d with %lu loops\n", cpu, loops); + + raw_spin_lock_init(&dwdt_spin_lock); + hrtimer_setup(&dwtd_timer, deadlock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); + kt = kthread_create_on_cpu(&grab_dumb_lock, NULL, cpu, "dumb-thread"); + + read_lock_irq(&dwdt_lock); + + /* Dummy lock; Disables preemption. */ + raw_spin_lock(&dwdt_spin_lock); + + pr_crit("dumb_ways_to_die: Queuing timer on CPU%d\n", cpu); + /* Start a timer that will run before the bandwidth timer. */ + hrtimer_forward_now(&dwtd_timer, ns_to_ktime(10000)); + hrtimer_start_expires(&dwtd_timer, HRTIMER_MODE_ABS_PINNED); + + pr_crit("dumb_ways_to_die: Waking up RT kthread on CPU%d\n", cpu); + sched_set_fifo(kt); /* Create a high priority thread. */ + wake_up_process(kt); + + /* Exhaust bandwidth of caller */ + for (i = 0; i < loops; ++i) + cpu_relax(); + + /* Enable preemption; kt should preempt now. */ + raw_spin_unlock(&dwdt_spin_lock); + + /* Waste time just in case RT task has not preempted us. (very unlikely!) */ + for (i = 0; i < loops; ++i) + cpu_relax(); + + read_unlock_irq(&dwdt_lock); + pr_crit("dumb_ways_to_die: Finished on CPU%d with %lu loops\n", cpu, loops); + migrate_enable(); +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2483,6 +2556,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = 0; switch (option) { + case 666: + dumb_ways_to_die(arg2); + break; case PR_SET_PDEATHSIG: if (!valid_signal(arg2)) { error = -EINVAL; -- The above adds a prctl() to trigger the scenario whose flow I've described with some inline comments. Patterns like above is a crime but I've done it in the name of science. Steps to reproduce: # mkdir /sys/fs/cgroup/CG0 # echo $$ > /sys/fs/cgroup/CG0/cgroup.procs # echo "500000 1000000" > /sys/fs/cgroup/CG0/cpu.max # dmesg | tail -n 2 # Find the CPU where bandwidth timer is running [ 175.919325] sched_cfs_period_timer: Started on CPU214 [ 175.919330] sched_cfs_period_timer: Finished on CPU214 # taskset -c 214 perl -e 'syscall 157,666,50000000' # Pin perl to same CPU, 50M loops Note: You have to pin the perl command to the same CPU as the timer for it to run into stalls. It may take a couple of attempts. Also please adjust the number of loops of cpu_relax() based on your setup. In my case, 50M loops runs long enough to exhaust the cfs bandwidth. With this I see: sched_cfs_period_timer: Started on CPU214 sched_cfs_period_timer: Finished on CPU214 dumb_ways_to_die: Started on CPU214 with 50000000 loops dumb_ways_to_die: Queuing timer on CPU214 dumb_ways_to_die: Waking up RT kthread on CPU214 RT kthread: Started on CPU214 deadlock_timer: Started on CPU214 rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: Tasks blocked on level-1 rcu_node (CPUs 208-223): P1975/3:b..l rcu: (detected by 124, t=15002 jiffies, g=3201, q=138 ncpus=256) task:ktimers/214 state:D stack:0 pid:1975 tgid:1975 ppid:2 task_flags:0x4208040 flags:0x00004000 Call Trace: <TASK> __schedule+0x401/0x15a0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? update_rq_clock+0x7c/0x120 ? srso_alias_return_thunk+0x5/0xfbef5 ? rt_mutex_setprio+0x1c2/0x480 schedule_rtlock+0x1e/0x40 rtlock_slowlock_locked+0x20e/0xc60 rt_read_lock+0x8f/0x190 ? __pfx_deadlock_timer+0x10/0x10 deadlock_timer+0x28/0x50 __hrtimer_run_queues+0xfd/0x2e0 hrtimer_run_softirq+0x9d/0xf0 handle_softirqs.constprop.0+0xc1/0x2a0 ? __pfx_smpboot_thread_fn+0x10/0x10 run_ktimerd+0x3e/0x80 smpboot_thread_fn+0xf3/0x220 kthread+0xff/0x210 ? rt_spin_lock+0x3c/0xc0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> I get rcub stall messages after a while (from a separate trace): INFO: task rcub/4:462 blocked for more than 120 seconds. Not tainted 6.15.0-rc1-test+ #743 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:rcub/4 state:D stack:0 pid:462 tgid:462 ppid:2 task_flags:0x208040 flags:0x00004000 Call Trace: <TASK> __schedule+0x401/0x15a0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? rt_mutex_adjust_prio_chain+0xa5/0x7e0 rt_mutex_schedule+0x20/0x40 rt_mutex_slowlock_block.constprop.0+0x42/0x1e0 __rt_mutex_slowlock_locked.constprop.0+0xa7/0x210 rt_mutex_slowlock.constprop.0+0x4e/0xc0 rcu_boost_kthread+0xe3/0x320 ? __pfx_rcu_boost_kthread+0x10/0x10 kthread+0xff/0x210 ? rt_spin_lock+0x3c/0xc0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> I left the program on for good 5 minutes and it did not budge after the splat. Note: I could not reproduce the splat with !PREEMPT_RT kernel (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the cfs bandwidth. > > Jan > -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-14 14:50 ` K Prateek Nayak @ 2025-04-14 15:05 ` Sebastian Andrzej Siewior 2025-04-14 15:18 ` K Prateek Nayak 2025-04-15 5:35 ` Jan Kiszka 2025-04-14 16:21 ` K Prateek Nayak 1 sibling, 2 replies; 18+ messages in thread From: Sebastian Andrzej Siewior @ 2025-04-14 15:05 UTC (permalink / raw) To: K Prateek Nayak Cc: Jan Kiszka, Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 2025-04-14 20:20:04 [+0530], K Prateek Nayak wrote: > Note: I could not reproduce the splat with !PREEMPT_RT kernel > (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the > cfs bandwidth. Not sure what this has to do with anything. On !RT the read_lock() in the timer can be acquired even with a pending writer. The writer keeps spinning until the main thread is gone. There should be no RCU boosting but the RCU still is there, too. On RT the read_lock() in the timer block, the write blocks, too. So every blocker on the lock is scheduled out until the reader is gone. On top of that, the reader gets RCU boosted with FIFO-1 by default to get out. Sebastian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-14 15:05 ` Sebastian Andrzej Siewior @ 2025-04-14 15:18 ` K Prateek Nayak 2025-04-15 5:35 ` Jan Kiszka 1 sibling, 0 replies; 18+ messages in thread From: K Prateek Nayak @ 2025-04-14 15:18 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Jan Kiszka, Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka Hello Sebastian, On 4/14/2025 8:35 PM, Sebastian Andrzej Siewior wrote: > On 2025-04-14 20:20:04 [+0530], K Prateek Nayak wrote: >> Note: I could not reproduce the splat with !PREEMPT_RT kernel >> (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the >> cfs bandwidth. > > Not sure what this has to do with anything. Let me clarify a bit more: - Fair task with cfs_bandwidth limits triggers the prctl(666, 50000000) - The prctl() takes a read_lock_irq() excpet on PREEMPT_RT this does not disable interrupt. - I take a dummy lock to stall the preemption - Within the read_lock critical section, I queue a timer that takes the read_lock. - I also wakeup up a high priority RT task that that takes the write_lock As soon as I drop the dummy raw_spin_lock: - High priority RT task runs, tries to take the write_lock but cannot since the preempted fair task has the read end still. - Next ktimerd runs trying to grab the read_lock() but is put in the slowpath since ktimerd has tried to take the write_lock - The fair task runs out of bandwidth and is preempted but this requires the ktimerd to run the replenish function which is queued behind the already preempted timer function trying to grab the read_lock() Isn't this the scenario that Valentin's original summary describes? If I've got something wrong please do correct me. > On !RT the read_lock() in the timer can be acquired even with a pending > writer. The writer keeps spinning until the main thread is gone. There > should be no RCU boosting but the RCU still is there, too. On !RT, the read_lock_irq() in fair task will not be preempted in the first place so progress is guaranteed that way right? > > On RT the read_lock() in the timer block, the write blocks, too. So > every blocker on the lock is scheduled out until the reader is gone. On > top of that, the reader gets RCU boosted with FIFO-1 by default to get > out. Except there is a circular dependency now: - fair task needs bandwidth replenishment to progress and drop lock. - rt task needs fair task to drop the lock and grab the write end. - ktimerd requires rt task to grab and drop the lock to make progress. I'm fairly new to the PREEMPT_RT bits so if I've missed something, please do let me know and sorry for any noise. > > Sebastian -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-14 15:05 ` Sebastian Andrzej Siewior 2025-04-14 15:18 ` K Prateek Nayak @ 2025-04-15 5:35 ` Jan Kiszka 2025-04-15 6:23 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2025-04-15 5:35 UTC (permalink / raw) To: Sebastian Andrzej Siewior, K Prateek Nayak Cc: Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 14.04.25 17:05, Sebastian Andrzej Siewior wrote: > On 2025-04-14 20:20:04 [+0530], K Prateek Nayak wrote: >> Note: I could not reproduce the splat with !PREEMPT_RT kernel >> (CONFIG_PREEMPT=y) or with small loops counts that don't exhaust the >> cfs bandwidth. > > Not sure what this has to do with anything. > On !RT the read_lock() in the timer can be acquired even with a pending > writer. The writer keeps spinning until the main thread is gone. There > should be no RCU boosting but the RCU still is there, too. > > On RT the read_lock() in the timer block, the write blocks, too. So > every blocker on the lock is scheduled out until the reader is gone. On > top of that, the reader gets RCU boosted with FIFO-1 by default to get > out. There is no boosting of the active readers on RT as there is no information recorded about who is currently holding a read lock. This is the whole point why rwlocks are hairy with RT, I thought. Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-15 5:35 ` Jan Kiszka @ 2025-04-15 6:23 ` Sebastian Andrzej Siewior 2025-04-15 6:54 ` Jan Kiszka 0 siblings, 1 reply; 18+ messages in thread From: Sebastian Andrzej Siewior @ 2025-04-15 6:23 UTC (permalink / raw) To: Jan Kiszka Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote: > > On RT the read_lock() in the timer block, the write blocks, too. So > > every blocker on the lock is scheduled out until the reader is gone. On > > top of that, the reader gets RCU boosted with FIFO-1 by default to get > > out. > > There is no boosting of the active readers on RT as there is no > information recorded about who is currently holding a read lock. This is > the whole point why rwlocks are hairy with RT, I thought. Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you get stuck in a RCU section for too long then this boosting will take effect by making the task, within the RCU section, the owner of the boost-lock and the boosting task will try to acquire it. This is used to get SCHED_OTHER tasks out of the RCU section. But if a SCHED_FIFO task is on the CPU then this boosting will have to no effect because the scheduler will not switch to a task with lower priority. > Jan Sebastian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-15 6:23 ` Sebastian Andrzej Siewior @ 2025-04-15 6:54 ` Jan Kiszka 2025-04-15 8:00 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2025-04-15 6:54 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 15.04.25 08:23, Sebastian Andrzej Siewior wrote: > On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote: >>> On RT the read_lock() in the timer block, the write blocks, too. So >>> every blocker on the lock is scheduled out until the reader is gone. On >>> top of that, the reader gets RCU boosted with FIFO-1 by default to get >>> out. >> >> There is no boosting of the active readers on RT as there is no >> information recorded about who is currently holding a read lock. This is >> the whole point why rwlocks are hairy with RT, I thought. > > Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with > SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you > get stuck in a RCU section for too long then this boosting will take > effect by making the task, within the RCU section, the owner of the > boost-lock and the boosting task will try to acquire it. This is used to > get SCHED_OTHER tasks out of the RCU section. > But if a SCHED_FIFO task is on the CPU then this boosting will have to > no effect because the scheduler will not switch to a task with lower > priority. Does that boosting happen to need ktimersd or ksoftirqd (which both are stalling in our case)? I'm still looking for the reason why it does not help in the observed stall scenarios. Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-15 6:54 ` Jan Kiszka @ 2025-04-15 8:00 ` Sebastian Andrzej Siewior 2025-04-15 10:23 ` Jan Kiszka 0 siblings, 1 reply; 18+ messages in thread From: Sebastian Andrzej Siewior @ 2025-04-15 8:00 UTC (permalink / raw) To: Jan Kiszka Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 2025-04-15 08:54:01 [+0200], Jan Kiszka wrote: > On 15.04.25 08:23, Sebastian Andrzej Siewior wrote: > > On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote: > >>> On RT the read_lock() in the timer block, the write blocks, too. So > >>> every blocker on the lock is scheduled out until the reader is gone. On > >>> top of that, the reader gets RCU boosted with FIFO-1 by default to get > >>> out. > >> > >> There is no boosting of the active readers on RT as there is no > >> information recorded about who is currently holding a read lock. This is > >> the whole point why rwlocks are hairy with RT, I thought. > > > > Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with > > SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you > > get stuck in a RCU section for too long then this boosting will take > > effect by making the task, within the RCU section, the owner of the > > boost-lock and the boosting task will try to acquire it. This is used to > > get SCHED_OTHER tasks out of the RCU section. > > But if a SCHED_FIFO task is on the CPU then this boosting will have to > > no effect because the scheduler will not switch to a task with lower > > priority. > > Does that boosting happen to need ktimersd or ksoftirqd (which both are > stalling in our case)? I'm still looking for the reason why it does not > help in the observed stall scenarios. Your problem is that you likely have many reader which need to get out first. That spinlock replacement will help. I'm not sure about the CFS patch referenced in the thread here. That boosting requires a RCU reader that starts the mechanism (on rcu unlock). But I don't think that it will help. You would also need to raise the priority above to the writer level (manually) and that will likely break other things. It is meant to unstuck SCHED_OTHER tasks and not boost stuck reader as a side effect. Also I am not sure how that works with multiple tasks. > Jan Sebastian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-15 8:00 ` Sebastian Andrzej Siewior @ 2025-04-15 10:23 ` Jan Kiszka 0 siblings, 0 replies; 18+ messages in thread From: Jan Kiszka @ 2025-04-15 10:23 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: K Prateek Nayak, Aaron Lu, Valentin Schneider, linux-rt-users, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 15.04.25 10:00, Sebastian Andrzej Siewior wrote: > On 2025-04-15 08:54:01 [+0200], Jan Kiszka wrote: >> On 15.04.25 08:23, Sebastian Andrzej Siewior wrote: >>> On 2025-04-15 07:35:50 [+0200], Jan Kiszka wrote: >>>>> On RT the read_lock() in the timer block, the write blocks, too. So >>>>> every blocker on the lock is scheduled out until the reader is gone. On >>>>> top of that, the reader gets RCU boosted with FIFO-1 by default to get >>>>> out. >>>> >>>> There is no boosting of the active readers on RT as there is no >>>> information recorded about who is currently holding a read lock. This is >>>> the whole point why rwlocks are hairy with RT, I thought. >>> >>> Kind of, yes. PREEMPT_RT has by default RCU boosting enabled with >>> SCHED_FIFO 1. If you acquire a readlock you start a RCU section. If you >>> get stuck in a RCU section for too long then this boosting will take >>> effect by making the task, within the RCU section, the owner of the >>> boost-lock and the boosting task will try to acquire it. This is used to >>> get SCHED_OTHER tasks out of the RCU section. >>> But if a SCHED_FIFO task is on the CPU then this boosting will have to >>> no effect because the scheduler will not switch to a task with lower >>> priority. >> >> Does that boosting happen to need ktimersd or ksoftirqd (which both are >> stalling in our case)? I'm still looking for the reason why it does not >> help in the observed stall scenarios. > > Your problem is that you likely have many reader which need to get out > first. That spinlock replacement will help. I'm not sure about the CFS > patch referenced in the thread here. Nope, we only have two readers, one which is scheduled out by CFS and another one - in soft IRQ context - that is getting stuck after the writer promoted the held lock to a write lock. > > That boosting requires a RCU reader that starts the mechanism (on rcu > unlock). But I don't think that it will help. You would also need to > raise the priority above to the writer level (manually) and that will > likely break other things. It is meant to unstuck SCHED_OTHER tasks and > not boost stuck reader as a side effect. Also I am not sure how that > works with multiple tasks. Ok, that is likely why we don't see that coming in for helping us out. Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-14 14:50 ` K Prateek Nayak 2025-04-14 15:05 ` Sebastian Andrzej Siewior @ 2025-04-14 16:21 ` K Prateek Nayak 1 sibling, 0 replies; 18+ messages in thread From: K Prateek Nayak @ 2025-04-14 16:21 UTC (permalink / raw) To: Jan Kiszka, Aaron Lu Cc: Valentin Schneider, linux-rt-users, Sebastian Andrzej Siewior, linux-kernel, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 4/14/2025 8:20 PM, K Prateek Nayak wrote: >> >> BTW, does anyone already have a test case that produces the lockup issue >> with one or two simple programs and some hectic CFS bandwidth settings? > > This is your cue to grab a brown paper bag since what I'm about to paste > below is probably lifetime without parole in the RT land but I believe > it gets close to the scenario described by Valentin: > > (Based on v6.15-rc1; I haven't yet tested this with Aaron's series yet) I tried this with Aaron's series [1] and I did not run into any rcu stalls yet. following are dmesg logs: [ 122.853909] sched_cfs_period_timer: Started on CPU248 [ 122.853912] sched_cfs_period_timer: Finished on CPU248 [ 123.726232] dumb_ways_to_die: Started on CPU248 with 50000000 loops [ 123.726574] dumb_ways_to_die: Queuing timer on CPU248 [ 123.726577] dumb_ways_to_die: Waking up RT kthread on CPU248 [ 125.768969] RT kthread: Started on CPU248 [ 125.769050] deadlock_timer: Started on CPU248 # Fair task runs, drops rwlock, is preempted [ 126.666709] RT kthread: Finished on CPU248 # RT kthread finishes [ 126.666737] deadlock_timer: Finished on CPU248 # ktimerd function finishes and unblocks replenish [ 126.666740] sched_cfs_period_timer: Started on CPU248 [ 126.666741] sched_cfs_period_timer: Finished on CPU248 # cfs task runs prctl() to completion and is throttled [ 126.666762] dumb_ways_to_die: Finished on CPU248 with 50000000 loops # cfs_bandwidth continues to catch up on slack accumulated [ 126.851820] sched_cfs_period_timer: Started on CPU248 [ 126.851825] sched_cfs_period_timer: Finished on CPU248 [1] https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/ -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 6:41 ` Jan Kiszka 2025-04-09 9:29 ` K Prateek Nayak @ 2025-04-09 13:21 ` Sebastian Andrzej Siewior 2025-04-09 13:41 ` Jan Kiszka 1 sibling, 1 reply; 18+ messages in thread From: Sebastian Andrzej Siewior @ 2025-04-09 13:21 UTC (permalink / raw) To: Jan Kiszka Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 2025-04-09 08:41:44 [+0200], Jan Kiszka wrote: > We are hunting for quite some time sporadic lock-ups or RT systems, > first only in the field (sigh), now finally also in the lab. Those have > a fairly high overlap with what was described here. Our baselines so > far: 6.1-rt, Debian and vanilla. We are currently preparing experiments > with latest mainline. > > While this thread remained silent afterwards, we have found [1][2][3] as > apparently related. But this means we are still with this RT bug, even > in latest 6.15-rc1? Not sure the commits are related. The problem here is that RW locks are not really real time friendly. Frederick had a simple fix to it https://lore.kernel.org/all/20210825132754.GA895675@lothringen/ but yeah. The alternative, which I didn't look into, would be to replace the reader side with RCU so we would just have the writer lock. That mean we need to RW lock because of performance… > Jan > > [1] https://lore.kernel.org/lkml/20231030145104.4107573-1-vschneid@redhat.com/ > [2] https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/ > [3] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/ > Sebastian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 13:21 ` Sebastian Andrzej Siewior @ 2025-04-09 13:41 ` Jan Kiszka 2025-04-09 13:52 ` Jan Kiszka 0 siblings, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2025-04-09 13:41 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 09.04.25 15:21, Sebastian Andrzej Siewior wrote: > On 2025-04-09 08:41:44 [+0200], Jan Kiszka wrote: >> We are hunting for quite some time sporadic lock-ups or RT systems, >> first only in the field (sigh), now finally also in the lab. Those have >> a fairly high overlap with what was described here. Our baselines so >> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments >> with latest mainline. >> >> While this thread remained silent afterwards, we have found [1][2][3] as >> apparently related. But this means we are still with this RT bug, even >> in latest 6.15-rc1? > > Not sure the commits are related. The problem here is that RW locks are > not really real time friendly. Frederick had a simple fix to it > https://lore.kernel.org/all/20210825132754.GA895675@lothringen/ > > but yeah. The alternative, which I didn't look into, would be to replace > the reader side with RCU so we would just have the writer lock. That > mean we need to RW lock because of performance… > We know that epoll is not a good idea for RT programs. However, our problem is that already non-RT programs manage to lock up an RT-enabled system. We are currently collecting more data to show what we are seeing, plus will try out the latest patches on the latest kernels. Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 13:41 ` Jan Kiszka @ 2025-04-09 13:52 ` Jan Kiszka 2025-04-09 13:57 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 18+ messages in thread From: Jan Kiszka @ 2025-04-09 13:52 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 09.04.25 15:41, Jan Kiszka wrote: > On 09.04.25 15:21, Sebastian Andrzej Siewior wrote: >> On 2025-04-09 08:41:44 [+0200], Jan Kiszka wrote: >>> We are hunting for quite some time sporadic lock-ups or RT systems, >>> first only in the field (sigh), now finally also in the lab. Those have >>> a fairly high overlap with what was described here. Our baselines so >>> far: 6.1-rt, Debian and vanilla. We are currently preparing experiments >>> with latest mainline. >>> >>> While this thread remained silent afterwards, we have found [1][2][3] as >>> apparently related. But this means we are still with this RT bug, even >>> in latest 6.15-rc1? >> >> Not sure the commits are related. The problem here is that RW locks are >> not really real time friendly. Frederick had a simple fix to it >> https://lore.kernel.org/all/20210825132754.GA895675@lothringen/ >> >> but yeah. The alternative, which I didn't look into, would be to replace >> the reader side with RCU so we would just have the writer lock. That >> mean we need to RW lock because of performance… >> > > We know that epoll is not a good idea for RT programs. However, our > problem is that already non-RT programs manage to lock up an RT-enabled > system. On second glance, Frederic's patch would probably also avoid the issue we are seeing as it should bring PI to the CFS-throttled read-lock holder (which is now a write-lock holder). But given how old that proposal is, I assume the performance impact was even for the RT kernel too much, wasn't it? Jan -- Siemens AG, Foundational Technologies Linux Expert Center ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller 2025-04-09 13:52 ` Jan Kiszka @ 2025-04-09 13:57 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 18+ messages in thread From: Sebastian Andrzej Siewior @ 2025-04-09 13:57 UTC (permalink / raw) To: Jan Kiszka Cc: Valentin Schneider, linux-rt-users, linux-kernel, kprateek.nayak, Thomas Gleixner, Juri Lelli, Clark Williams, Luis Claudio R. Goncalves, Andreas Ziegler, Felix Moessbauer, Florian Bezdeka On 2025-04-09 15:52:19 [+0200], Jan Kiszka wrote: > On second glance, Frederic's patch would probably also avoid the issue > we are seeing as it should bring PI to the CFS-throttled read-lock > holder (which is now a write-lock holder). > > But given how old that proposal is, I assume the performance impact was > even for the RT kernel too much, wasn't it? I don't remember any numbers and how bad things would get just the fear of it. But I guess this is workload related in terms how much does the RW lock improve the situation. RT then will multiple the CPU and reader resources to the point where it does not scale and present you the lockups. > Jan Sebastian ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-04-15 10:23 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-10-12 15:07 [RT BUG] Stall caused by eventpoll, rwlocks and CFS bandwidth controller Valentin Schneider 2025-04-09 6:41 ` Jan Kiszka 2025-04-09 9:29 ` K Prateek Nayak 2025-04-09 12:13 ` Aaron Lu 2025-04-09 13:44 ` Jan Kiszka 2025-04-14 14:50 ` K Prateek Nayak 2025-04-14 15:05 ` Sebastian Andrzej Siewior 2025-04-14 15:18 ` K Prateek Nayak 2025-04-15 5:35 ` Jan Kiszka 2025-04-15 6:23 ` Sebastian Andrzej Siewior 2025-04-15 6:54 ` Jan Kiszka 2025-04-15 8:00 ` Sebastian Andrzej Siewior 2025-04-15 10:23 ` Jan Kiszka 2025-04-14 16:21 ` K Prateek Nayak 2025-04-09 13:21 ` Sebastian Andrzej Siewior 2025-04-09 13:41 ` Jan Kiszka 2025-04-09 13:52 ` Jan Kiszka 2025-04-09 13:57 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox