* Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT
@ 2026-03-17 13:34 Paul E. McKenney
2026-03-18 10:50 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 100+ messages in thread
From: Paul E. McKenney @ 2026-03-17 13:34 UTC (permalink / raw)
To: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng
Cc: rcu, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior
Hello!
Kumar Kartikeya Dwivedi (CCed) privately reported a bug in
my implementation of the RCU Tasks Trace API in terms of SRCU-fast.
You see, I forgot to ask what contexts call_rcu_tasks_trace() is called
from, and it turns out that it can in fact be called with the scheduler
pi/rq locks held. This results in a deadlock when SRCU-fast invokes the
scheduler in order to start the SRCU-fast grace period. So RCU needs
a fix to my fix found here:
b540c63cf6e5 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()")
Sebastian, the PREEMPT_RT aspect is that lockdep does not complain
about acquisition of non-raw spinlocks from preemption-disabled regions
of code. This might be intentional, for example, there might be large
bodies of Linux-kernel code that frequently acquire non-raw spinlocks
from preemption-disabled regions of code, but which are never part of
PREEMPT_RT kernels. Otherwise, it might be good for lockdep to diagnose
this sort of thing.
Back to the actual bug, that call_srcu() now needs to tolerate being called
with scheduler rq/pi locks held...
The straightforward (but perhaps broken) way to resolve this is to make
srcu_gp_start_if_needed() defer invoking the scheduler, similar to the
way that vanilla RCU's call_rcu_core() function takes an early exit if
interrupts are disabled. Of course, vanilla RCU can rely on things like
the scheduling-clock interrupt to start any needed grace periods [1],
but SRCU will instead need to manually defer this work, perhaps using
workqueues or IRQ work.
In addition, rcutorture needs to be upgraded to sometimes invoke
->call() with the scheduler pi lock held, but this change is not fixing
a regression, so could be deferred. (There is already code in rcutorture
that invokes the readers while holding a scheduler pi lock.)
Given that RCU for this week through the end of March belongs to you guys,
if one of you can get this done by end of day Thursday, London time,
very good! Otherwise, I can put something together.
Please let me know!
Thanx, Paul [2]
[1] The exceptions to this rule being handled by the call to
invoke_rcu_core() when rcu_is_watching() returns false.
[2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI
handlers? Or should there be a call_rcu_nmi() for this purpose?
Or should we continue to have its callers check in_nmi() when needed?
^ permalink raw reply [flat|nested] 100+ messages in thread* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-17 13:34 Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Paul E. McKenney @ 2026-03-18 10:50 ` Sebastian Andrzej Siewior 2026-03-18 11:49 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-18 10:50 UTC (permalink / raw) To: Paul E. McKenney Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 2026-03-17 06:34:26 [-0700], Paul E. McKenney wrote: > Hello! Hi, > Kumar Kartikeya Dwivedi (CCed) privately reported a bug in > my implementation of the RCU Tasks Trace API in terms of SRCU-fast. > You see, I forgot to ask what contexts call_rcu_tasks_trace() is called > from, and it turns out that it can in fact be called with the scheduler > pi/rq locks held. This results in a deadlock when SRCU-fast invokes the > scheduler in order to start the SRCU-fast grace period. So RCU needs > a fix to my fix found here: > > b540c63cf6e5 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()") I can't find it. I looked in next and the rcu tree. > Sebastian, the PREEMPT_RT aspect is that lockdep does not complain > about acquisition of non-raw spinlocks from preemption-disabled regions > of code. This might be intentional, for example, there might be large > bodies of Linux-kernel code that frequently acquire non-raw spinlocks > from preemption-disabled regions of code, but which are never part of > PREEMPT_RT kernels. Otherwise, it might be good for lockdep to diagnose > this sort of thing. The point is you don't know where this preempt_disable() is coming from on !RT. It might be part of spinlock_t it might be explicit. We only have the might_sleep() on PREEMPT_RT. To catch this we would have to iterate over all held locks, compare the expected preemption level with the current and account for possible corner cases such as in-IRQ will be one higher and so on… However, if you hold a raw_spinlock_t (such as rq/pi) then are asking for a spinlock_t lockdep should respond with a | BUG: Invalid wait context report. > Back to the actual bug, that call_srcu() now needs to tolerate being called > with scheduler rq/pi locks held... This is because it is called from sched_ext BPF callbacks? > The straightforward (but perhaps broken) way to resolve this is to make > srcu_gp_start_if_needed() defer invoking the scheduler, similar to the Quick question. If srcu_gp_start_if_needed() can be invoked from a preempt-disabled section (due to rq/pi lock) then spin_lock_irqsave_sdp_contention(sdp, &flags); does not work, right? > way that vanilla RCU's call_rcu_core() function takes an early exit if > interrupts are disabled. Of course, vanilla RCU can rely on things like > the scheduling-clock interrupt to start any needed grace periods [1], > but SRCU will instead need to manually defer this work, perhaps using > workqueues or IRQ work. > > In addition, rcutorture needs to be upgraded to sometimes invoke > ->call() with the scheduler pi lock held, but this change is not fixing > a regression, so could be deferred. (There is already code in rcutorture > that invokes the readers while holding a scheduler pi lock.) > > Given that RCU for this week through the end of March belongs to you guys, > if one of you can get this done by end of day Thursday, London time, > very good! Otherwise, I can put something together. > > Please let me know! Given that the current locking does allow it and lockdep should have complained, I am curious if we could rule that out ;) > > Thanx, Paul [2] > > [1] The exceptions to this rule being handled by the call to > invoke_rcu_core() when rcu_is_watching() returns false. > > [2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI > handlers? Or should there be a call_rcu_nmi() for this purpose? > Or should we continue to have its callers check in_nmi() when needed? Did someone ask for this? Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 10:50 ` Sebastian Andrzej Siewior @ 2026-03-18 11:49 ` Paul E. McKenney 2026-03-18 14:43 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-18 11:49 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 11:50:58AM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-17 06:34:26 [-0700], Paul E. McKenney wrote: > > Hello! > Hi, > > > Kumar Kartikeya Dwivedi (CCed) privately reported a bug in > > my implementation of the RCU Tasks Trace API in terms of SRCU-fast. > > You see, I forgot to ask what contexts call_rcu_tasks_trace() is called > > from, and it turns out that it can in fact be called with the scheduler > > pi/rq locks held. This results in a deadlock when SRCU-fast invokes the > > scheduler in order to start the SRCU-fast grace period. So RCU needs > > a fix to my fix found here: > > > > b540c63cf6e5 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()") > > I can't find it. I looked in next and the rcu tree. Ah, it is in my RCU tree, not in the shared one. The patch is at the end of this email. > > Sebastian, the PREEMPT_RT aspect is that lockdep does not complain > > about acquisition of non-raw spinlocks from preemption-disabled regions > > of code. This might be intentional, for example, there might be large > > bodies of Linux-kernel code that frequently acquire non-raw spinlocks > > from preemption-disabled regions of code, but which are never part of > > PREEMPT_RT kernels. Otherwise, it might be good for lockdep to diagnose > > this sort of thing. > > The point is you don't know where this preempt_disable() is coming from > on !RT. It might be part of spinlock_t it might be explicit. We only > have the might_sleep() on PREEMPT_RT. > To catch this we would have to iterate over all held locks, compare the > expected preemption level with the current and account for possible > corner cases such as in-IRQ will be one higher and so on… > > However, if you hold a raw_spinlock_t (such as rq/pi) then are asking > for a spinlock_t lockdep should respond with a > | BUG: Invalid wait context > > report. Got it, and thank you for the explanation. > > Back to the actual bug, that call_srcu() now needs to tolerate being called > > with scheduler rq/pi locks held... > > This is because it is called from sched_ext BPF callbacks? You got it! We are re-implementing Tasks Trace RCU in terms of SRCU-fast, and I missed this requirement the first time around. I *did* make readers able to deal with BPF being invoked from everywhere, so two out of three? > > The straightforward (but perhaps broken) way to resolve this is to make > > srcu_gp_start_if_needed() defer invoking the scheduler, similar to the > > Quick question. If srcu_gp_start_if_needed() can be invoked from a > preempt-disabled section (due to rq/pi lock) then > spin_lock_irqsave_sdp_contention(sdp, &flags); > > does not work, right? Agreed, which is why the patch at the end of this email converts this to: raw_spin_lock_irqsave_sdp_contention(sdp, &flags) > > way that vanilla RCU's call_rcu_core() function takes an early exit if > > interrupts are disabled. Of course, vanilla RCU can rely on things like > > the scheduling-clock interrupt to start any needed grace periods [1], > > but SRCU will instead need to manually defer this work, perhaps using > > workqueues or IRQ work. > > > > In addition, rcutorture needs to be upgraded to sometimes invoke > > ->call() with the scheduler pi lock held, but this change is not fixing > > a regression, so could be deferred. (There is already code in rcutorture > > that invokes the readers while holding a scheduler pi lock.) > > > > Given that RCU for this week through the end of March belongs to you guys, > > if one of you can get this done by end of day Thursday, London time, > > very good! Otherwise, I can put something together. > > > > Please let me know! > > Given that the current locking does allow it and lockdep should have > complained, I am curious if we could rule that out ;) It would be nice, but your point about needing to worry about spinlocks is compelling. But couldn't lockdep scan the current task's list of held locks and see whether only raw spinlocks are held (including when no spinlocks of any type are held), and complain in that case? Or would that scanning be too high of overhead? (But we need that scan anyway to check deadlock, don't we?) > > Thanx, Paul [2] > > > > [1] The exceptions to this rule being handled by the call to > > invoke_rcu_core() when rcu_is_watching() returns false. > > > > [2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI > > handlers? Or should there be a call_rcu_nmi() for this purpose? > > Or should we continue to have its callers check in_nmi() when needed? > > Did someone ask for this? Yes. The BPF guys need to invoke call_srcu() from interrupts-disabled regions of code. I am way to old and lazy to do this sort of thing spontaneously. ;-) Thanx, Paul ------------------------------------------------------------------------ commit b540c63cf6e500e2f81d20b5de6ea11df8b7d22e Author: Paul E. McKenney <paulmck@kernel.org> Date: Sat Mar 14 04:12:58 2026 -0700 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() Tree SRCU has used non-raw spinlocks for many years, motivated by a desire to avoid unnecessary real-time latency and the absence of any reason to use raw spinlocks. However, the recent use of SRCU in tracing as the underlying implementation of RCU Tasks Trace means that call_srcu() is invoked from preemption-disabled regions of code, which in turn requires that any locks acquired by call_srcu() or its callees must be raw spinlocks. This commit therefore converts SRCU's spinlocks to raw spinlocks. Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index ad783d4dd3677..b122c560a59cf 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -34,7 +34,7 @@ struct srcu_data { /* Values: SRCU_READ_FLAVOR_.* */ /* Update-side state. */ - spinlock_t __private lock ____cacheline_internodealigned_in_smp; + raw_spinlock_t __private lock ____cacheline_internodealigned_in_smp; struct rcu_segcblist srcu_cblist; /* List of callbacks.*/ unsigned long srcu_gp_seq_needed; /* Furthest future GP needed. */ unsigned long srcu_gp_seq_needed_exp; /* Furthest future exp GP. */ @@ -55,7 +55,7 @@ struct srcu_data { * Node in SRCU combining tree, similar in function to rcu_data. */ struct srcu_node { - spinlock_t __private lock; + raw_spinlock_t __private lock; unsigned long srcu_have_cbs[4]; /* GP seq for children having CBs, but only */ /* if greater than ->srcu_gp_seq. */ unsigned long srcu_data_have_cbs[4]; /* Which srcu_data structs have CBs for given GP? */ @@ -74,7 +74,7 @@ struct srcu_usage { /* First node at each level. */ int srcu_size_state; /* Small-to-big transition state. */ struct mutex srcu_cb_mutex; /* Serialize CB preparation. */ - spinlock_t __private lock; /* Protect counters and size state. */ + raw_spinlock_t __private lock; /* Protect counters and size state. */ struct mutex srcu_gp_mutex; /* Serialize GP work. */ unsigned long srcu_gp_seq; /* Grace-period seq #. */ unsigned long srcu_gp_seq_needed; /* Latest gp_seq needed. */ @@ -156,7 +156,7 @@ struct srcu_struct { #define __SRCU_USAGE_INIT(name) \ { \ - .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ + .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ .srcu_gp_seq = SRCU_GP_SEQ_INITIAL_VAL, \ .srcu_gp_seq_needed = SRCU_GP_SEQ_INITIAL_VAL_WITH_STATE, \ .srcu_gp_seq_needed_exp = SRCU_GP_SEQ_INITIAL_VAL, \ diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 37e7a8a9e375d..fa6d30ce73d1f 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -502,6 +502,15 @@ do { \ ___locked; \ }) +#define raw_spin_trylock_irqsave_rcu_node(p, flags) \ +({ \ + bool ___locked = raw_spin_trylock_irqsave(&ACCESS_PRIVATE(p, lock), flags); \ + \ + if (___locked) \ + smp_mb__after_unlock_lock(); \ + ___locked; \ +}) + #define raw_lockdep_assert_held_rcu_node(p) \ lockdep_assert_held(&ACCESS_PRIVATE(p, lock)) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index aef8e91ad33e4..2328827f8775c 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -77,42 +77,6 @@ static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); static void process_srcu(struct work_struct *work); static void srcu_delay_timer(struct timer_list *t); -/* Wrappers for lock acquisition and release, see raw_spin_lock_rcu_node(). */ -#define spin_lock_rcu_node(p) \ -do { \ - spin_lock(&ACCESS_PRIVATE(p, lock)); \ - smp_mb__after_unlock_lock(); \ -} while (0) - -#define spin_unlock_rcu_node(p) spin_unlock(&ACCESS_PRIVATE(p, lock)) - -#define spin_lock_irq_rcu_node(p) \ -do { \ - spin_lock_irq(&ACCESS_PRIVATE(p, lock)); \ - smp_mb__after_unlock_lock(); \ -} while (0) - -#define spin_unlock_irq_rcu_node(p) \ - spin_unlock_irq(&ACCESS_PRIVATE(p, lock)) - -#define spin_lock_irqsave_rcu_node(p, flags) \ -do { \ - spin_lock_irqsave(&ACCESS_PRIVATE(p, lock), flags); \ - smp_mb__after_unlock_lock(); \ -} while (0) - -#define spin_trylock_irqsave_rcu_node(p, flags) \ -({ \ - bool ___locked = spin_trylock_irqsave(&ACCESS_PRIVATE(p, lock), flags); \ - \ - if (___locked) \ - smp_mb__after_unlock_lock(); \ - ___locked; \ -}) - -#define spin_unlock_irqrestore_rcu_node(p, flags) \ - spin_unlock_irqrestore(&ACCESS_PRIVATE(p, lock), flags) \ - /* * Initialize SRCU per-CPU data. Note that statically allocated * srcu_struct structures might already have srcu_read_lock() and @@ -131,7 +95,7 @@ static void init_srcu_struct_data(struct srcu_struct *ssp) */ for_each_possible_cpu(cpu) { sdp = per_cpu_ptr(ssp->sda, cpu); - spin_lock_init(&ACCESS_PRIVATE(sdp, lock)); + raw_spin_lock_init(&ACCESS_PRIVATE(sdp, lock)); rcu_segcblist_init(&sdp->srcu_cblist); sdp->srcu_cblist_invoking = false; sdp->srcu_gp_seq_needed = ssp->srcu_sup->srcu_gp_seq; @@ -186,7 +150,7 @@ static bool init_srcu_struct_nodes(struct srcu_struct *ssp, gfp_t gfp_flags) /* Each pass through this loop initializes one srcu_node structure. */ srcu_for_each_node_breadth_first(ssp, snp) { - spin_lock_init(&ACCESS_PRIVATE(snp, lock)); + raw_spin_lock_init(&ACCESS_PRIVATE(snp, lock)); BUILD_BUG_ON(ARRAY_SIZE(snp->srcu_have_cbs) != ARRAY_SIZE(snp->srcu_data_have_cbs)); for (i = 0; i < ARRAY_SIZE(snp->srcu_have_cbs); i++) { @@ -242,7 +206,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) if (!ssp->srcu_sup) return -ENOMEM; if (!is_static) - spin_lock_init(&ACCESS_PRIVATE(ssp->srcu_sup, lock)); + raw_spin_lock_init(&ACCESS_PRIVATE(ssp->srcu_sup, lock)); ssp->srcu_sup->srcu_size_state = SRCU_SIZE_SMALL; ssp->srcu_sup->node = NULL; mutex_init(&ssp->srcu_sup->srcu_cb_mutex); @@ -394,20 +358,20 @@ static void srcu_transition_to_big(struct srcu_struct *ssp) /* Double-checked locking on ->srcu_size-state. */ if (smp_load_acquire(&ssp->srcu_sup->srcu_size_state) != SRCU_SIZE_SMALL) return; - spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); if (smp_load_acquire(&ssp->srcu_sup->srcu_size_state) != SRCU_SIZE_SMALL) { - spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); return; } __srcu_transition_to_big(ssp); - spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); } /* * Check to see if the just-encountered contention event justifies * a transition to SRCU_SIZE_BIG. */ -static void spin_lock_irqsave_check_contention(struct srcu_struct *ssp) +static void raw_spin_lock_irqsave_check_contention(struct srcu_struct *ssp) { unsigned long j; @@ -429,16 +393,16 @@ static void spin_lock_irqsave_check_contention(struct srcu_struct *ssp) * to SRCU_SIZE_BIG. But only if the srcutree.convert_to_big module * parameter permits this. */ -static void spin_lock_irqsave_sdp_contention(struct srcu_data *sdp, unsigned long *flags) +static void raw_spin_lock_irqsave_sdp_contention(struct srcu_data *sdp, unsigned long *flags) { struct srcu_struct *ssp = sdp->ssp; - if (spin_trylock_irqsave_rcu_node(sdp, *flags)) + if (raw_spin_trylock_irqsave_rcu_node(sdp, *flags)) return; - spin_lock_irqsave_rcu_node(ssp->srcu_sup, *flags); - spin_lock_irqsave_check_contention(ssp); - spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, *flags); - spin_lock_irqsave_rcu_node(sdp, *flags); + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, *flags); + raw_spin_lock_irqsave_check_contention(ssp); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, *flags); + raw_spin_lock_irqsave_rcu_node(sdp, *flags); } /* @@ -447,12 +411,12 @@ static void spin_lock_irqsave_sdp_contention(struct srcu_data *sdp, unsigned lon * to SRCU_SIZE_BIG. But only if the srcutree.convert_to_big module * parameter permits this. */ -static void spin_lock_irqsave_ssp_contention(struct srcu_struct *ssp, unsigned long *flags) +static void raw_spin_lock_irqsave_ssp_contention(struct srcu_struct *ssp, unsigned long *flags) { - if (spin_trylock_irqsave_rcu_node(ssp->srcu_sup, *flags)) + if (raw_spin_trylock_irqsave_rcu_node(ssp->srcu_sup, *flags)) return; - spin_lock_irqsave_rcu_node(ssp->srcu_sup, *flags); - spin_lock_irqsave_check_contention(ssp); + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, *flags); + raw_spin_lock_irqsave_check_contention(ssp); } /* @@ -470,13 +434,13 @@ static void check_init_srcu_struct(struct srcu_struct *ssp) /* The smp_load_acquire() pairs with the smp_store_release(). */ if (!rcu_seq_state(smp_load_acquire(&ssp->srcu_sup->srcu_gp_seq_needed))) /*^^^*/ return; /* Already initialized. */ - spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); if (!rcu_seq_state(ssp->srcu_sup->srcu_gp_seq_needed)) { - spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); return; } init_srcu_struct_fields(ssp, true); - spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); } /* @@ -742,9 +706,9 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) unsigned long delay; struct srcu_usage *sup = ssp->srcu_sup; - spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); delay = srcu_get_delay(ssp); - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); if (WARN_ON(!delay)) return; /* Just leak it! */ if (WARN_ON(srcu_readers_active(ssp))) @@ -960,7 +924,7 @@ static void srcu_gp_end(struct srcu_struct *ssp) mutex_lock(&sup->srcu_cb_mutex); /* End the current grace period. */ - spin_lock_irq_rcu_node(sup); + raw_spin_lock_irq_rcu_node(sup); idx = rcu_seq_state(sup->srcu_gp_seq); WARN_ON_ONCE(idx != SRCU_STATE_SCAN2); if (srcu_gp_is_expedited(ssp)) @@ -971,7 +935,7 @@ static void srcu_gp_end(struct srcu_struct *ssp) gpseq = rcu_seq_current(&sup->srcu_gp_seq); if (ULONG_CMP_LT(sup->srcu_gp_seq_needed_exp, gpseq)) WRITE_ONCE(sup->srcu_gp_seq_needed_exp, gpseq); - spin_unlock_irq_rcu_node(sup); + raw_spin_unlock_irq_rcu_node(sup); mutex_unlock(&sup->srcu_gp_mutex); /* A new grace period can start at this point. But only one. */ @@ -983,7 +947,7 @@ static void srcu_gp_end(struct srcu_struct *ssp) } else { idx = rcu_seq_ctr(gpseq) % ARRAY_SIZE(snp->srcu_have_cbs); srcu_for_each_node_breadth_first(ssp, snp) { - spin_lock_irq_rcu_node(snp); + raw_spin_lock_irq_rcu_node(snp); cbs = false; last_lvl = snp >= sup->level[rcu_num_lvls - 1]; if (last_lvl) @@ -998,7 +962,7 @@ static void srcu_gp_end(struct srcu_struct *ssp) else mask = snp->srcu_data_have_cbs[idx]; snp->srcu_data_have_cbs[idx] = 0; - spin_unlock_irq_rcu_node(snp); + raw_spin_unlock_irq_rcu_node(snp); if (cbs) srcu_schedule_cbs_snp(ssp, snp, mask, cbdelay); } @@ -1008,27 +972,27 @@ static void srcu_gp_end(struct srcu_struct *ssp) if (!(gpseq & counter_wrap_check)) for_each_possible_cpu(cpu) { sdp = per_cpu_ptr(ssp->sda, cpu); - spin_lock_irq_rcu_node(sdp); + raw_spin_lock_irq_rcu_node(sdp); if (ULONG_CMP_GE(gpseq, sdp->srcu_gp_seq_needed + 100)) sdp->srcu_gp_seq_needed = gpseq; if (ULONG_CMP_GE(gpseq, sdp->srcu_gp_seq_needed_exp + 100)) sdp->srcu_gp_seq_needed_exp = gpseq; - spin_unlock_irq_rcu_node(sdp); + raw_spin_unlock_irq_rcu_node(sdp); } /* Callback initiation done, allow grace periods after next. */ mutex_unlock(&sup->srcu_cb_mutex); /* Start a new grace period if needed. */ - spin_lock_irq_rcu_node(sup); + raw_spin_lock_irq_rcu_node(sup); gpseq = rcu_seq_current(&sup->srcu_gp_seq); if (!rcu_seq_state(gpseq) && ULONG_CMP_LT(gpseq, sup->srcu_gp_seq_needed)) { srcu_gp_start(ssp); - spin_unlock_irq_rcu_node(sup); + raw_spin_unlock_irq_rcu_node(sup); srcu_reschedule(ssp, 0); } else { - spin_unlock_irq_rcu_node(sup); + raw_spin_unlock_irq_rcu_node(sup); } /* Transition to big if needed. */ @@ -1059,19 +1023,19 @@ static void srcu_funnel_exp_start(struct srcu_struct *ssp, struct srcu_node *snp if (WARN_ON_ONCE(rcu_seq_done(&ssp->srcu_sup->srcu_gp_seq, s)) || (!srcu_invl_snp_seq(sgsne) && ULONG_CMP_GE(sgsne, s))) return; - spin_lock_irqsave_rcu_node(snp, flags); + raw_spin_lock_irqsave_rcu_node(snp, flags); sgsne = snp->srcu_gp_seq_needed_exp; if (!srcu_invl_snp_seq(sgsne) && ULONG_CMP_GE(sgsne, s)) { - spin_unlock_irqrestore_rcu_node(snp, flags); + raw_spin_unlock_irqrestore_rcu_node(snp, flags); return; } WRITE_ONCE(snp->srcu_gp_seq_needed_exp, s); - spin_unlock_irqrestore_rcu_node(snp, flags); + raw_spin_unlock_irqrestore_rcu_node(snp, flags); } - spin_lock_irqsave_ssp_contention(ssp, &flags); + raw_spin_lock_irqsave_ssp_contention(ssp, &flags); if (ULONG_CMP_LT(ssp->srcu_sup->srcu_gp_seq_needed_exp, s)) WRITE_ONCE(ssp->srcu_sup->srcu_gp_seq_needed_exp, s); - spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); } /* @@ -1109,12 +1073,12 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, for (snp = snp_leaf; snp != NULL; snp = snp->srcu_parent) { if (WARN_ON_ONCE(rcu_seq_done(&sup->srcu_gp_seq, s)) && snp != snp_leaf) return; /* GP already done and CBs recorded. */ - spin_lock_irqsave_rcu_node(snp, flags); + raw_spin_lock_irqsave_rcu_node(snp, flags); snp_seq = snp->srcu_have_cbs[idx]; if (!srcu_invl_snp_seq(snp_seq) && ULONG_CMP_GE(snp_seq, s)) { if (snp == snp_leaf && snp_seq == s) snp->srcu_data_have_cbs[idx] |= sdp->grpmask; - spin_unlock_irqrestore_rcu_node(snp, flags); + raw_spin_unlock_irqrestore_rcu_node(snp, flags); if (snp == snp_leaf && snp_seq != s) { srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0); return; @@ -1129,11 +1093,11 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, sgsne = snp->srcu_gp_seq_needed_exp; if (!do_norm && (srcu_invl_snp_seq(sgsne) || ULONG_CMP_LT(sgsne, s))) WRITE_ONCE(snp->srcu_gp_seq_needed_exp, s); - spin_unlock_irqrestore_rcu_node(snp, flags); + raw_spin_unlock_irqrestore_rcu_node(snp, flags); } /* Top of tree, must ensure the grace period will be started. */ - spin_lock_irqsave_ssp_contention(ssp, &flags); + raw_spin_lock_irqsave_ssp_contention(ssp, &flags); if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { /* * Record need for grace period s. Pair with load @@ -1160,7 +1124,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, else if (list_empty(&sup->work.work.entry)) list_add(&sup->work.work.entry, &srcu_boot_list); } - spin_unlock_irqrestore_rcu_node(sup, flags); + raw_spin_unlock_irqrestore_rcu_node(sup, flags); } /* @@ -1172,9 +1136,9 @@ static bool try_check_zero(struct srcu_struct *ssp, int idx, int trycount) { unsigned long curdelay; - spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); curdelay = !srcu_get_delay(ssp); - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); for (;;) { if (srcu_readers_active_idx_check(ssp, idx)) @@ -1285,12 +1249,12 @@ static bool srcu_should_expedite(struct srcu_struct *ssp) return false; /* If the local srcu_data structure has callbacks, not idle. */ sdp = raw_cpu_ptr(ssp->sda); - spin_lock_irqsave_rcu_node(sdp, flags); + raw_spin_lock_irqsave_rcu_node(sdp, flags); if (rcu_segcblist_pend_cbs(&sdp->srcu_cblist)) { - spin_unlock_irqrestore_rcu_node(sdp, flags); + raw_spin_unlock_irqrestore_rcu_node(sdp, flags); return false; /* Callbacks already present, so not idle. */ } - spin_unlock_irqrestore_rcu_node(sdp, flags); + raw_spin_unlock_irqrestore_rcu_node(sdp, flags); /* * No local callbacks, so probabilistically probe global state. @@ -1350,7 +1314,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, sdp = per_cpu_ptr(ssp->sda, get_boot_cpu_id()); else sdp = raw_cpu_ptr(ssp->sda); - spin_lock_irqsave_sdp_contention(sdp, &flags); + raw_spin_lock_irqsave_sdp_contention(sdp, &flags); if (rhp) rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp); /* @@ -1410,7 +1374,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp, sdp->srcu_gp_seq_needed_exp = s; needexp = true; } - spin_unlock_irqrestore_rcu_node(sdp, flags); + raw_spin_unlock_irqrestore_rcu_node(sdp, flags); /* Ensure that snp node tree is fully initialized before traversing it */ if (ss_state < SRCU_SIZE_WAIT_BARRIER) @@ -1522,7 +1486,7 @@ static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm) /* * Make sure that later code is ordered after the SRCU grace - * period. This pairs with the spin_lock_irq_rcu_node() + * period. This pairs with the raw_spin_lock_irq_rcu_node() * in srcu_invoke_callbacks(). Unlike Tree RCU, this is needed * because the current CPU might have been totally uninvolved with * (and thus unordered against) that grace period. @@ -1701,7 +1665,7 @@ static void srcu_barrier_cb(struct rcu_head *rhp) */ static void srcu_barrier_one_cpu(struct srcu_struct *ssp, struct srcu_data *sdp) { - spin_lock_irq_rcu_node(sdp); + raw_spin_lock_irq_rcu_node(sdp); atomic_inc(&ssp->srcu_sup->srcu_barrier_cpu_cnt); sdp->srcu_barrier_head.func = srcu_barrier_cb; debug_rcu_head_queue(&sdp->srcu_barrier_head); @@ -1710,7 +1674,7 @@ static void srcu_barrier_one_cpu(struct srcu_struct *ssp, struct srcu_data *sdp) debug_rcu_head_unqueue(&sdp->srcu_barrier_head); atomic_dec(&ssp->srcu_sup->srcu_barrier_cpu_cnt); } - spin_unlock_irq_rcu_node(sdp); + raw_spin_unlock_irq_rcu_node(sdp); } /** @@ -1761,7 +1725,7 @@ static void srcu_expedite_current_cb(struct rcu_head *rhp) bool needcb = false; struct srcu_data *sdp = container_of(rhp, struct srcu_data, srcu_ec_head); - spin_lock_irqsave_sdp_contention(sdp, &flags); + raw_spin_lock_irqsave_sdp_contention(sdp, &flags); if (sdp->srcu_ec_state == SRCU_EC_IDLE) { WARN_ON_ONCE(1); } else if (sdp->srcu_ec_state == SRCU_EC_PENDING) { @@ -1771,7 +1735,7 @@ static void srcu_expedite_current_cb(struct rcu_head *rhp) sdp->srcu_ec_state = SRCU_EC_PENDING; needcb = true; } - spin_unlock_irqrestore_rcu_node(sdp, flags); + raw_spin_unlock_irqrestore_rcu_node(sdp, flags); // If needed, requeue ourselves as an expedited SRCU callback. if (needcb) __call_srcu(sdp->ssp, &sdp->srcu_ec_head, srcu_expedite_current_cb, false); @@ -1795,7 +1759,7 @@ void srcu_expedite_current(struct srcu_struct *ssp) migrate_disable(); sdp = this_cpu_ptr(ssp->sda); - spin_lock_irqsave_sdp_contention(sdp, &flags); + raw_spin_lock_irqsave_sdp_contention(sdp, &flags); if (sdp->srcu_ec_state == SRCU_EC_IDLE) { sdp->srcu_ec_state = SRCU_EC_PENDING; needcb = true; @@ -1804,7 +1768,7 @@ void srcu_expedite_current(struct srcu_struct *ssp) } else { WARN_ON_ONCE(sdp->srcu_ec_state != SRCU_EC_REPOST); } - spin_unlock_irqrestore_rcu_node(sdp, flags); + raw_spin_unlock_irqrestore_rcu_node(sdp, flags); // If needed, queue an expedited SRCU callback. if (needcb) __call_srcu(ssp, &sdp->srcu_ec_head, srcu_expedite_current_cb, false); @@ -1848,17 +1812,17 @@ static void srcu_advance_state(struct srcu_struct *ssp) */ idx = rcu_seq_state(smp_load_acquire(&ssp->srcu_sup->srcu_gp_seq)); /* ^^^ */ if (idx == SRCU_STATE_IDLE) { - spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); if (ULONG_CMP_GE(ssp->srcu_sup->srcu_gp_seq, ssp->srcu_sup->srcu_gp_seq_needed)) { WARN_ON_ONCE(rcu_seq_state(ssp->srcu_sup->srcu_gp_seq)); - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); mutex_unlock(&ssp->srcu_sup->srcu_gp_mutex); return; } idx = rcu_seq_state(READ_ONCE(ssp->srcu_sup->srcu_gp_seq)); if (idx == SRCU_STATE_IDLE) srcu_gp_start(ssp); - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); if (idx != SRCU_STATE_IDLE) { mutex_unlock(&ssp->srcu_sup->srcu_gp_mutex); return; /* Someone else started the grace period. */ @@ -1872,10 +1836,10 @@ static void srcu_advance_state(struct srcu_struct *ssp) return; /* readers present, retry later. */ } srcu_flip(ssp); - spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); rcu_seq_set_state(&ssp->srcu_sup->srcu_gp_seq, SRCU_STATE_SCAN2); ssp->srcu_sup->srcu_n_exp_nodelay = 0; - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); } if (rcu_seq_state(READ_ONCE(ssp->srcu_sup->srcu_gp_seq)) == SRCU_STATE_SCAN2) { @@ -1913,7 +1877,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) ssp = sdp->ssp; rcu_cblist_init(&ready_cbs); - spin_lock_irq_rcu_node(sdp); + raw_spin_lock_irq_rcu_node(sdp); WARN_ON_ONCE(!rcu_segcblist_segempty(&sdp->srcu_cblist, RCU_NEXT_TAIL)); rcu_segcblist_advance(&sdp->srcu_cblist, rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq)); @@ -1924,7 +1888,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) */ if (sdp->srcu_cblist_invoking || !rcu_segcblist_ready_cbs(&sdp->srcu_cblist)) { - spin_unlock_irq_rcu_node(sdp); + raw_spin_unlock_irq_rcu_node(sdp); return; /* Someone else on the job or nothing to do. */ } @@ -1932,7 +1896,7 @@ static void srcu_invoke_callbacks(struct work_struct *work) sdp->srcu_cblist_invoking = true; rcu_segcblist_extract_done_cbs(&sdp->srcu_cblist, &ready_cbs); len = ready_cbs.len; - spin_unlock_irq_rcu_node(sdp); + raw_spin_unlock_irq_rcu_node(sdp); rhp = rcu_cblist_dequeue(&ready_cbs); for (; rhp != NULL; rhp = rcu_cblist_dequeue(&ready_cbs)) { debug_rcu_head_unqueue(rhp); @@ -1947,11 +1911,11 @@ static void srcu_invoke_callbacks(struct work_struct *work) * Update counts, accelerate new callbacks, and if needed, * schedule another round of callback invocation. */ - spin_lock_irq_rcu_node(sdp); + raw_spin_lock_irq_rcu_node(sdp); rcu_segcblist_add_len(&sdp->srcu_cblist, -len); sdp->srcu_cblist_invoking = false; more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist); - spin_unlock_irq_rcu_node(sdp); + raw_spin_unlock_irq_rcu_node(sdp); /* An SRCU barrier or callbacks from previous nesting work pending */ if (more) srcu_schedule_cbs_sdp(sdp, 0); @@ -1965,7 +1929,7 @@ static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay) { bool pushgp = true; - spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); if (ULONG_CMP_GE(ssp->srcu_sup->srcu_gp_seq, ssp->srcu_sup->srcu_gp_seq_needed)) { if (!WARN_ON_ONCE(rcu_seq_state(ssp->srcu_sup->srcu_gp_seq))) { /* All requests fulfilled, time to go idle. */ @@ -1975,7 +1939,7 @@ static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay) /* Outstanding request and no GP. Start one. */ srcu_gp_start(ssp); } - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); if (pushgp) queue_delayed_work(rcu_gp_wq, &ssp->srcu_sup->work, delay); @@ -1995,9 +1959,9 @@ static void process_srcu(struct work_struct *work) ssp = sup->srcu_ssp; srcu_advance_state(ssp); - spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); curdelay = srcu_get_delay(ssp); - spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); if (curdelay) { WRITE_ONCE(sup->reschedule_count, 0); } else { ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 11:49 ` Paul E. McKenney @ 2026-03-18 14:43 ` Sebastian Andrzej Siewior 2026-03-18 15:43 ` Paul E. McKenney 2026-03-18 15:51 ` Boqun Feng 0 siblings, 2 replies; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-18 14:43 UTC (permalink / raw) To: Paul E. McKenney Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 2026-03-18 04:49:52 [-0700], Paul E. McKenney wrote: > > > Back to the actual bug, that call_srcu() now needs to tolerate being called > > > with scheduler rq/pi locks held... > > > > This is because it is called from sched_ext BPF callbacks? > > You got it! We are re-implementing Tasks Trace RCU in terms of SRCU-fast, > and I missed this requirement the first time around. I *did* make readers > able to deal with BPF being invoked from everywhere, so two out of three? right ;) > > > The straightforward (but perhaps broken) way to resolve this is to make > > > srcu_gp_start_if_needed() defer invoking the scheduler, similar to the > > > > Quick question. If srcu_gp_start_if_needed() can be invoked from a > > preempt-disabled section (due to rq/pi lock) then > > spin_lock_irqsave_sdp_contention(sdp, &flags); > > > > does not work, right? > > Agreed, which is why the patch at the end of this email converts this to: > > raw_spin_lock_irqsave_sdp_contention(sdp, &flags) I've seen that now. So the spinlock_t usage in SRCU was short. > > > way that vanilla RCU's call_rcu_core() function takes an early exit if > > > interrupts are disabled. Of course, vanilla RCU can rely on things like > > > the scheduling-clock interrupt to start any needed grace periods [1], > > > but SRCU will instead need to manually defer this work, perhaps using > > > workqueues or IRQ work. > > > > > > In addition, rcutorture needs to be upgraded to sometimes invoke > > > ->call() with the scheduler pi lock held, but this change is not fixing > > > a regression, so could be deferred. (There is already code in rcutorture > > > that invokes the readers while holding a scheduler pi lock.) > > > > > > Given that RCU for this week through the end of March belongs to you guys, > > > if one of you can get this done by end of day Thursday, London time, > > > very good! Otherwise, I can put something together. > > > > > > Please let me know! > > > > Given that the current locking does allow it and lockdep should have > > complained, I am curious if we could rule that out ;) Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ nesting right. The wakeup problem remains, right? But looking at the code, there is just srcu_funnel_gp_start(). If its srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed then there will be always a timer and never a direct wake up of the worker. Wouldn't that work? > It would be nice, but your point about needing to worry about spinlocks > is compelling. > > But couldn't lockdep scan the current task's list of held locks and see > whether only raw spinlocks are held (including when no spinlocks of any > type are held), and complain in that case? Or would that scanning be > too high of overhead? (But we need that scan anyway to check deadlock, > don't we?) PeterZ didn't like it and the nesting thing identified most of the problem cases. It should also catch _this_ one. Thinking about it further, you don't need to worry about local_bh_disable() but RCU will becomes another corner case. You would have to exclude "rcu_read_lock(); spin_lock();" on a !preempt kernel which would otherwise lead to false positives. But as I said, this case as explained is a nesting problem and should be reported by lockdep with its current features. > > > Thanx, Paul [2] > > > > > > [1] The exceptions to this rule being handled by the call to > > > invoke_rcu_core() when rcu_is_watching() returns false. > > > > > > [2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI > > > handlers? Or should there be a call_rcu_nmi() for this purpose? > > > Or should we continue to have its callers check in_nmi() when needed? > > > > Did someone ask for this? > > Yes. The BPF guys need to invoke call_srcu() from interrupts-disabled > regions of code. I am way to old and lazy to do this sort of thing > spontaneously. ;-) IRQ disabled should work but you asked about call_rcu_nmi() and NMI is already complicated because "most" other things don't work and you would need irq_work to let the remaining kernel know that you did something in NMI and this needs to be integrated now. I don't think regular RCU has call_rcu() from NMI. But I guess wrapping it via irq_work would be one way of dealing with it. > Thanx, Paul > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 14:43 ` Sebastian Andrzej Siewior @ 2026-03-18 15:43 ` Paul E. McKenney 2026-03-18 16:04 ` Sebastian Andrzej Siewior 2026-03-18 15:51 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-18 15:43 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-18 04:49:52 [-0700], Paul E. McKenney wrote: > > > > Back to the actual bug, that call_srcu() now needs to tolerate being called > > > > with scheduler rq/pi locks held... > > > > > > This is because it is called from sched_ext BPF callbacks? > > > > You got it! We are re-implementing Tasks Trace RCU in terms of SRCU-fast, > > and I missed this requirement the first time around. I *did* make readers > > able to deal with BPF being invoked from everywhere, so two out of three? > > right ;) > > > > > The straightforward (but perhaps broken) way to resolve this is to make > > > > srcu_gp_start_if_needed() defer invoking the scheduler, similar to the > > > > > > Quick question. If srcu_gp_start_if_needed() can be invoked from a > > > preempt-disabled section (due to rq/pi lock) then > > > spin_lock_irqsave_sdp_contention(sdp, &flags); > > > > > > does not work, right? > > > > Agreed, which is why the patch at the end of this email converts this to: > > > > raw_spin_lock_irqsave_sdp_contention(sdp, &flags) > > I've seen that now. So the spinlock_t usage in SRCU was short. Almost ten years. ;-) > > > > way that vanilla RCU's call_rcu_core() function takes an early exit if > > > > interrupts are disabled. Of course, vanilla RCU can rely on things like > > > > the scheduling-clock interrupt to start any needed grace periods [1], > > > > but SRCU will instead need to manually defer this work, perhaps using > > > > workqueues or IRQ work. > > > > > > > > In addition, rcutorture needs to be upgraded to sometimes invoke > > > > ->call() with the scheduler pi lock held, but this change is not fixing > > > > a regression, so could be deferred. (There is already code in rcutorture > > > > that invokes the readers while holding a scheduler pi lock.) > > > > > > > > Given that RCU for this week through the end of March belongs to you guys, > > > > if one of you can get this done by end of day Thursday, London time, > > > > very good! Otherwise, I can put something together. > > > > > > > > Please let me know! > > > > > > Given that the current locking does allow it and lockdep should have > > > complained, I am curious if we could rule that out ;) > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > nesting right. The wakeup problem remains, right? > But looking at the code, there is just srcu_funnel_gp_start(). If its > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > then there will be always a timer and never a direct wake up of the > worker. Wouldn't that work? Right, that patch fixes one lockdep problem, but another remains. > > It would be nice, but your point about needing to worry about spinlocks > > is compelling. > > > > But couldn't lockdep scan the current task's list of held locks and see > > whether only raw spinlocks are held (including when no spinlocks of any > > type are held), and complain in that case? Or would that scanning be > > too high of overhead? (But we need that scan anyway to check deadlock, > > don't we?) > > PeterZ didn't like it and the nesting thing identified most of the > problem cases. It should also catch _this_ one. > > Thinking about it further, you don't need to worry about > local_bh_disable() but RCU will becomes another corner case. You would > have to exclude "rcu_read_lock(); spin_lock();" on a !preempt kernel > which would otherwise lead to false positives. > But as I said, this case as explained is a nesting problem and should be > reported by lockdep with its current features. With a raw spinlock held, agreed. Not a big deal, just working out what to put in rcutorture to avoid regressions that would otherwise result in being unable to invoke call_srcu() from non-preemptible contexts. > > > > Thanx, Paul [2] > > > > > > > > [1] The exceptions to this rule being handled by the call to > > > > invoke_rcu_core() when rcu_is_watching() returns false. > > > > > > > > [2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI > > > > handlers? Or should there be a call_rcu_nmi() for this purpose? > > > > Or should we continue to have its callers check in_nmi() when needed? > > > > > > Did someone ask for this? > > > > Yes. The BPF guys need to invoke call_srcu() from interrupts-disabled > > regions of code. I am way to old and lazy to do this sort of thing > > spontaneously. ;-) > > IRQ disabled should work but you asked about call_rcu_nmi() and NMI is > already complicated because "most" other things don't work and you would > need irq_work to let the remaining kernel know that you did something in > NMI and this needs to be integrated now. I don't think regular RCU has > call_rcu() from NMI. But I guess wrapping it via irq_work would be one > way of dealing with it. Agreed, and as long as there is only a few call_rcu() call sites within NMI handlers, it is best to let the caller deal with it. But if this becomes popular enough, it would be better to have a call_rcu_nmi() or some such. Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 15:43 ` Paul E. McKenney @ 2026-03-18 16:04 ` Sebastian Andrzej Siewior 2026-03-18 16:32 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-18 16:04 UTC (permalink / raw) To: Paul E. McKenney Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 2026-03-18 08:43:32 [-0700], Paul E. McKenney wrote: > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > > nesting right. The wakeup problem remains, right? > > But looking at the code, there is just srcu_funnel_gp_start(). If its > > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > > then there will be always a timer and never a direct wake up of the > > worker. Wouldn't that work? > > Right, that patch fixes one lockdep problem, but another remains. What remains? > > > It would be nice, but your point about needing to worry about spinlocks > > > is compelling. > > > > > > But couldn't lockdep scan the current task's list of held locks and see > > > whether only raw spinlocks are held (including when no spinlocks of any > > > type are held), and complain in that case? Or would that scanning be > > > too high of overhead? (But we need that scan anyway to check deadlock, > > > don't we?) > > > > PeterZ didn't like it and the nesting thing identified most of the > > problem cases. It should also catch _this_ one. > > > > Thinking about it further, you don't need to worry about > > local_bh_disable() but RCU will becomes another corner case. You would > > have to exclude "rcu_read_lock(); spin_lock();" on a !preempt kernel > > which would otherwise lead to false positives. > > But as I said, this case as explained is a nesting problem and should be > > reported by lockdep with its current features. > > With a raw spinlock held, agreed. > > Not a big deal, just working out what to put in rcutorture to avoid > regressions that would otherwise result in being unable to invoke > call_srcu() from non-preemptible contexts. Okay. So take this as _no_ more work items ;) > > > > > Thanx, Paul [2] > > > > > > > > > > [1] The exceptions to this rule being handled by the call to > > > > > invoke_rcu_core() when rcu_is_watching() returns false. > > > > > > > > > > [2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI > > > > > handlers? Or should there be a call_rcu_nmi() for this purpose? > > > > > Or should we continue to have its callers check in_nmi() when needed? > > > > > > > > Did someone ask for this? > > > > > > Yes. The BPF guys need to invoke call_srcu() from interrupts-disabled > > > regions of code. I am way to old and lazy to do this sort of thing > > > spontaneously. ;-) > > > > IRQ disabled should work but you asked about call_rcu_nmi() and NMI is > > already complicated because "most" other things don't work and you would > > need irq_work to let the remaining kernel know that you did something in > > NMI and this needs to be integrated now. I don't think regular RCU has > > call_rcu() from NMI. But I guess wrapping it via irq_work would be one > > way of dealing with it. > > Agreed, and as long as there is only a few call_rcu() call sites within > NMI handlers, it is best to let the caller deal with it. But if this > becomes popular enough, it would be better to have a call_rcu_nmi() or > some such. Popular? Okay. Keep me posted, please. > Thanx, Paul Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 16:04 ` Sebastian Andrzej Siewior @ 2026-03-18 16:32 ` Paul E. McKenney 2026-03-18 16:42 ` Boqun Feng 2026-03-18 16:47 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-18 16:32 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 05:04:45PM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-18 08:43:32 [-0700], Paul E. McKenney wrote: > > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > > > nesting right. The wakeup problem remains, right? > > > But looking at the code, there is just srcu_funnel_gp_start(). If its > > > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > > > then there will be always a timer and never a direct wake up of the > > > worker. Wouldn't that work? > > > > Right, that patch fixes one lockdep problem, but another remains. > > What remains? With that patch, we no longer have call_srcu() directly acquiring a non-raw spinlock, but as you say, we still have the wakeup problem. > > > > It would be nice, but your point about needing to worry about spinlocks > > > > is compelling. > > > > > > > > But couldn't lockdep scan the current task's list of held locks and see > > > > whether only raw spinlocks are held (including when no spinlocks of any > > > > type are held), and complain in that case? Or would that scanning be > > > > too high of overhead? (But we need that scan anyway to check deadlock, > > > > don't we?) > > > > > > PeterZ didn't like it and the nesting thing identified most of the > > > problem cases. It should also catch _this_ one. > > > > > > Thinking about it further, you don't need to worry about > > > local_bh_disable() but RCU will becomes another corner case. You would > > > have to exclude "rcu_read_lock(); spin_lock();" on a !preempt kernel > > > which would otherwise lead to false positives. > > > But as I said, this case as explained is a nesting problem and should be > > > reported by lockdep with its current features. > > > > With a raw spinlock held, agreed. > > > > Not a big deal, just working out what to put in rcutorture to avoid > > regressions that would otherwise result in being unable to invoke > > call_srcu() from non-preemptible contexts. > > Okay. So take this as _no_ more work items ;) I agree that the rcutorture can wait until the next merge window. > > > > > > Thanx, Paul [2] > > > > > > > > > > > > [1] The exceptions to this rule being handled by the call to > > > > > > invoke_rcu_core() when rcu_is_watching() returns false. > > > > > > > > > > > > [2] Ah, and should vanilla RCU's call_rcu() be invokable from NMI > > > > > > handlers? Or should there be a call_rcu_nmi() for this purpose? > > > > > > Or should we continue to have its callers check in_nmi() when needed? > > > > > > > > > > Did someone ask for this? > > > > > > > > Yes. The BPF guys need to invoke call_srcu() from interrupts-disabled > > > > regions of code. I am way to old and lazy to do this sort of thing > > > > spontaneously. ;-) > > > > > > IRQ disabled should work but you asked about call_rcu_nmi() and NMI is > > > already complicated because "most" other things don't work and you would > > > need irq_work to let the remaining kernel know that you did something in > > > NMI and this needs to be integrated now. I don't think regular RCU has > > > call_rcu() from NMI. But I guess wrapping it via irq_work would be one > > > way of dealing with it. > > > > Agreed, and as long as there is only a few call_rcu() call sites within > > NMI handlers, it is best to let the caller deal with it. But if this > > becomes popular enough, it would be better to have a call_rcu_nmi() or > > some such. > > Popular? Okay. Keep me posted, please. Will do. Just out of curiosity, what are your concerns? Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 16:32 ` Paul E. McKenney @ 2026-03-18 16:42 ` Boqun Feng 2026-03-18 18:45 ` Paul E. McKenney 2026-03-18 16:47 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-18 16:42 UTC (permalink / raw) To: Paul E. McKenney Cc: Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 09:32:07AM -0700, Paul E. McKenney wrote: > On Wed, Mar 18, 2026 at 05:04:45PM +0100, Sebastian Andrzej Siewior wrote: > > On 2026-03-18 08:43:32 [-0700], Paul E. McKenney wrote: > > > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > > > > nesting right. The wakeup problem remains, right? > > > > But looking at the code, there is just srcu_funnel_gp_start(). If its > > > > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > > > > then there will be always a timer and never a direct wake up of the > > > > worker. Wouldn't that work? > > > > > > Right, that patch fixes one lockdep problem, but another remains. > > > > What remains? > > With that patch, we no longer have call_srcu() directly acquiring a > non-raw spinlock, but as you say, we still have the wakeup problem. > I don't think we have a wakeup problem since we use workqueue to defer the wakeup, but maybe I'm missing something here? Regards, Boqun ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 16:42 ` Boqun Feng @ 2026-03-18 18:45 ` Paul E. McKenney 0 siblings, 0 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-18 18:45 UTC (permalink / raw) To: Boqun Feng Cc: Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 09:42:09AM -0700, Boqun Feng wrote: > On Wed, Mar 18, 2026 at 09:32:07AM -0700, Paul E. McKenney wrote: > > On Wed, Mar 18, 2026 at 05:04:45PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2026-03-18 08:43:32 [-0700], Paul E. McKenney wrote: > > > > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > > > > > nesting right. The wakeup problem remains, right? > > > > > But looking at the code, there is just srcu_funnel_gp_start(). If its > > > > > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > > > > > then there will be always a timer and never a direct wake up of the > > > > > worker. Wouldn't that work? > > > > > > > > Right, that patch fixes one lockdep problem, but another remains. > > > > > > What remains? > > > > With that patch, we no longer have call_srcu() directly acquiring a > > non-raw spinlock, but as you say, we still have the wakeup problem. > > I don't think we have a wakeup problem since we use workqueue to defer > the wakeup, but maybe I'm missing something here? You are right, I was confused. We instead have a deadlock problem. Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 16:32 ` Paul E. McKenney 2026-03-18 16:42 ` Boqun Feng @ 2026-03-18 16:47 ` Sebastian Andrzej Siewior 2026-03-18 18:48 ` Paul E. McKenney 1 sibling, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-18 16:47 UTC (permalink / raw) To: Paul E. McKenney Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 2026-03-18 09:32:07 [-0700], Paul E. McKenney wrote: > > What remains? > > With that patch, we no longer have call_srcu() directly acquiring a > non-raw spinlock, but as you say, we still have the wakeup problem. Isn't this is just srcu_funnel_gp_start()? And where you could ensure that it is always a delayed worked (delay always 1+) that is scheduled so that it always setups a timer and never does a direct wake. > > Popular? Okay. Keep me posted, please. > > Will do. Just out of curiosity, what are your concerns? We don't have many NMI code paths and the possibilities are quite dense. I would imagine having this possibility would lead to things that wouldn't be needed otherwise. I mean we don't even allow allocating memory from hardirq (except for _nolock() variant which I think is used by bpf for $reasons) but need to call rcu_free from NMI. This would require to remove an item from some kind of data structure without regular locking. Unless RCU is used to get a delayed invocation of some sorts. I mean I am curious here who needs it any why ;) > Thanx, Paul Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 16:47 ` Sebastian Andrzej Siewior @ 2026-03-18 18:48 ` Paul E. McKenney 2026-03-19 8:55 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-18 18:48 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 05:47:10PM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-18 09:32:07 [-0700], Paul E. McKenney wrote: > > > What remains? > > > > With that patch, we no longer have call_srcu() directly acquiring a > > non-raw spinlock, but as you say, we still have the wakeup problem. > > Isn't this is just srcu_funnel_gp_start()? And where you could ensure > that it is always a delayed worked (delay always 1+) that is scheduled > so that it always setups a timer and never does a direct wake. > > > > Popular? Okay. Keep me posted, please. > > > > Will do. Just out of curiosity, what are your concerns? > > We don't have many NMI code paths and the possibilities are quite dense. > I would imagine having this possibility would lead to things that > wouldn't be needed otherwise. I mean we don't even allow allocating > memory from hardirq (except for _nolock() variant which I think is used > by bpf for $reasons) but need to call rcu_free from NMI. > This would require to remove an item from some kind of data structure > without regular locking. Unless RCU is used to get a delayed invocation > of some sorts. We would need a lockless enqueue, which we have in llist.h. The irq-work to actually do the call_rcu() or similar. There would also need to be rcu_barrier() changes, for example, to drain all the llists. > I mean I am curious here who needs it any why ;) You got it right above, BPF. ;-) They currently check in_nmi() and to the irq-work step themselves. Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 18:48 ` Paul E. McKenney @ 2026-03-19 8:55 ` Sebastian Andrzej Siewior 2026-03-19 10:05 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-19 8:55 UTC (permalink / raw) To: Paul E. McKenney Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 2026-03-18 11:48:56 [-0700], Paul E. McKenney wrote: > We would need a lockless enqueue, which we have in llist.h. Only if CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG. Otherwise you must not provide call_rcu_nmi(). > The irq-work to actually do the call_rcu() or similar. reasonable. > There would also need to be rcu_barrier() changes, for example, to > drain all the llists. > > > I mean I am curious here who needs it any why ;) > > You got it right above, BPF. ;-) > > They currently check in_nmi() and to the irq-work step themselves. Then it would make sense since you do actually have users. > Thanx, Paul Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 8:55 ` Sebastian Andrzej Siewior @ 2026-03-19 10:05 ` Paul E. McKenney 2026-03-19 10:43 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-19 10:05 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Thu, Mar 19, 2026 at 09:55:48AM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-18 11:48:56 [-0700], Paul E. McKenney wrote: > > We would need a lockless enqueue, which we have in llist.h. > > Only if CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG. Otherwise you must not > provide call_rcu_nmi(). I *think* that BPF is provided only in such architectures, so agreed. > > The irq-work to actually do the call_rcu() or similar. > > reasonable. > > > There would also need to be rcu_barrier() changes, for example, to > > drain all the llists. > > > > > I mean I am curious here who needs it any why ;) > > > > You got it right above, BPF. ;-) > > > > They currently check in_nmi() and to the irq-work step themselves. > > Then it would make sense since you do actually have users. For more fun, they would like to attach BPF programs to call_rcu() and call_srcu() internals, which would require a per-task recursion check. Or perhaps some other trick. Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 10:05 ` Paul E. McKenney @ 2026-03-19 10:43 ` Paul E. McKenney 2026-03-19 10:51 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-19 10:43 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Thu, Mar 19, 2026 at 03:05:40AM -0700, Paul E. McKenney wrote: > On Thu, Mar 19, 2026 at 09:55:48AM +0100, Sebastian Andrzej Siewior wrote: > > On 2026-03-18 11:48:56 [-0700], Paul E. McKenney wrote: > > > We would need a lockless enqueue, which we have in llist.h. > > > > Only if CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG. Otherwise you must not > > provide call_rcu_nmi(). > > I *think* that BPF is provided only in such architectures, so agreed. Plus srcu_read_lock_fast() uses atomic_long_inc(), which is supposed to be NMI-safe anyway, correct? Thanx, Paul > > > The irq-work to actually do the call_rcu() or similar. > > > > reasonable. > > > > > There would also need to be rcu_barrier() changes, for example, to > > > drain all the llists. > > > > > > > I mean I am curious here who needs it any why ;) > > > > > > You got it right above, BPF. ;-) > > > > > > They currently check in_nmi() and to the irq-work step themselves. > > > > Then it would make sense since you do actually have users. > > For more fun, they would like to attach BPF programs to call_rcu() and > call_srcu() internals, which would require a per-task recursion check. > Or perhaps some other trick. > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 10:43 ` Paul E. McKenney @ 2026-03-19 10:51 ` Sebastian Andrzej Siewior 0 siblings, 0 replies; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-19 10:51 UTC (permalink / raw) To: Paul E. McKenney Cc: frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 2026-03-19 03:43:17 [-0700], Paul E. McKenney wrote: > On Thu, Mar 19, 2026 at 03:05:40AM -0700, Paul E. McKenney wrote: > > On Thu, Mar 19, 2026 at 09:55:48AM +0100, Sebastian Andrzej Siewior wrote: > > > On 2026-03-18 11:48:56 [-0700], Paul E. McKenney wrote: > > > > We would need a lockless enqueue, which we have in llist.h. > > > > > > Only if CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG. Otherwise you must not > > > provide call_rcu_nmi(). > > > > I *think* that BPF is provided only in such architectures, so agreed. > > Plus srcu_read_lock_fast() uses atomic_long_inc(), which is supposed to > be NMI-safe anyway, correct? I am not aware of any NMI restrictions for the atomic ops, just cmpxchg. > Thanx, Paul Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 14:43 ` Sebastian Andrzej Siewior 2026-03-18 15:43 ` Paul E. McKenney @ 2026-03-18 15:51 ` Boqun Feng 2026-03-18 18:42 ` Paul E. McKenney 1 sibling, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-18 15:51 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Paul E. McKenney, frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: [..] > > > > way that vanilla RCU's call_rcu_core() function takes an early exit if > > > > interrupts are disabled. Of course, vanilla RCU can rely on things like > > > > the scheduling-clock interrupt to start any needed grace periods [1], > > > > but SRCU will instead need to manually defer this work, perhaps using > > > > workqueues or IRQ work. > > > > > > > > In addition, rcutorture needs to be upgraded to sometimes invoke > > > > ->call() with the scheduler pi lock held, but this change is not fixing > > > > a regression, so could be deferred. (There is already code in rcutorture > > > > that invokes the readers while holding a scheduler pi lock.) > > > > > > > > Given that RCU for this week through the end of March belongs to you guys, > > > > if one of you can get this done by end of day Thursday, London time, > > > > very good! Otherwise, I can put something together. > > > > > > > > Please let me know! > > > > > > Given that the current locking does allow it and lockdep should have > > > complained, I am curious if we could rule that out ;) > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > nesting right. The wakeup problem remains, right? > But looking at the code, there is just srcu_funnel_gp_start(). If its > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > then there will be always a timer and never a direct wake up of the > worker. Wouldn't that work? > Late to the party, so just make sure I understand the problem. The problem is the wakeup in call_srcu() when it's called with scheduler lock held, right? If so I think the current code works as what you already explain, we defer the wakeup into a workqueue. (but Paul, we are not talking about calling call_srcu(), that requires some more work to get it work) > > It would be nice, but your point about needing to worry about spinlocks > > is compelling. > > > > But couldn't lockdep scan the current task's list of held locks and see > > whether only raw spinlocks are held (including when no spinlocks of any > > type are held), and complain in that case? Or would that scanning be > > too high of overhead? (But we need that scan anyway to check deadlock, > > don't we?) > > PeterZ didn't like it and the nesting thing identified most of the > problem cases. It should also catch _this_ one. > > Thinking about it further, you don't need to worry about > local_bh_disable() but RCU will becomes another corner case. You would > have to exclude "rcu_read_lock(); spin_lock();" on a !preempt kernel > which would otherwise lead to false positives. > But as I said, this case as explained is a nesting problem and should be > reported by lockdep with its current features. > Right, otherwise there is a lockdep bug ;-) Regards, Boqun ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 15:51 ` Boqun Feng @ 2026-03-18 18:42 ` Paul E. McKenney 2026-03-18 20:04 ` Joel Fernandes 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-18 18:42 UTC (permalink / raw) To: Boqun Feng Cc: Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, joelagnelf, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 08:51:16AM -0700, Boqun Feng wrote: > On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: > [..] > > > > > way that vanilla RCU's call_rcu_core() function takes an early exit if > > > > > interrupts are disabled. Of course, vanilla RCU can rely on things like > > > > > the scheduling-clock interrupt to start any needed grace periods [1], > > > > > but SRCU will instead need to manually defer this work, perhaps using > > > > > workqueues or IRQ work. > > > > > > > > > > In addition, rcutorture needs to be upgraded to sometimes invoke > > > > > ->call() with the scheduler pi lock held, but this change is not fixing > > > > > a regression, so could be deferred. (There is already code in rcutorture > > > > > that invokes the readers while holding a scheduler pi lock.) > > > > > > > > > > Given that RCU for this week through the end of March belongs to you guys, > > > > > if one of you can get this done by end of day Thursday, London time, > > > > > very good! Otherwise, I can put something together. > > > > > > > > > > Please let me know! > > > > > > > > Given that the current locking does allow it and lockdep should have > > > > complained, I am curious if we could rule that out ;) > > > > Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > > nesting right. The wakeup problem remains, right? > > But looking at the code, there is just srcu_funnel_gp_start(). If its > > srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > > then there will be always a timer and never a direct wake up of the > > worker. Wouldn't that work? > > Late to the party, so just make sure I understand the problem. The > problem is the wakeup in call_srcu() when it's called with scheduler > lock held, right? If so I think the current code works as what you > already explain, we defer the wakeup into a workqueue. The issue is that call_rcu_tasks() (which is call_srcu() now) is also invoked with a scheduler pi/rq lock held, which results in a deadlock cycle. So the srcu_gp_start_if_needed() function's call to raw_spin_lock_irqsave_sdp_contention() must be deferred to the workqueue handler, not just the wake-up. And that in turn means that the callback point also needs to be passed to this handler. See this email thread: https://lore.kernel.org/all/CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com/ > (but Paul, we are not talking about calling call_srcu(), that requires > some more work to get it work) Agreed, splitting srcu_gp_start_if_needed() and using a workqueue if interrupts were already disabled on entry. Otherwise, directly invoking the split-out portion of srcu_gp_start_if_needed(). But we might be talking past each other. Thanx, Paul > > > It would be nice, but your point about needing to worry about spinlocks > > > is compelling. > > > > > > But couldn't lockdep scan the current task's list of held locks and see > > > whether only raw spinlocks are held (including when no spinlocks of any > > > type are held), and complain in that case? Or would that scanning be > > > too high of overhead? (But we need that scan anyway to check deadlock, > > > don't we?) > > > > PeterZ didn't like it and the nesting thing identified most of the > > problem cases. It should also catch _this_ one. > > > > Thinking about it further, you don't need to worry about > > local_bh_disable() but RCU will becomes another corner case. You would > > have to exclude "rcu_read_lock(); spin_lock();" on a !preempt kernel > > which would otherwise lead to false positives. > > But as I said, this case as explained is a nesting problem and should be > > reported by lockdep with its current features. > > Right, otherwise there is a lockdep bug ;-) > > Regards, > Boqun ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 18:42 ` Paul E. McKenney @ 2026-03-18 20:04 ` Joel Fernandes 2026-03-18 20:11 ` Kumar Kartikeya Dwivedi 2026-03-18 21:52 ` Boqun Feng 0 siblings, 2 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-18 20:04 UTC (permalink / raw) To: paulmck, Boqun Feng Cc: Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi On 3/18/2026 2:42 PM, Paul E. McKenney wrote: > On Wed, Mar 18, 2026 at 08:51:16AM -0700, Boqun Feng wrote: >> On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: >> [..] >>>>>> way that vanilla RCU's call_rcu_core() function takes an early exit if >>>>>> interrupts are disabled. Of course, vanilla RCU can rely on things like >>>>>> the scheduling-clock interrupt to start any needed grace periods [1], >>>>>> but SRCU will instead need to manually defer this work, perhaps using >>>>>> workqueues or IRQ work. >>>>>> >>>>>> In addition, rcutorture needs to be upgraded to sometimes invoke >>>>>> ->call() with the scheduler pi lock held, but this change is not fixing >>>>>> a regression, so could be deferred. (There is already code in rcutorture >>>>>> that invokes the readers while holding a scheduler pi lock.) >>>>>> >>>>>> Given that RCU for this week through the end of March belongs to you guys, >>>>>> if one of you can get this done by end of day Thursday, London time, >>>>>> very good! Otherwise, I can put something together. >>>>>> >>>>>> Please let me know! >>>>> >>>>> Given that the current locking does allow it and lockdep should have >>>>> complained, I am curious if we could rule that out ;) >>> >>> Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ >>> nesting right. The wakeup problem remains, right? >>> But looking at the code, there is just srcu_funnel_gp_start(). If its >>> srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed >>> then there will be always a timer and never a direct wake up of the >>> worker. Wouldn't that work? >> >> Late to the party, so just make sure I understand the problem. The >> problem is the wakeup in call_srcu() when it's called with scheduler >> lock held, right? If so I think the current code works as what you >> already explain, we defer the wakeup into a workqueue. > > The issue is that call_rcu_tasks() (which is call_srcu() now) is > also invoked with a scheduler pi/rq lock held, which results in a > deadlock cycle. So the srcu_gp_start_if_needed() function's call to > raw_spin_lock_irqsave_sdp_contention() must be deferred to the workqueue > handler, not just the wake-up. And that in turn means that the callback > point also needs to be passed to this handler. > > See this email thread: > > https://lore.kernel.org/all/CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com/ > >> (but Paul, we are not talking about calling call_srcu(), that requires >> some more work to get it work) > > Agreed, splitting srcu_gp_start_if_needed() and using a workqueue if > interrupts were already disabled on entry. Otherwise, directly invoking > the split-out portion of srcu_gp_start_if_needed(). > > But we might be talking past each other. > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a different issue, from the NMI issue? It is more of an issue of calling call_srcu API with scheduler locks held. Something like below I think: CPU A (BPF tracepoint) CPU B (concurrent call_srcu) ---------------------------- ------------------------------------ [1] holds &rq->__lock [2] -> call_srcu -> srcu_gp_start_if_needed -> srcu_funnel_gp_start -> spin_lock_irqsave_ssp_content... -> holds srcu locks [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) -> queue_delayed_work -> call_srcu() -> __queue_work() -> srcu_gp_start_if_needed() -> wake_up_worker() -> srcu_funnel_gp_start() -> try_to_wake_up() -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock -> WANTS srcu locks If I understand this, this looks like an issue that can happen independent of the conversion of the spin locks. thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 20:04 ` Joel Fernandes @ 2026-03-18 20:11 ` Kumar Kartikeya Dwivedi 2026-03-18 20:25 ` Joel Fernandes 2026-03-18 21:52 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-03-18 20:11 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, Boqun Feng, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu On Wed, 18 Mar 2026 at 21:04, Joel Fernandes <joelagnelf@nvidia.com> wrote: > > On 3/18/2026 2:42 PM, Paul E. McKenney wrote: > > On Wed, Mar 18, 2026 at 08:51:16AM -0700, Boqun Feng wrote: > >> On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: > >> [..] > >>>>>> way that vanilla RCU's call_rcu_core() function takes an early exit if > >>>>>> interrupts are disabled. Of course, vanilla RCU can rely on things like > >>>>>> the scheduling-clock interrupt to start any needed grace periods [1], > >>>>>> but SRCU will instead need to manually defer this work, perhaps using > >>>>>> workqueues or IRQ work. > >>>>>> > >>>>>> In addition, rcutorture needs to be upgraded to sometimes invoke > >>>>>> ->call() with the scheduler pi lock held, but this change is not fixing > >>>>>> a regression, so could be deferred. (There is already code in rcutorture > >>>>>> that invokes the readers while holding a scheduler pi lock.) > >>>>>> > >>>>>> Given that RCU for this week through the end of March belongs to you guys, > >>>>>> if one of you can get this done by end of day Thursday, London time, > >>>>>> very good! Otherwise, I can put something together. > >>>>>> > >>>>>> Please let me know! > >>>>> > >>>>> Given that the current locking does allow it and lockdep should have > >>>>> complained, I am curious if we could rule that out ;) > >>> > >>> Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > >>> nesting right. The wakeup problem remains, right? > >>> But looking at the code, there is just srcu_funnel_gp_start(). If its > >>> srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > >>> then there will be always a timer and never a direct wake up of the > >>> worker. Wouldn't that work? > >> > >> Late to the party, so just make sure I understand the problem. The > >> problem is the wakeup in call_srcu() when it's called with scheduler > >> lock held, right? If so I think the current code works as what you > >> already explain, we defer the wakeup into a workqueue. > > > > The issue is that call_rcu_tasks() (which is call_srcu() now) is > > also invoked with a scheduler pi/rq lock held, which results in a > > deadlock cycle. So the srcu_gp_start_if_needed() function's call to > > raw_spin_lock_irqsave_sdp_contention() must be deferred to the workqueue > > handler, not just the wake-up. And that in turn means that the callback > > point also needs to be passed to this handler. > > > > See this email thread: > > > > https://lore.kernel.org/all/CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com/ > > > >> (but Paul, we are not talking about calling call_srcu(), that requires > >> some more work to get it work) > > > > Agreed, splitting srcu_gp_start_if_needed() and using a workqueue if > > interrupts were already disabled on entry. Otherwise, directly invoking > > the split-out portion of srcu_gp_start_if_needed(). > > > > But we might be talking past each other. > > > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > different issue, from the NMI issue? It is more of an issue of calling > call_srcu API with scheduler locks held. > > Something like below I think: > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > ---------------------------- ------------------------------------ > [1] holds &rq->__lock > [2] > -> call_srcu > -> srcu_gp_start_if_needed > -> srcu_funnel_gp_start > -> spin_lock_irqsave_ssp_content... > -> holds srcu locks > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > -> queue_delayed_work > -> call_srcu() -> __queue_work() > -> srcu_gp_start_if_needed() -> wake_up_worker() > -> srcu_funnel_gp_start() -> try_to_wake_up() > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > -> WANTS srcu locks > > If I understand this, this looks like an issue that can happen independent > of the conversion of the spin locks. > Yes, this is a separate issue, we should make the conversion to raw spin locks anyway, but lockdep found this once we applied that fix from Paul. In sched-ext, we can end up calling call_srcu() while rq->lock is held, e.g. from exit_task() -> some bpf map that deletes an element -> call_srcu(). There are other callbacks of course where it can be held, and other programs that can run tracing the kernel while it is held. > thanks, > > -- > Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 20:11 ` Kumar Kartikeya Dwivedi @ 2026-03-18 20:25 ` Joel Fernandes 0 siblings, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-18 20:25 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: paulmck, Boqun Feng, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu On 3/18/2026 4:11 PM, Kumar Kartikeya Dwivedi wrote: > On Wed, 18 Mar 2026 at 21:04, Joel Fernandes <joelagnelf@nvidia.com> wrote: >> >> On 3/18/2026 2:42 PM, Paul E. McKenney wrote: >>> On Wed, Mar 18, 2026 at 08:51:16AM -0700, Boqun Feng wrote: >>>> On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: >>>> [..] >>>>>>>> way that vanilla RCU's call_rcu_core() function takes an early exit if >>>>>>>> interrupts are disabled. Of course, vanilla RCU can rely on things like >>>>>>>> the scheduling-clock interrupt to start any needed grace periods [1], >>>>>>>> but SRCU will instead need to manually defer this work, perhaps using >>>>>>>> workqueues or IRQ work. >>>>>>>> >>>>>>>> In addition, rcutorture needs to be upgraded to sometimes invoke >>>>>>>> ->call() with the scheduler pi lock held, but this change is not fixing >>>>>>>> a regression, so could be deferred. (There is already code in rcutorture >>>>>>>> that invokes the readers while holding a scheduler pi lock.) >>>>>>>> >>>>>>>> Given that RCU for this week through the end of March belongs to you guys, >>>>>>>> if one of you can get this done by end of day Thursday, London time, >>>>>>>> very good! Otherwise, I can put something together. >>>>>>>> >>>>>>>> Please let me know! >>>>>>> >>>>>>> Given that the current locking does allow it and lockdep should have >>>>>>> complained, I am curious if we could rule that out ;) >>>>> >>>>> Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ >>>>> nesting right. The wakeup problem remains, right? >>>>> But looking at the code, there is just srcu_funnel_gp_start(). If its >>>>> srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed >>>>> then there will be always a timer and never a direct wake up of the >>>>> worker. Wouldn't that work? >>>> >>>> Late to the party, so just make sure I understand the problem. The >>>> problem is the wakeup in call_srcu() when it's called with scheduler >>>> lock held, right? If so I think the current code works as what you >>>> already explain, we defer the wakeup into a workqueue. >>> >>> The issue is that call_rcu_tasks() (which is call_srcu() now) is >>> also invoked with a scheduler pi/rq lock held, which results in a >>> deadlock cycle. So the srcu_gp_start_if_needed() function's call to >>> raw_spin_lock_irqsave_sdp_contention() must be deferred to the workqueue >>> handler, not just the wake-up. And that in turn means that the callback >>> point also needs to be passed to this handler. >>> >>> See this email thread: >>> >>> https://lore.kernel.org/all/CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com/ >>> >>>> (but Paul, we are not talking about calling call_srcu(), that requires >>>> some more work to get it work) >>> >>> Agreed, splitting srcu_gp_start_if_needed() and using a workqueue if >>> interrupts were already disabled on entry. Otherwise, directly invoking >>> the split-out portion of srcu_gp_start_if_needed(). >>> >>> But we might be talking past each other. >>> >> >> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a >> different issue, from the NMI issue? It is more of an issue of calling >> call_srcu API with scheduler locks held. >> >> Something like below I think: >> >> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) >> ---------------------------- ------------------------------------ >> [1] holds &rq->__lock >> [2] >> -> call_srcu >> -> srcu_gp_start_if_needed >> -> srcu_funnel_gp_start >> -> spin_lock_irqsave_ssp_content... >> -> holds srcu locks >> >> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) >> -> queue_delayed_work >> -> call_srcu() -> __queue_work() >> -> srcu_gp_start_if_needed() -> wake_up_worker() >> -> srcu_funnel_gp_start() -> try_to_wake_up() >> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock >> -> WANTS srcu locks >> >> If I understand this, this looks like an issue that can happen independent >> of the conversion of the spin locks. >> > > Yes, this is a separate issue, we should make the conversion to raw > spin locks anyway, but lockdep found this once we applied that fix > from Paul. > In sched-ext, we can end up calling call_srcu() while rq->lock is > held, e.g. from exit_task() -> some bpf map that deletes an element -> > call_srcu(). > There are other callbacks of course where it can be held, and other > programs that can run tracing the kernel while it is held. > Thanks. I guess I am also wondering, why didn't lockdep find it without the conversion to raw spin locks though? An ABBA deadlock should have been detected either way. Is there some difference in lockdep's ability to find deadlocks depending on whether a spinlock is raw? Anyway, I am applying the raw lock conversion fix and running some more tests. thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 20:04 ` Joel Fernandes 2026-03-18 20:11 ` Kumar Kartikeya Dwivedi @ 2026-03-18 21:52 ` Boqun Feng 2026-03-18 21:55 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-18 21:52 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 04:04:05PM -0400, Joel Fernandes wrote: > On 3/18/2026 2:42 PM, Paul E. McKenney wrote: > > On Wed, Mar 18, 2026 at 08:51:16AM -0700, Boqun Feng wrote: > >> On Wed, Mar 18, 2026 at 03:43:05PM +0100, Sebastian Andrzej Siewior wrote: > >> [..] > >>>>>> way that vanilla RCU's call_rcu_core() function takes an early exit if > >>>>>> interrupts are disabled. Of course, vanilla RCU can rely on things like > >>>>>> the scheduling-clock interrupt to start any needed grace periods [1], > >>>>>> but SRCU will instead need to manually defer this work, perhaps using > >>>>>> workqueues or IRQ work. > >>>>>> > >>>>>> In addition, rcutorture needs to be upgraded to sometimes invoke > >>>>>> ->call() with the scheduler pi lock held, but this change is not fixing > >>>>>> a regression, so could be deferred. (There is already code in rcutorture > >>>>>> that invokes the readers while holding a scheduler pi lock.) > >>>>>> > >>>>>> Given that RCU for this week through the end of March belongs to you guys, > >>>>>> if one of you can get this done by end of day Thursday, London time, > >>>>>> very good! Otherwise, I can put something together. > >>>>>> > >>>>>> Please let me know! > >>>>> > >>>>> Given that the current locking does allow it and lockdep should have > >>>>> complained, I am curious if we could rule that out ;) > >>> > >>> Your patch just s/spinlock_t/raw_spinlock_t so we get the locking/ > >>> nesting right. The wakeup problem remains, right? > >>> But looking at the code, there is just srcu_funnel_gp_start(). If its > >>> srcu_schedule_cbs_sdp() / queue_delayed_work() usage is always delayed > >>> then there will be always a timer and never a direct wake up of the > >>> worker. Wouldn't that work? > >> > >> Late to the party, so just make sure I understand the problem. The > >> problem is the wakeup in call_srcu() when it's called with scheduler > >> lock held, right? If so I think the current code works as what you > >> already explain, we defer the wakeup into a workqueue. > > > > The issue is that call_rcu_tasks() (which is call_srcu() now) is > > also invoked with a scheduler pi/rq lock held, which results in a > > deadlock cycle. So the srcu_gp_start_if_needed() function's call to > > raw_spin_lock_irqsave_sdp_contention() must be deferred to the workqueue > > handler, not just the wake-up. And that in turn means that the callback > > point also needs to be passed to this handler. > > > > See this email thread: > > > > https://lore.kernel.org/all/CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com/ > > > >> (but Paul, we are not talking about calling call_srcu(), that requires > >> some more work to get it work) > > > > Agreed, splitting srcu_gp_start_if_needed() and using a workqueue if > > interrupts were already disabled on entry. Otherwise, directly invoking > > the split-out portion of srcu_gp_start_if_needed(). > > > > But we might be talking past each other. > > > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > different issue, from the NMI issue? It is more of an issue of calling > call_srcu API with scheduler locks held. > > Something like below I think: > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > ---------------------------- ------------------------------------ > [1] holds &rq->__lock > [2] > -> call_srcu > -> srcu_gp_start_if_needed > -> srcu_funnel_gp_start > -> spin_lock_irqsave_ssp_content... > -> holds srcu locks > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > -> queue_delayed_work > -> call_srcu() -> __queue_work() > -> srcu_gp_start_if_needed() -> wake_up_worker() > -> srcu_funnel_gp_start() -> try_to_wake_up() > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > -> WANTS srcu locks I see, we can also have a self deadlock even without CPU B, when CPU A is going to try_to_wake_up() the a worker on the same CPU. An interesting observation is that the deadlock can be avoided in queue_delayed_work() uses a non-zero delay, that means a timer will be armed instead of acquiring the rq lock. (But I guess BPF also wants to run with timer base lock held, right? ;-) ;-) ;-)). /me going to check Paul's second fix at rcu/dev. Regards, Boqun > > If I understand this, this looks like an issue that can happen independent > of the conversion of the spin locks. > > thanks, > > -- > Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 21:52 ` Boqun Feng @ 2026-03-18 21:55 ` Boqun Feng 2026-03-18 22:15 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-18 21:55 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: [...] > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > different issue, from the NMI issue? It is more of an issue of calling > > call_srcu API with scheduler locks held. > > > > Something like below I think: > > > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > ---------------------------- ------------------------------------ > > [1] holds &rq->__lock > > [2] > > -> call_srcu > > -> srcu_gp_start_if_needed > > -> srcu_funnel_gp_start > > -> spin_lock_irqsave_ssp_content... > > -> holds srcu locks > > > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > -> queue_delayed_work > > -> call_srcu() -> __queue_work() > > -> srcu_gp_start_if_needed() -> wake_up_worker() > > -> srcu_funnel_gp_start() -> try_to_wake_up() > > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > -> WANTS srcu locks > > I see, we can also have a self deadlock even without CPU B, when CPU A > is going to try_to_wake_up() the a worker on the same CPU. > > An interesting observation is that the deadlock can be avoided in > queue_delayed_work() uses a non-zero delay, that means a timer will be > armed instead of acquiring the rq lock. > > (But I guess BPF also wants to run with timer base lock held, right? ;-) > ;-) ;-)). > > /me going to check Paul's second fix at rcu/dev. > Oh I mis-read, there is no second fix, just a rcutorture changes. Let me see if I can find out a quick fix ;-) Regards, Boqun > Regards, > Boqun > > > > > If I understand this, this looks like an issue that can happen independent > > of the conversion of the spin locks. > > > > thanks, > > > > -- > > Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 21:55 ` Boqun Feng @ 2026-03-18 22:15 ` Boqun Feng 2026-03-18 22:52 ` Joel Fernandes 2026-03-18 23:56 ` Kumar Kartikeya Dwivedi 0 siblings, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-18 22:15 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > [...] > > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > different issue, from the NMI issue? It is more of an issue of calling > > > call_srcu API with scheduler locks held. > > > > > > Something like below I think: > > > > > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > ---------------------------- ------------------------------------ > > > [1] holds &rq->__lock > > > [2] > > > -> call_srcu > > > -> srcu_gp_start_if_needed > > > -> srcu_funnel_gp_start > > > -> spin_lock_irqsave_ssp_content... > > > -> holds srcu locks > > > > > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > -> queue_delayed_work > > > -> call_srcu() -> __queue_work() > > > -> srcu_gp_start_if_needed() -> wake_up_worker() > > > -> srcu_funnel_gp_start() -> try_to_wake_up() > > > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > -> WANTS srcu locks > > > > I see, we can also have a self deadlock even without CPU B, when CPU A > > is going to try_to_wake_up() the a worker on the same CPU. > > > > An interesting observation is that the deadlock can be avoided in > > queue_delayed_work() uses a non-zero delay, that means a timer will be > > armed instead of acquiring the rq lock. > > If my observation is correct, then this can probably fix the deadlock issue with runqueue lock (untested though), but it won't work if BPF tracepoint can happen with timer base lock held. Regards, Boqun ------> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 2328827f8775..a5d67264acb5 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, struct srcu_node *snp_leaf; unsigned long snp_seq; struct srcu_usage *sup = ssp->srcu_sup; + bool irqs_were_disabled; /* Ensure that snp node tree is fully initialized before traversing it */ if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, /* Top of tree, must ensure the grace period will be started. */ raw_spin_lock_irqsave_ssp_contention(ssp, &flags); + irqs_were_disabled = irqs_disabled_flags(flags); if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { /* * Record need for grace period s. Pair with load @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, // it isn't. And it does not have to be. After all, it // can only be executed during early boot when there is only // the one boot CPU running with interrupts still disabled. + // + // If irq was disabled when call_srcu() is called, then we + // could be in the scheduler path with a runqueue lock held, + // delay the process_srcu() work 1 more jiffies so we don't go + // through the kick_pool() -> wake_up_process() path below, and + // we could avoid deadlock with runqueue lock. if (likely(srcu_init_done)) queue_delayed_work(rcu_gp_wq, &sup->work, - !!srcu_get_delay(ssp)); + !!srcu_get_delay(ssp) + + !!irqs_were_disabled); else if (list_empty(&sup->work.work.entry)) list_add(&sup->work.work.entry, &srcu_boot_list); } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 22:15 ` Boqun Feng @ 2026-03-18 22:52 ` Joel Fernandes 2026-03-18 23:27 ` Boqun Feng 2026-03-18 23:56 ` Kumar Kartikeya Dwivedi 1 sibling, 1 reply; 100+ messages in thread From: Joel Fernandes @ 2026-03-18 22:52 UTC (permalink / raw) To: Boqun Feng Cc: paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On 3/18/2026 6:15 PM, Boqun Feng wrote: > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: >> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: >> [...] >>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a >>>> different issue, from the NMI issue? It is more of an issue of calling >>>> call_srcu API with scheduler locks held. >>>> >>>> Something like below I think: >>>> >>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) >>>> ---------------------------- ------------------------------------ >>>> [1] holds &rq->__lock >>>> [2] >>>> -> call_srcu >>>> -> srcu_gp_start_if_needed >>>> -> srcu_funnel_gp_start >>>> -> spin_lock_irqsave_ssp_content... >>>> -> holds srcu locks >>>> >>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) >>>> -> queue_delayed_work >>>> -> call_srcu() -> __queue_work() >>>> -> srcu_gp_start_if_needed() -> wake_up_worker() >>>> -> srcu_funnel_gp_start() -> try_to_wake_up() >>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock >>>> -> WANTS srcu locks >>> >>> I see, we can also have a self deadlock even without CPU B, when CPU A >>> is going to try_to_wake_up() the a worker on the same CPU. >>> >>> An interesting observation is that the deadlock can be avoided in >>> queue_delayed_work() uses a non-zero delay, that means a timer will be >>> armed instead of acquiring the rq lock. >>> > > If my observation is correct, then this can probably fix the deadlock > issue with runqueue lock (untested though), but it won't work if BPF > tracepoint can happen with timer base lock held. > > Regards, > Boqun > > ------> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 2328827f8775..a5d67264acb5 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > struct srcu_node *snp_leaf; > unsigned long snp_seq; > struct srcu_usage *sup = ssp->srcu_sup; > + bool irqs_were_disabled; > > /* Ensure that snp node tree is fully initialized before traversing it */ > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > /* Top of tree, must ensure the grace period will be started. */ > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > + irqs_were_disabled = irqs_disabled_flags(flags); > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > /* > * Record need for grace period s. Pair with load > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > // it isn't. And it does not have to be. After all, it > // can only be executed during early boot when there is only > // the one boot CPU running with interrupts still disabled. > + // > + // If irq was disabled when call_srcu() is called, then we > + // could be in the scheduler path with a runqueue lock held, > + // delay the process_srcu() work 1 more jiffies so we don't go > + // through the kick_pool() -> wake_up_process() path below, and > + // we could avoid deadlock with runqueue lock. > if (likely(srcu_init_done)) > queue_delayed_work(rcu_gp_wq, &sup->work, > - !!srcu_get_delay(ssp)); > + !!srcu_get_delay(ssp) + > + !!irqs_were_disabled); Nice, I wonder if it is better to do this in __queue_delayed_work() itself. Do we have queue_delayed_work() with zero delays that are in irq-disabled regions, and they depend on that zero-delay for correctness? Even with delay of 0 though, the work item doesn't execute right away anyway, the worker thread has to also be scheduler right? Also if IRQ is disabled, I'd think this is a critical path that is not wanting to run the work item right-away anyway since workqueue is more a bottom-half mechanism, than "run this immediately". IOW, would be good to make the workqueue-layer more resilient to waking up the scheduler when a delay would have been totally ok. But maybe +Tejun can yell if that sounds insane. thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 22:52 ` Joel Fernandes @ 2026-03-18 23:27 ` Boqun Feng 2026-03-19 1:08 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-18 23:27 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On Wed, Mar 18, 2026 at 06:52:53PM -0400, Joel Fernandes wrote: > > > On 3/18/2026 6:15 PM, Boqun Feng wrote: > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > >> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > >> [...] > >>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > >>>> different issue, from the NMI issue? It is more of an issue of calling > >>>> call_srcu API with scheduler locks held. > >>>> > >>>> Something like below I think: > >>>> > >>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > >>>> ---------------------------- ------------------------------------ > >>>> [1] holds &rq->__lock > >>>> [2] > >>>> -> call_srcu > >>>> -> srcu_gp_start_if_needed > >>>> -> srcu_funnel_gp_start > >>>> -> spin_lock_irqsave_ssp_content... > >>>> -> holds srcu locks > >>>> > >>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > >>>> -> queue_delayed_work > >>>> -> call_srcu() -> __queue_work() > >>>> -> srcu_gp_start_if_needed() -> wake_up_worker() > >>>> -> srcu_funnel_gp_start() -> try_to_wake_up() > >>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > >>>> -> WANTS srcu locks > >>> > >>> I see, we can also have a self deadlock even without CPU B, when CPU A > >>> is going to try_to_wake_up() the a worker on the same CPU. > >>> > >>> An interesting observation is that the deadlock can be avoided in > >>> queue_delayed_work() uses a non-zero delay, that means a timer will be > >>> armed instead of acquiring the rq lock. > >>> > > > > If my observation is correct, then this can probably fix the deadlock > > issue with runqueue lock (untested though), but it won't work if BPF > > tracepoint can happen with timer base lock held. > > > > Regards, > > Boqun > > > > ------> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 2328827f8775..a5d67264acb5 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > struct srcu_node *snp_leaf; > > unsigned long snp_seq; > > struct srcu_usage *sup = ssp->srcu_sup; > > + bool irqs_were_disabled; > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > /* Top of tree, must ensure the grace period will be started. */ > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > + irqs_were_disabled = irqs_disabled_flags(flags); > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > /* > > * Record need for grace period s. Pair with load > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > // it isn't. And it does not have to be. After all, it > > // can only be executed during early boot when there is only > > // the one boot CPU running with interrupts still disabled. > > + // > > + // If irq was disabled when call_srcu() is called, then we > > + // could be in the scheduler path with a runqueue lock held, > > + // delay the process_srcu() work 1 more jiffies so we don't go > > + // through the kick_pool() -> wake_up_process() path below, and > > + // we could avoid deadlock with runqueue lock. > > if (likely(srcu_init_done)) > > queue_delayed_work(rcu_gp_wq, &sup->work, > > - !!srcu_get_delay(ssp)); > > + !!srcu_get_delay(ssp) + > > + !!irqs_were_disabled); > Nice, I wonder if it is better to do this in __queue_delayed_work() itself. > Do we have queue_delayed_work() with zero delays that are in irq-disabled > regions, and they depend on that zero-delay for correctness? Even with > delay of 0 though, the work item doesn't execute right away anyway, the > worker thread has to also be scheduler right? > > Also if IRQ is disabled, I'd think this is a critical path that is not > wanting to run the work item right-away anyway since workqueue is more a > bottom-half mechanism, than "run this immediately". > > IOW, would be good to make the workqueue-layer more resilient to waking up > the scheduler when a delay would have been totally ok. But maybe +Tejun can > yell if that sounds insane. > I think all of these are probably a good point. However my fix is not complete :( It's missing the ABBA case in your example (it obviously could solve the self deadlock if my observation is correct), because we will still build rcu_node::lock -> runqueue::lock in some conditions, and BPF contributes the runqueue::lock -> rcu_node::lock dependency. Hence we still have ABBA deadlock. To remove the rcu_node::lock -> runqueue::lock entirely, we need to always delay 1+ jiffies: diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 2328827f8775..86733f7bf637 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1118,9 +1118,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, // it isn't. And it does not have to be. After all, it // can only be executed during early boot when there is only // the one boot CPU running with interrupts still disabled. + // + // Delay the process_srcu() work 1 more jiffies so we don't go + // through the kick_pool() -> wake_up_process() path below, and + // we could avoid deadlock with runqueue lock. if (likely(srcu_init_done)) queue_delayed_work(rcu_gp_wq, &sup->work, - !!srcu_get_delay(ssp)); + !!srcu_get_delay(ssp) + 1); else if (list_empty(&sup->work.work.entry)) list_add(&sup->work.work.entry, &srcu_boot_list); } Paul's suggestion at [1] is basically breaking another dependecy runqueue::lock -> rcu_node::lock, I'm investigating how we can do that. [1]: https://lore.kernel.org/rcu/214fb140-041d-4fd1-8694-658547209b84@paulmck-laptop/ Regards, Boqun > thanks, > > -- > Joel Fernandes > ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 23:27 ` Boqun Feng @ 2026-03-19 1:08 ` Boqun Feng 2026-03-19 9:03 ` Sebastian Andrzej Siewior 2026-03-19 10:02 ` Paul E. McKenney 0 siblings, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-19 1:08 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On Wed, Mar 18, 2026 at 04:27:23PM -0700, Boqun Feng wrote: > On Wed, Mar 18, 2026 at 06:52:53PM -0400, Joel Fernandes wrote: > > > > > > On 3/18/2026 6:15 PM, Boqun Feng wrote: > > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > >> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > >> [...] > > >>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > >>>> different issue, from the NMI issue? It is more of an issue of calling > > >>>> call_srcu API with scheduler locks held. > > >>>> > > >>>> Something like below I think: > > >>>> > > >>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > >>>> ---------------------------- ------------------------------------ > > >>>> [1] holds &rq->__lock > > >>>> [2] > > >>>> -> call_srcu > > >>>> -> srcu_gp_start_if_needed > > >>>> -> srcu_funnel_gp_start > > >>>> -> spin_lock_irqsave_ssp_content... > > >>>> -> holds srcu locks > > >>>> > > >>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > >>>> -> queue_delayed_work > > >>>> -> call_srcu() -> __queue_work() > > >>>> -> srcu_gp_start_if_needed() -> wake_up_worker() > > >>>> -> srcu_funnel_gp_start() -> try_to_wake_up() > > >>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > >>>> -> WANTS srcu locks > > >>> > > >>> I see, we can also have a self deadlock even without CPU B, when CPU A > > >>> is going to try_to_wake_up() the a worker on the same CPU. > > >>> > > >>> An interesting observation is that the deadlock can be avoided in > > >>> queue_delayed_work() uses a non-zero delay, that means a timer will be > > >>> armed instead of acquiring the rq lock. > > >>> > > > > > > If my observation is correct, then this can probably fix the deadlock > > > issue with runqueue lock (untested though), but it won't work if BPF > > > tracepoint can happen with timer base lock held. > > > > > > Regards, > > > Boqun > > > > > > ------> > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 2328827f8775..a5d67264acb5 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > struct srcu_node *snp_leaf; > > > unsigned long snp_seq; > > > struct srcu_usage *sup = ssp->srcu_sup; > > > + bool irqs_were_disabled; > > > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > /* Top of tree, must ensure the grace period will be started. */ > > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > > + irqs_were_disabled = irqs_disabled_flags(flags); > > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > > /* > > > * Record need for grace period s. Pair with load > > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > // it isn't. And it does not have to be. After all, it > > > // can only be executed during early boot when there is only > > > // the one boot CPU running with interrupts still disabled. > > > + // > > > + // If irq was disabled when call_srcu() is called, then we > > > + // could be in the scheduler path with a runqueue lock held, > > > + // delay the process_srcu() work 1 more jiffies so we don't go > > > + // through the kick_pool() -> wake_up_process() path below, and > > > + // we could avoid deadlock with runqueue lock. > > > if (likely(srcu_init_done)) > > > queue_delayed_work(rcu_gp_wq, &sup->work, > > > - !!srcu_get_delay(ssp)); > > > + !!srcu_get_delay(ssp) + > > > + !!irqs_were_disabled); > > Nice, I wonder if it is better to do this in __queue_delayed_work() itself. > > Do we have queue_delayed_work() with zero delays that are in irq-disabled > > regions, and they depend on that zero-delay for correctness? Even with > > delay of 0 though, the work item doesn't execute right away anyway, the > > worker thread has to also be scheduler right? > > > > Also if IRQ is disabled, I'd think this is a critical path that is not > > wanting to run the work item right-away anyway since workqueue is more a > > bottom-half mechanism, than "run this immediately". > > > > IOW, would be good to make the workqueue-layer more resilient to waking up > > the scheduler when a delay would have been totally ok. But maybe +Tejun can > > yell if that sounds insane. > > > > I think all of these are probably a good point. However my fix is not > complete :( It's missing the ABBA case in your example (it obviously > could solve the self deadlock if my observation is correct), because we > will still build rcu_node::lock -> runqueue::lock in some conditions, > and BPF contributes the runqueue::lock -> rcu_node::lock dependency. > Hence we still have ABBA deadlock. > > To remove the rcu_node::lock -> runqueue::lock entirely, we need to > always delay 1+ jiffies: > Hmm.. or I can do as the old call_rcu_tasks_trace() does: using an irq_work. I also pushed it at: https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix (based on Paul's fix on spinlock already, but only lightly build test). Regards, Boqun -------------------------->8 Subject: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Signed-off-by: Boqun Feng <boqun@kernel.org> --- include/linux/srcutree.h | 1 + kernel/rcu/srcutree.c | 22 ++++++++++++++++++++-- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index b122c560a59c..fd1a9270cb9a 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -95,6 +95,7 @@ struct srcu_usage { unsigned long reschedule_jiffies; unsigned long reschedule_count; struct delayed_work work; + struct irq_work irq_work; struct srcu_struct *srcu_ssp; }; diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 2328827f8775..57116635e72d 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -19,6 +19,7 @@ #include <linux/mutex.h> #include <linux/percpu.h> #include <linux/preempt.h> +#include <linux/irq_work.h> #include <linux/rcupdate_wait.h> #include <linux/sched.h> #include <linux/smp.h> @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; static void srcu_invoke_callbacks(struct work_struct *work); static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); static void process_srcu(struct work_struct *work); +static void srcu_irq_work(struct irq_work *work); static void srcu_delay_timer(struct timer_list *t); /* @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); ssp->srcu_sup->sda_is_static = is_static; if (!is_static) { ssp->sda = alloc_percpu(struct srcu_data); @@ -1118,9 +1121,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, // it isn't. And it does not have to be. After all, it // can only be executed during early boot when there is only // the one boot CPU running with interrupts still disabled. + // + // Use an irq_work here to avoid acquiring runqueue lock with + // srcu rcu_node::lock held. BPF instrument could introduce the + // opposite dependency, hence we need to break the possible + // locking dependency here. if (likely(srcu_init_done)) - queue_delayed_work(rcu_gp_wq, &sup->work, - !!srcu_get_delay(ssp)); + irq_work_queue(&sup->irq_work); else if (list_empty(&sup->work.work.entry)) list_add(&sup->work.work.entry, &srcu_boot_list); } @@ -1979,6 +1986,17 @@ static void process_srcu(struct work_struct *work) srcu_reschedule(ssp, curdelay); } +static void srcu_irq_work(struct irq_work *work) +{ + struct srcu_struct *ssp; + struct srcu_usage *sup; + + sup = container_of(work, struct srcu_usage, irq_work); + ssp = sup->srcu_ssp; + + queue_delayed_work(rcu_gp_wq, &sup->work, !!srcu_get_delay(ssp)); +} + void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, unsigned long *gp_seq) { -- 2.50.1 (Apple Git-155) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 1:08 ` Boqun Feng @ 2026-03-19 9:03 ` Sebastian Andrzej Siewior 2026-03-19 16:27 ` Boqun Feng 2026-03-19 10:02 ` Paul E. McKenney 1 sibling, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-19 9:03 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On 2026-03-18 18:08:21 [-0700], Boqun Feng wrote: > @@ -1979,6 +1986,17 @@ static void process_srcu(struct work_struct *work) > srcu_reschedule(ssp, curdelay); > } > > +static void srcu_irq_work(struct irq_work *work) > +{ > + struct srcu_struct *ssp; > + struct srcu_usage *sup; > + > + sup = container_of(work, struct srcu_usage, irq_work); > + ssp = sup->srcu_ssp; > + > + queue_delayed_work(rcu_gp_wq, &sup->work, !!srcu_get_delay(ssp)); > +} > + Please just use the queue_delayed_work() with a delay >0. Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 9:03 ` Sebastian Andrzej Siewior @ 2026-03-19 16:27 ` Boqun Feng 2026-03-19 16:33 ` Sebastian Andrzej Siewior 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 16:27 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-18 18:08:21 [-0700], Boqun Feng wrote: > > @@ -1979,6 +1986,17 @@ static void process_srcu(struct work_struct *work) > > srcu_reschedule(ssp, curdelay); > > } > > > > +static void srcu_irq_work(struct irq_work *work) > > +{ > > + struct srcu_struct *ssp; > > + struct srcu_usage *sup; > > + > > + sup = container_of(work, struct srcu_usage, irq_work); > > + ssp = sup->srcu_ssp; > > + > > + queue_delayed_work(rcu_gp_wq, &sup->work, !!srcu_get_delay(ssp)); > > +} > > + > > Please just use the queue_delayed_work() with a delay >0. > That doesn't work since queue_delayed_work() with a positive delay will still acquire timer base lock, and we can have BPF instrument with timer base lock held i.e. calling call_srcu() with timer base lock. irq_work on the other hand doesn't use any locking. Regards, Boqun > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 16:27 ` Boqun Feng @ 2026-03-19 16:33 ` Sebastian Andrzej Siewior 2026-03-19 16:48 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-19 16:33 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > Please just use the queue_delayed_work() with a delay >0. > > > > That doesn't work since queue_delayed_work() with a positive delay will > still acquire timer base lock, and we can have BPF instrument with timer > base lock held i.e. calling call_srcu() with timer base lock. > > irq_work on the other hand doesn't use any locking. Could we please restrict BPF somehow so it does roam free? It is absolutely awful to have irq_work() in call_srcu() just because it might acquire locks. > Regards, > Boqun > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 16:33 ` Sebastian Andrzej Siewior @ 2026-03-19 16:48 ` Boqun Feng 2026-03-19 16:59 ` Kumar Kartikeya Dwivedi 2026-03-19 17:02 ` Sebastian Andrzej Siewior 0 siblings, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-19 16:48 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > Please just use the queue_delayed_work() with a delay >0. > > > > > > > That doesn't work since queue_delayed_work() with a positive delay will > > still acquire timer base lock, and we can have BPF instrument with timer > > base lock held i.e. calling call_srcu() with timer base lock. > > > > irq_work on the other hand doesn't use any locking. > > Could we please restrict BPF somehow so it does roam free? It is > absolutely awful to have irq_work() in call_srcu() just because it > might acquire locks. > I agree it's not RCU's fault ;-) I guess it'll be difficult to restrict BPF, however maybe BPF can call call_srcu() in irq_work instead? Or a more systematic defer mechanism that allows BPF to defer any lock holding functions to a different context. (We have a similar issue that BPF cannot call kfree_rcu() in some cases IIRC). But we need to fix this in v7.0, so this short-term fix is still needed. Regars, Boqun > > Regards, > > Boqun > > > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 16:48 ` Boqun Feng @ 2026-03-19 16:59 ` Kumar Kartikeya Dwivedi 2026-03-19 17:27 ` Boqun Feng 2026-03-19 17:02 ` Sebastian Andrzej Siewior 1 sibling, 1 reply; 100+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-03-19 16:59 UTC (permalink / raw) To: Boqun Feng Cc: Sebastian Andrzej Siewior, Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > > On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > > Please just use the queue_delayed_work() with a delay >0. > > > > > > > > > > That doesn't work since queue_delayed_work() with a positive delay will > > > still acquire timer base lock, and we can have BPF instrument with timer > > > base lock held i.e. calling call_srcu() with timer base lock. > > > > > > irq_work on the other hand doesn't use any locking. > > > > Could we please restrict BPF somehow so it does roam free? It is > > absolutely awful to have irq_work() in call_srcu() just because it > > might acquire locks. > > > > I agree it's not RCU's fault ;-) > > I guess it'll be difficult to restrict BPF, however maybe BPF can call > call_srcu() in irq_work instead? Or a more systematic defer mechanism > that allows BPF to defer any lock holding functions to a different > context. (We have a similar issue that BPF cannot call kfree_rcu() in > some cases IIRC). > > But we need to fix this in v7.0, so this short-term fix is still needed. > I don't think this is an option, even longer term. We already do it when it's incorrect to invoke call_rcu() or any other API in a specific context (e.g., NMI, where we punt it using irq_work). However, the case reported in this thread is different. It was an existing user which worked fine before but got broken now. We were using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock is held before, so the conversion underneath to call_srcu() should continue to remain transparent in this respect. > Regars, > Boqun > > > > Regards, > > > Boqun > > > > > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 16:59 ` Kumar Kartikeya Dwivedi @ 2026-03-19 17:27 ` Boqun Feng 2026-03-19 18:41 ` Kumar Kartikeya Dwivedi 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 17:27 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: Sebastian Andrzej Siewior, Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > > > On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > > > On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > > > Please just use the queue_delayed_work() with a delay >0. > > > > > > > > > > > > > That doesn't work since queue_delayed_work() with a positive delay will > > > > still acquire timer base lock, and we can have BPF instrument with timer > > > > base lock held i.e. calling call_srcu() with timer base lock. > > > > > > > > irq_work on the other hand doesn't use any locking. > > > > > > Could we please restrict BPF somehow so it does roam free? It is > > > absolutely awful to have irq_work() in call_srcu() just because it > > > might acquire locks. > > > > > > > I agree it's not RCU's fault ;-) > > > > I guess it'll be difficult to restrict BPF, however maybe BPF can call > > call_srcu() in irq_work instead? Or a more systematic defer mechanism > > that allows BPF to defer any lock holding functions to a different > > context. (We have a similar issue that BPF cannot call kfree_rcu() in > > some cases IIRC). > > > > But we need to fix this in v7.0, so this short-term fix is still needed. > > > > I don't think this is an option, even longer term. We already do it > when it's incorrect to invoke call_rcu() or any other API in a > specific context (e.g., NMI, where we punt it using irq_work). > However, the case reported in this thread is different. It was an > existing user which worked fine before but got broken now. We were > using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > is held before, so the conversion underneath to call_srcu() should > continue to remain transparent in this respect. > I'm not sure that's a real argument here, kernel doesn't have a stable internal API, which allows developers to refactor the code into a saner way. There are currently multiple issues that suggest we may need a defer mechanism for BPF core, and if it makes the code more easier to reason about then why not? Think about it like a process that we learn about all the defer patterns that BPF currently needs and wrap them in a nice and maintainable way. Regards, Boqun > > Regars, > > Boqun > > > > > > Regards, > > > > Boqun > > > > > > > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 17:27 ` Boqun Feng @ 2026-03-19 18:41 ` Kumar Kartikeya Dwivedi 2026-03-19 20:14 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-03-19 18:41 UTC (permalink / raw) To: Boqun Feng Cc: Sebastian Andrzej Siewior, Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: > > On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > > On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > > > > > On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > > > On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > > > > On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > > > > Please just use the queue_delayed_work() with a delay >0. > > > > > > > > > > > > > > > > That doesn't work since queue_delayed_work() with a positive delay will > > > > > still acquire timer base lock, and we can have BPF instrument with timer > > > > > base lock held i.e. calling call_srcu() with timer base lock. > > > > > > > > > > irq_work on the other hand doesn't use any locking. > > > > > > > > Could we please restrict BPF somehow so it does roam free? It is > > > > absolutely awful to have irq_work() in call_srcu() just because it > > > > might acquire locks. > > > > > > > > > > I agree it's not RCU's fault ;-) > > > > > > I guess it'll be difficult to restrict BPF, however maybe BPF can call > > > call_srcu() in irq_work instead? Or a more systematic defer mechanism > > > that allows BPF to defer any lock holding functions to a different > > > context. (We have a similar issue that BPF cannot call kfree_rcu() in > > > some cases IIRC). > > > > > > But we need to fix this in v7.0, so this short-term fix is still needed. > > > > > > > I don't think this is an option, even longer term. We already do it > > when it's incorrect to invoke call_rcu() or any other API in a > > specific context (e.g., NMI, where we punt it using irq_work). > > However, the case reported in this thread is different. It was an > > existing user which worked fine before but got broken now. We were > > using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > > is held before, so the conversion underneath to call_srcu() should > > continue to remain transparent in this respect. > > > > I'm not sure that's a real argument here, kernel doesn't have a stable > internal API, which allows developers to refactor the code into a saner > way. There are currently multiple issues that suggest we may need a > defer mechanism for BPF core, and if it makes the code more easier to > reason about then why not? Think about it like a process that we learn > about all the defer patterns that BPF currently needs and wrap them in a > nice and maintainable way. This is all right in theory, but I don't understand how your theoretical deferral mechanism for BPF will help here in the case we're discussing, or is even appealing. How do we decide when to defer? Will we annotate all locks that can be held by RCU internals to be able to check if they are held (on the current cpu, which is non-trivial except by maintaining a held lock table, testing the locked bit is too conservative), and then deferring the call_srcu() from the caller in BPF? What if you gain new locks? It doesn't seem practical to me. Plus it pushes the burden of detection and deferral to the caller, making everything more complicated and error-prone. Also, any unconditional deferral in the caller for APIs that can "hold locks" to avoid all this is not without its cost. The implementation of RCU knows and can stay in sync with those conditions for when deferral is needed, and hide all that complexity from the caller. The cost should definitely be paid by the caller if we would break the API's broad contract, e.g., by trying to invoke it in NMI which it is not supposed to run in yet, in that case we already handle things using irq_work. Anything more complicated than that is hard to scale. All of this may also change in the future where we support call_rcu_nolock() to make it work everywhere, and only defer when we detect reentrancy (in the same or different context). > > Regards, > Boqun > > > > Regars, > > > Boqun > > > > > > > > Regards, > > > > > Boqun > > > > > > > > > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 18:41 ` Kumar Kartikeya Dwivedi @ 2026-03-19 20:14 ` Boqun Feng 2026-03-19 20:21 ` Joel Fernandes 2026-03-20 16:15 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng 0 siblings, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-19 20:14 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: Sebastian Andrzej Siewior, Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote: > On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: > > > > On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > > > On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > > > > > > > On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > > > > On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > > > > > On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > > > > > Please just use the queue_delayed_work() with a delay >0. > > > > > > > > > > > > > > > > > > > That doesn't work since queue_delayed_work() with a positive delay will > > > > > > still acquire timer base lock, and we can have BPF instrument with timer > > > > > > base lock held i.e. calling call_srcu() with timer base lock. > > > > > > > > > > > > irq_work on the other hand doesn't use any locking. > > > > > > > > > > Could we please restrict BPF somehow so it does roam free? It is > > > > > absolutely awful to have irq_work() in call_srcu() just because it > > > > > might acquire locks. > > > > > > > > > > > > > I agree it's not RCU's fault ;-) > > > > > > > > I guess it'll be difficult to restrict BPF, however maybe BPF can call > > > > call_srcu() in irq_work instead? Or a more systematic defer mechanism > > > > that allows BPF to defer any lock holding functions to a different > > > > context. (We have a similar issue that BPF cannot call kfree_rcu() in > > > > some cases IIRC). > > > > > > > > But we need to fix this in v7.0, so this short-term fix is still needed. > > > > > > > > > > I don't think this is an option, even longer term. We already do it > > > when it's incorrect to invoke call_rcu() or any other API in a > > > specific context (e.g., NMI, where we punt it using irq_work). > > > However, the case reported in this thread is different. It was an > > > existing user which worked fine before but got broken now. We were > > > using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > > > is held before, so the conversion underneath to call_srcu() should > > > continue to remain transparent in this respect. > > > > > > > I'm not sure that's a real argument here, kernel doesn't have a stable > > internal API, which allows developers to refactor the code into a saner > > way. There are currently multiple issues that suggest we may need a > > defer mechanism for BPF core, and if it makes the code more easier to > > reason about then why not? Think about it like a process that we learn > > about all the defer patterns that BPF currently needs and wrap them in a > > nice and maintainable way. > > This is all right in theory, but I don't understand how your > theoretical deferral mechanism for BPF will help here in the case > we're discussing, or is even appealing. > > How do we decide when to defer? Will we annotate all locks that can be > held by RCU internals to be able to check if they are held (on the > current cpu, which is non-trivial except by maintaining a held lock > table, testing the locked bit is too conservative), and then deferring > the call_srcu() from the caller in BPF? What if you gain new locks? It > doesn't seem practical to me. Plus it pushes the burden of detection > and deferral to the caller, making everything more complicated and > error-prone. > My suggestion would be: deferring all call_srcu()s that in BPF core. For new locks, I think every lock usage in BPF core should be carefully audited because that's very similar to NMI. It's not hard to trade BPF as a different context and let lockdep detect lock mis-usage, similar as what we do for interrupts and NMIs. Basically, if we want to use some synchronization in BPF core: 1. If it's re-entrant safe, then go ahead and use it. 2. If it has a lock, but can be deferred, use the general BPF defer mechanism to defer the operation. 3. If it cannot be deferred, it has to change or add a new variant that support either 1 or 2. i.e. a universal solution. > Also, any unconditional deferral in the caller for APIs that can "hold > locks" to avoid all this is not without its cost. > > The implementation of RCU knows and can stay in sync with those > conditions for when deferral is needed, and hide all that complexity > from the caller. The cost should definitely be paid by the caller if > we would break the API's broad contract, e.g., by trying to invoke it The thing is, lots of the synchronization primitives existed before BPF, and they were not designed or implemented with "BPF safe" in mind, and they could be dragged into BPF core code path if we begin to use them. For example, irq_work may be just "happen-to-work" here, or there is a bug that we are missing. It would be rather easier or clearer if we design a dedicate defer mechanism with BPF core in mind, and then we use that for all the deferrable operations. Regards, Boqun > in NMI which it is not supposed to run in yet, in that case we already > handle things using irq_work. Anything more complicated than that is > hard to scale. All of this may also change in the future where we > support call_rcu_nolock() to make it work everywhere, and only defer > when we detect reentrancy (in the same or different context). > > > > > > > Regards, > > Boqun > > > > > > Regars, > > > > Boqun > > > > > > > > > > Regards, > > > > > > Boqun > > > > > > > > > > > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 20:14 ` Boqun Feng @ 2026-03-19 20:21 ` Joel Fernandes 2026-03-19 20:39 ` Boqun Feng 2026-03-20 16:15 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Joel Fernandes @ 2026-03-19 20:21 UTC (permalink / raw) To: Boqun Feng, Kumar Kartikeya Dwivedi Cc: Sebastian Andrzej Siewior, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On 3/19/2026 4:14 PM, Boqun Feng wrote: > On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote: >> On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: >>> >>> On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: >>>> On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: >>>>> >>>>> On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: >>>>>> On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: >>>>>>> On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: >>>>>>>> Please just use the queue_delayed_work() with a delay >0. >>>>>>>> >>>>>>> >>>>>>> That doesn't work since queue_delayed_work() with a positive delay will >>>>>>> still acquire timer base lock, and we can have BPF instrument with timer >>>>>>> base lock held i.e. calling call_srcu() with timer base lock. >>>>>>> >>>>>>> irq_work on the other hand doesn't use any locking. >>>>>> >>>>>> Could we please restrict BPF somehow so it does roam free? It is >>>>>> absolutely awful to have irq_work() in call_srcu() just because it >>>>>> might acquire locks. >>>>>> >>>>> >>>>> I agree it's not RCU's fault ;-) >>>>> >>>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call >>>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism >>>>> that allows BPF to defer any lock holding functions to a different >>>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in >>>>> some cases IIRC). >>>>> >>>>> But we need to fix this in v7.0, so this short-term fix is still needed. >>>>> >>>> >>>> I don't think this is an option, even longer term. We already do it >>>> when it's incorrect to invoke call_rcu() or any other API in a >>>> specific context (e.g., NMI, where we punt it using irq_work). >>>> However, the case reported in this thread is different. It was an >>>> existing user which worked fine before but got broken now. We were >>>> using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock >>>> is held before, so the conversion underneath to call_srcu() should >>>> continue to remain transparent in this respect. >>>> >>> >>> I'm not sure that's a real argument here, kernel doesn't have a stable >>> internal API, which allows developers to refactor the code into a saner >>> way. There are currently multiple issues that suggest we may need a >>> defer mechanism for BPF core, and if it makes the code more easier to >>> reason about then why not? Think about it like a process that we learn >>> about all the defer patterns that BPF currently needs and wrap them in a >>> nice and maintainable way. >> >> This is all right in theory, but I don't understand how your >> theoretical deferral mechanism for BPF will help here in the case >> we're discussing, or is even appealing. >> >> How do we decide when to defer? Will we annotate all locks that can be >> held by RCU internals to be able to check if they are held (on the >> current cpu, which is non-trivial except by maintaining a held lock >> table, testing the locked bit is too conservative), and then deferring >> the call_srcu() from the caller in BPF? What if you gain new locks? It >> doesn't seem practical to me. Plus it pushes the burden of detection >> and deferral to the caller, making everything more complicated and >> error-prone. >> > > My suggestion would be: deferring all call_srcu()s that in BPF > core. [...] isn't one of the issues is that BPF is using call_rcu_tasks_trace() which is now internally using call_srcu? So whether other parts of BPF use call_srcu() or not, the issue still stands AFAICS. I think we have to fix RCU tasks trace, one way or the other. Or did I miss something? thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 20:21 ` Joel Fernandes @ 2026-03-19 20:39 ` Boqun Feng 2026-03-20 15:34 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 20:39 UTC (permalink / raw) To: Joel Fernandes Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 04:21:45PM -0400, Joel Fernandes wrote: > > > On 3/19/2026 4:14 PM, Boqun Feng wrote: > > On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote: > >> On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: > >>> > >>> On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > >>>> On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > >>>>> > >>>>> On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > >>>>>> On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > >>>>>>> On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > >>>>>>>> Please just use the queue_delayed_work() with a delay >0. > >>>>>>>> > >>>>>>> > >>>>>>> That doesn't work since queue_delayed_work() with a positive delay will > >>>>>>> still acquire timer base lock, and we can have BPF instrument with timer > >>>>>>> base lock held i.e. calling call_srcu() with timer base lock. > >>>>>>> > >>>>>>> irq_work on the other hand doesn't use any locking. > >>>>>> > >>>>>> Could we please restrict BPF somehow so it does roam free? It is > >>>>>> absolutely awful to have irq_work() in call_srcu() just because it > >>>>>> might acquire locks. > >>>>>> > >>>>> > >>>>> I agree it's not RCU's fault ;-) > >>>>> > >>>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call > >>>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism > >>>>> that allows BPF to defer any lock holding functions to a different > >>>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in > >>>>> some cases IIRC). > >>>>> > >>>>> But we need to fix this in v7.0, so this short-term fix is still needed. > >>>>> > >>>> > >>>> I don't think this is an option, even longer term. We already do it > >>>> when it's incorrect to invoke call_rcu() or any other API in a > >>>> specific context (e.g., NMI, where we punt it using irq_work). > >>>> However, the case reported in this thread is different. It was an > >>>> existing user which worked fine before but got broken now. We were > >>>> using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > >>>> is held before, so the conversion underneath to call_srcu() should > >>>> continue to remain transparent in this respect. > >>>> > >>> > >>> I'm not sure that's a real argument here, kernel doesn't have a stable > >>> internal API, which allows developers to refactor the code into a saner > >>> way. There are currently multiple issues that suggest we may need a > >>> defer mechanism for BPF core, and if it makes the code more easier to > >>> reason about then why not? Think about it like a process that we learn > >>> about all the defer patterns that BPF currently needs and wrap them in a > >>> nice and maintainable way. > >> > >> This is all right in theory, but I don't understand how your > >> theoretical deferral mechanism for BPF will help here in the case > >> we're discussing, or is even appealing. > >> > >> How do we decide when to defer? Will we annotate all locks that can be > >> held by RCU internals to be able to check if they are held (on the > >> current cpu, which is non-trivial except by maintaining a held lock > >> table, testing the locked bit is too conservative), and then deferring > >> the call_srcu() from the caller in BPF? What if you gain new locks? It > >> doesn't seem practical to me. Plus it pushes the burden of detection > >> and deferral to the caller, making everything more complicated and > >> error-prone. > >> > > > > My suggestion would be: deferring all call_srcu()s that in BPF > > core. [...] > > isn't one of the issues is that BPF is using call_rcu_tasks_trace() which is now > internally using call_srcu? So whether other parts of BPF use call_srcu() or I was talking about the long term solution in that thread ;-) Short-term, yes the switching from call_rcu_tasks_trace() to call_srcu() is the cause the issue, and we have the lockdep report to prove that. So in order to continue the process of switching to SRCU for BPF, we need to restore the behavior of call_rcu_tasks_trace() in call_srcu(). > not, the issue still stands AFAICS. > In an alternative universe, BPF has a defer mechanism, and BPF core would just call (for example): bpf_defer(call_srcu, ...); // <- a lockless defer so the issue won't happen. > I think we have to fix RCU tasks trace, one way or the other. > > Or did I miss something? > No I don't think so ;-) Regards, Boqun > thanks, > > -- > Joel Fernandes > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 20:39 ` Boqun Feng @ 2026-03-20 15:34 ` Paul E. McKenney 2026-03-20 15:59 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-20 15:34 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 01:39:33PM -0700, Boqun Feng wrote: > On Thu, Mar 19, 2026 at 04:21:45PM -0400, Joel Fernandes wrote: > > On 3/19/2026 4:14 PM, Boqun Feng wrote: > > > On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote: > > >> On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: > > >>> > > >>> On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > > >>>> On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > >>>>> > > >>>>> On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > >>>>>> On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > >>>>>>> On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > >>>>>>>> Please just use the queue_delayed_work() with a delay >0. > > >>>>>>>> > > >>>>>>> > > >>>>>>> That doesn't work since queue_delayed_work() with a positive delay will > > >>>>>>> still acquire timer base lock, and we can have BPF instrument with timer > > >>>>>>> base lock held i.e. calling call_srcu() with timer base lock. > > >>>>>>> > > >>>>>>> irq_work on the other hand doesn't use any locking. > > >>>>>> > > >>>>>> Could we please restrict BPF somehow so it does roam free? It is > > >>>>>> absolutely awful to have irq_work() in call_srcu() just because it > > >>>>>> might acquire locks. > > >>>>>> > > >>>>> > > >>>>> I agree it's not RCU's fault ;-) > > >>>>> > > >>>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call > > >>>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism > > >>>>> that allows BPF to defer any lock holding functions to a different > > >>>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in > > >>>>> some cases IIRC). > > >>>>> > > >>>>> But we need to fix this in v7.0, so this short-term fix is still needed. > > >>>>> > > >>>> > > >>>> I don't think this is an option, even longer term. We already do it > > >>>> when it's incorrect to invoke call_rcu() or any other API in a > > >>>> specific context (e.g., NMI, where we punt it using irq_work). > > >>>> However, the case reported in this thread is different. It was an > > >>>> existing user which worked fine before but got broken now. We were > > >>>> using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > > >>>> is held before, so the conversion underneath to call_srcu() should > > >>>> continue to remain transparent in this respect. > > >>>> > > >>> > > >>> I'm not sure that's a real argument here, kernel doesn't have a stable > > >>> internal API, which allows developers to refactor the code into a saner > > >>> way. There are currently multiple issues that suggest we may need a > > >>> defer mechanism for BPF core, and if it makes the code more easier to > > >>> reason about then why not? Think about it like a process that we learn > > >>> about all the defer patterns that BPF currently needs and wrap them in a > > >>> nice and maintainable way. > > >> > > >> This is all right in theory, but I don't understand how your > > >> theoretical deferral mechanism for BPF will help here in the case > > >> we're discussing, or is even appealing. > > >> > > >> How do we decide when to defer? Will we annotate all locks that can be > > >> held by RCU internals to be able to check if they are held (on the > > >> current cpu, which is non-trivial except by maintaining a held lock > > >> table, testing the locked bit is too conservative), and then deferring > > >> the call_srcu() from the caller in BPF? What if you gain new locks? It > > >> doesn't seem practical to me. Plus it pushes the burden of detection > > >> and deferral to the caller, making everything more complicated and > > >> error-prone. > > >> > > > > > > My suggestion would be: deferring all call_srcu()s that in BPF > > > core. [...] > > > > isn't one of the issues is that BPF is using call_rcu_tasks_trace() which is now > > internally using call_srcu? So whether other parts of BPF use call_srcu() or > > I was talking about the long term solution in that thread ;-) > > Short-term, yes the switching from call_rcu_tasks_trace() to call_srcu() > is the cause the issue, and we have the lockdep report to prove that. So > in order to continue the process of switching to SRCU for BPF, we need > to restore the behavior of call_rcu_tasks_trace() in call_srcu(). > > > not, the issue still stands AFAICS. > > > > In an alternative universe, BPF has a defer mechanism, and BPF core > would just call (for example): > > bpf_defer(call_srcu, ...); // <- a lockless defer > > so the issue won't happen. In theory, this is quite true. In practice, unfortunately for keeping this part of RCU as simple as we might wish, when a BPF program gets attached to some function in the kernel, it does not know whether or not that function holds a given scheduler lock. For example, there are any number of utility functions that can be (and are) called both with and without those scheduler locks held. Worse yet, it might be attached to a function that is *never* invoked with a scheduler lock held -- until some out-of-tree module is loaded. Which means that this module might well be loaded after BPF has JIT-ed the BPF program. So we really do need to make some variant of call_srcu() that deals with this. We do have some options. First, we could make call_srcu() deal with it directly, or second, we could create something like call_srcu_lockless() or call_srcu_nolock() or whatever that can safely be invoked from any context, including NMI handlers, and that invokes call_srcu() directly when it determines that it is safe to do so. The advantage of the second approach is that it avoids incurring the overhead of checking in the common case. Thoughts? Thanx, Paul > > I think we have to fix RCU tasks trace, one way or the other. > > > > Or did I miss something? > > > > No I don't think so ;-) > > Regards, > Boqun > > > thanks, > > > > -- > > Joel Fernandes > > > > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 15:34 ` Paul E. McKenney @ 2026-03-20 15:59 ` Boqun Feng 2026-03-20 16:24 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-20 15:59 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 08:34:41AM -0700, Paul E. McKenney wrote: > On Thu, Mar 19, 2026 at 01:39:33PM -0700, Boqun Feng wrote: > > On Thu, Mar 19, 2026 at 04:21:45PM -0400, Joel Fernandes wrote: > > > On 3/19/2026 4:14 PM, Boqun Feng wrote: > > > > On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote: > > > >> On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: > > > >>> > > > >>> On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > > > >>>> On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > > >>>>> > > > >>>>> On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > > >>>>>> On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > > >>>>>>> On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > >>>>>>>> Please just use the queue_delayed_work() with a delay >0. > > > >>>>>>>> > > > >>>>>>> > > > >>>>>>> That doesn't work since queue_delayed_work() with a positive delay will > > > >>>>>>> still acquire timer base lock, and we can have BPF instrument with timer > > > >>>>>>> base lock held i.e. calling call_srcu() with timer base lock. > > > >>>>>>> > > > >>>>>>> irq_work on the other hand doesn't use any locking. > > > >>>>>> > > > >>>>>> Could we please restrict BPF somehow so it does roam free? It is > > > >>>>>> absolutely awful to have irq_work() in call_srcu() just because it > > > >>>>>> might acquire locks. > > > >>>>>> > > > >>>>> > > > >>>>> I agree it's not RCU's fault ;-) > > > >>>>> > > > >>>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call > > > >>>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism > > > >>>>> that allows BPF to defer any lock holding functions to a different > > > >>>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in > > > >>>>> some cases IIRC). > > > >>>>> > > > >>>>> But we need to fix this in v7.0, so this short-term fix is still needed. > > > >>>>> > > > >>>> > > > >>>> I don't think this is an option, even longer term. We already do it > > > >>>> when it's incorrect to invoke call_rcu() or any other API in a > > > >>>> specific context (e.g., NMI, where we punt it using irq_work). > > > >>>> However, the case reported in this thread is different. It was an > > > >>>> existing user which worked fine before but got broken now. We were > > > >>>> using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > > > >>>> is held before, so the conversion underneath to call_srcu() should > > > >>>> continue to remain transparent in this respect. > > > >>>> > > > >>> > > > >>> I'm not sure that's a real argument here, kernel doesn't have a stable > > > >>> internal API, which allows developers to refactor the code into a saner > > > >>> way. There are currently multiple issues that suggest we may need a > > > >>> defer mechanism for BPF core, and if it makes the code more easier to > > > >>> reason about then why not? Think about it like a process that we learn > > > >>> about all the defer patterns that BPF currently needs and wrap them in a > > > >>> nice and maintainable way. > > > >> > > > >> This is all right in theory, but I don't understand how your > > > >> theoretical deferral mechanism for BPF will help here in the case > > > >> we're discussing, or is even appealing. > > > >> > > > >> How do we decide when to defer? Will we annotate all locks that can be > > > >> held by RCU internals to be able to check if they are held (on the > > > >> current cpu, which is non-trivial except by maintaining a held lock > > > >> table, testing the locked bit is too conservative), and then deferring > > > >> the call_srcu() from the caller in BPF? What if you gain new locks? It > > > >> doesn't seem practical to me. Plus it pushes the burden of detection > > > >> and deferral to the caller, making everything more complicated and > > > >> error-prone. > > > >> > > > > > > > > My suggestion would be: deferring all call_srcu()s that in BPF > > > > core. [...] > > > > > > isn't one of the issues is that BPF is using call_rcu_tasks_trace() which is now > > > internally using call_srcu? So whether other parts of BPF use call_srcu() or > > > > I was talking about the long term solution in that thread ;-) > > > > Short-term, yes the switching from call_rcu_tasks_trace() to call_srcu() > > is the cause the issue, and we have the lockdep report to prove that. So > > in order to continue the process of switching to SRCU for BPF, we need > > to restore the behavior of call_rcu_tasks_trace() in call_srcu(). > > > > > not, the issue still stands AFAICS. > > > > > > > In an alternative universe, BPF has a defer mechanism, and BPF core > > would just call (for example): > > > > bpf_defer(call_srcu, ...); // <- a lockless defer > > > > so the issue won't happen. > > In theory, this is quite true. > > In practice, unfortunately for keeping this part of RCU as simple as > we might wish, when a BPF program gets attached to some function in > the kernel, it does not know whether or not that function holds a given > scheduler lock. For example, there are any number of utility functions > that can be (and are) called both with and without those scheduler > locks held. Worse yet, it might be attached to a function that is > *never* invoked with a scheduler lock held -- until some out-of-tree > module is loaded. Which means that this module might well be loaded > after BPF has JIT-ed the BPF program. > Hmm.. maybe I failed to make myself more clear. I was suggesting we treat BPF as a special context, and you cannot do everything, if there is any call_srcu() needed, switch it to bpf_defer(). We should have the same result as either 1) call_srcu() locklessly defer itself or 2) a call_srcu_lockless(). Certainly we can call_srcu() do locklessly defer, but if it's only for BPF, that looks like a whack-a-mole approach to me. Say later on we want to use call_hazptr() in BPF for some reason (there is hoping!), then we need to make it locklessly defer as well. Now we have two lockless logic in both call_srcu() and call_hazptr(), if there is a third one, we need to do that as well. So where's the end? The lockless defer request comes from BPF being special, a proper way to deal with it IMO would be BPF has a general defer mechanism. Whether call_srcu() or call_srcu_lockless() can do lockless defer is orthogonal. BTW, an example to my point, I think we have a deadlock even with the old call_rcu_tasks_trace(), because at: https://elixir.bootlin.com/linux/v6.19.8/source/kernel/rcu/tasks.h#L384 We do a: mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); which means call_rcu_tasks_trace() may acquire timer base lock, and that means if BPF was to trace a point where timer base lock is held, then we may have a deadlock. So Now I wonder whether you had any magic to avoid the deadlock pre-7.0 or we are just lucky ;-) See, without a general defer mechanism, we will have a lot of fun auditing all the primitives that BPF may use. > So we really do need to make some variant of call_srcu() that deals > with this. > > We do have some options. First, we could make call_srcu() deal with it > directly, or second, we could create something like call_srcu_lockless() > or call_srcu_nolock() or whatever that can safely be invoked from any > context, including NMI handlers, and that invokes call_srcu() directly > when it determines that it is safe to do so. The advantage of the second > approach is that it avoids incurring the overhead of checking in the > common case. > Within the RCU scope, I prefer the second option. Regards, Boqun > Thoughts? > > Thanx, Paul > > > > I think we have to fix RCU tasks trace, one way or the other. > > > > > > Or did I miss something? > > > > > > > No I don't think so ;-) > > > > Regards, > > Boqun > > > > > thanks, > > > > > > -- > > > Joel Fernandes > > > > > > > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 15:59 ` Boqun Feng @ 2026-03-20 16:24 ` Paul E. McKenney 2026-03-20 16:57 ` Boqun Feng 2026-03-21 17:03 ` [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() Boqun Feng 0 siblings, 2 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-20 16:24 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 08:59:52AM -0700, Boqun Feng wrote: > On Fri, Mar 20, 2026 at 08:34:41AM -0700, Paul E. McKenney wrote: > > On Thu, Mar 19, 2026 at 01:39:33PM -0700, Boqun Feng wrote: > > > On Thu, Mar 19, 2026 at 04:21:45PM -0400, Joel Fernandes wrote: > > > > On 3/19/2026 4:14 PM, Boqun Feng wrote: > > > > > On Thu, Mar 19, 2026 at 07:41:06PM +0100, Kumar Kartikeya Dwivedi wrote: > > > > >> On Thu, 19 Mar 2026 at 18:27, Boqun Feng <boqun@kernel.org> wrote: > > > > >>> > > > > >>> On Thu, Mar 19, 2026 at 05:59:40PM +0100, Kumar Kartikeya Dwivedi wrote: > > > > >>>> On Thu, 19 Mar 2026 at 17:48, Boqun Feng <boqun@kernel.org> wrote: > > > > >>>>> > > > > >>>>> On Thu, Mar 19, 2026 at 05:33:50PM +0100, Sebastian Andrzej Siewior wrote: > > > > >>>>>> On 2026-03-19 09:27:59 [-0700], Boqun Feng wrote: > > > > >>>>>>> On Thu, Mar 19, 2026 at 10:03:15AM +0100, Sebastian Andrzej Siewior wrote: > > > > >>>>>>>> Please just use the queue_delayed_work() with a delay >0. > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>>> That doesn't work since queue_delayed_work() with a positive delay will > > > > >>>>>>> still acquire timer base lock, and we can have BPF instrument with timer > > > > >>>>>>> base lock held i.e. calling call_srcu() with timer base lock. > > > > >>>>>>> > > > > >>>>>>> irq_work on the other hand doesn't use any locking. > > > > >>>>>> > > > > >>>>>> Could we please restrict BPF somehow so it does roam free? It is > > > > >>>>>> absolutely awful to have irq_work() in call_srcu() just because it > > > > >>>>>> might acquire locks. > > > > >>>>>> > > > > >>>>> > > > > >>>>> I agree it's not RCU's fault ;-) > > > > >>>>> > > > > >>>>> I guess it'll be difficult to restrict BPF, however maybe BPF can call > > > > >>>>> call_srcu() in irq_work instead? Or a more systematic defer mechanism > > > > >>>>> that allows BPF to defer any lock holding functions to a different > > > > >>>>> context. (We have a similar issue that BPF cannot call kfree_rcu() in > > > > >>>>> some cases IIRC). > > > > >>>>> > > > > >>>>> But we need to fix this in v7.0, so this short-term fix is still needed. > > > > >>>>> > > > > >>>> > > > > >>>> I don't think this is an option, even longer term. We already do it > > > > >>>> when it's incorrect to invoke call_rcu() or any other API in a > > > > >>>> specific context (e.g., NMI, where we punt it using irq_work). > > > > >>>> However, the case reported in this thread is different. It was an > > > > >>>> existing user which worked fine before but got broken now. We were > > > > >>>> using call_rcu_tasks_trace() just fine in scx callbacks where rq->lock > > > > >>>> is held before, so the conversion underneath to call_srcu() should > > > > >>>> continue to remain transparent in this respect. > > > > >>>> > > > > >>> > > > > >>> I'm not sure that's a real argument here, kernel doesn't have a stable > > > > >>> internal API, which allows developers to refactor the code into a saner > > > > >>> way. There are currently multiple issues that suggest we may need a > > > > >>> defer mechanism for BPF core, and if it makes the code more easier to > > > > >>> reason about then why not? Think about it like a process that we learn > > > > >>> about all the defer patterns that BPF currently needs and wrap them in a > > > > >>> nice and maintainable way. > > > > >> > > > > >> This is all right in theory, but I don't understand how your > > > > >> theoretical deferral mechanism for BPF will help here in the case > > > > >> we're discussing, or is even appealing. > > > > >> > > > > >> How do we decide when to defer? Will we annotate all locks that can be > > > > >> held by RCU internals to be able to check if they are held (on the > > > > >> current cpu, which is non-trivial except by maintaining a held lock > > > > >> table, testing the locked bit is too conservative), and then deferring > > > > >> the call_srcu() from the caller in BPF? What if you gain new locks? It > > > > >> doesn't seem practical to me. Plus it pushes the burden of detection > > > > >> and deferral to the caller, making everything more complicated and > > > > >> error-prone. > > > > >> > > > > > > > > > > My suggestion would be: deferring all call_srcu()s that in BPF > > > > > core. [...] > > > > > > > > isn't one of the issues is that BPF is using call_rcu_tasks_trace() which is now > > > > internally using call_srcu? So whether other parts of BPF use call_srcu() or > > > > > > I was talking about the long term solution in that thread ;-) > > > > > > Short-term, yes the switching from call_rcu_tasks_trace() to call_srcu() > > > is the cause the issue, and we have the lockdep report to prove that. So > > > in order to continue the process of switching to SRCU for BPF, we need > > > to restore the behavior of call_rcu_tasks_trace() in call_srcu(). > > > > > > > not, the issue still stands AFAICS. > > > > > > > > > > In an alternative universe, BPF has a defer mechanism, and BPF core > > > would just call (for example): > > > > > > bpf_defer(call_srcu, ...); // <- a lockless defer > > > > > > so the issue won't happen. > > > > In theory, this is quite true. > > > > In practice, unfortunately for keeping this part of RCU as simple as > > we might wish, when a BPF program gets attached to some function in > > the kernel, it does not know whether or not that function holds a given > > scheduler lock. For example, there are any number of utility functions > > that can be (and are) called both with and without those scheduler > > locks held. Worse yet, it might be attached to a function that is > > *never* invoked with a scheduler lock held -- until some out-of-tree > > module is loaded. Which means that this module might well be loaded > > after BPF has JIT-ed the BPF program. > > > > Hmm.. maybe I failed to make myself more clear. I was suggesting we > treat BPF as a special context, and you cannot do everything, if there > is any call_srcu() needed, switch it to bpf_defer(). We should have the > same result as either 1) call_srcu() locklessly defer itself or 2) a > call_srcu_lockless(). > > Certainly we can call_srcu() do locklessly defer, but if it's only for > BPF, that looks like a whack-a-mole approach to me. Say later on we want > to use call_hazptr() in BPF for some reason (there is hoping!), then we > need to make it locklessly defer as well. Now we have two lockless logic > in both call_srcu() and call_hazptr(), if there is a third one, we need > to do that as well. So where's the end? Except that by the same line of reasoning, how do the BPF guys figure out exactly which function calls they need to defer and under what conditions they need to defer them? Keeping in mind that the list of functions and corresponding conditions is subject to change as the kernel continues to change. > The lockless defer request comes from BPF being special, a proper way to > deal with it IMO would be BPF has a general defer mechanism. Whether > call_srcu() or call_srcu_lockless() can do lockless defer is > orthogonal. Fair point, and for the general defer mechanism, I hereby nominate the irq_work_queue() function. We can use this both for RCU and for hazard pointers. The code to make a call_srcu_lockless() and call_hazptr_lockless() that includes the relevant checks and that does the deferral will not be large, complex, or slow. Especially assuming that we consolidate common checks. > BTW, an example to my point, I think we have a deadlock even with the > old call_rcu_tasks_trace(), because at: > > https://elixir.bootlin.com/linux/v6.19.8/source/kernel/rcu/tasks.h#L384 > > We do a: > > mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > > which means call_rcu_tasks_trace() may acquire timer base lock, and that > means if BPF was to trace a point where timer base lock is held, then we > may have a deadlock. So Now I wonder whether you had any magic to avoid > the deadlock pre-7.0 or we are just lucky ;-) Test it and see! ;-) > See, without a general defer mechanism, we will have a lot of fun > auditing all the primitives that BPF may use. No, *we* only audit the primitives in our subsystem that BPF actually uses when BPF starts using them. We let the *other* subsystems worry about *their* interactions with BPF. > > So we really do need to make some variant of call_srcu() that deals > > with this. > > > > We do have some options. First, we could make call_srcu() deal with it > > directly, or second, we could create something like call_srcu_lockless() > > or call_srcu_nolock() or whatever that can safely be invoked from any > > context, including NMI handlers, and that invokes call_srcu() directly > > when it determines that it is safe to do so. The advantage of the second > > approach is that it avoids incurring the overhead of checking in the > > common case. > > Within the RCU scope, I prefer the second option. Works for me! Would you guys like to implement this, or would you prefer that I do so? Thanx, Paul > Regards, > Boqun > > > Thoughts? > > > > Thanx, Paul > > > > > > I think we have to fix RCU tasks trace, one way or the other. > > > > > > > > Or did I miss something? > > > > > > > > > > No I don't think so ;-) > > > > > > Regards, > > > Boqun > > > > > > > thanks, > > > > > > > > -- > > > > Joel Fernandes > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 16:24 ` Paul E. McKenney @ 2026-03-20 16:57 ` Boqun Feng 2026-03-20 17:54 ` Joel Fernandes 2026-03-20 23:11 ` Paul E. McKenney 2026-03-21 17:03 ` [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() Boqun Feng 1 sibling, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-20 16:57 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 09:24:15AM -0700, Paul E. McKenney wrote: [...] > > > > In an alternative universe, BPF has a defer mechanism, and BPF core > > > > would just call (for example): > > > > > > > > bpf_defer(call_srcu, ...); // <- a lockless defer > > > > > > > > so the issue won't happen. > > > > > > In theory, this is quite true. > > > > > > In practice, unfortunately for keeping this part of RCU as simple as > > > we might wish, when a BPF program gets attached to some function in > > > the kernel, it does not know whether or not that function holds a given > > > scheduler lock. For example, there are any number of utility functions > > > that can be (and are) called both with and without those scheduler > > > locks held. Worse yet, it might be attached to a function that is > > > *never* invoked with a scheduler lock held -- until some out-of-tree > > > module is loaded. Which means that this module might well be loaded > > > after BPF has JIT-ed the BPF program. > > > > > > > Hmm.. maybe I failed to make myself more clear. I was suggesting we > > treat BPF as a special context, and you cannot do everything, if there > > is any call_srcu() needed, switch it to bpf_defer(). We should have the > > same result as either 1) call_srcu() locklessly defer itself or 2) a > > call_srcu_lockless(). > > > > Certainly we can call_srcu() do locklessly defer, but if it's only for > > BPF, that looks like a whack-a-mole approach to me. Say later on we want > > to use call_hazptr() in BPF for some reason (there is hoping!), then we > > need to make it locklessly defer as well. Now we have two lockless logic > > in both call_srcu() and call_hazptr(), if there is a third one, we need > > to do that as well. So where's the end? > > Except that by the same line of reasoning, how do the BPF guys figure out > exactly which function calls they need to defer and under what conditions > they need to defer them? Keeping in mind that the list of functions and Can't they just defer anything that is deferrable? If it's deferrable, then what's the actual cost for BPF to defer it? > corresponding conditions is subject to change as the kernel continues > to change. > > > The lockless defer request comes from BPF being special, a proper way to > > deal with it IMO would be BPF has a general defer mechanism. Whether > > call_srcu() or call_srcu_lockless() can do lockless defer is > > orthogonal. > > Fair point, and for the general defer mechanism, I hereby nominate > the irq_work_queue() function. We can use this both for RCU and > for hazard pointers. The code to make a call_srcu_lockless() and > call_hazptr_lockless() that includes the relevant checks and that does > the deferral will not be large, complex, or slow. Especially assuming > that we consolidate common checks. > > > BTW, an example to my point, I think we have a deadlock even with the > > old call_rcu_tasks_trace(), because at: > > > > https://elixir.bootlin.com/linux/v6.19.8/source/kernel/rcu/tasks.h#L384 > > > > We do a: > > > > mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > > > > which means call_rcu_tasks_trace() may acquire timer base lock, and that > > means if BPF was to trace a point where timer base lock is held, then we > > may have a deadlock. So Now I wonder whether you had any magic to avoid > > the deadlock pre-7.0 or we are just lucky ;-) > > Test it and see! ;-) > "Program testing can be used to show the presence of bugs, but never to show their absence!" ;-) > > See, without a general defer mechanism, we will have a lot of fun > > auditing all the primitives that BPF may use. > > No, *we* only audit the primitives in our subsystem that BPF actually > uses when BPF starts using them. We let the *other* subsystems worry > about *their* interactions with BPF. > As an RCU mainatainer: fine As a LOCKING maintainer: shake my head, because for every primitive that BPF uses, now there could be a normal version and a _bpf/lockless() version. That could create more maintenance issues, but only time can tell. > > > So we really do need to make some variant of call_srcu() that deals > > > with this. > > > > > > We do have some options. First, we could make call_srcu() deal with it > > > directly, or second, we could create something like call_srcu_lockless() > > > or call_srcu_nolock() or whatever that can safely be invoked from any > > > context, including NMI handlers, and that invokes call_srcu() directly > > > when it determines that it is safe to do so. The advantage of the second > > > approach is that it avoids incurring the overhead of checking in the > > > common case. > > > > Within the RCU scope, I prefer the second option. > > Works for me! > > Would you guys like to implement this, or would you prefer that I do so? > I feel I don't have cycles for it soon, I have a big backlog (including making preempt_count 64bit on 64bit x86). But I will send the fix in the current call_srcu() for v7.0 and work with Joel to get into Linus' tree. I will definitely review it if you beat me to it ;-) Regards, Boqun > Thanx, Paul > > > Regards, > > Boqun > > [..] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 16:57 ` Boqun Feng @ 2026-03-20 17:54 ` Joel Fernandes 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng 2026-03-20 18:20 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng 2026-03-20 23:11 ` Paul E. McKenney 1 sibling, 2 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-20 17:54 UTC (permalink / raw) To: Boqun Feng, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On 3/20/2026 12:57 PM, Boqun Feng wrote: > >>>> So we really do need to make some variant of call_srcu() that deals >>>> with this. >>>> >>>> We do have some options. First, we could make call_srcu() deal with it >>>> directly, or second, we could create something like call_srcu_lockless() >>>> or call_srcu_nolock() or whatever that can safely be invoked from any >>>> context, including NMI handlers, and that invokes call_srcu() directly >>>> when it determines that it is safe to do so. The advantage of the second >>>> approach is that it avoids incurring the overhead of checking in the >>>> common case. >>> Within the RCU scope, I prefer the second option. >> Works for me! >> >> Would you guys like to implement this, or would you prefer that I do so? >> > I feel I don't have cycles for it soon, I have a big backlog (including > making preempt_count 64bit on 64bit x86). But I will send the fix in the > current call_srcu() for v7.0 and work with Joel to get into Linus' tree. Boqun, I get a splat as below with your irq_work patch on rcutorture: Maybe the srcu_get_delay call needs: raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); delay = srcu_get_delay(ssp); raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); can you check? [ 0.459781] ------------[ cut here ]------------ [ 0.460401] WARNING: kernel/rcu/srcutree.c:681 at srcu_get_delay+0xb4/0xd0, CPU#0: swapper/0/1 [ 0.460751] Modules linked in: [ 0.460751] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc3-00020-gc18a9e13ce7f #96 PREEMPTLAZY [ 0.460751] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 [ 0.460751] RIP: 0010:srcu_get_delay+0xb4/0xd0 [ 0.460751] Code: 00 00 00 5b 5d 48 39 d0 48 0f 47 c2 e9 45 b0 0b 01 48 89 fd be ff ff ff ff 48 8d bb d0 00 00 00 e8 e1 86 0a 01 85 c0 75 0d 90 <0f> 0b 90 48 8b 55 40 e9 57 ff ff ff 48 8b 55 40 e9 4e ff ff ff 0f [ 0.460751] RSP: 0000:ffffb4ba80003f80 EFLAGS: 00010046 [ 0.460751] RAX: 0000000000000000 RBX: ffffffffac1604c0 RCX: 0000000000000001 [ 0.460751] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffffac160590 [ 0.460751] RBP: ffffffffac160460 R08: 0000000000000000 R09: 0000000000000000 [ 0.460751] R10: 0000000000000000 R11: ffffb4ba80003ff8 R12: 0000000000000023 [ 0.460751] R13: ffff9de181214b00 R14: 0000000000000000 R15: 0000000000000000 [ 0.460751] FS: 0000000000000000(0000) GS:ffff9de1f2799000(0000) knlGS:0000000000000000 [ 0.460751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.460751] CR2: ffff9de18ddda000 CR3: 000000000c64e000 CR4: 00000000000006f0 [ 0.460751] Call Trace: [ 0.460751] <IRQ> [ 0.460751] srcu_irq_work+0x11/0x40 [ 0.460751] irq_work_single+0x42/0x90 [ 0.460751] irq_work_run_list+0x26/0x40 [ 0.460751] irq_work_run+0x18/0x30 [ 0.460751] __sysvec_irq_work+0x30/0x180 [ 0.460751] sysvec_irq_work+0x6a/0x80 [ 0.460751] </IRQ> [ 0.460751] <TASK> [ 0.460751] asm_sysvec_irq_work+0x1a/0x20 [ 0.460751] RIP: 0010:_raw_spin_unlock_irqrestore+0x34/0x50 [ 0.460751] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 e6 58 f1 fe 48 89 ef e8 2e 92 f1 fe 80 e7 02 74 06 e8 e4 c0 00 ff fb 65 ff 0d fc 63 66 01 <74> 07 5b 5d c3 cc cc cc cc e8 7e 03 df fe 5b 5d e9 57 1b 00 00 0f [ 0.460751] RSP: 0000:ffffb4ba80013d50 EFLAGS: 00000286 [ 0.460751] RAX: 0000000000001c8b RBX: 0000000000000297 RCX: 0000000000000000 [ 0.460751] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffab614c2c [ 0.460751] RBP: ffffffffac160578 R08: 0000000000000001 R09: 0000000000000000 [ 0.460751] R10: 0000000000000001 R11: 0000000000000000 R12: fffffffffffffe74 [ 0.460751] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [ 0.460751] ? _raw_spin_unlock_irqrestore+0x2c/0x50 [ 0.460751] srcu_gp_start_if_needed+0x354/0x530 [ 0.460751] __synchronize_srcu+0xcc/0x180 [ 0.460751] ? __pfx_wakeme_after_rcu+0x10/0x10 [ 0.460751] ? synchronize_srcu+0x3f/0x170 [ 0.460751] ? __pfx_rcu_init_tasks_generic+0x10/0x10 [ 0.460751] rcu_init_tasks_generic+0x104/0x150 [ 0.460751] do_one_initcall+0x59/0x2e0 [ 0.460751] ? _printk+0x56/0x70 [ 0.460751] kernel_init_freeable+0x227/0x440 [ 0.460751] ? __pfx_kernel_init+0x10/0x10 [ 0.460751] kernel_init+0x15/0x1c0 [ 0.460751] ret_from_fork+0x2ac/0x330 [ 0.460751] ? __pfx_kernel_init+0x10/0x10 [ 0.460751] ret_from_fork_asm+0x1a/0x30 [ 0.460751] </TASK> ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 17:54 ` Joel Fernandes @ 2026-03-20 18:14 ` Boqun Feng 2026-03-20 19:18 ` Joel Fernandes ` (3 more replies) 2026-03-20 18:20 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng 1 sibling, 4 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-20 18:14 UTC (permalink / raw) To: Joel Fernandes, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Boqun Feng, Andrea Righi, Zqiang Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). [boqun: Apply Joel's feedback] Reported-by: Andrea Righi <arighi@nvidia.com> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Suggested-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Boqun Feng <boqun@kernel.org> --- @Zqiang, I put your name as Suggested-by because you proposed the same idea, let me know if you rather not have it. @Joel, I did two updates (including your test feedback, other one is call irq_work_sync() when we clean the srcu_struct), please give it a try. include/linux/srcutree.h | 1 + kernel/rcu/srcutree.c | 29 +++++++++++++++++++++++++++-- 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index dfb31d11ff05..be76fa4fc170 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -95,6 +95,7 @@ struct srcu_usage { unsigned long reschedule_jiffies; unsigned long reschedule_count; struct delayed_work work; + struct irq_work irq_work; struct srcu_struct *srcu_ssp; }; diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 2328827f8775..73aef361a524 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -19,6 +19,7 @@ #include <linux/mutex.h> #include <linux/percpu.h> #include <linux/preempt.h> +#include <linux/irq_work.h> #include <linux/rcupdate_wait.h> #include <linux/sched.h> #include <linux/smp.h> @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; static void srcu_invoke_callbacks(struct work_struct *work); static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); static void process_srcu(struct work_struct *work); +static void srcu_irq_work(struct irq_work *work); static void srcu_delay_timer(struct timer_list *t); /* @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); ssp->srcu_sup->sda_is_static = is_static; if (!is_static) { ssp->sda = alloc_percpu(struct srcu_data); @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) return; /* Just leak it! */ if (WARN_ON(srcu_readers_active(ssp))) return; /* Just leak it! */ + /* Wait for irq_work to finish first as it may queue a new work. */ + irq_work_sync(&sup->irq_work); flush_delayed_work(&sup->work); for_each_possible_cpu(cpu) { struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, // it isn't. And it does not have to be. After all, it // can only be executed during early boot when there is only // the one boot CPU running with interrupts still disabled. + // + // Use an irq_work here to avoid acquiring runqueue lock with + // srcu rcu_node::lock held. BPF instrument could introduce the + // opposite dependency, hence we need to break the possible + // locking dependency here. if (likely(srcu_init_done)) - queue_delayed_work(rcu_gp_wq, &sup->work, - !!srcu_get_delay(ssp)); + irq_work_queue(&sup->irq_work); else if (list_empty(&sup->work.work.entry)) list_add(&sup->work.work.entry, &srcu_boot_list); } @@ -1979,6 +1988,22 @@ static void process_srcu(struct work_struct *work) srcu_reschedule(ssp, curdelay); } +static void srcu_irq_work(struct irq_work *work) +{ + struct srcu_struct *ssp; + struct srcu_usage *sup; + unsigned long delay; + + sup = container_of(work, struct srcu_usage, irq_work); + ssp = sup->srcu_ssp; + + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); + delay = srcu_get_delay(ssp); + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); + + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); +} + void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, unsigned long *gp_seq) { -- 2.50.1 (Apple Git-155) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng @ 2026-03-20 19:18 ` Joel Fernandes 2026-03-20 20:47 ` Andrea Righi ` (2 subsequent siblings) 3 siblings, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-20 19:18 UTC (permalink / raw) To: Boqun Feng, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On 3/20/2026 2:14 PM, Boqun Feng wrote: > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > [boqun: Apply Joel's feedback] > > Reported-by: Andrea Righi <arighi@nvidia.com> > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Suggested-by: Zqiang <qiang.zhang@linux.dev> > Signed-off-by: Boqun Feng <boqun@kernel.org> > --- > @Zqiang, I put your name as Suggested-by because you proposed the same > idea, let me know if you rather not have it. > > @Joel, I did two updates (including your test feedback, other one is > call irq_work_sync() when we clean the srcu_struct), please give it a > try. Thanks Boqun, I applied it for testing further. It would be good if Andrea and Kumar can try this patch as well to confirm that the BPF issues are gone. thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng 2026-03-20 19:18 ` Joel Fernandes @ 2026-03-20 20:47 ` Andrea Righi 2026-03-20 20:54 ` Boqun Feng 2026-03-20 22:29 ` [PATCH v2] " Boqun Feng 2026-03-21 4:27 ` [PATCH] " Zqiang 2026-03-21 10:10 ` Paul E. McKenney 3 siblings, 2 replies; 100+ messages in thread From: Andrea Righi @ 2026-03-20 20:47 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Zqiang Hi Boqun, On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > [boqun: Apply Joel's feedback] > > Reported-by: Andrea Righi <arighi@nvidia.com> > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Suggested-by: Zqiang <qiang.zhang@linux.dev> > Signed-off-by: Boqun Feng <boqun@kernel.org> > --- > @Zqiang, I put your name as Suggested-by because you proposed the same > idea, let me know if you rather not have it. > > @Joel, I did two updates (including your test feedback, other one is > call irq_work_sync() when we clean the srcu_struct), please give it a > try. I'm getting this at boot with this patch applied (testing directly from Joel's branch rcu/dev): [ 0.639477] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) [ 0.639479] WARNING: kernel/locking/lockdep.c:4404 at lockdep_hardirqs_on_prepare+0x15e/0x1a0, CPU#0: swapper/0/1 [ 0.639507] Modules linked in: [ 0.639507] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc3-virtme #5 PREEMPT(full) [ 0.639507] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 0.639507] RIP: 0010:lockdep_hardirqs_on_prepare+0x165/0x1a0 [ 0.639507] Code: cc 90 e8 0e b1 4e 00 85 c0 74 0a 8b 35 54 08 43 02 85 f6 74 31 90 5b c3 cc cc cc cc 48 8d 3d d2 3c 44 02 48 c7 c6 b9 58 c5 b2 <67> 48 0f b9 3a eb ac 48 8d 3d cd 3c 44 02 48 c7 c6 37 55 c5 b2 67 [ 0.639507] RSP: 0018:ffffd26700003f58 EFLAGS: 00010046 [ 0.639507] RAX: 0000000000000001 RBX: ffffffffb35719f8 RCX: 0000000000000001 [ 0.639507] RDX: 0000000000000000 RSI: ffffffffb2c558b9 RDI: ffffffffb36bf810 [ 0.639507] RBP: ffffffffb35718e0 R08: 0000000000000001 R09: 0000000000000000 [ 0.639507] R10: 0000000000000001 R11: 000000007cb360a8 R12: 0000000000000000 [ 0.639507] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 0.639507] FS: 0000000000000000(0000) GS:ffff8a7e076f6000(0000) knlGS:0000000000000000 [ 0.639507] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.639507] CR2: ffff8a7dbffff000 CR3: 0000000023452000 CR4: 0000000000750ef0 [ 0.639507] PKRU: 55555554 [ 0.639507] Call Trace: [ 0.639507] <IRQ> [ 0.639507] trace_hardirqs_on+0x18/0x100 [ 0.639507] _raw_spin_unlock_irq+0x28/0x50 [ 0.639507] srcu_irq_work+0x63/0x90 [ 0.639507] irq_work_single+0x69/0x90 [ 0.639507] irq_work_run_list+0x26/0x40 [ 0.639507] irq_work_run+0x18/0x30 [ 0.639507] __sysvec_irq_work+0x35/0x1b0 [ 0.639507] ? irq_exit_rcu+0xe/0x20 [ 0.639507] sysvec_irq_work+0x6e/0x80 [ 0.639507] </IRQ> [ 0.639507] <TASK> [ 0.639507] asm_sysvec_irq_work+0x1a/0x20 [ 0.639507] RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x70 [ 0.639507] Code: f5 53 48 8b 74 24 10 48 89 fb 48 83 c7 18 e8 b1 3d 26 ff 48 89 df e8 d9 a1 26 ff f7 c5 00 02 00 00 75 17 9c 58 f6 c4 02 75 2b <65> ff 0d 33 62 f0 01 74 16 5b 5d c3 cc cc cc cc e8 f5 83 35 ff 9c [ 0.639507] RSP: 0018:ffffd26700013d48 EFLAGS: 00000246 [ 0.639507] RAX: 0000000000000092 RBX: ffffffffb35719f8 RCX: ffffffffb2017e0b [ 0.639507] RDX: ffff8a7d80338000 RSI: 0000000000000000 RDI: ffffffffb2017e0b [ 0.639507] RBP: 0000000000000282 R08: 0000000000000000 R09: 0000000000000001 [ 0.639507] R10: 0000000000000001 R11: 000000007cb360a8 R12: ffff8a7dbb628a40 [ 0.639507] R13: 0000000000000000 R14: ffffffffb3571940 R15: 0000000000000001 [ 0.639507] ? _raw_spin_unlock_irqrestore+0x4b/0x70 [ 0.639507] ? _raw_spin_unlock_irqrestore+0x4b/0x70 [ 0.639507] srcu_gp_start_if_needed+0x37a/0x520 [ 0.639507] ? __pfx_rcu_init_tasks_generic+0x10/0x10 [ 0.639507] __synchronize_srcu+0xf6/0x1b0 [ 0.639507] ? __pfx_wakeme_after_rcu+0x10/0x10 [ 0.639507] ? __pfx_rcu_init_tasks_generic+0x10/0x10 [ 0.639507] rcu_init_tasks_generic+0xfe/0x120 [ 0.639507] do_one_initcall+0x6f/0x300 [ 0.639507] kernel_init_freeable+0x24b/0x2b0 [ 0.639507] ? __pfx_kernel_init+0x10/0x10 [ 0.639507] kernel_init+0x1a/0x130 [ 0.639507] ret_from_fork+0x2bd/0x370 [ 0.639507] ? __pfx_kernel_init+0x10/0x10 [ 0.639507] ret_from_fork_asm+0x1a/0x30 [ 0.639507] </TASK> [ 0.639507] irq event stamp: 6418 [ 0.639507] hardirqs last enabled at (6417): [<ffffffffb2017e0b>] _raw_spin_unlock_irqrestore+0x4b/0x70 [ 0.639507] hardirqs last disabled at (6418): [<ffffffffb20013fe>] sysvec_irq_work+0xe/0x80 [ 0.639507] softirqs last enabled at (6406): [<ffffffffb11ce7b6>] __irq_exit_rcu+0x96/0xc0 [ 0.639507] softirqs last disabled at (6401): [<ffffffffb11ce7b6>] __irq_exit_rcu+0x96/0xc0 [ 0.639507] ---[ end trace 0000000000000000 ]--- Thanks, -Andrea ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 20:47 ` Andrea Righi @ 2026-03-20 20:54 ` Boqun Feng 2026-03-20 21:00 ` Andrea Righi 2026-03-20 22:29 ` [PATCH v2] " Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-20 20:54 UTC (permalink / raw) To: Andrea Righi Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Zqiang On Fri, Mar 20, 2026 at 09:47:51PM +0100, Andrea Righi wrote: > Hi Boqun, > > On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > happen basically everywhere (including where a scheduler lock is held), > > call_srcu() now needs to avoid acquiring scheduler lock because > > otherwise it could cause deadlock [1]. Fix this by following what the > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > the work to start process_srcu(). > > > > [boqun: Apply Joel's feedback] > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > --- > > @Zqiang, I put your name as Suggested-by because you proposed the same > > idea, let me know if you rather not have it. > > > > @Joel, I did two updates (including your test feedback, other one is > > call irq_work_sync() when we clean the srcu_struct), please give it a > > try. > > I'm getting this at boot with this patch applied (testing directly from > Joel's branch rcu/dev): > > [ 0.639477] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) My bad, this is missing: diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 73aef361a524..e08aaacad695 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1993,13 +1993,14 @@ static void srcu_irq_work(struct irq_work *work) struct srcu_struct *ssp; struct srcu_usage *sup; unsigned long delay; + unsigned long flags; sup = container_of(work, struct srcu_usage, irq_work); ssp = sup->srcu_ssp; - raw_spin_lock_irq_rcu_node(ssp->srcu_sup); + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); delay = srcu_get_delay(ssp); - raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); } Regards, Boqun > [ 0.639479] WARNING: kernel/locking/lockdep.c:4404 at lockdep_hardirqs_on_prepare+0x15e/0x1a0, CPU#0: swapper/0/1 > [ 0.639507] Modules linked in: > [ 0.639507] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc3-virtme #5 PREEMPT(full) > [ 0.639507] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 0.639507] RIP: 0010:lockdep_hardirqs_on_prepare+0x165/0x1a0 > [ 0.639507] Code: cc 90 e8 0e b1 4e 00 85 c0 74 0a 8b 35 54 08 43 02 85 f6 74 31 90 5b c3 cc cc cc cc 48 8d 3d d2 3c 44 02 48 c7 c6 b9 58 c5 b2 <67> 48 0f b9 3a eb ac 48 8d 3d cd 3c 44 02 48 c7 c6 37 55 c5 b2 67 > [ 0.639507] RSP: 0018:ffffd26700003f58 EFLAGS: 00010046 > [ 0.639507] RAX: 0000000000000001 RBX: ffffffffb35719f8 RCX: 0000000000000001 > [ 0.639507] RDX: 0000000000000000 RSI: ffffffffb2c558b9 RDI: ffffffffb36bf810 > [ 0.639507] RBP: ffffffffb35718e0 R08: 0000000000000001 R09: 0000000000000000 > [ 0.639507] R10: 0000000000000001 R11: 000000007cb360a8 R12: 0000000000000000 > [ 0.639507] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 0.639507] FS: 0000000000000000(0000) GS:ffff8a7e076f6000(0000) knlGS:0000000000000000 > [ 0.639507] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.639507] CR2: ffff8a7dbffff000 CR3: 0000000023452000 CR4: 0000000000750ef0 > [ 0.639507] PKRU: 55555554 > [ 0.639507] Call Trace: > [ 0.639507] <IRQ> > [ 0.639507] trace_hardirqs_on+0x18/0x100 > [ 0.639507] _raw_spin_unlock_irq+0x28/0x50 > [ 0.639507] srcu_irq_work+0x63/0x90 > [ 0.639507] irq_work_single+0x69/0x90 > [ 0.639507] irq_work_run_list+0x26/0x40 > [ 0.639507] irq_work_run+0x18/0x30 > [ 0.639507] __sysvec_irq_work+0x35/0x1b0 > [ 0.639507] ? irq_exit_rcu+0xe/0x20 > [ 0.639507] sysvec_irq_work+0x6e/0x80 > [ 0.639507] </IRQ> > [ 0.639507] <TASK> > [ 0.639507] asm_sysvec_irq_work+0x1a/0x20 > [ 0.639507] RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x70 > [ 0.639507] Code: f5 53 48 8b 74 24 10 48 89 fb 48 83 c7 18 e8 b1 3d 26 ff 48 89 df e8 d9 a1 26 ff f7 c5 00 02 00 00 75 17 9c 58 f6 c4 02 75 2b <65> ff 0d 33 62 f0 01 74 16 5b 5d c3 cc cc cc cc e8 f5 83 35 ff 9c > [ 0.639507] RSP: 0018:ffffd26700013d48 EFLAGS: 00000246 > [ 0.639507] RAX: 0000000000000092 RBX: ffffffffb35719f8 RCX: ffffffffb2017e0b > [ 0.639507] RDX: ffff8a7d80338000 RSI: 0000000000000000 RDI: ffffffffb2017e0b > [ 0.639507] RBP: 0000000000000282 R08: 0000000000000000 R09: 0000000000000001 > [ 0.639507] R10: 0000000000000001 R11: 000000007cb360a8 R12: ffff8a7dbb628a40 > [ 0.639507] R13: 0000000000000000 R14: ffffffffb3571940 R15: 0000000000000001 > [ 0.639507] ? _raw_spin_unlock_irqrestore+0x4b/0x70 > [ 0.639507] ? _raw_spin_unlock_irqrestore+0x4b/0x70 > [ 0.639507] srcu_gp_start_if_needed+0x37a/0x520 > [ 0.639507] ? __pfx_rcu_init_tasks_generic+0x10/0x10 > [ 0.639507] __synchronize_srcu+0xf6/0x1b0 > [ 0.639507] ? __pfx_wakeme_after_rcu+0x10/0x10 > [ 0.639507] ? __pfx_rcu_init_tasks_generic+0x10/0x10 > [ 0.639507] rcu_init_tasks_generic+0xfe/0x120 > [ 0.639507] do_one_initcall+0x6f/0x300 > [ 0.639507] kernel_init_freeable+0x24b/0x2b0 > [ 0.639507] ? __pfx_kernel_init+0x10/0x10 > [ 0.639507] kernel_init+0x1a/0x130 > [ 0.639507] ret_from_fork+0x2bd/0x370 > [ 0.639507] ? __pfx_kernel_init+0x10/0x10 > [ 0.639507] ret_from_fork_asm+0x1a/0x30 > [ 0.639507] </TASK> > [ 0.639507] irq event stamp: 6418 > [ 0.639507] hardirqs last enabled at (6417): [<ffffffffb2017e0b>] _raw_spin_unlock_irqrestore+0x4b/0x70 > [ 0.639507] hardirqs last disabled at (6418): [<ffffffffb20013fe>] sysvec_irq_work+0xe/0x80 > [ 0.639507] softirqs last enabled at (6406): [<ffffffffb11ce7b6>] __irq_exit_rcu+0x96/0xc0 > [ 0.639507] softirqs last disabled at (6401): [<ffffffffb11ce7b6>] __irq_exit_rcu+0x96/0xc0 > [ 0.639507] ---[ end trace 0000000000000000 ]--- > > Thanks, > -Andrea ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 20:54 ` Boqun Feng @ 2026-03-20 21:00 ` Andrea Righi 2026-03-20 21:02 ` Andrea Righi 0 siblings, 1 reply; 100+ messages in thread From: Andrea Righi @ 2026-03-20 21:00 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Zqiang On Fri, Mar 20, 2026 at 01:54:01PM -0700, Boqun Feng wrote: > On Fri, Mar 20, 2026 at 09:47:51PM +0100, Andrea Righi wrote: > > Hi Boqun, > > > > On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > happen basically everywhere (including where a scheduler lock is held), > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > the work to start process_srcu(). > > > > > > [boqun: Apply Joel's feedback] > > > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > --- > > > @Zqiang, I put your name as Suggested-by because you proposed the same > > > idea, let me know if you rather not have it. > > > > > > @Joel, I did two updates (including your test feedback, other one is > > > call irq_work_sync() when we clean the srcu_struct), please give it a > > > try. > > > > I'm getting this at boot with this patch applied (testing directly from > > Joel's branch rcu/dev): > > > > [ 0.639477] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) > > My bad, this is missing: > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 73aef361a524..e08aaacad695 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1993,13 +1993,14 @@ static void srcu_irq_work(struct irq_work *work) > struct srcu_struct *ssp; > struct srcu_usage *sup; > unsigned long delay; > + unsigned long flags; > > sup = container_of(work, struct srcu_usage, irq_work); > ssp = sup->srcu_ssp; > > - raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); > delay = srcu_get_delay(ssp); > - raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); > > queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > } Ah yes, much better with this one. :) And I confirm that it fixes the initial locking issue that I reported. Thanks! -Andrea ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 21:00 ` Andrea Righi @ 2026-03-20 21:02 ` Andrea Righi 2026-03-20 21:06 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Andrea Righi @ 2026-03-20 21:02 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Zqiang On Fri, Mar 20, 2026 at 10:00:47PM +0100, Andrea Righi wrote: > On Fri, Mar 20, 2026 at 01:54:01PM -0700, Boqun Feng wrote: > > On Fri, Mar 20, 2026 at 09:47:51PM +0100, Andrea Righi wrote: > > > Hi Boqun, > > > > > > On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > > happen basically everywhere (including where a scheduler lock is held), > > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > > the work to start process_srcu(). > > > > > > > > [boqun: Apply Joel's feedback] > > > > > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > > --- > > > > @Zqiang, I put your name as Suggested-by because you proposed the same > > > > idea, let me know if you rather not have it. > > > > > > > > @Joel, I did two updates (including your test feedback, other one is > > > > call irq_work_sync() when we clean the srcu_struct), please give it a > > > > try. > > > > > > I'm getting this at boot with this patch applied (testing directly from > > > Joel's branch rcu/dev): > > > > > > [ 0.639477] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) > > > > My bad, this is missing: > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 73aef361a524..e08aaacad695 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1993,13 +1993,14 @@ static void srcu_irq_work(struct irq_work *work) > > struct srcu_struct *ssp; > > struct srcu_usage *sup; > > unsigned long delay; > > + unsigned long flags; > > > > sup = container_of(work, struct srcu_usage, irq_work); > > ssp = sup->srcu_ssp; > > > > - raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > > + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); > > delay = srcu_get_delay(ssp); > > - raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > > + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); > > > > queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > > } > > Ah yes, much better with this one. :) And I confirm that it fixes the > initial locking issue that I reported. Forgot to add my: Tested-by: Andrea Righi <arighi@nvidia.com> -Andrea ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 21:02 ` Andrea Righi @ 2026-03-20 21:06 ` Boqun Feng 0 siblings, 0 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-20 21:06 UTC (permalink / raw) To: Andrea Righi Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Zqiang On Fri, Mar 20, 2026 at 10:02:12PM +0100, Andrea Righi wrote: > On Fri, Mar 20, 2026 at 10:00:47PM +0100, Andrea Righi wrote: > > On Fri, Mar 20, 2026 at 01:54:01PM -0700, Boqun Feng wrote: > > > On Fri, Mar 20, 2026 at 09:47:51PM +0100, Andrea Righi wrote: > > > > Hi Boqun, > > > > > > > > On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > > > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > > > happen basically everywhere (including where a scheduler lock is held), > > > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > > > the work to start process_srcu(). > > > > > > > > > > [boqun: Apply Joel's feedback] > > > > > > > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > > > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > > > --- > > > > > @Zqiang, I put your name as Suggested-by because you proposed the same > > > > > idea, let me know if you rather not have it. > > > > > > > > > > @Joel, I did two updates (including your test feedback, other one is > > > > > call irq_work_sync() when we clean the srcu_struct), please give it a > > > > > try. > > > > > > > > I'm getting this at boot with this patch applied (testing directly from > > > > Joel's branch rcu/dev): > > > > > > > > [ 0.639477] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) > > > > > > My bad, this is missing: > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 73aef361a524..e08aaacad695 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -1993,13 +1993,14 @@ static void srcu_irq_work(struct irq_work *work) > > > struct srcu_struct *ssp; > > > struct srcu_usage *sup; > > > unsigned long delay; > > > + unsigned long flags; > > > > > > sup = container_of(work, struct srcu_usage, irq_work); > > > ssp = sup->srcu_ssp; > > > > > > - raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > > > + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); > > > delay = srcu_get_delay(ssp); > > > - raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > > > + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); > > > > > > queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > > > } > > > > Ah yes, much better with this one. :) And I confirm that it fixes the > > initial locking issue that I reported. > > Forgot to add my: > > Tested-by: Andrea Righi <arighi@nvidia.com> > Thank you ;-) Regards, Boqun > -Andrea ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 20:47 ` Andrea Righi 2026-03-20 20:54 ` Boqun Feng @ 2026-03-20 22:29 ` Boqun Feng 2026-03-23 21:09 ` Joel Fernandes 2026-03-24 11:27 ` Frederic Weisbecker 1 sibling, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-20 22:29 UTC (permalink / raw) To: Joel Fernandes, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Boqun Feng, Andrea Righi, Zqiang Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). [boqun: Apply Joel's feedback] [boqun: Apply Andrea's test feedback] Reported-by: Andrea Righi <arighi@nvidia.com> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Suggested-by: Zqiang <qiang.zhang@linux.dev> Tested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Boqun Feng <boqun@kernel.org> --- include/linux/srcutree.h | 1 + kernel/rcu/srcutree.c | 30 ++++++++++++++++++++++++++++-- 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index dfb31d11ff05..be76fa4fc170 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -95,6 +95,7 @@ struct srcu_usage { unsigned long reschedule_jiffies; unsigned long reschedule_count; struct delayed_work work; + struct irq_work irq_work; struct srcu_struct *srcu_ssp; }; diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 2328827f8775..e08aaacad695 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -19,6 +19,7 @@ #include <linux/mutex.h> #include <linux/percpu.h> #include <linux/preempt.h> +#include <linux/irq_work.h> #include <linux/rcupdate_wait.h> #include <linux/sched.h> #include <linux/smp.h> @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; static void srcu_invoke_callbacks(struct work_struct *work); static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); static void process_srcu(struct work_struct *work); +static void srcu_irq_work(struct irq_work *work); static void srcu_delay_timer(struct timer_list *t); /* @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); ssp->srcu_sup->sda_is_static = is_static; if (!is_static) { ssp->sda = alloc_percpu(struct srcu_data); @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) return; /* Just leak it! */ if (WARN_ON(srcu_readers_active(ssp))) return; /* Just leak it! */ + /* Wait for irq_work to finish first as it may queue a new work. */ + irq_work_sync(&sup->irq_work); flush_delayed_work(&sup->work); for_each_possible_cpu(cpu) { struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, // it isn't. And it does not have to be. After all, it // can only be executed during early boot when there is only // the one boot CPU running with interrupts still disabled. + // + // Use an irq_work here to avoid acquiring runqueue lock with + // srcu rcu_node::lock held. BPF instrument could introduce the + // opposite dependency, hence we need to break the possible + // locking dependency here. if (likely(srcu_init_done)) - queue_delayed_work(rcu_gp_wq, &sup->work, - !!srcu_get_delay(ssp)); + irq_work_queue(&sup->irq_work); else if (list_empty(&sup->work.work.entry)) list_add(&sup->work.work.entry, &srcu_boot_list); } @@ -1979,6 +1988,23 @@ static void process_srcu(struct work_struct *work) srcu_reschedule(ssp, curdelay); } +static void srcu_irq_work(struct irq_work *work) +{ + struct srcu_struct *ssp; + struct srcu_usage *sup; + unsigned long delay; + unsigned long flags; + + sup = container_of(work, struct srcu_usage, irq_work); + ssp = sup->srcu_ssp; + + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); + delay = srcu_get_delay(ssp); + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); + + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); +} + void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, unsigned long *gp_seq) { -- 2.50.1 (Apple Git-155) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 22:29 ` [PATCH v2] " Boqun Feng @ 2026-03-23 21:09 ` Joel Fernandes 2026-03-23 22:18 ` Boqun Feng 2026-03-24 11:27 ` Frederic Weisbecker 1 sibling, 1 reply; 100+ messages in thread From: Joel Fernandes @ 2026-03-23 21:09 UTC (permalink / raw) To: Boqun Feng, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On 3/20/2026 6:29 PM, Boqun Feng wrote: > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > [boqun: Apply Joel's feedback] > [boqun: Apply Andrea's test feedback] > > Reported-by: Andrea Righi <arighi@nvidia.com> > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Suggested-by: Zqiang <qiang.zhang@linux.dev> > Tested-by: Andrea Righi <arighi@nvidia.com> > Signed-off-by: Boqun Feng <boqun@kernel.org> Tested-by: Joel Fernandes <joelagnelf@nvidia.com> I have the following in the shared rcu tree rcu/dev branch, but if you're sending it for 7.0 -rc cycle I will drop them: 998608c o rcu: Use an intermediate irq_work to start process_srcu() 15d921a o rcutorture: Test call_srcu() with preemption disabled and not thanks, -- Joel Fernandes > --- > include/linux/srcutree.h | 1 + > kernel/rcu/srcutree.c | 30 ++++++++++++++++++++++++++++-- > 2 files changed, 29 insertions(+), 2 deletions(-) > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > index dfb31d11ff05..be76fa4fc170 100644 > --- a/include/linux/srcutree.h > +++ b/include/linux/srcutree.h > @@ -95,6 +95,7 @@ struct srcu_usage { > unsigned long reschedule_jiffies; > unsigned long reschedule_count; > struct delayed_work work; > + struct irq_work irq_work; > struct srcu_struct *srcu_ssp; > }; > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 2328827f8775..e08aaacad695 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -19,6 +19,7 @@ > #include <linux/mutex.h> > #include <linux/percpu.h> > #include <linux/preempt.h> > +#include <linux/irq_work.h> > #include <linux/rcupdate_wait.h> > #include <linux/sched.h> > #include <linux/smp.h> > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > static void srcu_invoke_callbacks(struct work_struct *work); > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > static void process_srcu(struct work_struct *work); > +static void srcu_irq_work(struct irq_work *work); > static void srcu_delay_timer(struct timer_list *t); > > /* > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > ssp->srcu_sup->sda_is_static = is_static; > if (!is_static) { > ssp->sda = alloc_percpu(struct srcu_data); > @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) > return; /* Just leak it! */ > if (WARN_ON(srcu_readers_active(ssp))) > return; /* Just leak it! */ > + /* Wait for irq_work to finish first as it may queue a new work. */ > + irq_work_sync(&sup->irq_work); > flush_delayed_work(&sup->work); > for_each_possible_cpu(cpu) { > struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); > @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > // it isn't. And it does not have to be. After all, it > // can only be executed during early boot when there is only > // the one boot CPU running with interrupts still disabled. > + // > + // Use an irq_work here to avoid acquiring runqueue lock with > + // srcu rcu_node::lock held. BPF instrument could introduce the > + // opposite dependency, hence we need to break the possible > + // locking dependency here. > if (likely(srcu_init_done)) > - queue_delayed_work(rcu_gp_wq, &sup->work, > - !!srcu_get_delay(ssp)); > + irq_work_queue(&sup->irq_work); > else if (list_empty(&sup->work.work.entry)) > list_add(&sup->work.work.entry, &srcu_boot_list); > } > @@ -1979,6 +1988,23 @@ static void process_srcu(struct work_struct *work) > srcu_reschedule(ssp, curdelay); > } > > +static void srcu_irq_work(struct irq_work *work) > +{ > + struct srcu_struct *ssp; > + struct srcu_usage *sup; > + unsigned long delay; > + unsigned long flags; > + > + sup = container_of(work, struct srcu_usage, irq_work); > + ssp = sup->srcu_ssp; > + > + raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); > + delay = srcu_get_delay(ssp); > + raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); > + > + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > +} > + > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > unsigned long *gp_seq) > { ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-23 21:09 ` Joel Fernandes @ 2026-03-23 22:18 ` Boqun Feng 2026-03-23 22:50 ` Joel Fernandes 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-23 22:18 UTC (permalink / raw) To: Joel Fernandes Cc: Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Mon, Mar 23, 2026 at 05:09:53PM -0400, Joel Fernandes wrote: > > > On 3/20/2026 6:29 PM, Boqun Feng wrote: > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > happen basically everywhere (including where a scheduler lock is held), > > call_srcu() now needs to avoid acquiring scheduler lock because > > otherwise it could cause deadlock [1]. Fix this by following what the > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > the work to start process_srcu(). > > > > [boqun: Apply Joel's feedback] > > [boqun: Apply Andrea's test feedback] > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > Tested-by: Andrea Righi <arighi@nvidia.com> > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > Tested-by: Joel Fernandes <joelagnelf@nvidia.com> > > I have the following in the shared rcu tree rcu/dev branch, but if you're > sending it for 7.0 -rc cycle I will drop them: > > 998608c o rcu: Use an intermediate irq_work to start process_srcu() > 15d921a o rcutorture: Test call_srcu() with preemption disabled and not > Thank you. I think Paul has another one we need for v7.0: https://lore.kernel.org/rcu/ad67e723-7f7d-4350-b886-c04e56c0b78d@paulmck-laptop/ I've made an fixes.v7.0-rc4 branch at: git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux.git fixes.v7.0-rc4 with these commits on top of -rc4: d0f816d5e872 rcu: Use an intermediate irq_work to start process_srcu() 6b7fc20ab878 srcu: Push srcu_node allocation to GP when non-preemptible f627bbab8fd0 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() I will send a PR at Wednesday if I don't hear any objection. Thanks! Regards, Boqun > thanks, > > -- > Joel Fernandes > [..] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-23 22:18 ` Boqun Feng @ 2026-03-23 22:50 ` Joel Fernandes 0 siblings, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-23 22:50 UTC (permalink / raw) To: Boqun Feng Cc: Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On 3/23/2026 6:18 PM, Boqun Feng wrote: > On Mon, Mar 23, 2026 at 05:09:53PM -0400, Joel Fernandes wrote: >> >> >> On 3/20/2026 6:29 PM, Boqun Feng wrote: >>> Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms >>> of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can >>> happen basically everywhere (including where a scheduler lock is held), >>> call_srcu() now needs to avoid acquiring scheduler lock because >>> otherwise it could cause deadlock [1]. Fix this by following what the >>> previous RCU Tasks Trace did: using an irq_work to delay the queuing of >>> the work to start process_srcu(). >>> >>> [boqun: Apply Joel's feedback] >>> [boqun: Apply Andrea's test feedback] >>> >>> Reported-by: Andrea Righi <arighi@nvidia.com> >>> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ >>> Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") >>> Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] >>> Suggested-by: Zqiang <qiang.zhang@linux.dev> >>> Tested-by: Andrea Righi <arighi@nvidia.com> >>> Signed-off-by: Boqun Feng <boqun@kernel.org> >> >> Tested-by: Joel Fernandes <joelagnelf@nvidia.com> >> >> I have the following in the shared rcu tree rcu/dev branch, but if you're >> sending it for 7.0 -rc cycle I will drop them: >> >> 998608c o rcu: Use an intermediate irq_work to start process_srcu() >> 15d921a o rcutorture: Test call_srcu() with preemption disabled and not >> > > Thank you. I think Paul has another one we need for v7.0: > > https://lore.kernel.org/rcu/ad67e723-7f7d-4350-b886-c04e56c0b78d@paulmck-laptop/ > > I've made an fixes.v7.0-rc4 branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux.git fixes.v7.0-rc4 > > with these commits on top of -rc4: > > d0f816d5e872 rcu: Use an intermediate irq_work to start process_srcu() > 6b7fc20ab878 srcu: Push srcu_node allocation to GP when non-preemptible > f627bbab8fd0 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() > > I will send a PR at Wednesday if I don't hear any objection. Thanks! > Btw, I am getting lockdep splats on SRCU-T now for some reason. rcu_gp_start_if_needed() calls schedule_work() acquiring pool->lock I believe this is because of Paul's rcutorture patch. I will debug it more and send a patch but all other tests scenarios are looking good. -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 22:29 ` [PATCH v2] " Boqun Feng 2026-03-23 21:09 ` Joel Fernandes @ 2026-03-24 11:27 ` Frederic Weisbecker 2026-03-24 14:56 ` Joel Fernandes 2026-03-24 14:56 ` Alexei Starovoitov 1 sibling, 2 replies; 100+ messages in thread From: Frederic Weisbecker @ 2026-03-24 11:27 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang Le Fri, Mar 20, 2026 at 03:29:16PM -0700, Boqun Feng a écrit : > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > [boqun: Apply Joel's feedback] > [boqun: Apply Andrea's test feedback] > > Reported-by: Andrea Righi <arighi@nvidia.com> > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Suggested-by: Zqiang <qiang.zhang@linux.dev> > Tested-by: Andrea Righi <arighi@nvidia.com> > Signed-off-by: Boqun Feng <boqun@kernel.org> I have the feeling that this problem should be solved at the BPF level. Tracepoints can fire at any time, in that sense they are like NMIs, and NMIs shouldn't acquire locks, let alone call call_rcu_*() BPF should arrange for delaying such operations to more appropriate contexts. I understand this is a regression trigerred by an RCU change but to me it rather reveals a hidden design issue rather than an API breakage. Thanks. -- Frederic Weisbecker SUSE Labs ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-24 11:27 ` Frederic Weisbecker @ 2026-03-24 14:56 ` Joel Fernandes 2026-03-24 14:56 ` Alexei Starovoitov 1 sibling, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-24 14:56 UTC (permalink / raw) To: Frederic Weisbecker, Boqun Feng Cc: Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On 3/24/2026 7:27 AM, Frederic Weisbecker wrote: > Le Fri, Mar 20, 2026 at 03:29:16PM -0700, Boqun Feng a écrit : >> Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms >> of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can >> happen basically everywhere (including where a scheduler lock is held), >> call_srcu() now needs to avoid acquiring scheduler lock because >> otherwise it could cause deadlock [1]. Fix this by following what the >> previous RCU Tasks Trace did: using an irq_work to delay the queuing of >> the work to start process_srcu(). >> >> [boqun: Apply Joel's feedback] >> [boqun: Apply Andrea's test feedback] >> >> Reported-by: Andrea Righi <arighi@nvidia.com> >> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ >> Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") >> Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] >> Suggested-by: Zqiang <qiang.zhang@linux.dev> >> Tested-by: Andrea Righi <arighi@nvidia.com> >> Signed-off-by: Boqun Feng <boqun@kernel.org> > > I have the feeling that this problem should be solved at the BPF > level. Tracepoints can fire at any time, in that sense they are like NMIs, > and NMIs shouldn't acquire locks, let alone call call_rcu_*() > > BPF should arrange for delaying such operations to more appropriate contexts. > > I understand this is a regression trigerred by an RCU change but to me it > rather reveals a hidden design issue rather than an API breakage. Sure, but there is a valid point I think by those who say that "This was working before the RCU change". So in that sense, we ought to have this (possibly short-term fix). ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-24 11:27 ` Frederic Weisbecker 2026-03-24 14:56 ` Joel Fernandes @ 2026-03-24 14:56 ` Alexei Starovoitov 2026-03-24 17:36 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Alexei Starovoitov @ 2026-03-24 14:56 UTC (permalink / raw) To: Frederic Weisbecker Cc: Boqun Feng, Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, Neeraj upadhyay, Uladzislau Rezki, Boqun Feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Tue, Mar 24, 2026 at 4:27 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > Le Fri, Mar 20, 2026 at 03:29:16PM -0700, Boqun Feng a écrit : > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > happen basically everywhere (including where a scheduler lock is held), > > call_srcu() now needs to avoid acquiring scheduler lock because > > otherwise it could cause deadlock [1]. Fix this by following what the > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > the work to start process_srcu(). > > > > [boqun: Apply Joel's feedback] > > [boqun: Apply Andrea's test feedback] > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > Tested-by: Andrea Righi <arighi@nvidia.com> > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > I have the feeling that this problem should be solved at the BPF > level. Tracepoints can fire at any time, in that sense they are like NMIs, > and NMIs shouldn't acquire locks, let alone call call_rcu_*() > > BPF should arrange for delaying such operations to more appropriate contexts. > > I understand this is a regression trigerred by an RCU change but to me it > rather reveals a hidden design issue rather than an API breakage. You all are still missing that rcu_tasks_trace was developed exclusively for bpf with bpf requirements. Then srcu_fast was introduced and then it looked like that task_trace can be replaced with srcu_fast and that's where the problems discovered. So either task_trace need to be resurrected or srcu_fast needs to be fixed. Let's punt to bpf subsystem isn't an option. At this rate we will have rcu_bpf. Which rcu_tasks_trace effectively was. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-24 14:56 ` Alexei Starovoitov @ 2026-03-24 17:36 ` Boqun Feng 2026-03-24 18:40 ` Joel Fernandes 2026-03-24 19:23 ` Paul E. McKenney 0 siblings, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-24 17:36 UTC (permalink / raw) To: Alexei Starovoitov Cc: Frederic Weisbecker, Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, Neeraj upadhyay, Uladzislau Rezki, Boqun Feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Tue, Mar 24, 2026 at 07:56:44AM -0700, Alexei Starovoitov wrote: > On Tue, Mar 24, 2026 at 4:27 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > Le Fri, Mar 20, 2026 at 03:29:16PM -0700, Boqun Feng a écrit : > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > happen basically everywhere (including where a scheduler lock is held), > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > the work to start process_srcu(). > > > > > > [boqun: Apply Joel's feedback] > > > [boqun: Apply Andrea's test feedback] > > > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > > Tested-by: Andrea Righi <arighi@nvidia.com> > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > > I have the feeling that this problem should be solved at the BPF > > level. Tracepoints can fire at any time, in that sense they are like NMIs, > > and NMIs shouldn't acquire locks, let alone call call_rcu_*() > > > > BPF should arrange for delaying such operations to more appropriate contexts. > > > > I understand this is a regression trigerred by an RCU change but to me it > > rather reveals a hidden design issue rather than an API breakage. > > You all are still missing that rcu_tasks_trace was developed > exclusively for bpf with bpf requirements. I'm missing this for sure. But what would be a better design? BPF is also calling normal RCU (via call_rcu()) as well, and I don't think we can say normal call_rcu() was designed exclusively for BPF. Plus BPF heavily uses irq_work to avoid deadlocks but irq_work_queue() itself has a tracepoint in it (trace_ipi_send_cpu()), so in theory you could hit a deadlock there too. > Then srcu_fast was introduced and then it looked like that > task_trace can be replaced with srcu_fast and that's where the problems > discovered. So either task_trace need to be resurrected > or srcu_fast needs to be fixed. > Let's punt to bpf subsystem isn't an option. I don't think we are suggesting that, i.e. "BPF should fix its own issue", at least myself was hoping that we can redesign the APIs that BPF relies on and make it clear that it's for BPF and we can in theory avoid all the deadlocks (we will probably have to make some primitives non-traceable) and BPF can use it combined with other general synchronization primitives. The current approach seems to me that we just whack-a-mole when an issue happens, and it's not a systematic solution. (It doesn't have to be a problem, it could be an opportunity ;-)) Regards, Boqun > At this rate we will have rcu_bpf. Which rcu_tasks_trace effectively was. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-24 17:36 ` Boqun Feng @ 2026-03-24 18:40 ` Joel Fernandes 2026-03-24 19:23 ` Paul E. McKenney 1 sibling, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-24 18:40 UTC (permalink / raw) To: Boqun Feng, Alexei Starovoitov Cc: Frederic Weisbecker, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, Neeraj upadhyay, Uladzislau Rezki, Boqun Feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On 3/24/2026 1:36 PM, Boqun Feng wrote: > On Tue, Mar 24, 2026 at 07:56:44AM -0700, Alexei Starovoitov wrote: >> On Tue, Mar 24, 2026 at 4:27 AM Frederic Weisbecker <frederic@kernel.org> wrote: >>> >>> Le Fri, Mar 20, 2026 at 03:29:16PM -0700, Boqun Feng a écrit : >>>> Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms >>>> of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can >>>> happen basically everywhere (including where a scheduler lock is held), >>>> call_srcu() now needs to avoid acquiring scheduler lock because >>>> otherwise it could cause deadlock [1]. Fix this by following what the >>>> previous RCU Tasks Trace did: using an irq_work to delay the queuing of >>>> the work to start process_srcu(). >>>> >>>> [boqun: Apply Joel's feedback] >>>> [boqun: Apply Andrea's test feedback] >>>> >>>> Reported-by: Andrea Righi <arighi@nvidia.com> >>>> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ >>>> Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") >>>> Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] >>>> Suggested-by: Zqiang <qiang.zhang@linux.dev> >>>> Tested-by: Andrea Righi <arighi@nvidia.com> >>>> Signed-off-by: Boqun Feng <boqun@kernel.org> >>> >>> I have the feeling that this problem should be solved at the BPF >>> level. Tracepoints can fire at any time, in that sense they are like NMIs, >>> and NMIs shouldn't acquire locks, let alone call call_rcu_*() >>> >>> BPF should arrange for delaying such operations to more appropriate contexts. >>> >>> I understand this is a regression trigerred by an RCU change but to me it >>> rather reveals a hidden design issue rather than an API breakage. >> >> You all are still missing that rcu_tasks_trace was developed >> exclusively for bpf with bpf requirements. > > I'm missing this for sure. But what would be a better design? > > BPF is also calling normal RCU (via call_rcu()) as well, and I don't > think we can say normal call_rcu() was designed exclusively for BPF. > Plus BPF heavily uses irq_work to avoid deadlocks but irq_work_queue() > itself has a tracepoint in it (trace_ipi_send_cpu()), so in theory you > could hit a deadlock there too. > >> Then srcu_fast was introduced and then it looked like that >> task_trace can be replaced with srcu_fast and that's where the problems >> discovered. So either task_trace need to be resurrected >> or srcu_fast needs to be fixed. >> Let's punt to bpf subsystem isn't an option. > > I don't think we are suggesting that, i.e. "BPF should fix its own > issue", at least myself was hoping that we can redesign the APIs that > BPF relies on and make it clear that it's for BPF and we can in theory > avoid all the deadlocks (we will probably have to make some primitives > non-traceable) and BPF can use it combined with other general > synchronization primitives. > > The current approach seems to me that we just whack-a-mole when an issue > happens, and it's not a systematic solution. > > (It doesn't have to be a problem, it could be an opportunity ;-)) The advantage of the "whack-a-mole" approach is perhaps totally benign BPF usecases (non-tracing related even) where not using irq_work might be just ok seems to be a win (no self-IPIs pointless, other irq_work overhead, low latency). Probably for the best trade-offs, we need to probably deal with it on a case-by-case basis that irq-work'ing everything randomly (?). That said, I'm not seeing like 20 reports or something of this so perhaps for now "case by case" is winning. One argument could be BPF is supposed to be SAFE so anything it does like map operations that relies on kernel APIs that take locks etc, should be probably be done with a lot of care and extra precautions... simply because you can insert a BPF program at any tracepoint. Not sure where the line is, BPF is kind of an interesting usecase as in it wants to be powerful (run in kernel context) but also be safe/verified at the same time. my 2 c, thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH v2] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-24 17:36 ` Boqun Feng 2026-03-24 18:40 ` Joel Fernandes @ 2026-03-24 19:23 ` Paul E. McKenney 1 sibling, 0 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-24 19:23 UTC (permalink / raw) To: Boqun Feng Cc: Alexei Starovoitov, Frederic Weisbecker, Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, Neeraj upadhyay, Uladzislau Rezki, Boqun Feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Tue, Mar 24, 2026 at 10:36:08AM -0700, Boqun Feng wrote: > On Tue, Mar 24, 2026 at 07:56:44AM -0700, Alexei Starovoitov wrote: > > On Tue, Mar 24, 2026 at 4:27 AM Frederic Weisbecker <frederic@kernel.org> wrote: > > > > > > Le Fri, Mar 20, 2026 at 03:29:16PM -0700, Boqun Feng a écrit : > > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > > happen basically everywhere (including where a scheduler lock is held), > > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > > the work to start process_srcu(). > > > > > > > > [boqun: Apply Joel's feedback] > > > > [boqun: Apply Andrea's test feedback] > > > > > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > > > Tested-by: Andrea Righi <arighi@nvidia.com> > > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > > > > I have the feeling that this problem should be solved at the BPF > > > level. Tracepoints can fire at any time, in that sense they are like NMIs, > > > and NMIs shouldn't acquire locks, let alone call call_rcu_*() > > > > > > BPF should arrange for delaying such operations to more appropriate contexts. > > > > > > I understand this is a regression trigerred by an RCU change but to me it > > > rather reveals a hidden design issue rather than an API breakage. > > > > You all are still missing that rcu_tasks_trace was developed > > exclusively for bpf with bpf requirements. > > I'm missing this for sure. But what would be a better design? > > BPF is also calling normal RCU (via call_rcu()) as well, and I don't > think we can say normal call_rcu() was designed exclusively for BPF. > Plus BPF heavily uses irq_work to avoid deadlocks but irq_work_queue() > itself has a tracepoint in it (trace_ipi_send_cpu()), so in theory you > could hit a deadlock there too. My fault, I should have seen this one coming. But here I am. > > Then srcu_fast was introduced and then it looked like that > > task_trace can be replaced with srcu_fast and that's where the problems > > discovered. So either task_trace need to be resurrected > > or srcu_fast needs to be fixed. > > Let's punt to bpf subsystem isn't an option. > > I don't think we are suggesting that, i.e. "BPF should fix its own > issue", at least myself was hoping that we can redesign the APIs that > BPF relies on and make it clear that it's for BPF and we can in theory > avoid all the deadlocks (we will probably have to make some primitives > non-traceable) and BPF can use it combined with other general > synchronization primitives. > > The current approach seems to me that we just whack-a-mole when an issue > happens, and it's not a systematic solution. > > (It doesn't have to be a problem, it could be an opportunity ;-)) I am hoping that we don't have too many more moles to whack. Famous last words... ;-) Thanx, Paul > Regards, > Boqun > > > At this rate we will have rcu_bpf. Which rcu_tasks_trace effectively was. ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng 2026-03-20 19:18 ` Joel Fernandes 2026-03-20 20:47 ` Andrea Righi @ 2026-03-21 4:27 ` Zqiang 2026-03-21 18:15 ` Boqun Feng 2026-03-21 10:10 ` Paul E. McKenney 3 siblings, 1 reply; 100+ messages in thread From: Zqiang @ 2026-03-21 4:27 UTC (permalink / raw) To: Boqun Feng, Joel Fernandes, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Boqun Feng, Andrea Righi > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > [boqun: Apply Joel's feedback] > > Reported-by: Andrea Righi <arighi@nvidia.com> > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Suggested-by: Zqiang <qiang.zhang@linux.dev> > Signed-off-by: Boqun Feng <boqun@kernel.org> > --- > @Zqiang, I put your name as Suggested-by because you proposed the same > idea, let me know if you rather not have it. Thanks Boqun add me to Suggested-by :) . > > @Joel, I did two updates (including your test feedback, other one is > call irq_work_sync() when we clean the srcu_struct), please give it a > try. > > include/linux/srcutree.h | 1 + > kernel/rcu/srcutree.c | 29 +++++++++++++++++++++++++++-- > 2 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > index dfb31d11ff05..be76fa4fc170 100644 > --- a/include/linux/srcutree.h > +++ b/include/linux/srcutree.h > @@ -95,6 +95,7 @@ struct srcu_usage { > unsigned long reschedule_jiffies; > unsigned long reschedule_count; > struct delayed_work work; > + struct irq_work irq_work; > struct srcu_struct *srcu_ssp; > }; > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 2328827f8775..73aef361a524 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -19,6 +19,7 @@ > #include <linux/mutex.h> > #include <linux/percpu.h> > #include <linux/preempt.h> > +#include <linux/irq_work.h> > #include <linux/rcupdate_wait.h> > #include <linux/sched.h> > #include <linux/smp.h> > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > static void srcu_invoke_callbacks(struct work_struct *work); > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > static void process_srcu(struct work_struct *work); > +static void srcu_irq_work(struct irq_work *work); > static void srcu_delay_timer(struct timer_list *t); > > /* > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > ssp->srcu_sup->sda_is_static = is_static; > if (!is_static) { > ssp->sda = alloc_percpu(struct srcu_data); > @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) > return; /* Just leak it! */ > if (WARN_ON(srcu_readers_active(ssp))) > return; /* Just leak it! */ > + /* Wait for irq_work to finish first as it may queue a new work. */ > + irq_work_sync(&sup->irq_work); > flush_delayed_work(&sup->work); > for_each_possible_cpu(cpu) { > struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); > @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, The following should also be replaced, although under normal situation, we wouldn't go here: if (snp == snp_leaf && snp_seq != s) { srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0); return; } Thanks Zqiang > // it isn't. And it does not have to be. After all, it > // can only be executed during early boot when there is only > // the one boot CPU running with interrupts still disabled. > + // > + // Use an irq_work here to avoid acquiring runqueue lock with > + // srcu rcu_node::lock held. BPF instrument could introduce the > + // opposite dependency, hence we need to break the possible > + // locking dependency here. > if (likely(srcu_init_done)) > - queue_delayed_work(rcu_gp_wq, &sup->work, > - !!srcu_get_delay(ssp)); > + irq_work_queue(&sup->irq_work); > else if (list_empty(&sup->work.work.entry)) > list_add(&sup->work.work.entry, &srcu_boot_list); > } > @@ -1979,6 +1988,22 @@ static void process_srcu(struct work_struct *work) > srcu_reschedule(ssp, curdelay); > } > > +static void srcu_irq_work(struct irq_work *work) > +{ > + struct srcu_struct *ssp; > + struct srcu_usage *sup; > + unsigned long delay; > + > + sup = container_of(work, struct srcu_usage, irq_work); > + ssp = sup->srcu_ssp; > + > + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > + delay = srcu_get_delay(ssp); > + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > + > + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > +} > + > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > unsigned long *gp_seq) > { > -- > 2.50.1 (Apple Git-155) > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 4:27 ` [PATCH] " Zqiang @ 2026-03-21 18:15 ` Boqun Feng 0 siblings, 0 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-21 18:15 UTC (permalink / raw) To: Zqiang Cc: Joel Fernandes, Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi On Sat, Mar 21, 2026 at 04:27:02AM +0000, Zqiang wrote: > > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > happen basically everywhere (including where a scheduler lock is held), > > call_srcu() now needs to avoid acquiring scheduler lock because > > otherwise it could cause deadlock [1]. Fix this by following what the > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > the work to start process_srcu(). > > > > [boqun: Apply Joel's feedback] > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > --- > > @Zqiang, I put your name as Suggested-by because you proposed the same > > idea, let me know if you rather not have it. > > Thanks Boqun add me to Suggested-by :) . > No problem. > > > > @Joel, I did two updates (including your test feedback, other one is > > call irq_work_sync() when we clean the srcu_struct), please give it a > > try. > > > > include/linux/srcutree.h | 1 + > > kernel/rcu/srcutree.c | 29 +++++++++++++++++++++++++++-- > > 2 files changed, 28 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > index dfb31d11ff05..be76fa4fc170 100644 > > --- a/include/linux/srcutree.h > > +++ b/include/linux/srcutree.h > > @@ -95,6 +95,7 @@ struct srcu_usage { > > unsigned long reschedule_jiffies; > > unsigned long reschedule_count; > > struct delayed_work work; > > + struct irq_work irq_work; > > struct srcu_struct *srcu_ssp; > > }; > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 2328827f8775..73aef361a524 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -19,6 +19,7 @@ > > #include <linux/mutex.h> > > #include <linux/percpu.h> > > #include <linux/preempt.h> > > +#include <linux/irq_work.h> > > #include <linux/rcupdate_wait.h> > > #include <linux/sched.h> > > #include <linux/smp.h> > > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > > static void srcu_invoke_callbacks(struct work_struct *work); > > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > > static void process_srcu(struct work_struct *work); > > +static void srcu_irq_work(struct irq_work *work); > > static void srcu_delay_timer(struct timer_list *t); > > > > /* > > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > > ssp->srcu_sup->sda_is_static = is_static; > > if (!is_static) { > > ssp->sda = alloc_percpu(struct srcu_data); > > @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) > > return; /* Just leak it! */ > > if (WARN_ON(srcu_readers_active(ssp))) > > return; /* Just leak it! */ > > + /* Wait for irq_work to finish first as it may queue a new work. */ > > + irq_work_sync(&sup->irq_work); > > flush_delayed_work(&sup->work); > > for_each_possible_cpu(cpu) { > > struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); > > @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > The following should also be replaced, although under normal situation, > we wouldn't go here: > > if (snp == snp_leaf && snp_seq != s) { > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0); > return; > } > Sigh, another mole to whack... this one is less fatal since we don't call it with rcu node lock: raw_spin_unlock_irqrestore_rcu_node(snp, flags); if (snp == snp_leaf && snp_seq != s) { srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0); return; } but the operation is per srcu_data, so we may need per srcu_data (a hacky way is that we compare with rcu_tasks_trace_srcu_struct and have only one percpu irq_work for rcu_tasks_trace_srcu_struct). Note that if we may the delay always >0, then we can dodge the pi->lock (or the pool->lock as we recently discovered). But we will still have a timer base lock in call_srcu(). Depending on whether it's considered as a bug per BPF (we have the issue in v6.19 as well, see [1]). If [1] is not considered as a bug, then I think we can just fix the issue by an always positive delay. Otherwise, bring your mallet, we may have more moles to whack. ;-) [1]: https://lore.kernel.org/rcu/20260321170321.32257-1-boqun@kernel.org/ Regards, Boqun > Thanks > Zqiang > > > > > > // it isn't. And it does not have to be. After all, it > > // can only be executed during early boot when there is only > > // the one boot CPU running with interrupts still disabled. > > + // > > + // Use an irq_work here to avoid acquiring runqueue lock with > > + // srcu rcu_node::lock held. BPF instrument could introduce the > > + // opposite dependency, hence we need to break the possible > > + // locking dependency here. > > if (likely(srcu_init_done)) > > - queue_delayed_work(rcu_gp_wq, &sup->work, > > - !!srcu_get_delay(ssp)); > > + irq_work_queue(&sup->irq_work); > > else if (list_empty(&sup->work.work.entry)) > > list_add(&sup->work.work.entry, &srcu_boot_list); > > } > > @@ -1979,6 +1988,22 @@ static void process_srcu(struct work_struct *work) > > srcu_reschedule(ssp, curdelay); > > } > > > > +static void srcu_irq_work(struct irq_work *work) > > +{ > > + struct srcu_struct *ssp; > > + struct srcu_usage *sup; > > + unsigned long delay; > > + > > + sup = container_of(work, struct srcu_usage, irq_work); > > + ssp = sup->srcu_ssp; > > + > > + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > > + delay = srcu_get_delay(ssp); > > + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > > + > > + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > > +} > > + > > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > > unsigned long *gp_seq) > > { > > -- > > 2.50.1 (Apple Git-155) > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng ` (2 preceding siblings ...) 2026-03-21 4:27 ` [PATCH] " Zqiang @ 2026-03-21 10:10 ` Paul E. McKenney 2026-03-21 17:15 ` Boqun Feng 3 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-21 10:10 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > [boqun: Apply Joel's feedback] > > Reported-by: Andrea Righi <arighi@nvidia.com> > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Suggested-by: Zqiang <qiang.zhang@linux.dev> > Signed-off-by: Boqun Feng <boqun@kernel.org> First, thank you all for putting this together! If I enable both early boot RCU testing and lockdep, for example, by running the RUDE01 rcutorture scenario, I get the following splat, which suggests that the raw_spin_unlock_irq_rcu_node() in srcu_irq_work() might need help (see inline below): [ 0.872594] ------------[ cut here ]------------ [ 0.873550] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) [ 0.873550] WARNING: kernel/locking/lockdep.c:4404 at lockdep_hardirqs_on_prepare+0x150/0x190, CPU#0: swapper/0/1 [ 0.873550] Modules linked in: [ 0.873550] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc3-00039-g35d354b6cd0f-dirty #8217 PREEMPT(full) [ 0.873550] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 0.873550] RIP: 0010:lockdep_hardirqs_on_prepare+0x157/0x190 [ 0.873550] Code: 01 90 e8 ec c7 54 00 85 c0 74 0a 8b 35 12 c4 e0 01 85 f6 74 31 90 5d c3 cc cc cc cc 48 8d 3d 20 cf e1 01 48 c7 c6 ec c1 87 9f <67> 48 0f b9 3a eb ac 48 8d 3d 1b cf e1 01 48 c7 c6 87 be 87 9f 67 [ 0.873550] RSP: 0000:ffff9ff3c0003f50 EFLAGS: 00010046 [ 0.873550] RAX: 0000000000000001 RBX: ffffffff9fb608f8 RCX: 0000000000000001 [ 0.873550] RDX: 0000000000000000 RSI: ffffffff9f87c1ec RDI: ffffffff9fd44120 [ 0.873550] RBP: ffffffff9f00bae3 R08: 0000000000000001 R09: 0000000000000000 [ 0.873550] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 0.873550] R13: ffff9e1bc11e4bc0 R14: 0000000000000000 R15: 0000000000000000 [ 0.873550] FS: 0000000000000000(0000) GS:ffff9e1c3ed9b000(0000) knlGS:0000000000000000 [ 0.873550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.873550] CR2: ffff9e1bcf3d8000 CR3: 000000000dc4e000 CR4: 00000000000006f0 [ 0.873550] Call Trace: [ 0.873550] <IRQ> [ 0.873550] ? _raw_spin_unlock_irq+0x23/0x40 [ 0.873550] trace_hardirqs_on+0x16/0xe0 [ 0.873550] _raw_spin_unlock_irq+0x23/0x40 [ 0.873550] srcu_irq_work+0x5e/0x90 [ 0.873550] irq_work_single+0x42/0x90 [ 0.873550] irq_work_run_list+0x26/0x40 [ 0.873550] irq_work_run+0x18/0x30 [ 0.873550] __sysvec_irq_work+0x30/0x180 [ 0.873550] sysvec_irq_work+0x6a/0x80 [ 0.873550] </IRQ> [ 0.873550] <TASK> [ 0.873550] asm_sysvec_irq_work+0x1a/0x20 [ 0.873550] RIP: 0010:_raw_spin_unlock_irqrestore+0x34/0x50 [ 0.873550] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 06 d4 f1 fe 48 89 ef e8 2e 0c f2 fe 80 e7 02 74 06 e8 74 86 01 ff fb 65 ff 0d ec d4 66 01 <74> 07 5b 5d e9 53 1b 00 00 e8 5e 94 df fe 5b 5d e9 47 1b 00 00 0f [ 0.873550] RSP: 0000:ffff9ff3c0013d50 EFLAGS: 00000286 [ 0.873550] RAX: 0000000000001417 RBX: 0000000000000297 RCX: 0000000000000000 [ 0.873550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00bb3c [ 0.873550] RBP: ffffffff9fb60630 R08: 0000000000000001 R09: 0000000000000000 [ 0.873550] R10: 0000000000000001 R11: 0000000000000001 R12: fffffffffffffe74 [ 0.873550] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 [ 0.873550] ? _raw_spin_unlock_irqrestore+0x2c/0x50 [ 0.873550] srcu_gp_start_if_needed+0x354/0x530 [ 0.873550] __synchronize_srcu+0xd1/0x180 [ 0.873550] ? __pfx_wakeme_after_rcu+0x10/0x10 [ 0.873550] ? synchronize_srcu+0x3f/0x170 [ 0.873550] ? __pfx_rcu_init_tasks_generic+0x10/0x10 [ 0.873550] rcu_init_tasks_generic+0x10c/0x130 [ 0.873550] do_one_initcall+0x59/0x2e0 [ 0.873550] ? _printk+0x56/0x70 [ 0.873550] kernel_init_freeable+0x227/0x440 [ 0.873550] ? __pfx_kernel_init+0x10/0x10 [ 0.873550] kernel_init+0x15/0x1c0 [ 0.873550] ret_from_fork+0x2ac/0x330 [ 0.873550] ? __pfx_kernel_init+0x10/0x10 [ 0.873550] ret_from_fork_asm+0x1a/0x30 [ 0.873550] </TASK> [ 0.873550] irq event stamp: 5144 [ 0.873550] hardirqs last enabled at (5143): [<ffffffff9f00bb3c>] _raw_spin_unlock_irqrestore+0x2c/0x50 [ 0.873550] hardirqs last disabled at (5144): [<ffffffff9eff7c9f>] sysvec_irq_work+0xf/0x80 [ 0.873550] softirqs last enabled at (5132): [<ffffffff9dea2501>] __irq_exit_rcu+0xa1/0xc0 [ 0.873550] softirqs last disabled at (5127): [<ffffffff9dea2501>] __irq_exit_rcu+0xa1/0xc0 [ 0.873550] ---[ end trace 0000000000000000 ]--- [ 0.873574] ------------[ cut here ]------------ > --- > @Zqiang, I put your name as Suggested-by because you proposed the same > idea, let me know if you rather not have it. > > @Joel, I did two updates (including your test feedback, other one is > call irq_work_sync() when we clean the srcu_struct), please give it a > try. > > include/linux/srcutree.h | 1 + > kernel/rcu/srcutree.c | 29 +++++++++++++++++++++++++++-- > 2 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > index dfb31d11ff05..be76fa4fc170 100644 > --- a/include/linux/srcutree.h > +++ b/include/linux/srcutree.h > @@ -95,6 +95,7 @@ struct srcu_usage { > unsigned long reschedule_jiffies; > unsigned long reschedule_count; > struct delayed_work work; > + struct irq_work irq_work; > struct srcu_struct *srcu_ssp; > }; > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 2328827f8775..73aef361a524 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -19,6 +19,7 @@ > #include <linux/mutex.h> > #include <linux/percpu.h> > #include <linux/preempt.h> > +#include <linux/irq_work.h> > #include <linux/rcupdate_wait.h> > #include <linux/sched.h> > #include <linux/smp.h> > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > static void srcu_invoke_callbacks(struct work_struct *work); > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > static void process_srcu(struct work_struct *work); > +static void srcu_irq_work(struct irq_work *work); > static void srcu_delay_timer(struct timer_list *t); > > /* > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > ssp->srcu_sup->sda_is_static = is_static; > if (!is_static) { > ssp->sda = alloc_percpu(struct srcu_data); > @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) > return; /* Just leak it! */ > if (WARN_ON(srcu_readers_active(ssp))) > return; /* Just leak it! */ > + /* Wait for irq_work to finish first as it may queue a new work. */ > + irq_work_sync(&sup->irq_work); > flush_delayed_work(&sup->work); > for_each_possible_cpu(cpu) { > struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); > @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > // it isn't. And it does not have to be. After all, it > // can only be executed during early boot when there is only > // the one boot CPU running with interrupts still disabled. > + // > + // Use an irq_work here to avoid acquiring runqueue lock with > + // srcu rcu_node::lock held. BPF instrument could introduce the > + // opposite dependency, hence we need to break the possible > + // locking dependency here. > if (likely(srcu_init_done)) > - queue_delayed_work(rcu_gp_wq, &sup->work, > - !!srcu_get_delay(ssp)); > + irq_work_queue(&sup->irq_work); > else if (list_empty(&sup->work.work.entry)) > list_add(&sup->work.work.entry, &srcu_boot_list); > } > @@ -1979,6 +1988,22 @@ static void process_srcu(struct work_struct *work) > srcu_reschedule(ssp, curdelay); > } > > +static void srcu_irq_work(struct irq_work *work) > +{ > + struct srcu_struct *ssp; > + struct srcu_usage *sup; > + unsigned long delay; > + > + sup = container_of(work, struct srcu_usage, irq_work); > + ssp = sup->srcu_ssp; > + > + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > + delay = srcu_get_delay(ssp); > + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); Removing the "_irq" from both avoids the lockdep splat in my test setup, which makes sense given that interrupts are disabled in irq_work handlers. Or at least it looks to me that they are. ;-) Like this: + raw_spin_lock_rcu_node(ssp->srcu_sup); + delay = srcu_get_delay(ssp); + raw_spin_unlock_rcu_node(ssp->srcu_sup); Thanx, Paul > + > + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > +} > + > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > unsigned long *gp_seq) > { > -- > 2.50.1 (Apple Git-155) > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 10:10 ` Paul E. McKenney @ 2026-03-21 17:15 ` Boqun Feng 2026-03-21 17:41 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-21 17:15 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 03:10:05AM -0700, Paul E. McKenney wrote: > On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > happen basically everywhere (including where a scheduler lock is held), > > call_srcu() now needs to avoid acquiring scheduler lock because > > otherwise it could cause deadlock [1]. Fix this by following what the > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > the work to start process_srcu(). > > > > [boqun: Apply Joel's feedback] > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > First, thank you all for putting this together! > > If I enable both early boot RCU testing and lockdep, for example, > by running the RUDE01 rcutorture scenario, I get the following splat, > which suggests that the raw_spin_unlock_irq_rcu_node() in srcu_irq_work() > might need help (see inline below): Yes, Andrea reported a similar issue: https://lore.kernel.org/rcu/ab2yd35rm6OgZUmb@gpd4/ > [ 0.872594] ------------[ cut here ]------------ > [ 0.873550] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) > [ 0.873550] WARNING: kernel/locking/lockdep.c:4404 at lockdep_hardirqs_on_prepare+0x150/0x190, CPU#0: swapper/0/1 > [ 0.873550] Modules linked in: > [ 0.873550] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc3-00039-g35d354b6cd0f-dirty #8217 PREEMPT(full) > [ 0.873550] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > [ 0.873550] RIP: 0010:lockdep_hardirqs_on_prepare+0x157/0x190 > [ 0.873550] Code: 01 90 e8 ec c7 54 00 85 c0 74 0a 8b 35 12 c4 e0 01 85 f6 74 31 90 5d c3 cc cc cc cc 48 8d 3d 20 cf e1 01 48 c7 c6 ec c1 87 9f <67> 48 0f b9 3a eb ac 48 8d 3d 1b cf e1 01 48 c7 c6 87 be 87 9f 67 > [ 0.873550] RSP: 0000:ffff9ff3c0003f50 EFLAGS: 00010046 > [ 0.873550] RAX: 0000000000000001 RBX: ffffffff9fb608f8 RCX: 0000000000000001 > [ 0.873550] RDX: 0000000000000000 RSI: ffffffff9f87c1ec RDI: ffffffff9fd44120 > [ 0.873550] RBP: ffffffff9f00bae3 R08: 0000000000000001 R09: 0000000000000000 > [ 0.873550] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 > [ 0.873550] R13: ffff9e1bc11e4bc0 R14: 0000000000000000 R15: 0000000000000000 > [ 0.873550] FS: 0000000000000000(0000) GS:ffff9e1c3ed9b000(0000) knlGS:0000000000000000 > [ 0.873550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.873550] CR2: ffff9e1bcf3d8000 CR3: 000000000dc4e000 CR4: 00000000000006f0 > [ 0.873550] Call Trace: > [ 0.873550] <IRQ> > [ 0.873550] ? _raw_spin_unlock_irq+0x23/0x40 > [ 0.873550] trace_hardirqs_on+0x16/0xe0 > [ 0.873550] _raw_spin_unlock_irq+0x23/0x40 > [ 0.873550] srcu_irq_work+0x5e/0x90 > [ 0.873550] irq_work_single+0x42/0x90 > [ 0.873550] irq_work_run_list+0x26/0x40 > [ 0.873550] irq_work_run+0x18/0x30 > [ 0.873550] __sysvec_irq_work+0x30/0x180 > [ 0.873550] sysvec_irq_work+0x6a/0x80 > [ 0.873550] </IRQ> > [ 0.873550] <TASK> > [ 0.873550] asm_sysvec_irq_work+0x1a/0x20 > [ 0.873550] RIP: 0010:_raw_spin_unlock_irqrestore+0x34/0x50 > [ 0.873550] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 06 d4 f1 fe 48 89 ef e8 2e 0c f2 fe 80 e7 02 74 06 e8 74 86 01 ff fb 65 ff 0d ec d4 66 01 <74> 07 5b 5d e9 53 1b 00 00 e8 5e 94 df fe 5b 5d e9 47 1b 00 00 0f > [ 0.873550] RSP: 0000:ffff9ff3c0013d50 EFLAGS: 00000286 > [ 0.873550] RAX: 0000000000001417 RBX: 0000000000000297 RCX: 0000000000000000 > [ 0.873550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00bb3c > [ 0.873550] RBP: ffffffff9fb60630 R08: 0000000000000001 R09: 0000000000000000 > [ 0.873550] R10: 0000000000000001 R11: 0000000000000001 R12: fffffffffffffe74 > [ 0.873550] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 > [ 0.873550] ? _raw_spin_unlock_irqrestore+0x2c/0x50 > [ 0.873550] srcu_gp_start_if_needed+0x354/0x530 > [ 0.873550] __synchronize_srcu+0xd1/0x180 > [ 0.873550] ? __pfx_wakeme_after_rcu+0x10/0x10 > [ 0.873550] ? synchronize_srcu+0x3f/0x170 > [ 0.873550] ? __pfx_rcu_init_tasks_generic+0x10/0x10 > [ 0.873550] rcu_init_tasks_generic+0x10c/0x130 > [ 0.873550] do_one_initcall+0x59/0x2e0 > [ 0.873550] ? _printk+0x56/0x70 > [ 0.873550] kernel_init_freeable+0x227/0x440 > [ 0.873550] ? __pfx_kernel_init+0x10/0x10 > [ 0.873550] kernel_init+0x15/0x1c0 > [ 0.873550] ret_from_fork+0x2ac/0x330 > [ 0.873550] ? __pfx_kernel_init+0x10/0x10 > [ 0.873550] ret_from_fork_asm+0x1a/0x30 > [ 0.873550] </TASK> > [ 0.873550] irq event stamp: 5144 > [ 0.873550] hardirqs last enabled at (5143): [<ffffffff9f00bb3c>] _raw_spin_unlock_irqrestore+0x2c/0x50 > [ 0.873550] hardirqs last disabled at (5144): [<ffffffff9eff7c9f>] sysvec_irq_work+0xf/0x80 > [ 0.873550] softirqs last enabled at (5132): [<ffffffff9dea2501>] __irq_exit_rcu+0xa1/0xc0 > [ 0.873550] softirqs last disabled at (5127): [<ffffffff9dea2501>] __irq_exit_rcu+0xa1/0xc0 > [ 0.873550] ---[ end trace 0000000000000000 ]--- > [ 0.873574] ------------[ cut here ]------------ > > > --- > > @Zqiang, I put your name as Suggested-by because you proposed the same > > idea, let me know if you rather not have it. > > > > @Joel, I did two updates (including your test feedback, other one is > > call irq_work_sync() when we clean the srcu_struct), please give it a > > try. > > > > include/linux/srcutree.h | 1 + > > kernel/rcu/srcutree.c | 29 +++++++++++++++++++++++++++-- > > 2 files changed, 28 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > index dfb31d11ff05..be76fa4fc170 100644 > > --- a/include/linux/srcutree.h > > +++ b/include/linux/srcutree.h > > @@ -95,6 +95,7 @@ struct srcu_usage { > > unsigned long reschedule_jiffies; > > unsigned long reschedule_count; > > struct delayed_work work; > > + struct irq_work irq_work; > > struct srcu_struct *srcu_ssp; > > }; > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 2328827f8775..73aef361a524 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -19,6 +19,7 @@ > > #include <linux/mutex.h> > > #include <linux/percpu.h> > > #include <linux/preempt.h> > > +#include <linux/irq_work.h> > > #include <linux/rcupdate_wait.h> > > #include <linux/sched.h> > > #include <linux/smp.h> > > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > > static void srcu_invoke_callbacks(struct work_struct *work); > > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > > static void process_srcu(struct work_struct *work); > > +static void srcu_irq_work(struct irq_work *work); > > static void srcu_delay_timer(struct timer_list *t); > > > > /* > > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > > ssp->srcu_sup->sda_is_static = is_static; > > if (!is_static) { > > ssp->sda = alloc_percpu(struct srcu_data); > > @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) > > return; /* Just leak it! */ > > if (WARN_ON(srcu_readers_active(ssp))) > > return; /* Just leak it! */ > > + /* Wait for irq_work to finish first as it may queue a new work. */ > > + irq_work_sync(&sup->irq_work); > > flush_delayed_work(&sup->work); > > for_each_possible_cpu(cpu) { > > struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); > > @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > // it isn't. And it does not have to be. After all, it > > // can only be executed during early boot when there is only > > // the one boot CPU running with interrupts still disabled. > > + // > > + // Use an irq_work here to avoid acquiring runqueue lock with > > + // srcu rcu_node::lock held. BPF instrument could introduce the > > + // opposite dependency, hence we need to break the possible > > + // locking dependency here. > > if (likely(srcu_init_done)) > > - queue_delayed_work(rcu_gp_wq, &sup->work, > > - !!srcu_get_delay(ssp)); > > + irq_work_queue(&sup->irq_work); > > else if (list_empty(&sup->work.work.entry)) > > list_add(&sup->work.work.entry, &srcu_boot_list); > > } > > @@ -1979,6 +1988,22 @@ static void process_srcu(struct work_struct *work) > > srcu_reschedule(ssp, curdelay); > > } > > > > +static void srcu_irq_work(struct irq_work *work) > > +{ > > + struct srcu_struct *ssp; > > + struct srcu_usage *sup; > > + unsigned long delay; > > + > > + sup = container_of(work, struct srcu_usage, irq_work); > > + ssp = sup->srcu_ssp; > > + > > + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > > + delay = srcu_get_delay(ssp); > > + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > > Removing the "_irq" from both avoids the lockdep splat in my test setup, > which makes sense given that interrupts are disabled in irq_work handlers. > Or at least it looks to me that they are. ;-) > > Like this: > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > + delay = srcu_get_delay(ssp); > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > It was fixed differently in v2: https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ I used _irqsave/_irqrestore just in case. Given it's an urgent fix, overly careful code is probably fine ;-) Thanks for the testing and feedback. Regards, Boqun > Thanx, Paul > > > + > > + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > > +} > > + > > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > > unsigned long *gp_seq) > > { > > -- > > 2.50.1 (Apple Git-155) > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 17:15 ` Boqun Feng @ 2026-03-21 17:41 ` Paul E. McKenney 2026-03-21 18:06 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-21 17:41 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 10:15:47AM -0700, Boqun Feng wrote: > On Sat, Mar 21, 2026 at 03:10:05AM -0700, Paul E. McKenney wrote: > > On Fri, Mar 20, 2026 at 11:14:00AM -0700, Boqun Feng wrote: > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > happen basically everywhere (including where a scheduler lock is held), > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > the work to start process_srcu(). > > > > > > [boqun: Apply Joel's feedback] > > > > > > Reported-by: Andrea Righi <arighi@nvidia.com> > > > Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > Suggested-by: Zqiang <qiang.zhang@linux.dev> > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > > First, thank you all for putting this together! > > > > If I enable both early boot RCU testing and lockdep, for example, > > by running the RUDE01 rcutorture scenario, I get the following splat, > > which suggests that the raw_spin_unlock_irq_rcu_node() in srcu_irq_work() > > might need help (see inline below): > > Yes, Andrea reported a similar issue: > > https://lore.kernel.org/rcu/ab2yd35rm6OgZUmb@gpd4/ > > > [ 0.872594] ------------[ cut here ]------------ > > [ 0.873550] DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context()) > > [ 0.873550] WARNING: kernel/locking/lockdep.c:4404 at lockdep_hardirqs_on_prepare+0x150/0x190, CPU#0: swapper/0/1 > > [ 0.873550] Modules linked in: > > [ 0.873550] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc3-00039-g35d354b6cd0f-dirty #8217 PREEMPT(full) > > [ 0.873550] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > [ 0.873550] RIP: 0010:lockdep_hardirqs_on_prepare+0x157/0x190 > > [ 0.873550] Code: 01 90 e8 ec c7 54 00 85 c0 74 0a 8b 35 12 c4 e0 01 85 f6 74 31 90 5d c3 cc cc cc cc 48 8d 3d 20 cf e1 01 48 c7 c6 ec c1 87 9f <67> 48 0f b9 3a eb ac 48 8d 3d 1b cf e1 01 48 c7 c6 87 be 87 9f 67 > > [ 0.873550] RSP: 0000:ffff9ff3c0003f50 EFLAGS: 00010046 > > [ 0.873550] RAX: 0000000000000001 RBX: ffffffff9fb608f8 RCX: 0000000000000001 > > [ 0.873550] RDX: 0000000000000000 RSI: ffffffff9f87c1ec RDI: ffffffff9fd44120 > > [ 0.873550] RBP: ffffffff9f00bae3 R08: 0000000000000001 R09: 0000000000000000 > > [ 0.873550] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 > > [ 0.873550] R13: ffff9e1bc11e4bc0 R14: 0000000000000000 R15: 0000000000000000 > > [ 0.873550] FS: 0000000000000000(0000) GS:ffff9e1c3ed9b000(0000) knlGS:0000000000000000 > > [ 0.873550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 0.873550] CR2: ffff9e1bcf3d8000 CR3: 000000000dc4e000 CR4: 00000000000006f0 > > [ 0.873550] Call Trace: > > [ 0.873550] <IRQ> > > [ 0.873550] ? _raw_spin_unlock_irq+0x23/0x40 > > [ 0.873550] trace_hardirqs_on+0x16/0xe0 > > [ 0.873550] _raw_spin_unlock_irq+0x23/0x40 > > [ 0.873550] srcu_irq_work+0x5e/0x90 > > [ 0.873550] irq_work_single+0x42/0x90 > > [ 0.873550] irq_work_run_list+0x26/0x40 > > [ 0.873550] irq_work_run+0x18/0x30 > > [ 0.873550] __sysvec_irq_work+0x30/0x180 > > [ 0.873550] sysvec_irq_work+0x6a/0x80 > > [ 0.873550] </IRQ> > > [ 0.873550] <TASK> > > [ 0.873550] asm_sysvec_irq_work+0x1a/0x20 > > [ 0.873550] RIP: 0010:_raw_spin_unlock_irqrestore+0x34/0x50 > > [ 0.873550] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 06 d4 f1 fe 48 89 ef e8 2e 0c f2 fe 80 e7 02 74 06 e8 74 86 01 ff fb 65 ff 0d ec d4 66 01 <74> 07 5b 5d e9 53 1b 00 00 e8 5e 94 df fe 5b 5d e9 47 1b 00 00 0f > > [ 0.873550] RSP: 0000:ffff9ff3c0013d50 EFLAGS: 00000286 > > [ 0.873550] RAX: 0000000000001417 RBX: 0000000000000297 RCX: 0000000000000000 > > [ 0.873550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9f00bb3c > > [ 0.873550] RBP: ffffffff9fb60630 R08: 0000000000000001 R09: 0000000000000000 > > [ 0.873550] R10: 0000000000000001 R11: 0000000000000001 R12: fffffffffffffe74 > > [ 0.873550] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 > > [ 0.873550] ? _raw_spin_unlock_irqrestore+0x2c/0x50 > > [ 0.873550] srcu_gp_start_if_needed+0x354/0x530 > > [ 0.873550] __synchronize_srcu+0xd1/0x180 > > [ 0.873550] ? __pfx_wakeme_after_rcu+0x10/0x10 > > [ 0.873550] ? synchronize_srcu+0x3f/0x170 > > [ 0.873550] ? __pfx_rcu_init_tasks_generic+0x10/0x10 > > [ 0.873550] rcu_init_tasks_generic+0x10c/0x130 > > [ 0.873550] do_one_initcall+0x59/0x2e0 > > [ 0.873550] ? _printk+0x56/0x70 > > [ 0.873550] kernel_init_freeable+0x227/0x440 > > [ 0.873550] ? __pfx_kernel_init+0x10/0x10 > > [ 0.873550] kernel_init+0x15/0x1c0 > > [ 0.873550] ret_from_fork+0x2ac/0x330 > > [ 0.873550] ? __pfx_kernel_init+0x10/0x10 > > [ 0.873550] ret_from_fork_asm+0x1a/0x30 > > [ 0.873550] </TASK> > > [ 0.873550] irq event stamp: 5144 > > [ 0.873550] hardirqs last enabled at (5143): [<ffffffff9f00bb3c>] _raw_spin_unlock_irqrestore+0x2c/0x50 > > [ 0.873550] hardirqs last disabled at (5144): [<ffffffff9eff7c9f>] sysvec_irq_work+0xf/0x80 > > [ 0.873550] softirqs last enabled at (5132): [<ffffffff9dea2501>] __irq_exit_rcu+0xa1/0xc0 > > [ 0.873550] softirqs last disabled at (5127): [<ffffffff9dea2501>] __irq_exit_rcu+0xa1/0xc0 > > [ 0.873550] ---[ end trace 0000000000000000 ]--- > > [ 0.873574] ------------[ cut here ]------------ > > > > > --- > > > @Zqiang, I put your name as Suggested-by because you proposed the same > > > idea, let me know if you rather not have it. > > > > > > @Joel, I did two updates (including your test feedback, other one is > > > call irq_work_sync() when we clean the srcu_struct), please give it a > > > try. > > > > > > include/linux/srcutree.h | 1 + > > > kernel/rcu/srcutree.c | 29 +++++++++++++++++++++++++++-- > > > 2 files changed, 28 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > > index dfb31d11ff05..be76fa4fc170 100644 > > > --- a/include/linux/srcutree.h > > > +++ b/include/linux/srcutree.h > > > @@ -95,6 +95,7 @@ struct srcu_usage { > > > unsigned long reschedule_jiffies; > > > unsigned long reschedule_count; > > > struct delayed_work work; > > > + struct irq_work irq_work; > > > struct srcu_struct *srcu_ssp; > > > }; > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 2328827f8775..73aef361a524 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -19,6 +19,7 @@ > > > #include <linux/mutex.h> > > > #include <linux/percpu.h> > > > #include <linux/preempt.h> > > > +#include <linux/irq_work.h> > > > #include <linux/rcupdate_wait.h> > > > #include <linux/sched.h> > > > #include <linux/smp.h> > > > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > > > static void srcu_invoke_callbacks(struct work_struct *work); > > > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > > > static void process_srcu(struct work_struct *work); > > > +static void srcu_irq_work(struct irq_work *work); > > > static void srcu_delay_timer(struct timer_list *t); > > > > > > /* > > > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > > > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > > > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > > > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > > > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > > > ssp->srcu_sup->sda_is_static = is_static; > > > if (!is_static) { > > > ssp->sda = alloc_percpu(struct srcu_data); > > > @@ -713,6 +716,8 @@ void cleanup_srcu_struct(struct srcu_struct *ssp) > > > return; /* Just leak it! */ > > > if (WARN_ON(srcu_readers_active(ssp))) > > > return; /* Just leak it! */ > > > + /* Wait for irq_work to finish first as it may queue a new work. */ > > > + irq_work_sync(&sup->irq_work); > > > flush_delayed_work(&sup->work); > > > for_each_possible_cpu(cpu) { > > > struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu); > > > @@ -1118,9 +1123,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > // it isn't. And it does not have to be. After all, it > > > // can only be executed during early boot when there is only > > > // the one boot CPU running with interrupts still disabled. > > > + // > > > + // Use an irq_work here to avoid acquiring runqueue lock with > > > + // srcu rcu_node::lock held. BPF instrument could introduce the > > > + // opposite dependency, hence we need to break the possible > > > + // locking dependency here. > > > if (likely(srcu_init_done)) > > > - queue_delayed_work(rcu_gp_wq, &sup->work, > > > - !!srcu_get_delay(ssp)); > > > + irq_work_queue(&sup->irq_work); > > > else if (list_empty(&sup->work.work.entry)) > > > list_add(&sup->work.work.entry, &srcu_boot_list); > > > } > > > @@ -1979,6 +1988,22 @@ static void process_srcu(struct work_struct *work) > > > srcu_reschedule(ssp, curdelay); > > > } > > > > > > +static void srcu_irq_work(struct irq_work *work) > > > +{ > > > + struct srcu_struct *ssp; > > > + struct srcu_usage *sup; > > > + unsigned long delay; > > > + > > > + sup = container_of(work, struct srcu_usage, irq_work); > > > + ssp = sup->srcu_ssp; > > > + > > > + raw_spin_lock_irq_rcu_node(ssp->srcu_sup); > > > + delay = srcu_get_delay(ssp); > > > + raw_spin_unlock_irq_rcu_node(ssp->srcu_sup); > > > > Removing the "_irq" from both avoids the lockdep splat in my test setup, > > which makes sense given that interrupts are disabled in irq_work handlers. > > Or at least it looks to me that they are. ;-) > > > > Like this: > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > + delay = srcu_get_delay(ssp); > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > It was fixed differently in v2: > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > overly careful code is probably fine ;-) > > Thanks for the testing and feedback. OK, I will try that one, thank you! FYI, with my change on your earlier version, SRCU-T got deadlocks between the pi-lock and the workqueue pool lock. Which might or might not be particularly urgent. Thanx, Paul > Regards, > Boqun > > > Thanx, Paul > > > > > + > > > + queue_delayed_work(rcu_gp_wq, &sup->work, !!delay); > > > +} > > > + > > > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > > > unsigned long *gp_seq) > > > { > > > -- > > > 2.50.1 (Apple Git-155) > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 17:41 ` Paul E. McKenney @ 2026-03-21 18:06 ` Boqun Feng 2026-03-21 19:31 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-21 18:06 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: [...] > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > + delay = srcu_get_delay(ssp); > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > It was fixed differently in v2: > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > overly careful code is probably fine ;-) > > > > Thanks for the testing and feedback. > > OK, I will try that one, thank you! > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > the pi-lock and the workqueue pool lock. Which might or might not be > particularly urgent. > I just checked my run yesterday, I also hit it. It's probably what Zqiang has found: https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ We have a queue_work_on() in srcu_schedule_cbs_sdp(), so srcu_torture_deferred_free(): raw_spin_lock_irqsave(->pi_lock,...); call_srcu(): if (snp == snp_leaf && snp_seq != s) { srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): if (!delay) queue_work_on(...) I was about to reply to Zqiang, fixing that could be a touch design decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work incoming. Regards, Boqun > Thanx, Paul > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 18:06 ` Boqun Feng @ 2026-03-21 19:31 ` Paul E. McKenney 2026-03-21 19:45 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-21 19:31 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > [...] > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > + delay = srcu_get_delay(ssp); > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > It was fixed differently in v2: > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > overly careful code is probably fine ;-) > > > > > > Thanks for the testing and feedback. > > > > OK, I will try that one, thank you! > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > the pi-lock and the workqueue pool lock. Which might or might not be > > particularly urgent. > > > > I just checked my run yesterday, I also hit it. It's probably what > Zqiang has found: > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > srcu_torture_deferred_free(): > raw_spin_lock_irqsave(->pi_lock,...); > call_srcu(): > if (snp == snp_leaf && snp_seq != s) { > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > if (!delay) > queue_work_on(...) > > I was about to reply to Zqiang, fixing that could be a touch design > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > incoming. Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps lower priority, though perhaps not lower irritation. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 19:31 ` Paul E. McKenney @ 2026-03-21 19:45 ` Boqun Feng 2026-03-21 20:07 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-21 19:45 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > [...] > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > + delay = srcu_get_delay(ssp); > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > overly careful code is probably fine ;-) > > > > > > > > Thanks for the testing and feedback. > > > > > > OK, I will try that one, thank you! > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > particularly urgent. > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > Zqiang has found: > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > srcu_torture_deferred_free(): > > raw_spin_lock_irqsave(->pi_lock,...); > > call_srcu(): > > if (snp == snp_leaf && snp_seq != s) { > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > if (!delay) > > queue_work_on(...) > > > > I was about to reply to Zqiang, fixing that could be a touch design > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > incoming. > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > lower priority, though perhaps not lower irritation. ;-) > I see, there is a schedule_work() in srcutiny's srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since locks are (almost) no-op. Maybe we can make RCU torture only test it on SMP? Regards, Boqun > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 19:45 ` Boqun Feng @ 2026-03-21 20:07 ` Paul E. McKenney 2026-03-21 20:08 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-21 20:07 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > [...] > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > + delay = srcu_get_delay(ssp); > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > overly careful code is probably fine ;-) > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > OK, I will try that one, thank you! > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > particularly urgent. > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > Zqiang has found: > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > srcu_torture_deferred_free(): > > > raw_spin_lock_irqsave(->pi_lock,...); > > > call_srcu(): > > > if (snp == snp_leaf && snp_seq != s) { > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > if (!delay) > > > queue_work_on(...) > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > incoming. > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > lower priority, though perhaps not lower irritation. ;-) > > I see, there is a schedule_work() in srcutiny's > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > locks are (almost) no-op. Maybe we can make RCU torture only test it on > SMP? Like this, you mean? I will give it a shot tomorrow. Thanx, Paul ------------------------------------------------------------------------ diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 3c8e4cd5b83e6..afef343eb8a19 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -843,7 +843,7 @@ static unsigned long srcu_torture_completed(void) static void srcu_torture_deferred_free(struct rcu_torture *rp) { unsigned long flags; - bool lockit = jiffies & 0x1; + bool lockit = IS_ENABLED(CONFIG_SMP) && (jiffies & 0x1); if (lockit) raw_spin_lock_irqsave(¤t->pi_lock, flags); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 20:07 ` Paul E. McKenney @ 2026-03-21 20:08 ` Boqun Feng 2026-03-22 10:09 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-21 20:08 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 01:07:45PM -0700, Paul E. McKenney wrote: > On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > > [...] > > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > > + delay = srcu_get_delay(ssp); > > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > > overly careful code is probably fine ;-) > > > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > > > OK, I will try that one, thank you! > > > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > > particularly urgent. > > > > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > > Zqiang has found: > > > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > > > srcu_torture_deferred_free(): > > > > raw_spin_lock_irqsave(->pi_lock,...); > > > > call_srcu(): > > > > if (snp == snp_leaf && snp_seq != s) { > > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > > if (!delay) > > > > queue_work_on(...) > > > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > > incoming. > > > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > > lower priority, though perhaps not lower irritation. ;-) > > > > I see, there is a schedule_work() in srcutiny's > > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > > locks are (almost) no-op. Maybe we can make RCU torture only test it on > > SMP? > > Like this, you mean? I will give it a shot tomorrow. > Yes, thanks! Regards, Boqun > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c > index 3c8e4cd5b83e6..afef343eb8a19 100644 > --- a/kernel/rcu/rcutorture.c > +++ b/kernel/rcu/rcutorture.c > @@ -843,7 +843,7 @@ static unsigned long srcu_torture_completed(void) > static void srcu_torture_deferred_free(struct rcu_torture *rp) > { > unsigned long flags; > - bool lockit = jiffies & 0x1; > + bool lockit = IS_ENABLED(CONFIG_SMP) && (jiffies & 0x1); > > if (lockit) > raw_spin_lock_irqsave(¤t->pi_lock, flags); ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-21 20:08 ` Boqun Feng @ 2026-03-22 10:09 ` Paul E. McKenney 2026-03-22 16:16 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-22 10:09 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sat, Mar 21, 2026 at 01:08:59PM -0700, Boqun Feng wrote: > On Sat, Mar 21, 2026 at 01:07:45PM -0700, Paul E. McKenney wrote: > > On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > > > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > > > [...] > > > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > > > + delay = srcu_get_delay(ssp); > > > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > > > overly careful code is probably fine ;-) > > > > > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > > > > > OK, I will try that one, thank you! > > > > > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > > > particularly urgent. > > > > > > > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > > > Zqiang has found: > > > > > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > > > > > srcu_torture_deferred_free(): > > > > > raw_spin_lock_irqsave(->pi_lock,...); > > > > > call_srcu(): > > > > > if (snp == snp_leaf && snp_seq != s) { > > > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > > > if (!delay) > > > > > queue_work_on(...) > > > > > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > > > incoming. > > > > > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > > > lower priority, though perhaps not lower irritation. ;-) > > > > > > I see, there is a schedule_work() in srcutiny's > > > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > > > locks are (almost) no-op. Maybe we can make RCU torture only test it on > > > SMP? > > > > Like this, you mean? I will give it a shot tomorrow. > > Yes, thanks! OK, the previous patch did fine on short rcutorture testing aside from the !SMP lockdep splat, so I have started the test without pi_lock. Longer term, shouldn't lockdep take into account the fact that on !SMP, the disabling of preemption (or interrupts or...) is essentially the same as acquiring a global lock? This means that only one task at a time can be acquiring a raw spinlock on !SMP, so that the order of acquisition of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking double acquisitions, perhaps.) In other words, shouldn't lockdep leave raw spinlocks out of lockdep's cycle-detection data structure? Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 10:09 ` Paul E. McKenney @ 2026-03-22 16:16 ` Boqun Feng 2026-03-22 17:09 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-22 16:16 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 03:09:19AM -0700, Paul E. McKenney wrote: > On Sat, Mar 21, 2026 at 01:08:59PM -0700, Boqun Feng wrote: > > On Sat, Mar 21, 2026 at 01:07:45PM -0700, Paul E. McKenney wrote: > > > On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > > > > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > > > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > > > > [...] > > > > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > > > > + delay = srcu_get_delay(ssp); > > > > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > > > > overly careful code is probably fine ;-) > > > > > > > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > > > > > > > OK, I will try that one, thank you! > > > > > > > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > > > > particularly urgent. > > > > > > > > > > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > > > > Zqiang has found: > > > > > > > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > > > > > > > srcu_torture_deferred_free(): > > > > > > raw_spin_lock_irqsave(->pi_lock,...); > > > > > > call_srcu(): > > > > > > if (snp == snp_leaf && snp_seq != s) { > > > > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > > > > if (!delay) > > > > > > queue_work_on(...) > > > > > > > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > > > > incoming. > > > > > > > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > > > > lower priority, though perhaps not lower irritation. ;-) > > > > > > > > I see, there is a schedule_work() in srcutiny's > > > > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > > > > locks are (almost) no-op. Maybe we can make RCU torture only test it on > > > > SMP? > > > > > > Like this, you mean? I will give it a shot tomorrow. > > > > Yes, thanks! > > OK, the previous patch did fine on short rcutorture testing aside from > the !SMP lockdep splat, so I have started the test without pi_lock. > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > the disabling of preemption (or interrupts or...) is essentially the same > as acquiring a global lock? This means that only one task at a time can > be acquiring a raw spinlock on !SMP, so that the order of acquisition > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > raw spinlocks out of lockdep's cycle-detection data structure? > Lockdep doesn't know whether a code path is UP-only, so it'll apply the general locking rule to check. Similar as lockdep still detect PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a separate kconfig to narrow down lockdep detection for UP-only if UP=y. Regards, Boqun > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 16:16 ` Boqun Feng @ 2026-03-22 17:09 ` Paul E. McKenney 2026-03-22 17:31 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-22 17:09 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 09:16:59AM -0700, Boqun Feng wrote: > On Sun, Mar 22, 2026 at 03:09:19AM -0700, Paul E. McKenney wrote: > > On Sat, Mar 21, 2026 at 01:08:59PM -0700, Boqun Feng wrote: > > > On Sat, Mar 21, 2026 at 01:07:45PM -0700, Paul E. McKenney wrote: > > > > On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > > > > > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > > > > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > > > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > > > > > [...] > > > > > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > > > > > + delay = srcu_get_delay(ssp); > > > > > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > > > > > overly careful code is probably fine ;-) > > > > > > > > > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > > > > > > > > > OK, I will try that one, thank you! > > > > > > > > > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > > > > > particularly urgent. > > > > > > > > > > > > > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > > > > > Zqiang has found: > > > > > > > > > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > > > > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > > > > > > > > > srcu_torture_deferred_free(): > > > > > > > raw_spin_lock_irqsave(->pi_lock,...); > > > > > > > call_srcu(): > > > > > > > if (snp == snp_leaf && snp_seq != s) { > > > > > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > > > > > if (!delay) > > > > > > > queue_work_on(...) > > > > > > > > > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > > > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > > > > > incoming. > > > > > > > > > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > > > > > lower priority, though perhaps not lower irritation. ;-) > > > > > > > > > > I see, there is a schedule_work() in srcutiny's > > > > > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > > > > > locks are (almost) no-op. Maybe we can make RCU torture only test it on > > > > > SMP? > > > > > > > > Like this, you mean? I will give it a shot tomorrow. > > > > > > Yes, thanks! > > > > OK, the previous patch did fine on short rcutorture testing aside from > > the !SMP lockdep splat, so I have started the test without pi_lock. > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > the disabling of preemption (or interrupts or...) is essentially the same > > as acquiring a global lock? This means that only one task at a time can > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > raw spinlocks out of lockdep's cycle-detection data structure? > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > general locking rule to check. Similar as lockdep still detect > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > separate kconfig to narrow down lockdep detection for UP-only if UP=y. But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, correct? In which case all the code paths are UP-only. The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to detect and fix locking issues that would otherwise only show up in CONFIG_PREEMPT_RT=y kernels. And of course, CONFIG_SMP=n kernels still need deadlock detection for mutexes and the like. Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 17:09 ` Paul E. McKenney @ 2026-03-22 17:31 ` Boqun Feng 2026-03-22 17:44 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-22 17:31 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 10:09:53AM -0700, Paul E. McKenney wrote: > On Sun, Mar 22, 2026 at 09:16:59AM -0700, Boqun Feng wrote: > > On Sun, Mar 22, 2026 at 03:09:19AM -0700, Paul E. McKenney wrote: > > > On Sat, Mar 21, 2026 at 01:08:59PM -0700, Boqun Feng wrote: > > > > On Sat, Mar 21, 2026 at 01:07:45PM -0700, Paul E. McKenney wrote: > > > > > On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > > > > > > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > > > > > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > > > > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > > > > > > [...] > > > > > > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > + delay = srcu_get_delay(ssp); > > > > > > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > > > > > > overly careful code is probably fine ;-) > > > > > > > > > > > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > > > > > > > > > > > OK, I will try that one, thank you! > > > > > > > > > > > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > > > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > > > > > > particularly urgent. > > > > > > > > > > > > > > > > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > > > > > > Zqiang has found: > > > > > > > > > > > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > > > > > > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > > > > > > > > > > > srcu_torture_deferred_free(): > > > > > > > > raw_spin_lock_irqsave(->pi_lock,...); > > > > > > > > call_srcu(): > > > > > > > > if (snp == snp_leaf && snp_seq != s) { > > > > > > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > > > > > > if (!delay) > > > > > > > > queue_work_on(...) > > > > > > > > > > > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > > > > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > > > > > > incoming. > > > > > > > > > > > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > > > > > > lower priority, though perhaps not lower irritation. ;-) > > > > > > > > > > > > I see, there is a schedule_work() in srcutiny's > > > > > > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > > > > > > locks are (almost) no-op. Maybe we can make RCU torture only test it on > > > > > > SMP? > > > > > > > > > > Like this, you mean? I will give it a shot tomorrow. > > > > > > > > Yes, thanks! > > > > > > OK, the previous patch did fine on short rcutorture testing aside from > > > the !SMP lockdep splat, so I have started the test without pi_lock. > > > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > > the disabling of preemption (or interrupts or...) is essentially the same > > > as acquiring a global lock? This means that only one task at a time can > > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > > raw spinlocks out of lockdep's cycle-detection data structure? > > > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > > general locking rule to check. Similar as lockdep still detect > > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > > separate kconfig to narrow down lockdep detection for UP-only if UP=y. > > But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, > correct? In which case all the code paths are UP-only. > Why? UP and SMP can share the same code path, no? Just because lockdep is running in UP and see a code path, doesn't mean that code path is UP-only, right? What I was trying to say is Lockdep detects generel locking rule violation, so even when it's running in UP kernel, it'll use rules that SMP uses. > The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation > was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to > detect and fix locking issues that would otherwise only show up in > CONFIG_PREEMPT_RT=y kernels. > I'm not sure how it's different than people wanted to detect SMP=y locking issues when running UP kernel. Of course, I didn't know whether that's the intention because I wasn't there ;-) Regards, Boqun > And of course, CONFIG_SMP=n kernels still need deadlock detection for > mutexes and the like. > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 17:31 ` Boqun Feng @ 2026-03-22 17:44 ` Paul E. McKenney 2026-03-22 18:17 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-22 17:44 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 10:31:14AM -0700, Boqun Feng wrote: > On Sun, Mar 22, 2026 at 10:09:53AM -0700, Paul E. McKenney wrote: > > On Sun, Mar 22, 2026 at 09:16:59AM -0700, Boqun Feng wrote: > > > On Sun, Mar 22, 2026 at 03:09:19AM -0700, Paul E. McKenney wrote: > > > > On Sat, Mar 21, 2026 at 01:08:59PM -0700, Boqun Feng wrote: > > > > > On Sat, Mar 21, 2026 at 01:07:45PM -0700, Paul E. McKenney wrote: > > > > > > On Sat, Mar 21, 2026 at 12:45:27PM -0700, Boqun Feng wrote: > > > > > > > On Sat, Mar 21, 2026 at 12:31:04PM -0700, Paul E. McKenney wrote: > > > > > > > > On Sat, Mar 21, 2026 at 11:06:59AM -0700, Boqun Feng wrote: > > > > > > > > > On Sat, Mar 21, 2026 at 10:41:47AM -0700, Paul E. McKenney wrote: > > > > > > > > > [...] > > > > > > > > > > > > + raw_spin_lock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > + delay = srcu_get_delay(ssp); > > > > > > > > > > > > + raw_spin_unlock_rcu_node(ssp->srcu_sup); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It was fixed differently in v2: > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/rcu/20260320222916.19987-1-boqun@kernel.org/ > > > > > > > > > > > > > > > > > > > > > > I used _irqsave/_irqrestore just in case. Given it's an urgent fix, > > > > > > > > > > > overly careful code is probably fine ;-) > > > > > > > > > > > > > > > > > > > > > > Thanks for the testing and feedback. > > > > > > > > > > > > > > > > > > > > OK, I will try that one, thank you! > > > > > > > > > > > > > > > > > > > > FYI, with my change on your earlier version, SRCU-T got deadlocks between > > > > > > > > > > the pi-lock and the workqueue pool lock. Which might or might not be > > > > > > > > > > particularly urgent. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I just checked my run yesterday, I also hit it. It's probably what > > > > > > > > > Zqiang has found: > > > > > > > > > > > > > > > > > > https://lore.kernel.org/rcu/4c23c66f86a2aff8f2d7b759f9dd257b82147a17@linux.dev/ > > > > > > > > > > > > > > > > > > We have a queue_work_on() in srcu_schedule_cbs_sdp(), so > > > > > > > > > > > > > > > > > > srcu_torture_deferred_free(): > > > > > > > > > raw_spin_lock_irqsave(->pi_lock,...); > > > > > > > > > call_srcu(): > > > > > > > > > if (snp == snp_leaf && snp_seq != s) { > > > > > > > > > srcu_schedule_cbs_sdp(sdp, do_norm ? SRCU_INTERVAL : 0): > > > > > > > > > if (!delay) > > > > > > > > > queue_work_on(...) > > > > > > > > > > > > > > > > > > I was about to reply to Zqiang, fixing that could be a touch design > > > > > > > > > decision. Since it's a per srcu_data work ;-) NR_CPUS x irq_work > > > > > > > > > incoming. > > > > > > > > > > > > > > > > Just to be clear, SRCU-T is Tiny SRCU rather than Tree SRCU. So perhaps > > > > > > > > lower priority, though perhaps not lower irritation. ;-) > > > > > > > > > > > > > > I see, there is a schedule_work() in srcutiny's > > > > > > > srcu_gp_start_if_needed(). But it couldn't cause deadlock on UP since > > > > > > > locks are (almost) no-op. Maybe we can make RCU torture only test it on > > > > > > > SMP? > > > > > > > > > > > > Like this, you mean? I will give it a shot tomorrow. > > > > > > > > > > Yes, thanks! > > > > > > > > OK, the previous patch did fine on short rcutorture testing aside from > > > > the !SMP lockdep splat, so I have started the test without pi_lock. > > > > > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > > > the disabling of preemption (or interrupts or...) is essentially the same > > > > as acquiring a global lock? This means that only one task at a time can > > > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > > > raw spinlocks out of lockdep's cycle-detection data structure? > > > > > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > > > general locking rule to check. Similar as lockdep still detect > > > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > > > separate kconfig to narrow down lockdep detection for UP-only if UP=y. > > > > But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, > > correct? In which case all the code paths are UP-only. > > Why? UP and SMP can share the same code path, no? Just because lockdep > is running in UP and see a code path, doesn't mean that code path is > UP-only, right? What I was trying to say is Lockdep detects generel > locking rule violation, so even when it's running in UP kernel, it'll > use rules that SMP uses. I know that this is what lockdep is doing. I am instead saying that what lockdep is doing is not a good thing. ;-) > > The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation > > was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to > > detect and fix locking issues that would otherwise only show up in > > CONFIG_PREEMPT_RT=y kernels. > > I'm not sure how it's different than people wanted to detect SMP=y > locking issues when running UP kernel. Of course, I didn't know whether > that's the intention because I wasn't there ;-) It is very different. There were very few people developing and running PREEMPT_RT, so it was quite difficult for them to clean up the lock-context issues that everyone else was creating. Making lockdep check for PREEMPT_RT issues in !PREEMPT_RT kernels was therefore very helpful, at least to the PREEMPT_RT guys. In contrast (and unlike 20 years ago), almost everyone runs SMP, so there is no harm in !SMP lockdep ignoring things that would be deadlocks in SMP kernels. The ample SMP lockdep testing will catch those issues, so there is no need for extra help from UP lockdep testing. Thanx, Paul > Regards, > Boqun > > > And of course, CONFIG_SMP=n kernels still need deadlock detection for > > mutexes and the like. > > > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 17:44 ` Paul E. McKenney @ 2026-03-22 18:17 ` Boqun Feng 2026-03-22 19:47 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-22 18:17 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 10:44:58AM -0700, Paul E. McKenney wrote: [...] > > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > > > > the disabling of preemption (or interrupts or...) is essentially the same > > > > > as acquiring a global lock? This means that only one task at a time can > > > > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > > > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > > > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > > > > raw spinlocks out of lockdep's cycle-detection data structure? > > > > > > > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > > > > general locking rule to check. Similar as lockdep still detect > > > > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > > > > separate kconfig to narrow down lockdep detection for UP-only if UP=y. > > > > > > But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, > > > correct? In which case all the code paths are UP-only. > > > > Why? UP and SMP can share the same code path, no? Just because lockdep > > is running in UP and see a code path, doesn't mean that code path is > > UP-only, right? What I was trying to say is Lockdep detects generel > > locking rule violation, so even when it's running in UP kernel, it'll > > use rules that SMP uses. > > I know that this is what lockdep is doing. I am instead saying that > what lockdep is doing is not a good thing. ;-) > I see. > > > The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation > > > was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to > > > detect and fix locking issues that would otherwise only show up in > > > CONFIG_PREEMPT_RT=y kernels. > > > > I'm not sure how it's different than people wanted to detect SMP=y > > locking issues when running UP kernel. Of course, I didn't know whether > > that's the intention because I wasn't there ;-) > > It is very different. > > There were very few people developing and running PREEMPT_RT, so it > was quite difficult for them to clean up the lock-context issues that > everyone else was creating. Making lockdep check for PREEMPT_RT issues > in !PREEMPT_RT kernels was therefore very helpful, at least to the > PREEMPT_RT guys. > > In contrast (and unlike 20 years ago), almost everyone runs SMP, so there (It also means the people who cares UP is almost none ;-)) > is no harm in !SMP lockdep ignoring things that would be deadlocks in > SMP kernels. The ample SMP lockdep testing will catch those issues, > so there is no need for extra help from UP lockdep testing. > Fair point. If anyone cares UP enough to submit a patch in lockdep, I'm happy to take a look. Regards, Boqun > Thanx, Paul > > > Regards, > > Boqun > > > > > And of course, CONFIG_SMP=n kernels still need deadlock detection for > > > mutexes and the like. > > > > > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 18:17 ` Boqun Feng @ 2026-03-22 19:47 ` Paul E. McKenney 2026-03-22 20:26 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-22 19:47 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 11:17:27AM -0700, Boqun Feng wrote: > On Sun, Mar 22, 2026 at 10:44:58AM -0700, Paul E. McKenney wrote: > [...] > > > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > > > > > the disabling of preemption (or interrupts or...) is essentially the same > > > > > > as acquiring a global lock? This means that only one task at a time can > > > > > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > > > > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > > > > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > > > > > raw spinlocks out of lockdep's cycle-detection data structure? > > > > > > > > > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > > > > > general locking rule to check. Similar as lockdep still detect > > > > > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > > > > > separate kconfig to narrow down lockdep detection for UP-only if UP=y. > > > > > > > > But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, > > > > correct? In which case all the code paths are UP-only. > > > > > > Why? UP and SMP can share the same code path, no? Just because lockdep > > > is running in UP and see a code path, doesn't mean that code path is > > > UP-only, right? What I was trying to say is Lockdep detects generel > > > locking rule violation, so even when it's running in UP kernel, it'll > > > use rules that SMP uses. > > > > I know that this is what lockdep is doing. I am instead saying that > > what lockdep is doing is not a good thing. ;-) > > I see. > > > > > The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation > > > > was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to > > > > detect and fix locking issues that would otherwise only show up in > > > > CONFIG_PREEMPT_RT=y kernels. > > > > > > I'm not sure how it's different than people wanted to detect SMP=y > > > locking issues when running UP kernel. Of course, I didn't know whether > > > that's the intention because I wasn't there ;-) > > > > It is very different. > > > > There were very few people developing and running PREEMPT_RT, so it > > was quite difficult for them to clean up the lock-context issues that > > everyone else was creating. Making lockdep check for PREEMPT_RT issues > > in !PREEMPT_RT kernels was therefore very helpful, at least to the > > PREEMPT_RT guys. > > > > In contrast (and unlike 20 years ago), almost everyone runs SMP, so there > > (It also means the people who cares UP is almost none ;-)) I don't know about that, given that the sets of people who run SMP and who care about UP are not necessarily disjoint. But those who care *only* about UP definitely are no longer the vast majority. ;-) > > is no harm in !SMP lockdep ignoring things that would be deadlocks in > > SMP kernels. The ample SMP lockdep testing will catch those issues, > > so there is no need for extra help from UP lockdep testing. > > Fair point. If anyone cares UP enough to submit a patch in lockdep, I'm > happy to take a look. Fair enough. In other news, the SMP-check hack to rcutorture does suppress this lockdep issue, at least in my testing. But I applied that change to my -rcu tree, which reverted me back to v1 of your patch and re-introduced the earlier lockdep issue. So I am restarting the tests. Thanx, Paul > Regards, > Boqun > > > Thanx, Paul > > > > > Regards, > > > Boqun > > > > > > > And of course, CONFIG_SMP=n kernels still need deadlock detection for > > > > mutexes and the like. > > > > > > > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 19:47 ` Paul E. McKenney @ 2026-03-22 20:26 ` Boqun Feng 2026-03-23 7:50 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-22 20:26 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 12:47:41PM -0700, Paul E. McKenney wrote: > On Sun, Mar 22, 2026 at 11:17:27AM -0700, Boqun Feng wrote: > > On Sun, Mar 22, 2026 at 10:44:58AM -0700, Paul E. McKenney wrote: > > [...] > > > > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > > > > > > the disabling of preemption (or interrupts or...) is essentially the same > > > > > > > as acquiring a global lock? This means that only one task at a time can > > > > > > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > > > > > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > > > > > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > > > > > > raw spinlocks out of lockdep's cycle-detection data structure? > > > > > > > > > > > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > > > > > > general locking rule to check. Similar as lockdep still detect > > > > > > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > > > > > > separate kconfig to narrow down lockdep detection for UP-only if UP=y. > > > > > > > > > > But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, > > > > > correct? In which case all the code paths are UP-only. > > > > > > > > Why? UP and SMP can share the same code path, no? Just because lockdep > > > > is running in UP and see a code path, doesn't mean that code path is > > > > UP-only, right? What I was trying to say is Lockdep detects generel > > > > locking rule violation, so even when it's running in UP kernel, it'll > > > > use rules that SMP uses. > > > > > > I know that this is what lockdep is doing. I am instead saying that > > > what lockdep is doing is not a good thing. ;-) > > > > I see. > > > > > > > The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation > > > > > was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to > > > > > detect and fix locking issues that would otherwise only show up in > > > > > CONFIG_PREEMPT_RT=y kernels. > > > > > > > > I'm not sure how it's different than people wanted to detect SMP=y > > > > locking issues when running UP kernel. Of course, I didn't know whether > > > > that's the intention because I wasn't there ;-) > > > > > > It is very different. > > > > > > There were very few people developing and running PREEMPT_RT, so it > > > was quite difficult for them to clean up the lock-context issues that > > > everyone else was creating. Making lockdep check for PREEMPT_RT issues > > > in !PREEMPT_RT kernels was therefore very helpful, at least to the > > > PREEMPT_RT guys. > > > > > > In contrast (and unlike 20 years ago), almost everyone runs SMP, so there > > > > (It also means the people who cares UP is almost none ;-)) > > I don't know about that, given that the sets of people who run SMP and > who care about UP are not necessarily disjoint. But those who care *only* > about UP definitely are no longer the vast majority. ;-) > You're right, I should have said "UP-only". And I *think* the reality is that most of the sub-systems share the code path between UP and SMP, and it works fine for them, because UP tests help them find SMP deadlock as well. That's probably why lockdep has been doing this "not a good thing" for a while. > > > is no harm in !SMP lockdep ignoring things that would be deadlocks in > > > SMP kernels. The ample SMP lockdep testing will catch those issues, > > > so there is no need for extra help from UP lockdep testing. > > > > Fair point. If anyone cares UP enough to submit a patch in lockdep, I'm > > happy to take a look. > > Fair enough. > > In other news, the SMP-check hack to rcutorture does suppress this lockdep > issue, at least in my testing. But I applied that change to my -rcu tree, > which reverted me back to v1 of your patch and re-introduced the earlier > lockdep issue. > > So I am restarting the tests. > Thank you! Regards, Boqun > Thanx, Paul > > > Regards, > > Boqun > > > > > Thanx, Paul > > > > > > > Regards, > > > > Boqun > > > > > > > > > And of course, CONFIG_SMP=n kernels still need deadlock detection for > > > > > mutexes and the like. > > > > > > > > > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() 2026-03-22 20:26 ` Boqun Feng @ 2026-03-23 7:50 ` Paul E. McKenney 0 siblings, 0 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-23 7:50 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrea Righi, Zqiang On Sun, Mar 22, 2026 at 01:26:02PM -0700, Boqun Feng wrote: > On Sun, Mar 22, 2026 at 12:47:41PM -0700, Paul E. McKenney wrote: > > On Sun, Mar 22, 2026 at 11:17:27AM -0700, Boqun Feng wrote: > > > On Sun, Mar 22, 2026 at 10:44:58AM -0700, Paul E. McKenney wrote: > > > [...] > > > > > > > > Longer term, shouldn't lockdep take into account the fact that on !SMP, > > > > > > > > the disabling of preemption (or interrupts or...) is essentially the same > > > > > > > > as acquiring a global lock? This means that only one task at a time can > > > > > > > > be acquiring a raw spinlock on !SMP, so that the order of acquisition > > > > > > > > of raw spinlocks on !SMP is irrelevant? (Aside from self-deadlocking > > > > > > > > double acquisitions, perhaps.) In other words, shouldn't lockdep leave > > > > > > > > raw spinlocks out of lockdep's cycle-detection data structure? > > > > > > > > > > > > > > Lockdep doesn't know whether a code path is UP-only, so it'll apply the > > > > > > > general locking rule to check. Similar as lockdep still detect > > > > > > > PREEMPT_RT locking issue for !PREEMPT_RT kernel. Maybe we can add a > > > > > > > separate kconfig to narrow down lockdep detection for UP-only if UP=y. > > > > > > > > > > > > But lockdep *does* know when it is running in a CONFIG_SMP=n kernel, > > > > > > correct? In which case all the code paths are UP-only. > > > > > > > > > > Why? UP and SMP can share the same code path, no? Just because lockdep > > > > > is running in UP and see a code path, doesn't mean that code path is > > > > > UP-only, right? What I was trying to say is Lockdep detects generel > > > > > locking rule violation, so even when it's running in UP kernel, it'll > > > > > use rules that SMP uses. > > > > > > > > I know that this is what lockdep is doing. I am instead saying that > > > > what lockdep is doing is not a good thing. ;-) > > > > > > I see. > > > > > > > > > The situation with CONFIG_PREEMPT_RT=y is quite different: The motivation > > > > > > was for people running lockdep on CONFIG_PREEMPT_RT=n kernels to > > > > > > detect and fix locking issues that would otherwise only show up in > > > > > > CONFIG_PREEMPT_RT=y kernels. > > > > > > > > > > I'm not sure how it's different than people wanted to detect SMP=y > > > > > locking issues when running UP kernel. Of course, I didn't know whether > > > > > that's the intention because I wasn't there ;-) > > > > > > > > It is very different. > > > > > > > > There were very few people developing and running PREEMPT_RT, so it > > > > was quite difficult for them to clean up the lock-context issues that > > > > everyone else was creating. Making lockdep check for PREEMPT_RT issues > > > > in !PREEMPT_RT kernels was therefore very helpful, at least to the > > > > PREEMPT_RT guys. > > > > > > > > In contrast (and unlike 20 years ago), almost everyone runs SMP, so there > > > > > > (It also means the people who cares UP is almost none ;-)) > > > > I don't know about that, given that the sets of people who run SMP and > > who care about UP are not necessarily disjoint. But those who care *only* > > about UP definitely are no longer the vast majority. ;-) > > You're right, I should have said "UP-only". And I *think* the reality is > that most of the sub-systems share the code path between UP and SMP, and > it works fine for them, because UP tests help them find SMP deadlock as > well. That's probably why lockdep has been doing this "not a good thing" > for a while. There are many instances of "#ifdef CONFIG_SMP", too many to avoid activating my laziness. > > > > is no harm in !SMP lockdep ignoring things that would be deadlocks in > > > > SMP kernels. The ample SMP lockdep testing will catch those issues, > > > > so there is no need for extra help from UP lockdep testing. > > > > > > Fair point. If anyone cares UP enough to submit a patch in lockdep, I'm > > > happy to take a look. > > > > Fair enough. > > > > In other news, the SMP-check hack to rcutorture does suppress this lockdep > > issue, at least in my testing. But I applied that change to my -rcu tree, > > which reverted me back to v1 of your patch and re-introduced the earlier > > lockdep issue. > > > > So I am restarting the tests. > > Thank you! With these commits in addition: 79e15e57059e ("rcutorture: Test call_srcu() with pi_lock held only for SMP") (from my -rcu tree) ca174c705db5 ("cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug") (from mainline) Tested-by: Paul E. McKenney <paulmck@kernel.org> Thanx, Paul > Regards, > Boqun > > > Thanx, Paul > > > > > Regards, > > > Boqun > > > > > > > Thanx, Paul > > > > > > > > > Regards, > > > > > Boqun > > > > > > > > > > > And of course, CONFIG_SMP=n kernels still need deadlock detection for > > > > > > mutexes and the like. > > > > > > > > > > > > Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 17:54 ` Joel Fernandes 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng @ 2026-03-20 18:20 ` Boqun Feng 1 sibling, 0 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-20 18:20 UTC (permalink / raw) To: Joel Fernandes Cc: Paul E. McKenney, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 01:54:22PM -0400, Joel Fernandes wrote: > > > On 3/20/2026 12:57 PM, Boqun Feng wrote: > > > >>>> So we really do need to make some variant of call_srcu() that deals > >>>> with this. > >>>> > >>>> We do have some options. First, we could make call_srcu() deal with it > >>>> directly, or second, we could create something like call_srcu_lockless() > >>>> or call_srcu_nolock() or whatever that can safely be invoked from any > >>>> context, including NMI handlers, and that invokes call_srcu() directly > >>>> when it determines that it is safe to do so. The advantage of the second > >>>> approach is that it avoids incurring the overhead of checking in the > >>>> common case. > >>> Within the RCU scope, I prefer the second option. > >> Works for me! > >> > >> Would you guys like to implement this, or would you prefer that I do so? > >> > > I feel I don't have cycles for it soon, I have a big backlog (including > > making preempt_count 64bit on 64bit x86). But I will send the fix in the > > current call_srcu() for v7.0 and work with Joel to get into Linus' tree. > > Boqun, I get a splat as below with your irq_work patch on rcutorture: > Thank you, I fixed that in the new version [1]. Please give it a go. [1]: https://lore.kernel.org/rcu/20260320181400.15909-1-boqun@kernel.org/ Regards, Boqun > Maybe the srcu_get_delay call needs: > > raw_spin_lock_irqsave_rcu_node(ssp->srcu_sup, flags); > delay = srcu_get_delay(ssp); > raw_spin_unlock_irqrestore_rcu_node(ssp->srcu_sup, flags); > > can you check? > > [ 0.459781] ------------[ cut here ]------------ > [ 0.460401] WARNING: kernel/rcu/srcutree.c:681 at srcu_get_delay+0xb4/0xd0, > CPU#0: swapper/0/1 > [ 0.460751] Modules linked in: > [ 0.460751] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted > 7.0.0-rc3-00020-gc18a9e13ce7f #96 PREEMPTLAZY > [ 0.460751] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 > 04/01/2014 > [ 0.460751] RIP: 0010:srcu_get_delay+0xb4/0xd0 > [ 0.460751] Code: 00 00 00 5b 5d 48 39 d0 48 0f 47 c2 e9 45 b0 0b 01 48 89 fd > be ff ff ff ff 48 8d bb d0 00 00 00 e8 e1 86 0a 01 85 c0 75 0d 90 <0f> 0b 90 48 > 8b 55 40 e9 57 ff ff ff 48 8b 55 40 e9 4e ff ff ff 0f > [ 0.460751] RSP: 0000:ffffb4ba80003f80 EFLAGS: 00010046 > [ 0.460751] RAX: 0000000000000000 RBX: ffffffffac1604c0 RCX: 0000000000000001 > [ 0.460751] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffffac160590 > [ 0.460751] RBP: ffffffffac160460 R08: 0000000000000000 R09: 0000000000000000 > [ 0.460751] R10: 0000000000000000 R11: ffffb4ba80003ff8 R12: 0000000000000023 > [ 0.460751] R13: ffff9de181214b00 R14: 0000000000000000 R15: 0000000000000000 > [ 0.460751] FS: 0000000000000000(0000) GS:ffff9de1f2799000(0000) > knlGS:0000000000000000 > [ 0.460751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 0.460751] CR2: ffff9de18ddda000 CR3: 000000000c64e000 CR4: 00000000000006f0 > [ 0.460751] Call Trace: > [ 0.460751] <IRQ> > [ 0.460751] srcu_irq_work+0x11/0x40 > [ 0.460751] irq_work_single+0x42/0x90 > [ 0.460751] irq_work_run_list+0x26/0x40 > [ 0.460751] irq_work_run+0x18/0x30 > [ 0.460751] __sysvec_irq_work+0x30/0x180 > [ 0.460751] sysvec_irq_work+0x6a/0x80 > [ 0.460751] </IRQ> > [ 0.460751] <TASK> > [ 0.460751] asm_sysvec_irq_work+0x1a/0x20 > [ 0.460751] RIP: 0010:_raw_spin_unlock_irqrestore+0x34/0x50 > [ 0.460751] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 e6 58 f1 fe 48 89 ef e8 > 2e 92 f1 fe 80 e7 02 74 06 e8 e4 c0 00 ff fb 65 ff 0d fc 63 66 01 <74> 07 5b 5d > c3 cc cc cc cc e8 7e 03 df fe 5b 5d e9 57 1b 00 00 0f > [ 0.460751] RSP: 0000:ffffb4ba80013d50 EFLAGS: 00000286 > [ 0.460751] RAX: 0000000000001c8b RBX: 0000000000000297 RCX: 0000000000000000 > [ 0.460751] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffab614c2c > [ 0.460751] RBP: ffffffffac160578 R08: 0000000000000001 R09: 0000000000000000 > [ 0.460751] R10: 0000000000000001 R11: 0000000000000000 R12: fffffffffffffe74 > [ 0.460751] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 > [ 0.460751] ? _raw_spin_unlock_irqrestore+0x2c/0x50 > [ 0.460751] srcu_gp_start_if_needed+0x354/0x530 > [ 0.460751] __synchronize_srcu+0xcc/0x180 > [ 0.460751] ? __pfx_wakeme_after_rcu+0x10/0x10 > [ 0.460751] ? synchronize_srcu+0x3f/0x170 > [ 0.460751] ? __pfx_rcu_init_tasks_generic+0x10/0x10 > [ 0.460751] rcu_init_tasks_generic+0x104/0x150 > [ 0.460751] do_one_initcall+0x59/0x2e0 > [ 0.460751] ? _printk+0x56/0x70 > [ 0.460751] kernel_init_freeable+0x227/0x440 > [ 0.460751] ? __pfx_kernel_init+0x10/0x10 > [ 0.460751] kernel_init+0x15/0x1c0 > [ 0.460751] ret_from_fork+0x2ac/0x330 > [ 0.460751] ? __pfx_kernel_init+0x10/0x10 > [ 0.460751] ret_from_fork_asm+0x1a/0x30 > [ 0.460751] </TASK> ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 16:57 ` Boqun Feng 2026-03-20 17:54 ` Joel Fernandes @ 2026-03-20 23:11 ` Paul E. McKenney 2026-03-21 3:29 ` Paul E. McKenney 1 sibling, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-20 23:11 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 09:57:21AM -0700, Boqun Feng wrote: > On Fri, Mar 20, 2026 at 09:24:15AM -0700, Paul E. McKenney wrote: > [...] > > > > > In an alternative universe, BPF has a defer mechanism, and BPF core > > > > > would just call (for example): > > > > > > > > > > bpf_defer(call_srcu, ...); // <- a lockless defer > > > > > > > > > > so the issue won't happen. > > > > > > > > In theory, this is quite true. > > > > > > > > In practice, unfortunately for keeping this part of RCU as simple as > > > > we might wish, when a BPF program gets attached to some function in > > > > the kernel, it does not know whether or not that function holds a given > > > > scheduler lock. For example, there are any number of utility functions > > > > that can be (and are) called both with and without those scheduler > > > > locks held. Worse yet, it might be attached to a function that is > > > > *never* invoked with a scheduler lock held -- until some out-of-tree > > > > module is loaded. Which means that this module might well be loaded > > > > after BPF has JIT-ed the BPF program. > > > > > > > > > > Hmm.. maybe I failed to make myself more clear. I was suggesting we > > > treat BPF as a special context, and you cannot do everything, if there > > > is any call_srcu() needed, switch it to bpf_defer(). We should have the > > > same result as either 1) call_srcu() locklessly defer itself or 2) a > > > call_srcu_lockless(). > > > > > > Certainly we can call_srcu() do locklessly defer, but if it's only for > > > BPF, that looks like a whack-a-mole approach to me. Say later on we want > > > to use call_hazptr() in BPF for some reason (there is hoping!), then we > > > need to make it locklessly defer as well. Now we have two lockless logic > > > in both call_srcu() and call_hazptr(), if there is a third one, we need > > > to do that as well. So where's the end? > > > > Except that by the same line of reasoning, how do the BPF guys figure out > > exactly which function calls they need to defer and under what conditions > > they need to defer them? Keeping in mind that the list of functions and > > Can't they just defer anything that is deferrable? If it's deferrable, > then what's the actual cost for BPF to defer it? I believe that there are overhead issues with unconditional deferral, but I must defer to the BPF guys. > > corresponding conditions is subject to change as the kernel continues > > to change. > > > > > The lockless defer request comes from BPF being special, a proper way to > > > deal with it IMO would be BPF has a general defer mechanism. Whether > > > call_srcu() or call_srcu_lockless() can do lockless defer is > > > orthogonal. > > > > Fair point, and for the general defer mechanism, I hereby nominate > > the irq_work_queue() function. We can use this both for RCU and > > for hazard pointers. The code to make a call_srcu_lockless() and > > call_hazptr_lockless() that includes the relevant checks and that does > > the deferral will not be large, complex, or slow. Especially assuming > > that we consolidate common checks. > > > > > BTW, an example to my point, I think we have a deadlock even with the > > > old call_rcu_tasks_trace(), because at: > > > > > > https://elixir.bootlin.com/linux/v6.19.8/source/kernel/rcu/tasks.h#L384 > > > > > > We do a: > > > > > > mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > > > > > > which means call_rcu_tasks_trace() may acquire timer base lock, and that > > > means if BPF was to trace a point where timer base lock is held, then we > > > may have a deadlock. So Now I wonder whether you had any magic to avoid > > > the deadlock pre-7.0 or we are just lucky ;-) > > > > Test it and see! ;-) > > "Program testing can be used to show the presence of bugs, but never to > show their absence!" ;-) But with the exceptions of trivial programs, and I assert that the Linux kernel is non-trivial, there will always be bugs. We therefore need not concern ourselves with showing their absence. ;-) > > > See, without a general defer mechanism, we will have a lot of fun > > > auditing all the primitives that BPF may use. > > > > No, *we* only audit the primitives in our subsystem that BPF actually > > uses when BPF starts using them. We let the *other* subsystems worry > > about *their* interactions with BPF. > > As an RCU mainatainer: fine > > As a LOCKING maintainer: shake my head, because for every primitive that > BPF uses, now there could be a normal version and a _bpf/lockless() > version. That could create more maintenance issues, but only time can > tell. One might be surprised by no-lock locking primitives, but you are right that there is always trylock and release. > > > > So we really do need to make some variant of call_srcu() that deals > > > > with this. > > > > > > > > We do have some options. First, we could make call_srcu() deal with it > > > > directly, or second, we could create something like call_srcu_lockless() > > > > or call_srcu_nolock() or whatever that can safely be invoked from any > > > > context, including NMI handlers, and that invokes call_srcu() directly > > > > when it determines that it is safe to do so. The advantage of the second > > > > approach is that it avoids incurring the overhead of checking in the > > > > common case. > > > > > > Within the RCU scope, I prefer the second option. > > > > Works for me! > > > > Would you guys like to implement this, or would you prefer that I do so? > > I feel I don't have cycles for it soon, I have a big backlog (including > making preempt_count 64bit on 64bit x86). But I will send the fix in the > current call_srcu() for v7.0 and work with Joel to get into Linus' tree. > > I will definitely review it if you beat me to it ;-) I am sure that further adjustments will be required. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 23:11 ` Paul E. McKenney @ 2026-03-21 3:29 ` Paul E. McKenney 0 siblings, 0 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-21 3:29 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 04:11:26PM -0700, Paul E. McKenney wrote: > On Fri, Mar 20, 2026 at 09:57:21AM -0700, Boqun Feng wrote: > > On Fri, Mar 20, 2026 at 09:24:15AM -0700, Paul E. McKenney wrote: [ . . . ] > > I will definitely review it if you beat me to it ;-) > > I am sure that further adjustments will be required. ;-) And speaking of further adjustments, heavier rcutorture testing found the bug allegedly fixed by the following. I will be testing the stack over the weekend. Thanx, Paul ------------------------------------------------------------------------ commit a717ba6b919d1be6d13e3ce13ca70e2afd3c24d0 Author: Paul E. McKenney <paulmck@kernel.org> Date: Fri Mar 20 20:06:22 2026 -0700 srcu: Push srcu_node allocation to GP when non-preemptible When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot parameters specify initialization-time allocation of the srcu_node tree for statically allocated srcu_struct structures (for example, in DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime), init_srcu_struct_nodes() will attempt to dynamically allocate this tree at the first run-time update-side use of this srcu_struct structure, but while holding a raw spinlock. Because the memory allocator can acquire non-raw spinlocks, this can result in lockdep splats. This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used when the first run-time update-side use of this srcu_struct structure happens before srcu_init() is called. The actual allocation then takes place from workqueue context at the ends of upcoming SRCU grace periods. Fixes: ce2e65b01ad1 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 2328827f8775c..678bd9a73875b 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -227,9 +227,12 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) ssp->srcu_sup->srcu_gp_seq_needed_exp = SRCU_GP_SEQ_INITIAL_VAL; ssp->srcu_sup->srcu_last_gp_end = ktime_get_mono_fast_ns(); if (READ_ONCE(ssp->srcu_sup->srcu_size_state) == SRCU_SIZE_SMALL && SRCU_SIZING_IS_INIT()) { - if (!init_srcu_struct_nodes(ssp, is_static ? GFP_ATOMIC : GFP_KERNEL)) + if (!preemptible()) + WRITE_ONCE(ssp->srcu_sup->srcu_size_state, SRCU_SIZE_ALLOC); + else if (init_srcu_struct_nodes(ssp, GFP_KERNEL)) + WRITE_ONCE(ssp->srcu_sup->srcu_size_state, SRCU_SIZE_BIG); + else goto err_free_sda; - WRITE_ONCE(ssp->srcu_sup->srcu_size_state, SRCU_SIZE_BIG); } ssp->srcu_sup->srcu_ssp = ssp; smp_store_release(&ssp->srcu_sup->srcu_gp_seq_needed, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() 2026-03-20 16:24 ` Paul E. McKenney 2026-03-20 16:57 ` Boqun Feng @ 2026-03-21 17:03 ` Boqun Feng 2026-03-23 15:17 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-21 17:03 UTC (permalink / raw) To: Joel Fernandes, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Song Liu, Boqun Feng, stable The following deadlock is possible: __mod_timer() lock_timer_base() raw_spin_lock_irqsave(&base->lock) <- base->lock ACQUIRED trace_timer_start() <- tp_btf/timer_start fires here [probe_timer_start BPF program] bpf_task_storage_delete() bpf_selem_unlink(selem, false) <- reuse_now=false bpf_selem_free(false) call_rcu_tasks_trace() call_rcu_tasks_generic() raw_spin_trylock(rtpcp) <- succeeds (different lock) mod_timer(lazy_timer) <- lazy_timer is on this CPU's base lock_timer_base() raw_spin_lock_irqsave(&base->lock) <- SAME LOCK -> DEADLOCK because BPF can instrument a place while the timer base lock is held. Fix it by using an intermediate irq_work. Further, because a "timer base->lock" to a "rtpcp lock" lock dependency can be establish in this way, we cannot mod_timer() with a rtpcp lock held. Fix that as well. Fixes: d119357d0743 ("rcu-tasks: Treat only synchronous grace periods urgently") Cc: stable@kernel.org Signed-off-by: Boqun Feng <boqun@kernel.org> --- This is a follow-up of [1], and yes we can trigger a whole system deadlock freeze easily with (even non-recursively) tracing the timer_start tracepoint. I have a reproduce at: https://github.com/fbq/rcu_tasks_deadlock Be very careful, since it'll freeze your system when run it. I've tested it on 6.17 and 6.19 and can confirm the deadlock could be triggered. So this is an old bug if it's a bug. It's up to BPF whether this is a bug or not, because it has existed for a while and nobody seems to get hurt(?). [1]: https://lore.kernel.org/rcu/2b3848e9-3b11-41b8-8c44-5de28d4a4433@paulmck-laptop/ kernel/rcu/tasks.h | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 2b55e6acf3c1..d9d5b7d5944e 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -46,6 +46,7 @@ struct rcu_tasks_percpu { unsigned int urgent_gp; struct work_struct rtp_work; struct irq_work rtp_irq_work; + struct irq_work timer_irq_work; struct rcu_head barrier_q_head; struct list_head rtp_blkd_tasks; struct list_head rtp_exit_list; @@ -134,6 +135,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp); static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = { \ .lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock), \ .rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup), \ + .timer_irq_work = IRQ_WORK_INIT(call_rcu_tasks_mod_timer_lazy), \ }; \ static struct rcu_tasks rt_name = \ { \ @@ -321,11 +323,19 @@ static void call_rcu_tasks_generic_timer(struct timer_list *tlp) if (!rtpcp->urgent_gp) rtpcp->urgent_gp = 1; needwake = true; - mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); + /* + * We cannot mod_timer() here because "timer base lock -> + * rcu_node lock" lock dependencies can be establish by having + * a BPF instrument at timer_start(), and that means deadlock + * if we mod_timer() here. So delay the mod_timer() when we are + * out of the lock critical section. + */ } raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); - if (needwake) + if (needwake) { + mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); rcuwait_wake_up(&rtp->cbs_wait); + } } // IRQ-work handler that does deferred wakeup for call_rcu_tasks_generic(). @@ -338,6 +348,16 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp) rcuwait_wake_up(&rtp->cbs_wait); } +static void call_rcu_tasks_mod_timer_lazy(struct irq_work *iwp) +{ + struct rcu_tasks *rtp; + struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, timer_irq_work); + rtp = rtpcp->rtpp; + + if (!timer_pending(&rtpcp->lazy_timer)) + mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); +} + // Enqueue a callback for the specified flavor of Tasks RCU. static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, struct rcu_tasks *rtp) @@ -377,7 +397,7 @@ static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, (rcu_segcblist_n_cbs(&rtpcp->cblist) == rcu_task_lazy_lim); if (havekthread && !needwake && !timer_pending(&rtpcp->lazy_timer)) { if (rtp->lazy_jiffies) - mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); + irq_work_queue(&rtpcp->timer_irq_work); else needwake = rcu_segcblist_empty(&rtpcp->cblist); } -- 2.50.1 (Apple Git-155) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* Re: [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() 2026-03-21 17:03 ` [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() Boqun Feng @ 2026-03-23 15:17 ` Boqun Feng 2026-03-23 20:37 ` Joel Fernandes 2026-03-23 21:50 ` Kumar Kartikeya Dwivedi 0 siblings, 2 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-23 15:17 UTC (permalink / raw) To: Joel Fernandes, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Song Liu, stable On Sat, Mar 21, 2026 at 10:03:21AM -0700, Boqun Feng wrote: > The following deadlock is possible: > > __mod_timer() > lock_timer_base() > raw_spin_lock_irqsave(&base->lock) <- base->lock ACQUIRED > trace_timer_start() <- tp_btf/timer_start fires here > [probe_timer_start BPF program] > bpf_task_storage_delete() > bpf_selem_unlink(selem, false) <- reuse_now=false > bpf_selem_free(false) > call_rcu_tasks_trace() > call_rcu_tasks_generic() > raw_spin_trylock(rtpcp) <- succeeds (different lock) > mod_timer(lazy_timer) <- lazy_timer is on this CPU's base > lock_timer_base() > raw_spin_lock_irqsave(&base->lock) <- SAME LOCK -> DEADLOCK > > because BPF can instrument a place while the timer base lock is held. > Fix it by using an intermediate irq_work. > > Further, because a "timer base->lock" to a "rtpcp lock" lock dependency > can be establish in this way, we cannot mod_timer() with a rtpcp lock > held. Fix that as well. > > Fixes: d119357d0743 ("rcu-tasks: Treat only synchronous grace periods urgently") > Cc: stable@kernel.org > Signed-off-by: Boqun Feng <boqun@kernel.org> > --- > This is a follow-up of [1], and yes we can trigger a whole system > deadlock freeze easily with (even non-recursively) tracing the > timer_start tracepoint. I have a reproduce at: > > https://github.com/fbq/rcu_tasks_deadlock > > Be very careful, since it'll freeze your system when run it. > > I've tested it on 6.17 and 6.19 and can confirm the deadlock could be > triggered. So this is an old bug if it's a bug. > > It's up to BPF whether this is a bug or not, because it has existed for > a while and nobody seems to get hurt(?). > Ping BPF ;-) I know this was sent in a Saturday and only 2 days have passed, but we are at the decision point about how hard/urgent we should fix these "BPF deadlocks": As this patch shows, the deadlocks existed before v7.0 (i.e. before SRCU switches). And yes, ideally we should fix all of them, but given we are close to v7.0 release, I would like to focus the new issue that SRCU introduces, because that one would likely affect SCHED_EXT. Thoughts? Regards, Boqun > [1]: https://lore.kernel.org/rcu/2b3848e9-3b11-41b8-8c44-5de28d4a4433@paulmck-laptop/ > > kernel/rcu/tasks.h | 26 +++++++++++++++++++++++--- > 1 file changed, 23 insertions(+), 3 deletions(-) > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > index 2b55e6acf3c1..d9d5b7d5944e 100644 > --- a/kernel/rcu/tasks.h > +++ b/kernel/rcu/tasks.h > @@ -46,6 +46,7 @@ struct rcu_tasks_percpu { > unsigned int urgent_gp; > struct work_struct rtp_work; > struct irq_work rtp_irq_work; > + struct irq_work timer_irq_work; > struct rcu_head barrier_q_head; > struct list_head rtp_blkd_tasks; > struct list_head rtp_exit_list; > @@ -134,6 +135,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp); > static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = { \ > .lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock), \ > .rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup), \ > + .timer_irq_work = IRQ_WORK_INIT(call_rcu_tasks_mod_timer_lazy), \ > }; \ > static struct rcu_tasks rt_name = \ > { \ > @@ -321,11 +323,19 @@ static void call_rcu_tasks_generic_timer(struct timer_list *tlp) > if (!rtpcp->urgent_gp) > rtpcp->urgent_gp = 1; > needwake = true; > - mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > + /* > + * We cannot mod_timer() here because "timer base lock -> > + * rcu_node lock" lock dependencies can be establish by having > + * a BPF instrument at timer_start(), and that means deadlock > + * if we mod_timer() here. So delay the mod_timer() when we are > + * out of the lock critical section. > + */ > } > raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); > - if (needwake) > + if (needwake) { > + mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > rcuwait_wake_up(&rtp->cbs_wait); > + } > } > > // IRQ-work handler that does deferred wakeup for call_rcu_tasks_generic(). > @@ -338,6 +348,16 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp) > rcuwait_wake_up(&rtp->cbs_wait); > } > > +static void call_rcu_tasks_mod_timer_lazy(struct irq_work *iwp) > +{ > + struct rcu_tasks *rtp; > + struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, timer_irq_work); > + rtp = rtpcp->rtpp; > + > + if (!timer_pending(&rtpcp->lazy_timer)) > + mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > +} > + > // Enqueue a callback for the specified flavor of Tasks RCU. > static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, > struct rcu_tasks *rtp) > @@ -377,7 +397,7 @@ static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, > (rcu_segcblist_n_cbs(&rtpcp->cblist) == rcu_task_lazy_lim); > if (havekthread && !needwake && !timer_pending(&rtpcp->lazy_timer)) { > if (rtp->lazy_jiffies) > - mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); > + irq_work_queue(&rtpcp->timer_irq_work); > else > needwake = rcu_segcblist_empty(&rtpcp->cblist); > } > -- > 2.50.1 (Apple Git-155) > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() 2026-03-23 15:17 ` Boqun Feng @ 2026-03-23 20:37 ` Joel Fernandes 2026-03-23 21:50 ` Kumar Kartikeya Dwivedi 1 sibling, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-23 20:37 UTC (permalink / raw) To: Boqun Feng, Paul E. McKenney Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Song Liu, stable On 3/23/2026 11:17 AM, Boqun Feng wrote: > On Sat, Mar 21, 2026 at 10:03:21AM -0700, Boqun Feng wrote: >> The following deadlock is possible: >> >> __mod_timer() >> lock_timer_base() >> raw_spin_lock_irqsave(&base->lock) <- base->lock ACQUIRED >> trace_timer_start() <- tp_btf/timer_start fires here >> [probe_timer_start BPF program] >> bpf_task_storage_delete() >> bpf_selem_unlink(selem, false) <- reuse_now=false >> bpf_selem_free(false) >> call_rcu_tasks_trace() >> call_rcu_tasks_generic() >> raw_spin_trylock(rtpcp) <- succeeds (different lock) >> mod_timer(lazy_timer) <- lazy_timer is on this CPU's base >> lock_timer_base() >> raw_spin_lock_irqsave(&base->lock) <- SAME LOCK -> DEADLOCK >> >> because BPF can instrument a place while the timer base lock is held. >> Fix it by using an intermediate irq_work. >> >> Further, because a "timer base->lock" to a "rtpcp lock" lock dependency >> can be establish in this way, we cannot mod_timer() with a rtpcp lock >> held. Fix that as well. >> >> Fixes: d119357d0743 ("rcu-tasks: Treat only synchronous grace periods urgently") >> Cc: stable@kernel.org >> Signed-off-by: Boqun Feng <boqun@kernel.org> >> --- >> This is a follow-up of [1], and yes we can trigger a whole system >> deadlock freeze easily with (even non-recursively) tracing the >> timer_start tracepoint. I have a reproduce at: >> >> https://github.com/fbq/rcu_tasks_deadlock >> >> Be very careful, since it'll freeze your system when run it. >> >> I've tested it on 6.17 and 6.19 and can confirm the deadlock could be >> triggered. So this is an old bug if it's a bug. >> >> It's up to BPF whether this is a bug or not, because it has existed for >> a while and nobody seems to get hurt(?). >> > > Ping BPF ;-) I know this was sent in a Saturday and only 2 days have > passed, but we are at the decision point about how hard/urgent we should > fix these "BPF deadlocks": As this patch shows, the deadlocks existed > before v7.0 (i.e. before SRCU switches). And yes, ideally we should fix > all of them, but given we are close to v7.0 release, I would like to > focus the new issue that SRCU introduces, because that one would likely > affect SCHED_EXT. Thoughts? Yeah I think this deadlock is "lower priority" in the sense, I believe it existed even before 7.0. But the new SRCU ones are related to 7.0 where the RCU tasks trace implementation was switched to SRCU. So in that sense, yes I think the SRCU fix has to go in for 7.0 itself ideally, with the potential timer fix above moved to a later time. I wonder also, could we (programmatically as a startup self-test) add a BPF tracepoint program to every trace point possible and see what blows up? I am pretty sure there are dragons in there too other than timers and workqueues. That would also make for a nice-selftest, where a test BPF program is hooked to all tracepoints and then busy-loop somework load is run. I'd brace for some fireworks but..;-) Btw, I am testing the v2 of the irq_work srcu fix, so far so good. :-) thanks, -- Joel Fernandes > > Regards, > Boqun > >> [1]: https://lore.kernel.org/rcu/2b3848e9-3b11-41b8-8c44-5de28d4a4433@paulmck-laptop/ >> >> kernel/rcu/tasks.h | 26 +++++++++++++++++++++++--- >> 1 file changed, 23 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h >> index 2b55e6acf3c1..d9d5b7d5944e 100644 >> --- a/kernel/rcu/tasks.h >> +++ b/kernel/rcu/tasks.h >> @@ -46,6 +46,7 @@ struct rcu_tasks_percpu { >> unsigned int urgent_gp; >> struct work_struct rtp_work; >> struct irq_work rtp_irq_work; >> + struct irq_work timer_irq_work; >> struct rcu_head barrier_q_head; >> struct list_head rtp_blkd_tasks; >> struct list_head rtp_exit_list; >> @@ -134,6 +135,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp); >> static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = { \ >> .lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock), \ >> .rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup), \ >> + .timer_irq_work = IRQ_WORK_INIT(call_rcu_tasks_mod_timer_lazy), \ >> }; \ >> static struct rcu_tasks rt_name = \ >> { \ >> @@ -321,11 +323,19 @@ static void call_rcu_tasks_generic_timer(struct timer_list *tlp) >> if (!rtpcp->urgent_gp) >> rtpcp->urgent_gp = 1; >> needwake = true; >> - mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); >> + /* >> + * We cannot mod_timer() here because "timer base lock -> >> + * rcu_node lock" lock dependencies can be establish by having >> + * a BPF instrument at timer_start(), and that means deadlock >> + * if we mod_timer() here. So delay the mod_timer() when we are >> + * out of the lock critical section. >> + */ >> } >> raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); >> - if (needwake) >> + if (needwake) { >> + mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); >> rcuwait_wake_up(&rtp->cbs_wait); >> + } >> } >> >> // IRQ-work handler that does deferred wakeup for call_rcu_tasks_generic(). >> @@ -338,6 +348,16 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp) >> rcuwait_wake_up(&rtp->cbs_wait); >> } >> >> +static void call_rcu_tasks_mod_timer_lazy(struct irq_work *iwp) >> +{ >> + struct rcu_tasks *rtp; >> + struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, timer_irq_work); >> + rtp = rtpcp->rtpp; >> + >> + if (!timer_pending(&rtpcp->lazy_timer)) >> + mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); >> +} >> + >> // Enqueue a callback for the specified flavor of Tasks RCU. >> static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, >> struct rcu_tasks *rtp) >> @@ -377,7 +397,7 @@ static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, >> (rcu_segcblist_n_cbs(&rtpcp->cblist) == rcu_task_lazy_lim); >> if (havekthread && !needwake && !timer_pending(&rtpcp->lazy_timer)) { >> if (rtp->lazy_jiffies) >> - mod_timer(&rtpcp->lazy_timer, rcu_tasks_lazy_time(rtp)); >> + irq_work_queue(&rtpcp->timer_irq_work); >> else >> needwake = rcu_segcblist_empty(&rtpcp->cblist); >> } >> -- >> 2.50.1 (Apple Git-155) >> -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() 2026-03-23 15:17 ` Boqun Feng 2026-03-23 20:37 ` Joel Fernandes @ 2026-03-23 21:50 ` Kumar Kartikeya Dwivedi 2026-03-23 22:13 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-03-23 21:50 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Paul E. McKenney, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Song Liu, stable On Mon, 23 Mar 2026 at 16:17, Boqun Feng <boqun@kernel.org> wrote: > > On Sat, Mar 21, 2026 at 10:03:21AM -0700, Boqun Feng wrote: > > The following deadlock is possible: > > > > __mod_timer() > > lock_timer_base() > > raw_spin_lock_irqsave(&base->lock) <- base->lock ACQUIRED > > trace_timer_start() <- tp_btf/timer_start fires here > > [probe_timer_start BPF program] > > bpf_task_storage_delete() > > bpf_selem_unlink(selem, false) <- reuse_now=false > > bpf_selem_free(false) > > call_rcu_tasks_trace() > > call_rcu_tasks_generic() > > raw_spin_trylock(rtpcp) <- succeeds (different lock) > > mod_timer(lazy_timer) <- lazy_timer is on this CPU's base > > lock_timer_base() > > raw_spin_lock_irqsave(&base->lock) <- SAME LOCK -> DEADLOCK > > > > because BPF can instrument a place while the timer base lock is held. > > Fix it by using an intermediate irq_work. > > > > Further, because a "timer base->lock" to a "rtpcp lock" lock dependency > > can be establish in this way, we cannot mod_timer() with a rtpcp lock > > held. Fix that as well. > > > > Fixes: d119357d0743 ("rcu-tasks: Treat only synchronous grace periods urgently") > > Cc: stable@kernel.org > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > --- > > This is a follow-up of [1], and yes we can trigger a whole system > > deadlock freeze easily with (even non-recursively) tracing the > > timer_start tracepoint. I have a reproduce at: > > > > https://github.com/fbq/rcu_tasks_deadlock > > > > Be very careful, since it'll freeze your system when run it. > > > > I've tested it on 6.17 and 6.19 and can confirm the deadlock could be > > triggered. So this is an old bug if it's a bug. > > > > It's up to BPF whether this is a bug or not, because it has existed for > > a while and nobody seems to get hurt(?). > > > > Ping BPF ;-) I know this was sent in a Saturday and only 2 days have > passed, but we are at the decision point about how hard/urgent we should > fix these "BPF deadlocks": As this patch shows, the deadlocks existed > before v7.0 (i.e. before SRCU switches). And yes, ideally we should fix > all of them, but given we are close to v7.0 release, I would like to > focus the new issue that SRCU introduces, because that one would likely > affect SCHED_EXT. Thoughts? I tried both of your changes, thanks for working on these. I agree the one reported by Andrea is more important, so that should be the first priority and should be sent as a fix for 7.0. It would be good to make the timer-related fix too, but it doesn't need to be rushed for 7.0. We're aware of the issue being fixed by this patch in timer tracepoints [0]. It's a corner case which no one has hit thus far. We already made similar fixes where we could, e.g. [1], but it's difficult to make a similar change for local storage. Given Paul said he plans into looking into call_{s,}rcu_nolock() anyway, the guard can be dropped once call_{s,}rcu_nolock() materializes. Something like what you did in this patch would be prerequisite for the call_{s,}rcu_nolock() anyway, the assumption will be that it could be invoked from NMI, so reentrancy can happen anywhere, it doesn't matter much then whether it happens in the same context when the lock is held and we end up invoking the same path through call_srcu(), or when an NMI prog interrupts when the timer lock is held. [0]: https://lore.kernel.org/bpf/CAP01T76xUCrDH4G2XikNvhPTn6ZbNTgQH59qt2Q_o0c9uudd8w@mail.gmail.com [1]: https://lore.kernel.org/bpf/20260204055147.54960-2-alexei.starovoitov@gmail.com > > [...] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() 2026-03-23 21:50 ` Kumar Kartikeya Dwivedi @ 2026-03-23 22:13 ` Boqun Feng 0 siblings, 0 replies; 100+ messages in thread From: Boqun Feng @ 2026-03-23 22:13 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: Joel Fernandes, Paul E. McKenney, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Song Liu, stable On Mon, Mar 23, 2026 at 10:50:18PM +0100, Kumar Kartikeya Dwivedi wrote: > On Mon, 23 Mar 2026 at 16:17, Boqun Feng <boqun@kernel.org> wrote: > > > > On Sat, Mar 21, 2026 at 10:03:21AM -0700, Boqun Feng wrote: > > > The following deadlock is possible: > > > > > > __mod_timer() > > > lock_timer_base() > > > raw_spin_lock_irqsave(&base->lock) <- base->lock ACQUIRED > > > trace_timer_start() <- tp_btf/timer_start fires here > > > [probe_timer_start BPF program] > > > bpf_task_storage_delete() > > > bpf_selem_unlink(selem, false) <- reuse_now=false > > > bpf_selem_free(false) > > > call_rcu_tasks_trace() > > > call_rcu_tasks_generic() > > > raw_spin_trylock(rtpcp) <- succeeds (different lock) > > > mod_timer(lazy_timer) <- lazy_timer is on this CPU's base > > > lock_timer_base() > > > raw_spin_lock_irqsave(&base->lock) <- SAME LOCK -> DEADLOCK > > > > > > because BPF can instrument a place while the timer base lock is held. > > > Fix it by using an intermediate irq_work. > > > > > > Further, because a "timer base->lock" to a "rtpcp lock" lock dependency > > > can be establish in this way, we cannot mod_timer() with a rtpcp lock > > > held. Fix that as well. > > > > > > Fixes: d119357d0743 ("rcu-tasks: Treat only synchronous grace periods urgently") > > > Cc: stable@kernel.org > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > --- > > > This is a follow-up of [1], and yes we can trigger a whole system > > > deadlock freeze easily with (even non-recursively) tracing the > > > timer_start tracepoint. I have a reproduce at: > > > > > > https://github.com/fbq/rcu_tasks_deadlock > > > > > > Be very careful, since it'll freeze your system when run it. > > > > > > I've tested it on 6.17 and 6.19 and can confirm the deadlock could be > > > triggered. So this is an old bug if it's a bug. > > > > > > It's up to BPF whether this is a bug or not, because it has existed for > > > a while and nobody seems to get hurt(?). > > > > > > > Ping BPF ;-) I know this was sent in a Saturday and only 2 days have > > passed, but we are at the decision point about how hard/urgent we should > > fix these "BPF deadlocks": As this patch shows, the deadlocks existed > > before v7.0 (i.e. before SRCU switches). And yes, ideally we should fix > > all of them, but given we are close to v7.0 release, I would like to > > focus the new issue that SRCU introduces, because that one would likely > > affect SCHED_EXT. Thoughts? > > I tried both of your changes, thanks for working on these. I agree the No problem. :) > one reported by Andrea is more important, so that should be the first > priority and should be sent as a fix for 7.0. > Agreed. I will work with Joel to forward these fixes to Linus. > It would be good to make the timer-related fix too, but it doesn't > need to be rushed for 7.0. We're aware of the issue being fixed by > this patch in timer tracepoints [0]. > It's a corner case which no one has hit thus far. We already made > similar fixes where we could, e.g. [1], but it's difficult to make a > similar change for local storage. > Right, we need more time to cover all the cases. For example, there is tracepoint inside irq_work_queue() where we can instrument it within BPF, that could open a lot of fun interactions. ;-) > Given Paul said he plans into looking into call_{s,}rcu_nolock() > anyway, the guard can be dropped once call_{s,}rcu_nolock() > materializes. > > Something like what you did in this patch would be prerequisite for > the call_{s,}rcu_nolock() anyway, the assumption will be that it could > be invoked from NMI, so reentrancy can happen anywhere, it doesn't > matter much then whether it happens in the same context when the lock > is held and we end up invoking the same path through call_srcu(), or > when an NMI prog interrupts when the timer lock is held. > Sounds good, I will work with Paul on these. Regards, Boqun > [0]: https://lore.kernel.org/bpf/CAP01T76xUCrDH4G2XikNvhPTn6ZbNTgQH59qt2Q_o0c9uudd8w@mail.gmail.com > [1]: https://lore.kernel.org/bpf/20260204055147.54960-2-alexei.starovoitov@gmail.com > > > > > [...] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 20:14 ` Boqun Feng 2026-03-19 20:21 ` Joel Fernandes @ 2026-03-20 16:15 ` Boqun Feng 2026-03-20 16:24 ` Paul E. McKenney 1 sibling, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-20 16:15 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi Cc: Sebastian Andrzej Siewior, Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 01:14:08PM -0700, Boqun Feng wrote: [...] > > Also, any unconditional deferral in the caller for APIs that can "hold > > locks" to avoid all this is not without its cost. > > > > The implementation of RCU knows and can stay in sync with those > > conditions for when deferral is needed, and hide all that complexity > > from the caller. The cost should definitely be paid by the caller if > > we would break the API's broad contract, e.g., by trying to invoke it > > The thing is, lots of the synchronization primitives existed before BPF, > and they were not designed or implemented with "BPF safe" in mind, and > they could be dragged into BPF core code path if we begin to use them. > For example, irq_work may be just "happen-to-work" here, or there is a Correction: irq_work is designed to be NMI safe (at least when queueing on the local CPU), so it's potential the bpf_defer() I proposed. Regards, Boqun > bug that we are missing. It would be rather easier or clearer if we > design a dedicate defer mechanism with BPF core in mind, and then we use > that for all the deferrable operations. > > Regards, > Boqun > > > in NMI which it is not supposed to run in yet, in that case we already > > handle things using irq_work. Anything more complicated than that is > > hard to scale. All of this may also change in the future where we > > support call_rcu_nolock() to make it work everywhere, and only defer > > when we detect reentrancy (in the same or different context). > > > > > > [..] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-20 16:15 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng @ 2026-03-20 16:24 ` Paul E. McKenney 0 siblings, 0 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-20 16:24 UTC (permalink / raw) To: Boqun Feng Cc: Kumar Kartikeya Dwivedi, Sebastian Andrzej Siewior, Joel Fernandes, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Fri, Mar 20, 2026 at 09:15:42AM -0700, Boqun Feng wrote: > On Thu, Mar 19, 2026 at 01:14:08PM -0700, Boqun Feng wrote: > [...] > > > Also, any unconditional deferral in the caller for APIs that can "hold > > > locks" to avoid all this is not without its cost. > > > > > > The implementation of RCU knows and can stay in sync with those > > > conditions for when deferral is needed, and hide all that complexity > > > from the caller. The cost should definitely be paid by the caller if > > > we would break the API's broad contract, e.g., by trying to invoke it > > > > The thing is, lots of the synchronization primitives existed before BPF, > > and they were not designed or implemented with "BPF safe" in mind, and > > they could be dragged into BPF core code path if we begin to use them. > > For example, irq_work may be just "happen-to-work" here, or there is a > > Correction: irq_work is designed to be NMI safe (at least when queueing > on the local CPU), so it's potential the bpf_defer() I proposed. Completely agreed! Thanx, Paul > Regards, > Boqun > > > bug that we are missing. It would be rather easier or clearer if we > > design a dedicate defer mechanism with BPF core in mind, and then we use > > that for all the deferrable operations. > > > > Regards, > > Boqun > > > > > in NMI which it is not supposed to run in yet, in that case we already > > > handle things using irq_work. Anything more complicated than that is > > > hard to scale. All of this may also change in the future where we > > > support call_rcu_nolock() to make it work everywhere, and only defer > > > when we detect reentrancy (in the same or different context). > > > > > > > > > > [..] ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 16:48 ` Boqun Feng 2026-03-19 16:59 ` Kumar Kartikeya Dwivedi @ 2026-03-19 17:02 ` Sebastian Andrzej Siewior 2026-03-19 17:44 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Sebastian Andrzej Siewior @ 2026-03-19 17:02 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On 2026-03-19 09:48:16 [-0700], Boqun Feng wrote: > I agree it's not RCU's fault ;-) I never claimed it is anyone's fault. I just see that BPF should be able to do things which kgdb would not be allowed to. > I guess it'll be difficult to restrict BPF, however maybe BPF can call > call_srcu() in irq_work instead? Or a more systematic defer mechanism > that allows BPF to defer any lock holding functions to a different > context. (We have a similar issue that BPF cannot call kfree_rcu() in > some cases IIRC). > > But we need to fix this in v7.0, so this short-term fix is still needed. I would prefer something substantial before we rush to get a quick fix and move on. If we could get that irq_work() part only for BPF where it is required then it would be already a step forward. Long term it would be nice if we could avoid calling this while locks are held. I think call_rcu() can't be used under rq/pi lock, but timers should be fine. Is this rq/pi locking originating from "regular" BPF code or sched_ext? > Regars, > Boqun > > > > Regards, > > > Boqun Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 17:02 ` Sebastian Andrzej Siewior @ 2026-03-19 17:44 ` Boqun Feng 2026-03-19 18:42 ` Joel Fernandes 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 17:44 UTC (permalink / raw) To: Sebastian Andrzej Siewior Cc: Joel Fernandes, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend On Thu, Mar 19, 2026 at 06:02:44PM +0100, Sebastian Andrzej Siewior wrote: > On 2026-03-19 09:48:16 [-0700], Boqun Feng wrote: > > I agree it's not RCU's fault ;-) > > I never claimed it is anyone's fault. I just see that BPF should be able > to do things which kgdb would not be allowed to. > > > I guess it'll be difficult to restrict BPF, however maybe BPF can call > > call_srcu() in irq_work instead? Or a more systematic defer mechanism > > that allows BPF to defer any lock holding functions to a different > > context. (We have a similar issue that BPF cannot call kfree_rcu() in > > some cases IIRC). > > > > But we need to fix this in v7.0, so this short-term fix is still needed. > > I would prefer something substantial before we rush to get a quick fix > and move on. > The quick fix here is really "restore the previous behavior of call_rcu_tasks_trace() in call_srcu()", and the future work will naturally happen: if the extra irq_work layer turns out calling issues to other SRCU users, then we need to fix them as well. Otherwise, there is no real need to avoid the extra irq_work hop. So I *think* it's OK ;-) Cleaning up all the ad-hoc irq_work usages in BPF is another thing, which can happen if we learn about all the cases and have a good design. > If we could get that irq_work() part only for BPF where it is required > then it would be already a step forward. > I'm happy to include that (i.e. using Qiang's suggestion) if Joel also agrees. > Long term it would be nice if we could avoid calling this while locks > are held. I think call_rcu() can't be used under rq/pi lock, but timers > should be fine. > > Is this rq/pi locking originating from "regular" BPF code or sched_ext? > I think if you have any tracepoint (include traceable functions) under rq/pi locking, then potentially BPF can call call_srcu() there. The root cause of the issues is that BPF is actually like a NMI unless the code is noinstr (There is a rabit hole about BPF calling call_srcu() while it's instrumenting call_srcu() itself). And the right way to solve all the issues is to have a general defer mechanism for BPF. Regards, Boqun > > Regars, > > Boqun > > > > > > Regards, > > > > Boqun > > Sebastian ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 17:44 ` Boqun Feng @ 2026-03-19 18:42 ` Joel Fernandes 2026-03-19 20:20 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Joel Fernandes @ 2026-03-19 18:42 UTC (permalink / raw) To: Boqun Feng, Sebastian Andrzej Siewior Cc: paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Steven Rostedt, Andrea Righi On 3/19/2026 1:44 PM, Boqun Feng wrote: > On Thu, Mar 19, 2026 at 06:02:44PM +0100, Sebastian Andrzej Siewior wrote: >> On 2026-03-19 09:48:16 [-0700], Boqun Feng wrote: >>> I agree it's not RCU's fault ;-) >> >> I never claimed it is anyone's fault. I just see that BPF should be able >> to do things which kgdb would not be allowed to. >> >>> I guess it'll be difficult to restrict BPF, however maybe BPF can call >>> call_srcu() in irq_work instead? Or a more systematic defer mechanism >>> that allows BPF to defer any lock holding functions to a different >>> context. (We have a similar issue that BPF cannot call kfree_rcu() in >>> some cases IIRC). >>> >>> But we need to fix this in v7.0, so this short-term fix is still needed. >> >> I would prefer something substantial before we rush to get a quick fix >> and move on. >> > > The quick fix here is really "restore the previous behavior of > call_rcu_tasks_trace() in call_srcu()", and the future work will Unfortunately reverting c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") is tricky since the original body of RCU Tasks Trace code is deleted. Perhaps we should have added an easier escape-hatch, lesson learnt:) > naturally happen: if the extra irq_work layer turns out calling issues > to other SRCU users, then we need to fix them as well. Otherwise, there > is no real need to avoid the extra irq_work hop. So I *think* it's OK > ;-) > > Cleaning up all the ad-hoc irq_work usages in BPF is another thing, > which can happen if we learn about all the cases and have a good design. > >> If we could get that irq_work() part only for BPF where it is required >> then it would be already a step forward. >> > > I'm happy to include that (i.e. using Qiang's suggestion) if Joel also > agrees. Sure, I am Ok with sort of short-term fix, but I worry that it still does not the issues due to the tasks-trace conversion. In particular, it doesn't fix the issue Andrea reported AFAICS, because there is a dependency on pool->lock? see: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ That happens precisely because of the queue_delayed_work() happening from the SRCU tasks-trace specific BPF right? This looks something like this, due to combination of SRCU, scheduler and WQ: srcu_usage.lock -> pool->lock -> pi_lock -> rq->__lock ^ | | | +----------- DEADLOCK CYCLE ------------+ >> Long term it would be nice if we could avoid calling this while locks >> are held. I think call_rcu() can't be used under rq/pi lock, but timers >> should be fine. >> >> Is this rq/pi locking originating from "regular" BPF code or sched_ext? >> > > I think if you have any tracepoint (include traceable functions) under > rq/pi locking, then potentially BPF can call call_srcu() there. > > The root cause of the issues is that BPF is actually like a NMI unless > the code is noinstr (There is a rabit hole about BPF calling > call_srcu() while it's instrumenting call_srcu() itself). And the right > way to solve all the issues is to have a general defer mechanism for > BPF. Will that really solve the above mentioned issue though that Andrea reported? +Andrea, +Steve as well. thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 18:42 ` Joel Fernandes @ 2026-03-19 20:20 ` Boqun Feng 2026-03-19 20:26 ` Joel Fernandes 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 20:20 UTC (permalink / raw) To: Joel Fernandes Cc: Sebastian Andrzej Siewior, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Steven Rostedt, Andrea Righi On Thu, Mar 19, 2026 at 02:42:56PM -0400, Joel Fernandes wrote: > On 3/19/2026 1:44 PM, Boqun Feng wrote: > > On Thu, Mar 19, 2026 at 06:02:44PM +0100, Sebastian Andrzej Siewior wrote: > >> On 2026-03-19 09:48:16 [-0700], Boqun Feng wrote: > >>> I agree it's not RCU's fault ;-) > >> > >> I never claimed it is anyone's fault. I just see that BPF should be able > >> to do things which kgdb would not be allowed to. > >> > >>> I guess it'll be difficult to restrict BPF, however maybe BPF can call > >>> call_srcu() in irq_work instead? Or a more systematic defer mechanism > >>> that allows BPF to defer any lock holding functions to a different > >>> context. (We have a similar issue that BPF cannot call kfree_rcu() in > >>> some cases IIRC). > >>> > >>> But we need to fix this in v7.0, so this short-term fix is still needed. > >> > >> I would prefer something substantial before we rush to get a quick fix > >> and move on. > >> > > > > The quick fix here is really "restore the previous behavior of > > call_rcu_tasks_trace() in call_srcu()", and the future work will > > Unfortunately reverting c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in > terms of SRCU-fast") is tricky since the original body of RCU Tasks Trace code > is deleted. Perhaps we should have added an easier escape-hatch, lesson learnt:) > > > naturally happen: if the extra irq_work layer turns out calling issues > > to other SRCU users, then we need to fix them as well. Otherwise, there > > is no real need to avoid the extra irq_work hop. So I *think* it's OK > > ;-) > > > > Cleaning up all the ad-hoc irq_work usages in BPF is another thing, > > which can happen if we learn about all the cases and have a good design. > > > >> If we could get that irq_work() part only for BPF where it is required > >> then it would be already a step forward. > >> > > > > I'm happy to include that (i.e. using Qiang's suggestion) if Joel also > > agrees. > > Sure, I am Ok with sort of short-term fix, but I worry that it still does not > the issues due to the tasks-trace conversion. In particular, it doesn't fix the > issue Andrea reported AFAICS, because there is a dependency on pool->lock? see: > https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ > > That happens precisely because of the queue_delayed_work() happening from the > SRCU tasks-trace specific BPF right? > > This looks something like this, due to combination of SRCU, scheduler and WQ: > > srcu_usage.lock -> pool->lock -> pi_lock -> rq->__lock > ^ | > | | > +----------- DEADLOCK CYCLE ------------+ > > >> Long term it would be nice if we could avoid calling this while locks > >> are held. I think call_rcu() can't be used under rq/pi lock, but timers > >> should be fine. > >> > >> Is this rq/pi locking originating from "regular" BPF code or sched_ext? > >> > > > > I think if you have any tracepoint (include traceable functions) under > > rq/pi locking, then potentially BPF can call call_srcu() there. > > > > > The root cause of the issues is that BPF is actually like a NMI unless > > the code is noinstr (There is a rabit hole about BPF calling > > call_srcu() while it's instrumenting call_srcu() itself). And the right > > way to solve all the issues is to have a general defer mechanism for > > BPF. > Will that really solve the above mentioned issue though that Andrea reported? > It should, since we call irq_work to queue_work instead queue_work directly, so we break the srcu_usage.lock -> pool->lock dependency. But yes, some tests would be good, the code is at: https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix related commits are: 78dcdc35d85f rcu: Use an intermediate irq_work to start process_srcu() 0490fe4b5c39 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() One fixes the raw spinlock vs spinlock issue, the other fixes the deadlock. Regards, Boqun > +Andrea, +Steve as well. > > thanks, > > -- > Joel Fernandes > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 20:20 ` Boqun Feng @ 2026-03-19 20:26 ` Joel Fernandes 2026-03-19 20:45 ` Joel Fernandes 0 siblings, 1 reply; 100+ messages in thread From: Joel Fernandes @ 2026-03-19 20:26 UTC (permalink / raw) To: Boqun Feng Cc: Sebastian Andrzej Siewior, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Steven Rostedt, Andrea Righi On 3/19/2026 4:20 PM, Boqun Feng wrote: > On Thu, Mar 19, 2026 at 02:42:56PM -0400, Joel Fernandes wrote: [...] >>> naturally happen: if the extra irq_work layer turns out calling issues >>> to other SRCU users, then we need to fix them as well. Otherwise, there >>> is no real need to avoid the extra irq_work hop. So I *think* it's OK >>> ;-) >>> >>> Cleaning up all the ad-hoc irq_work usages in BPF is another thing, >>> which can happen if we learn about all the cases and have a good design. >>> >>>> If we could get that irq_work() part only for BPF where it is required >>>> then it would be already a step forward. >>>> >>> >>> I'm happy to include that (i.e. using Qiang's suggestion) if Joel also >>> agrees. >> >> Sure, I am Ok with sort of short-term fix, but I worry that it still does not >> the issues due to the tasks-trace conversion. In particular, it doesn't fix the >> issue Andrea reported AFAICS, because there is a dependency on pool->lock? see: >> https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ >> >> That happens precisely because of the queue_delayed_work() happening from the >> SRCU tasks-trace specific BPF right? >> >> This looks something like this, due to combination of SRCU, scheduler and WQ: >> >> srcu_usage.lock -> pool->lock -> pi_lock -> rq->__lock >> ^ | >> | | >> +----------- DEADLOCK CYCLE ------------+ >> >>>> Long term it would be nice if we could avoid calling this while locks >>>> are held. I think call_rcu() can't be used under rq/pi lock, but timers >>>> should be fine. >>>> >>>> Is this rq/pi locking originating from "regular" BPF code or sched_ext? >>>> >>> >>> I think if you have any tracepoint (include traceable functions) under >>> rq/pi locking, then potentially BPF can call call_srcu() there. >> >>> >>> The root cause of the issues is that BPF is actually like a NMI unless >>> the code is noinstr (There is a rabit hole about BPF calling >>> call_srcu() while it's instrumenting call_srcu() itself). And the right >>> way to solve all the issues is to have a general defer mechanism for >>> BPF. >> Will that really solve the above mentioned issue though that Andrea reported? >> > > It should, since we call irq_work to queue_work instead queue_work > directly, so we break the srcu_usage.lock -> pool->lock dependency. But > yes, some tests would be good, the code is at: > > https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix > > related commits are: > > 78dcdc35d85f rcu: Use an intermediate irq_work to start process_srcu() > 0490fe4b5c39 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() > > One fixes the raw spinlock vs spinlock issue, the other fixes the > deadlock. Ah yes, with the irq_work fix, indeed. I'll try to queue the irq_work fix for 7.1 and run some tests. Appreciate if Andrea, Paul and Kumar can also check, thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 20:26 ` Joel Fernandes @ 2026-03-19 20:45 ` Joel Fernandes 0 siblings, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-19 20:45 UTC (permalink / raw) To: Boqun Feng Cc: Sebastian Andrzej Siewior, paulmck, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo, bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Steven Rostedt, Andrea Righi On 3/19/2026 4:26 PM, Joel Fernandes wrote: > > > On 3/19/2026 4:20 PM, Boqun Feng wrote: >> On Thu, Mar 19, 2026 at 02:42:56PM -0400, Joel Fernandes wrote: > [...] > >>>> naturally happen: if the extra irq_work layer turns out calling issues >>>> to other SRCU users, then we need to fix them as well. Otherwise, there >>>> is no real need to avoid the extra irq_work hop. So I *think* it's OK >>>> ;-) >>>> >>>> Cleaning up all the ad-hoc irq_work usages in BPF is another thing, >>>> which can happen if we learn about all the cases and have a good design. >>>> >>>>> If we could get that irq_work() part only for BPF where it is required >>>>> then it would be already a step forward. >>>>> >>>> >>>> I'm happy to include that (i.e. using Qiang's suggestion) if Joel also >>>> agrees. >>> >>> Sure, I am Ok with sort of short-term fix, but I worry that it still does not >>> the issues due to the tasks-trace conversion. In particular, it doesn't fix the >>> issue Andrea reported AFAICS, because there is a dependency on pool->lock? see: >>> https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ >>> >>> That happens precisely because of the queue_delayed_work() happening from the >>> SRCU tasks-trace specific BPF right? >>> >>> This looks something like this, due to combination of SRCU, scheduler and WQ: >>> >>> srcu_usage.lock -> pool->lock -> pi_lock -> rq->__lock >>> ^ | >>> | | >>> +----------- DEADLOCK CYCLE ------------+ >>> >>>>> Long term it would be nice if we could avoid calling this while locks >>>>> are held. I think call_rcu() can't be used under rq/pi lock, but timers >>>>> should be fine. >>>>> >>>>> Is this rq/pi locking originating from "regular" BPF code or sched_ext? >>>>> >>>> >>>> I think if you have any tracepoint (include traceable functions) under >>>> rq/pi locking, then potentially BPF can call call_srcu() there. >>> >>>> >>>> The root cause of the issues is that BPF is actually like a NMI unless >>>> the code is noinstr (There is a rabit hole about BPF calling >>>> call_srcu() while it's instrumenting call_srcu() itself). And the right >>>> way to solve all the issues is to have a general defer mechanism for >>>> BPF. >>> Will that really solve the above mentioned issue though that Andrea reported? >>> >> >> It should, since we call irq_work to queue_work instead queue_work >> directly, so we break the srcu_usage.lock -> pool->lock dependency. But >> yes, some tests would be good, the code is at: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix >> >> related commits are: >> >> 78dcdc35d85f rcu: Use an intermediate irq_work to start process_srcu() >> 0490fe4b5c39 srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable() >> >> One fixes the raw spinlock vs spinlock issue, the other fixes the >> deadlock. > Ah yes, with the irq_work fix, indeed. > > I'll try to queue the irq_work fix for 7.1 and run some tests. Appreciate if > Andrea, Paul and Kumar can also check, Ah, but of course these should go through 7.0 (assuming they fix all open issues) since that's when the bug was introduced. thanks, -- Joel Fernandes ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 1:08 ` Boqun Feng 2026-03-19 9:03 ` Sebastian Andrzej Siewior @ 2026-03-19 10:02 ` Paul E. McKenney 2026-03-19 14:34 ` Boqun Feng 1 sibling, 1 reply; 100+ messages in thread From: Paul E. McKenney @ 2026-03-19 10:02 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On Wed, Mar 18, 2026 at 06:08:21PM -0700, Boqun Feng wrote: > On Wed, Mar 18, 2026 at 04:27:23PM -0700, Boqun Feng wrote: > > On Wed, Mar 18, 2026 at 06:52:53PM -0400, Joel Fernandes wrote: > > > > > > > > > On 3/18/2026 6:15 PM, Boqun Feng wrote: > > > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > > >> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > > >> [...] > > > >>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > >>>> different issue, from the NMI issue? It is more of an issue of calling > > > >>>> call_srcu API with scheduler locks held. > > > >>>> > > > >>>> Something like below I think: > > > >>>> > > > >>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > >>>> ---------------------------- ------------------------------------ > > > >>>> [1] holds &rq->__lock > > > >>>> [2] > > > >>>> -> call_srcu > > > >>>> -> srcu_gp_start_if_needed > > > >>>> -> srcu_funnel_gp_start > > > >>>> -> spin_lock_irqsave_ssp_content... > > > >>>> -> holds srcu locks > > > >>>> > > > >>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > >>>> -> queue_delayed_work > > > >>>> -> call_srcu() -> __queue_work() > > > >>>> -> srcu_gp_start_if_needed() -> wake_up_worker() > > > >>>> -> srcu_funnel_gp_start() -> try_to_wake_up() > > > >>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > >>>> -> WANTS srcu locks > > > >>> > > > >>> I see, we can also have a self deadlock even without CPU B, when CPU A > > > >>> is going to try_to_wake_up() the a worker on the same CPU. > > > >>> > > > >>> An interesting observation is that the deadlock can be avoided in > > > >>> queue_delayed_work() uses a non-zero delay, that means a timer will be > > > >>> armed instead of acquiring the rq lock. > > > >>> > > > > > > > > If my observation is correct, then this can probably fix the deadlock > > > > issue with runqueue lock (untested though), but it won't work if BPF > > > > tracepoint can happen with timer base lock held. > > > > > > > > Regards, > > > > Boqun > > > > > > > > ------> > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > index 2328827f8775..a5d67264acb5 100644 > > > > --- a/kernel/rcu/srcutree.c > > > > +++ b/kernel/rcu/srcutree.c > > > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > struct srcu_node *snp_leaf; > > > > unsigned long snp_seq; > > > > struct srcu_usage *sup = ssp->srcu_sup; > > > > + bool irqs_were_disabled; > > > > > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > > > /* Top of tree, must ensure the grace period will be started. */ > > > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > > > + irqs_were_disabled = irqs_disabled_flags(flags); > > > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > > > /* > > > > * Record need for grace period s. Pair with load > > > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > // it isn't. And it does not have to be. After all, it > > > > // can only be executed during early boot when there is only > > > > // the one boot CPU running with interrupts still disabled. > > > > + // > > > > + // If irq was disabled when call_srcu() is called, then we > > > > + // could be in the scheduler path with a runqueue lock held, > > > > + // delay the process_srcu() work 1 more jiffies so we don't go > > > > + // through the kick_pool() -> wake_up_process() path below, and > > > > + // we could avoid deadlock with runqueue lock. > > > > if (likely(srcu_init_done)) > > > > queue_delayed_work(rcu_gp_wq, &sup->work, > > > > - !!srcu_get_delay(ssp)); > > > > + !!srcu_get_delay(ssp) + > > > > + !!irqs_were_disabled); > > > Nice, I wonder if it is better to do this in __queue_delayed_work() itself. > > > Do we have queue_delayed_work() with zero delays that are in irq-disabled > > > regions, and they depend on that zero-delay for correctness? Even with > > > delay of 0 though, the work item doesn't execute right away anyway, the > > > worker thread has to also be scheduler right? > > > > > > Also if IRQ is disabled, I'd think this is a critical path that is not > > > wanting to run the work item right-away anyway since workqueue is more a > > > bottom-half mechanism, than "run this immediately". > > > > > > IOW, would be good to make the workqueue-layer more resilient to waking up > > > the scheduler when a delay would have been totally ok. But maybe +Tejun can > > > yell if that sounds insane. > > > > > > > I think all of these are probably a good point. However my fix is not > > complete :( It's missing the ABBA case in your example (it obviously > > could solve the self deadlock if my observation is correct), because we > > will still build rcu_node::lock -> runqueue::lock in some conditions, > > and BPF contributes the runqueue::lock -> rcu_node::lock dependency. > > Hence we still have ABBA deadlock. > > > > To remove the rcu_node::lock -> runqueue::lock entirely, we need to > > always delay 1+ jiffies: > > > > Hmm.. or I can do as the old call_rcu_tasks_trace() does: using an > irq_work. I also pushed it at: > > https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix > > (based on Paul's fix on spinlock already, but only lightly build test). > > Regards, > Boqun > > -------------------------->8 > Subject: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > happen basically everywhere (including where a scheduler lock is held), > call_srcu() now needs to avoid acquiring scheduler lock because > otherwise it could cause deadlock [1]. Fix this by following what the > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > the work to start process_srcu(). > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > Signed-off-by: Boqun Feng <boqun@kernel.org> > --- > include/linux/srcutree.h | 1 + > kernel/rcu/srcutree.c | 22 ++++++++++++++++++++-- > 2 files changed, 21 insertions(+), 2 deletions(-) > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > index b122c560a59c..fd1a9270cb9a 100644 > --- a/include/linux/srcutree.h > +++ b/include/linux/srcutree.h > @@ -95,6 +95,7 @@ struct srcu_usage { > unsigned long reschedule_jiffies; > unsigned long reschedule_count; > struct delayed_work work; > + struct irq_work irq_work; > struct srcu_struct *srcu_ssp; > }; > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 2328827f8775..57116635e72d 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -19,6 +19,7 @@ > #include <linux/mutex.h> > #include <linux/percpu.h> > #include <linux/preempt.h> > +#include <linux/irq_work.h> > #include <linux/rcupdate_wait.h> > #include <linux/sched.h> > #include <linux/smp.h> > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > static void srcu_invoke_callbacks(struct work_struct *work); > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > static void process_srcu(struct work_struct *work); > +static void srcu_irq_work(struct irq_work *work); > static void srcu_delay_timer(struct timer_list *t); > > /* > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > ssp->srcu_sup->sda_is_static = is_static; > if (!is_static) { > ssp->sda = alloc_percpu(struct srcu_data); > @@ -1118,9 +1121,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > // it isn't. And it does not have to be. After all, it > // can only be executed during early boot when there is only > // the one boot CPU running with interrupts still disabled. > + // > + // Use an irq_work here to avoid acquiring runqueue lock with > + // srcu rcu_node::lock held. BPF instrument could introduce the > + // opposite dependency, hence we need to break the possible > + // locking dependency here. If I understand the lockdep splat, you need to bail out earlier on, prior to the first lock acquisition. Thanx, Paul > if (likely(srcu_init_done)) > - queue_delayed_work(rcu_gp_wq, &sup->work, > - !!srcu_get_delay(ssp)); > + irq_work_queue(&sup->irq_work); > else if (list_empty(&sup->work.work.entry)) > list_add(&sup->work.work.entry, &srcu_boot_list); > } > @@ -1979,6 +1986,17 @@ static void process_srcu(struct work_struct *work) > srcu_reschedule(ssp, curdelay); > } > > +static void srcu_irq_work(struct irq_work *work) > +{ > + struct srcu_struct *ssp; > + struct srcu_usage *sup; > + > + sup = container_of(work, struct srcu_usage, irq_work); > + ssp = sup->srcu_ssp; > + > + queue_delayed_work(rcu_gp_wq, &sup->work, !!srcu_get_delay(ssp)); > +} > + > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > unsigned long *gp_seq) > { > -- > 2.50.1 (Apple Git-155) > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 10:02 ` Paul E. McKenney @ 2026-03-19 14:34 ` Boqun Feng 2026-03-19 16:10 ` Paul E. McKenney 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 14:34 UTC (permalink / raw) To: Paul E. McKenney Cc: Joel Fernandes, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On Thu, Mar 19, 2026 at 03:02:57AM -0700, Paul E. McKenney wrote: > On Wed, Mar 18, 2026 at 06:08:21PM -0700, Boqun Feng wrote: > > On Wed, Mar 18, 2026 at 04:27:23PM -0700, Boqun Feng wrote: > > > On Wed, Mar 18, 2026 at 06:52:53PM -0400, Joel Fernandes wrote: > > > > > > > > > > > > On 3/18/2026 6:15 PM, Boqun Feng wrote: > > > > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > > > >> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > > > >> [...] > > > > >>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > > >>>> different issue, from the NMI issue? It is more of an issue of calling > > > > >>>> call_srcu API with scheduler locks held. > > > > >>>> > > > > >>>> Something like below I think: > > > > >>>> > > > > >>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > > >>>> ---------------------------- ------------------------------------ > > > > >>>> [1] holds &rq->__lock > > > > >>>> [2] > > > > >>>> -> call_srcu > > > > >>>> -> srcu_gp_start_if_needed > > > > >>>> -> srcu_funnel_gp_start > > > > >>>> -> spin_lock_irqsave_ssp_content... > > > > >>>> -> holds srcu locks > > > > >>>> > > > > >>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > > >>>> -> queue_delayed_work > > > > >>>> -> call_srcu() -> __queue_work() > > > > >>>> -> srcu_gp_start_if_needed() -> wake_up_worker() > > > > >>>> -> srcu_funnel_gp_start() -> try_to_wake_up() > > > > >>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > > >>>> -> WANTS srcu locks > > > > >>> > > > > >>> I see, we can also have a self deadlock even without CPU B, when CPU A > > > > >>> is going to try_to_wake_up() the a worker on the same CPU. > > > > >>> > > > > >>> An interesting observation is that the deadlock can be avoided in > > > > >>> queue_delayed_work() uses a non-zero delay, that means a timer will be > > > > >>> armed instead of acquiring the rq lock. > > > > >>> > > > > > > > > > > If my observation is correct, then this can probably fix the deadlock > > > > > issue with runqueue lock (untested though), but it won't work if BPF > > > > > tracepoint can happen with timer base lock held. > > > > > > > > > > Regards, > > > > > Boqun > > > > > > > > > > ------> > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > > index 2328827f8775..a5d67264acb5 100644 > > > > > --- a/kernel/rcu/srcutree.c > > > > > +++ b/kernel/rcu/srcutree.c > > > > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > struct srcu_node *snp_leaf; > > > > > unsigned long snp_seq; > > > > > struct srcu_usage *sup = ssp->srcu_sup; > > > > > + bool irqs_were_disabled; > > > > > > > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > > > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > > > > > /* Top of tree, must ensure the grace period will be started. */ > > > > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > > > > + irqs_were_disabled = irqs_disabled_flags(flags); > > > > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > > > > /* > > > > > * Record need for grace period s. Pair with load > > > > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > // it isn't. And it does not have to be. After all, it > > > > > // can only be executed during early boot when there is only > > > > > // the one boot CPU running with interrupts still disabled. > > > > > + // > > > > > + // If irq was disabled when call_srcu() is called, then we > > > > > + // could be in the scheduler path with a runqueue lock held, > > > > > + // delay the process_srcu() work 1 more jiffies so we don't go > > > > > + // through the kick_pool() -> wake_up_process() path below, and > > > > > + // we could avoid deadlock with runqueue lock. > > > > > if (likely(srcu_init_done)) > > > > > queue_delayed_work(rcu_gp_wq, &sup->work, > > > > > - !!srcu_get_delay(ssp)); > > > > > + !!srcu_get_delay(ssp) + > > > > > + !!irqs_were_disabled); > > > > Nice, I wonder if it is better to do this in __queue_delayed_work() itself. > > > > Do we have queue_delayed_work() with zero delays that are in irq-disabled > > > > regions, and they depend on that zero-delay for correctness? Even with > > > > delay of 0 though, the work item doesn't execute right away anyway, the > > > > worker thread has to also be scheduler right? > > > > > > > > Also if IRQ is disabled, I'd think this is a critical path that is not > > > > wanting to run the work item right-away anyway since workqueue is more a > > > > bottom-half mechanism, than "run this immediately". > > > > > > > > IOW, would be good to make the workqueue-layer more resilient to waking up > > > > the scheduler when a delay would have been totally ok. But maybe +Tejun can > > > > yell if that sounds insane. > > > > > > > > > > I think all of these are probably a good point. However my fix is not > > > complete :( It's missing the ABBA case in your example (it obviously > > > could solve the self deadlock if my observation is correct), because we > > > will still build rcu_node::lock -> runqueue::lock in some conditions, > > > and BPF contributes the runqueue::lock -> rcu_node::lock dependency. > > > Hence we still have ABBA deadlock. > > > > > > To remove the rcu_node::lock -> runqueue::lock entirely, we need to > > > always delay 1+ jiffies: > > > > > > > Hmm.. or I can do as the old call_rcu_tasks_trace() does: using an > > irq_work. I also pushed it at: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix > > > > (based on Paul's fix on spinlock already, but only lightly build test). > > > > Regards, > > Boqun > > > > -------------------------->8 > > Subject: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() > > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > happen basically everywhere (including where a scheduler lock is held), > > call_srcu() now needs to avoid acquiring scheduler lock because > > otherwise it could cause deadlock [1]. Fix this by following what the > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > the work to start process_srcu(). > > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > --- > > include/linux/srcutree.h | 1 + > > kernel/rcu/srcutree.c | 22 ++++++++++++++++++++-- > > 2 files changed, 21 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > index b122c560a59c..fd1a9270cb9a 100644 > > --- a/include/linux/srcutree.h > > +++ b/include/linux/srcutree.h > > @@ -95,6 +95,7 @@ struct srcu_usage { > > unsigned long reschedule_jiffies; > > unsigned long reschedule_count; > > struct delayed_work work; > > + struct irq_work irq_work; > > struct srcu_struct *srcu_ssp; > > }; > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 2328827f8775..57116635e72d 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -19,6 +19,7 @@ > > #include <linux/mutex.h> > > #include <linux/percpu.h> > > #include <linux/preempt.h> > > +#include <linux/irq_work.h> > > #include <linux/rcupdate_wait.h> > > #include <linux/sched.h> > > #include <linux/smp.h> > > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > > static void srcu_invoke_callbacks(struct work_struct *work); > > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > > static void process_srcu(struct work_struct *work); > > +static void srcu_irq_work(struct irq_work *work); > > static void srcu_delay_timer(struct timer_list *t); > > > > /* > > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > > ssp->srcu_sup->sda_is_static = is_static; > > if (!is_static) { > > ssp->sda = alloc_percpu(struct srcu_data); > > @@ -1118,9 +1121,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > // it isn't. And it does not have to be. After all, it > > // can only be executed during early boot when there is only > > // the one boot CPU running with interrupts still disabled. > > + // > > + // Use an irq_work here to avoid acquiring runqueue lock with > > + // srcu rcu_node::lock held. BPF instrument could introduce the > > + // opposite dependency, hence we need to break the possible > > + // locking dependency here. > > If I understand the lockdep splat, you need to bail out earlier on, > prior to the first lock acquisition. > I think you're talking about another direction of the dependency ;-) Joel's example shows both depenedencies clearly: CPU A (BPF tracepoint) CPU B (concurrent call_srcu) ---------------------------- ------------------------------------ [1] holds &rq->__lock [2] -> call_srcu -> srcu_gp_start_if_needed -> srcu_funnel_gp_start -> spin_lock_irqsave_ssp_content... -> holds srcu locks [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) -> queue_delayed_work -> call_srcu() -> __queue_work() -> srcu_gp_start_if_needed() -> wake_up_worker() -> srcu_funnel_gp_start() -> try_to_wake_up() -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock -> WANTS srcu locks To remove [1] -> [4], i.e. the dependency "&rq->__lock" -> "srcu locks", yes, you need to bail out earlier on. But to remove [2] -> [6], i.e. the dependency "srcu locks" -> "&rq->__lock", you just need to make sure call_srcu() won't call anything that need to acquire the rq lock. In the old version of call_rcu_tasks_trace(), we were fine also because [2] -> [6] didn't exist ([6] is rcu node lock in this case). [1] -> [4] ([4] being rcu node lock) existed in the RCU Tasks Trace version: // Enqueue a callback for the specified flavor of Tasks RCU. static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, struct rcu_tasks *rtp) { ... if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled. raw_spin_lock_rcu_node(rtpcp); // irqs already disabled. ... } ... rcu_segcblist_enqueue(&rtpcp->cblist, rhp); raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); ... /* We can't create the thread unless interrupts are enabled. */ if (needwake && READ_ONCE(rtp->kthread_ptr)) irq_work_queue(&rtpcp->rtp_irq_work); } Hope it helps. Regards, Boqun > Thanx, Paul > > > if (likely(srcu_init_done)) > > - queue_delayed_work(rcu_gp_wq, &sup->work, > > - !!srcu_get_delay(ssp)); > > + irq_work_queue(&sup->irq_work); > > else if (list_empty(&sup->work.work.entry)) > > list_add(&sup->work.work.entry, &srcu_boot_list); > > } > > @@ -1979,6 +1986,17 @@ static void process_srcu(struct work_struct *work) > > srcu_reschedule(ssp, curdelay); > > } > > > > +static void srcu_irq_work(struct irq_work *work) > > +{ > > + struct srcu_struct *ssp; > > + struct srcu_usage *sup; > > + > > + sup = container_of(work, struct srcu_usage, irq_work); > > + ssp = sup->srcu_ssp; > > + > > + queue_delayed_work(rcu_gp_wq, &sup->work, !!srcu_get_delay(ssp)); > > +} > > + > > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > > unsigned long *gp_seq) > > { > > -- > > 2.50.1 (Apple Git-155) > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 14:34 ` Boqun Feng @ 2026-03-19 16:10 ` Paul E. McKenney 0 siblings, 0 replies; 100+ messages in thread From: Paul E. McKenney @ 2026-03-19 16:10 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu, Kumar Kartikeya Dwivedi, Tejun Heo On Thu, Mar 19, 2026 at 07:34:31AM -0700, Boqun Feng wrote: > On Thu, Mar 19, 2026 at 03:02:57AM -0700, Paul E. McKenney wrote: > > On Wed, Mar 18, 2026 at 06:08:21PM -0700, Boqun Feng wrote: > > > On Wed, Mar 18, 2026 at 04:27:23PM -0700, Boqun Feng wrote: > > > > On Wed, Mar 18, 2026 at 06:52:53PM -0400, Joel Fernandes wrote: > > > > > > > > > > > > > > > On 3/18/2026 6:15 PM, Boqun Feng wrote: > > > > > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > > > > >> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > > > > >> [...] > > > > > >>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > > > >>>> different issue, from the NMI issue? It is more of an issue of calling > > > > > >>>> call_srcu API with scheduler locks held. > > > > > >>>> > > > > > >>>> Something like below I think: > > > > > >>>> > > > > > >>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > > > >>>> ---------------------------- ------------------------------------ > > > > > >>>> [1] holds &rq->__lock > > > > > >>>> [2] > > > > > >>>> -> call_srcu > > > > > >>>> -> srcu_gp_start_if_needed > > > > > >>>> -> srcu_funnel_gp_start > > > > > >>>> -> spin_lock_irqsave_ssp_content... > > > > > >>>> -> holds srcu locks > > > > > >>>> > > > > > >>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > > > >>>> -> queue_delayed_work > > > > > >>>> -> call_srcu() -> __queue_work() > > > > > >>>> -> srcu_gp_start_if_needed() -> wake_up_worker() > > > > > >>>> -> srcu_funnel_gp_start() -> try_to_wake_up() > > > > > >>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > > > >>>> -> WANTS srcu locks > > > > > >>> > > > > > >>> I see, we can also have a self deadlock even without CPU B, when CPU A > > > > > >>> is going to try_to_wake_up() the a worker on the same CPU. > > > > > >>> > > > > > >>> An interesting observation is that the deadlock can be avoided in > > > > > >>> queue_delayed_work() uses a non-zero delay, that means a timer will be > > > > > >>> armed instead of acquiring the rq lock. > > > > > >>> > > > > > > > > > > > > If my observation is correct, then this can probably fix the deadlock > > > > > > issue with runqueue lock (untested though), but it won't work if BPF > > > > > > tracepoint can happen with timer base lock held. > > > > > > > > > > > > Regards, > > > > > > Boqun > > > > > > > > > > > > ------> > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > > > > index 2328827f8775..a5d67264acb5 100644 > > > > > > --- a/kernel/rcu/srcutree.c > > > > > > +++ b/kernel/rcu/srcutree.c > > > > > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > struct srcu_node *snp_leaf; > > > > > > unsigned long snp_seq; > > > > > > struct srcu_usage *sup = ssp->srcu_sup; > > > > > > + bool irqs_were_disabled; > > > > > > > > > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > > > > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > > > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > > > > > > > /* Top of tree, must ensure the grace period will be started. */ > > > > > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > > > > > + irqs_were_disabled = irqs_disabled_flags(flags); > > > > > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > > > > > /* > > > > > > * Record need for grace period s. Pair with load > > > > > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > // it isn't. And it does not have to be. After all, it > > > > > > // can only be executed during early boot when there is only > > > > > > // the one boot CPU running with interrupts still disabled. > > > > > > + // > > > > > > + // If irq was disabled when call_srcu() is called, then we > > > > > > + // could be in the scheduler path with a runqueue lock held, > > > > > > + // delay the process_srcu() work 1 more jiffies so we don't go > > > > > > + // through the kick_pool() -> wake_up_process() path below, and > > > > > > + // we could avoid deadlock with runqueue lock. > > > > > > if (likely(srcu_init_done)) > > > > > > queue_delayed_work(rcu_gp_wq, &sup->work, > > > > > > - !!srcu_get_delay(ssp)); > > > > > > + !!srcu_get_delay(ssp) + > > > > > > + !!irqs_were_disabled); > > > > > Nice, I wonder if it is better to do this in __queue_delayed_work() itself. > > > > > Do we have queue_delayed_work() with zero delays that are in irq-disabled > > > > > regions, and they depend on that zero-delay for correctness? Even with > > > > > delay of 0 though, the work item doesn't execute right away anyway, the > > > > > worker thread has to also be scheduler right? > > > > > > > > > > Also if IRQ is disabled, I'd think this is a critical path that is not > > > > > wanting to run the work item right-away anyway since workqueue is more a > > > > > bottom-half mechanism, than "run this immediately". > > > > > > > > > > IOW, would be good to make the workqueue-layer more resilient to waking up > > > > > the scheduler when a delay would have been totally ok. But maybe +Tejun can > > > > > yell if that sounds insane. > > > > > > > > > > > > > I think all of these are probably a good point. However my fix is not > > > > complete :( It's missing the ABBA case in your example (it obviously > > > > could solve the self deadlock if my observation is correct), because we > > > > will still build rcu_node::lock -> runqueue::lock in some conditions, > > > > and BPF contributes the runqueue::lock -> rcu_node::lock dependency. > > > > Hence we still have ABBA deadlock. > > > > > > > > To remove the rcu_node::lock -> runqueue::lock entirely, we need to > > > > always delay 1+ jiffies: > > > > > > > > > > Hmm.. or I can do as the old call_rcu_tasks_trace() does: using an > > > irq_work. I also pushed it at: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux.git/ srcu-fix > > > > > > (based on Paul's fix on spinlock already, but only lightly build test). > > > > > > Regards, > > > Boqun > > > > > > -------------------------->8 > > > Subject: [PATCH] rcu: Use an intermediate irq_work to start process_srcu() > > > > > > Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms > > > of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can > > > happen basically everywhere (including where a scheduler lock is held), > > > call_srcu() now needs to avoid acquiring scheduler lock because > > > otherwise it could cause deadlock [1]. Fix this by following what the > > > previous RCU Tasks Trace did: using an irq_work to delay the queuing of > > > the work to start process_srcu(). > > > > > > Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") > > > Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] > > > Signed-off-by: Boqun Feng <boqun@kernel.org> > > > --- > > > include/linux/srcutree.h | 1 + > > > kernel/rcu/srcutree.c | 22 ++++++++++++++++++++-- > > > 2 files changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > > index b122c560a59c..fd1a9270cb9a 100644 > > > --- a/include/linux/srcutree.h > > > +++ b/include/linux/srcutree.h > > > @@ -95,6 +95,7 @@ struct srcu_usage { > > > unsigned long reschedule_jiffies; > > > unsigned long reschedule_count; > > > struct delayed_work work; > > > + struct irq_work irq_work; > > > struct srcu_struct *srcu_ssp; > > > }; > > > > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 2328827f8775..57116635e72d 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -19,6 +19,7 @@ > > > #include <linux/mutex.h> > > > #include <linux/percpu.h> > > > #include <linux/preempt.h> > > > +#include <linux/irq_work.h> > > > #include <linux/rcupdate_wait.h> > > > #include <linux/sched.h> > > > #include <linux/smp.h> > > > @@ -75,6 +76,7 @@ static bool __read_mostly srcu_init_done; > > > static void srcu_invoke_callbacks(struct work_struct *work); > > > static void srcu_reschedule(struct srcu_struct *ssp, unsigned long delay); > > > static void process_srcu(struct work_struct *work); > > > +static void srcu_irq_work(struct irq_work *work); > > > static void srcu_delay_timer(struct timer_list *t); > > > > > > /* > > > @@ -216,6 +218,7 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) > > > mutex_init(&ssp->srcu_sup->srcu_barrier_mutex); > > > atomic_set(&ssp->srcu_sup->srcu_barrier_cpu_cnt, 0); > > > INIT_DELAYED_WORK(&ssp->srcu_sup->work, process_srcu); > > > + init_irq_work(&ssp->srcu_sup->irq_work, srcu_irq_work); > > > ssp->srcu_sup->sda_is_static = is_static; > > > if (!is_static) { > > > ssp->sda = alloc_percpu(struct srcu_data); > > > @@ -1118,9 +1121,13 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > // it isn't. And it does not have to be. After all, it > > > // can only be executed during early boot when there is only > > > // the one boot CPU running with interrupts still disabled. > > > + // > > > + // Use an irq_work here to avoid acquiring runqueue lock with > > > + // srcu rcu_node::lock held. BPF instrument could introduce the > > > + // opposite dependency, hence we need to break the possible > > > + // locking dependency here. > > > > If I understand the lockdep splat, you need to bail out earlier on, > > prior to the first lock acquisition. > > > > I think you're talking about another direction of the dependency ;-) > > Joel's example shows both depenedencies clearly: > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > ---------------------------- ------------------------------------ > [1] holds &rq->__lock > [2] > -> call_srcu > -> srcu_gp_start_if_needed > -> srcu_funnel_gp_start > -> spin_lock_irqsave_ssp_content... > -> holds srcu locks > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > -> queue_delayed_work > -> call_srcu() -> __queue_work() > -> srcu_gp_start_if_needed() -> wake_up_worker() > -> srcu_funnel_gp_start() -> try_to_wake_up() > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > -> WANTS srcu locks > > To remove [1] -> [4], i.e. the dependency "&rq->__lock" -> "srcu locks", > yes, you need to bail out earlier on. But to remove [2] -> [6], i.e. the > dependency "srcu locks" -> "&rq->__lock", you just need to make sure > call_srcu() won't call anything that need to acquire the rq lock. > > In the old version of call_rcu_tasks_trace(), we were fine also because > [2] -> [6] didn't exist ([6] is rcu node lock in this case). [1] -> [4] > ([4] being rcu node lock) existed in the RCU Tasks Trace version: > > > // Enqueue a callback for the specified flavor of Tasks RCU. > static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func, > struct rcu_tasks *rtp) > { > ... > if (!raw_spin_trylock_rcu_node(rtpcp)) { // irqs already disabled. > raw_spin_lock_rcu_node(rtpcp); // irqs already disabled. > ... > } > ... > rcu_segcblist_enqueue(&rtpcp->cblist, rhp); > raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags); > ... > /* We can't create the thread unless interrupts are enabled. */ > if (needwake && READ_ONCE(rtp->kthread_ptr)) > irq_work_queue(&rtpcp->rtp_irq_work); > } > > Hope it helps. Got it, thank you! Thanx, Paul > Regards, > Boqun > > > Thanx, Paul > > > > > if (likely(srcu_init_done)) > > > - queue_delayed_work(rcu_gp_wq, &sup->work, > > > - !!srcu_get_delay(ssp)); > > > + irq_work_queue(&sup->irq_work); > > > else if (list_empty(&sup->work.work.entry)) > > > list_add(&sup->work.work.entry, &srcu_boot_list); > > > } > > > @@ -1979,6 +1986,17 @@ static void process_srcu(struct work_struct *work) > > > srcu_reschedule(ssp, curdelay); > > > } > > > > > > +static void srcu_irq_work(struct irq_work *work) > > > +{ > > > + struct srcu_struct *ssp; > > > + struct srcu_usage *sup; > > > + > > > + sup = container_of(work, struct srcu_usage, irq_work); > > > + ssp = sup->srcu_ssp; > > > + > > > + queue_delayed_work(rcu_gp_wq, &sup->work, !!srcu_get_delay(ssp)); > > > +} > > > + > > > void srcutorture_get_gp_data(struct srcu_struct *ssp, int *flags, > > > unsigned long *gp_seq) > > > { > > > -- > > > 2.50.1 (Apple Git-155) > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 22:15 ` Boqun Feng 2026-03-18 22:52 ` Joel Fernandes @ 2026-03-18 23:56 ` Kumar Kartikeya Dwivedi 2026-03-19 0:26 ` Zqiang 1 sibling, 1 reply; 100+ messages in thread From: Kumar Kartikeya Dwivedi @ 2026-03-18 23:56 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu On Wed, 18 Mar 2026 at 23:15, Boqun Feng <boqun@kernel.org> wrote: > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > [...] > > > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > > different issue, from the NMI issue? It is more of an issue of calling > > > > call_srcu API with scheduler locks held. > > > > > > > > Something like below I think: > > > > > > > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > > ---------------------------- ------------------------------------ > > > > [1] holds &rq->__lock > > > > [2] > > > > -> call_srcu > > > > -> srcu_gp_start_if_needed > > > > -> srcu_funnel_gp_start > > > > -> spin_lock_irqsave_ssp_content... > > > > -> holds srcu locks > > > > > > > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > > -> queue_delayed_work > > > > -> call_srcu() -> __queue_work() > > > > -> srcu_gp_start_if_needed() -> wake_up_worker() > > > > -> srcu_funnel_gp_start() -> try_to_wake_up() > > > > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > > -> WANTS srcu locks > > > > > > I see, we can also have a self deadlock even without CPU B, when CPU A > > > is going to try_to_wake_up() the a worker on the same CPU. > > > > > > An interesting observation is that the deadlock can be avoided in > > > queue_delayed_work() uses a non-zero delay, that means a timer will be > > > armed instead of acquiring the rq lock. > > > > > If my observation is correct, then this can probably fix the deadlock > issue with runqueue lock (untested though), but it won't work if BPF > tracepoint can happen with timer base lock held. Unfortunately it can be, there is at least one tracepoint that is invoked with hrtimer base lock held. Alexei ended up fixing this in the recent past [0]. So I think this would cause trouble too. hrtimer_start_range_ns() -> __hrtimer_start_range_ns() -> remove_timer() -> __remove_hrtimer() -> debug_deactivate() -> trace_hrtimer_cancel(). BPF can attach to such a tracepoint. [0]: https://lore.kernel.org/bpf/20260204040834.22263-2-alexei.starovoitov@gmail.com > > Regards, > Boqun > > ------> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 2328827f8775..a5d67264acb5 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > struct srcu_node *snp_leaf; > unsigned long snp_seq; > struct srcu_usage *sup = ssp->srcu_sup; > + bool irqs_were_disabled; > > /* Ensure that snp node tree is fully initialized before traversing it */ > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > /* Top of tree, must ensure the grace period will be started. */ > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > + irqs_were_disabled = irqs_disabled_flags(flags); > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > /* > * Record need for grace period s. Pair with load > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > // it isn't. And it does not have to be. After all, it > // can only be executed during early boot when there is only > // the one boot CPU running with interrupts still disabled. > + // > + // If irq was disabled when call_srcu() is called, then we > + // could be in the scheduler path with a runqueue lock held, > + // delay the process_srcu() work 1 more jiffies so we don't go > + // through the kick_pool() -> wake_up_process() path below, and > + // we could avoid deadlock with runqueue lock. > if (likely(srcu_init_done)) > queue_delayed_work(rcu_gp_wq, &sup->work, > - !!srcu_get_delay(ssp)); > + !!srcu_get_delay(ssp) + > + !!irqs_were_disabled); > else if (list_empty(&sup->work.work.entry)) > list_add(&sup->work.work.entry, &srcu_boot_list); > } ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-18 23:56 ` Kumar Kartikeya Dwivedi @ 2026-03-19 0:26 ` Zqiang 2026-03-19 1:13 ` Boqun Feng 0 siblings, 1 reply; 100+ messages in thread From: Zqiang @ 2026-03-19 0:26 UTC (permalink / raw) To: Kumar Kartikeya Dwivedi, Boqun Feng Cc: Joel Fernandes, paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu > > On Wed, 18 Mar 2026 at 23:15, Boqun Feng <boqun@kernel.org> wrote: > > > > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > [...] > > > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > > different issue, from the NMI issue? It is more of an issue of calling > > > > call_srcu API with scheduler locks held. > > > > > > > > Something like below I think: > > > > > > > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > > ---------------------------- ------------------------------------ > > > > [1] holds &rq->__lock > > > > [2] > > > > -> call_srcu > > > > -> srcu_gp_start_if_needed > > > > -> srcu_funnel_gp_start > > > > -> spin_lock_irqsave_ssp_content... > > > > -> holds srcu locks > > > > > > > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > > -> queue_delayed_work > > > > -> call_srcu() -> __queue_work() > > > > -> srcu_gp_start_if_needed() -> wake_up_worker() > > > > -> srcu_funnel_gp_start() -> try_to_wake_up() > > > > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > > -> WANTS srcu locks > > > > > > I see, we can also have a self deadlock even without CPU B, when CPU A > > > is going to try_to_wake_up() the a worker on the same CPU. > > > > > > An interesting observation is that the deadlock can be avoided in > > > queue_delayed_work() uses a non-zero delay, that means a timer will be > > > armed instead of acquiring the rq lock. > > > > > > > If my observation is correct, then this can probably fix the deadlock > > issue with runqueue lock (untested though), but it won't work if BPF > > tracepoint can happen with timer base lock held. > > > Unfortunately it can be, there is at least one tracepoint that is > invoked with hrtimer base lock held. > Alexei ended up fixing this in the recent past [0]. So I think this > would cause trouble too. > > hrtimer_start_range_ns() -> __hrtimer_start_range_ns() -> > remove_timer() -> __remove_hrtimer() -> debug_deactivate() -> > trace_hrtimer_cancel(). > BPF can attach to such a tracepoint. Is it possible to use irq_work_queue() to trigger queue_delay_work() by checking if ssp == &rcu_tasks_trace_srcu_struct ? Thanks Zqiang > > [0]: https://lore.kernel.org/bpf/20260204040834.22263-2-alexei.starovoitov@gmail.com > > > > > Regards, > > Boqun > > > > ------> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 2328827f8775..a5d67264acb5 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > struct srcu_node *snp_leaf; > > unsigned long snp_seq; > > struct srcu_usage *sup = ssp->srcu_sup; > > + bool irqs_were_disabled; > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > /* Top of tree, must ensure the grace period will be started. */ > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > + irqs_were_disabled = irqs_disabled_flags(flags); > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > /* > > * Record need for grace period s. Pair with load > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > // it isn't. And it does not have to be. After all, it > > // can only be executed during early boot when there is only > > // the one boot CPU running with interrupts still disabled. > > + // > > + // If irq was disabled when call_srcu() is called, then we > > + // could be in the scheduler path with a runqueue lock held, > > + // delay the process_srcu() work 1 more jiffies so we don't go > > + // through the kick_pool() -> wake_up_process() path below, and > > + // we could avoid deadlock with runqueue lock. > > if (likely(srcu_init_done)) > > queue_delayed_work(rcu_gp_wq, &sup->work, > > - !!srcu_get_delay(ssp)); > > + !!srcu_get_delay(ssp) + > > + !!irqs_were_disabled); > > else if (list_empty(&sup->work.work.entry)) > > list_add(&sup->work.work.entry, &srcu_boot_list); > > } > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 0:26 ` Zqiang @ 2026-03-19 1:13 ` Boqun Feng 2026-03-19 2:47 ` Joel Fernandes 0 siblings, 1 reply; 100+ messages in thread From: Boqun Feng @ 2026-03-19 1:13 UTC (permalink / raw) To: Zqiang Cc: Kumar Kartikeya Dwivedi, Joel Fernandes, paulmck, Sebastian Andrzej Siewior, frederic, neeraj.iitr10, urezki, boqun.feng, rcu On Thu, Mar 19, 2026 at 12:26:38AM +0000, Zqiang wrote: > > > > On Wed, 18 Mar 2026 at 23:15, Boqun Feng <boqun@kernel.org> wrote: > > > > > > > > On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: > > > On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: > > > [...] > > > > > Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a > > > > > different issue, from the NMI issue? It is more of an issue of calling > > > > > call_srcu API with scheduler locks held. > > > > > > > > > > Something like below I think: > > > > > > > > > > CPU A (BPF tracepoint) CPU B (concurrent call_srcu) > > > > > ---------------------------- ------------------------------------ > > > > > [1] holds &rq->__lock > > > > > [2] > > > > > -> call_srcu > > > > > -> srcu_gp_start_if_needed > > > > > -> srcu_funnel_gp_start > > > > > -> spin_lock_irqsave_ssp_content... > > > > > -> holds srcu locks > > > > > > > > > > [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) > > > > > -> queue_delayed_work > > > > > -> call_srcu() -> __queue_work() > > > > > -> srcu_gp_start_if_needed() -> wake_up_worker() > > > > > -> srcu_funnel_gp_start() -> try_to_wake_up() > > > > > -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock > > > > > -> WANTS srcu locks > > > > > > > > I see, we can also have a self deadlock even without CPU B, when CPU A > > > > is going to try_to_wake_up() the a worker on the same CPU. > > > > > > > > An interesting observation is that the deadlock can be avoided in > > > > queue_delayed_work() uses a non-zero delay, that means a timer will be > > > > armed instead of acquiring the rq lock. > > > > > > > > > > If my observation is correct, then this can probably fix the deadlock > > > issue with runqueue lock (untested though), but it won't work if BPF > > > tracepoint can happen with timer base lock held. > > > > > Unfortunately it can be, there is at least one tracepoint that is > > invoked with hrtimer base lock held. > > Alexei ended up fixing this in the recent past [0]. So I think this > > would cause trouble too. > > > > hrtimer_start_range_ns() -> __hrtimer_start_range_ns() -> > > remove_timer() -> __remove_hrtimer() -> debug_deactivate() -> > > trace_hrtimer_cancel(). > > BPF can attach to such a tracepoint. > > Is it possible to use irq_work_queue() to trigger queue_delay_work() > by checking if ssp == &rcu_tasks_trace_srcu_struct ? > Good call! I didn't do the exact == check, but close: https://lore.kernel.org/rcu/abtMhd_LVp3uL_pA@tardis.local/ ;-) Regards, Boqun > Thanks > Zqiang > > > > > > [0]: https://lore.kernel.org/bpf/20260204040834.22263-2-alexei.starovoitov@gmail.com > > > > > > > > Regards, > > > Boqun > > > > > > ------> > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > > index 2328827f8775..a5d67264acb5 100644 > > > --- a/kernel/rcu/srcutree.c > > > +++ b/kernel/rcu/srcutree.c > > > @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > struct srcu_node *snp_leaf; > > > unsigned long snp_seq; > > > struct srcu_usage *sup = ssp->srcu_sup; > > > + bool irqs_were_disabled; > > > > > > /* Ensure that snp node tree is fully initialized before traversing it */ > > > if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) > > > @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > > > > /* Top of tree, must ensure the grace period will be started. */ > > > raw_spin_lock_irqsave_ssp_contention(ssp, &flags); > > > + irqs_were_disabled = irqs_disabled_flags(flags); > > > if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { > > > /* > > > * Record need for grace period s. Pair with load > > > @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, > > > // it isn't. And it does not have to be. After all, it > > > // can only be executed during early boot when there is only > > > // the one boot CPU running with interrupts still disabled. > > > + // > > > + // If irq was disabled when call_srcu() is called, then we > > > + // could be in the scheduler path with a runqueue lock held, > > > + // delay the process_srcu() work 1 more jiffies so we don't go > > > + // through the kick_pool() -> wake_up_process() path below, and > > > + // we could avoid deadlock with runqueue lock. > > > if (likely(srcu_init_done)) > > > queue_delayed_work(rcu_gp_wq, &sup->work, > > > - !!srcu_get_delay(ssp)); > > > + !!srcu_get_delay(ssp) + > > > + !!irqs_were_disabled); > > > else if (list_empty(&sup->work.work.entry)) > > > list_add(&sup->work.work.entry, &srcu_boot_list); > > > } > > > > > ^ permalink raw reply [flat|nested] 100+ messages in thread
* Re: Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT 2026-03-19 1:13 ` Boqun Feng @ 2026-03-19 2:47 ` Joel Fernandes 0 siblings, 0 replies; 100+ messages in thread From: Joel Fernandes @ 2026-03-19 2:47 UTC (permalink / raw) To: Boqun Feng Cc: Zqiang, Kartikeya Dwivedi Kumar, paulmck@kernel.org, Sebastian Andrzej Siewior, frederic@kernel.org, neeraj.iitr10@gmail.com, urezki@gmail.com, boqun.feng@gmail.com, rcu@vger.kernel.org > On Mar 18, 2026, at 9:13 PM, Boqun Feng <boqun@kernel.org> wrote: > > On Thu, Mar 19, 2026 at 12:26:38AM +0000, Zqiang wrote: >>> >>>> On Wed, 18 Mar 2026 at 23:15, Boqun Feng <boqun@kernel.org> wrote: >>> >>>> >>>> On Wed, Mar 18, 2026 at 02:55:48PM -0700, Boqun Feng wrote: >>>> On Wed, Mar 18, 2026 at 02:52:48PM -0700, Boqun Feng wrote: >>>> [...] >>>>>> Ah so it is an ABBA deadlock, not a ABA self-deadlock. I guess this is a >>>>>> different issue, from the NMI issue? It is more of an issue of calling >>>>>> call_srcu API with scheduler locks held. >>>>>> >>>>>> Something like below I think: >>>>>> >>>>>> CPU A (BPF tracepoint) CPU B (concurrent call_srcu) >>>>>> ---------------------------- ------------------------------------ >>>>>> [1] holds &rq->__lock >>>>>> [2] >>>>>> -> call_srcu >>>>>> -> srcu_gp_start_if_needed >>>>>> -> srcu_funnel_gp_start >>>>>> -> spin_lock_irqsave_ssp_content... >>>>>> -> holds srcu locks >>>>>> >>>>>> [4] calls call_rcu_tasks_trace() [5] srcu_funnel_gp_start (cont..) >>>>>> -> queue_delayed_work >>>>>> -> call_srcu() -> __queue_work() >>>>>> -> srcu_gp_start_if_needed() -> wake_up_worker() >>>>>> -> srcu_funnel_gp_start() -> try_to_wake_up() >>>>>> -> spin_lock_irqsave_ssp_contention() [6] WANTS rq->__lock >>>>>> -> WANTS srcu locks >>>>> >>>>> I see, we can also have a self deadlock even without CPU B, when CPU A >>>>> is going to try_to_wake_up() the a worker on the same CPU. >>>>> >>>>> An interesting observation is that the deadlock can be avoided in >>>>> queue_delayed_work() uses a non-zero delay, that means a timer will be >>>>> armed instead of acquiring the rq lock. >>>>> >>>> >>>> If my observation is correct, then this can probably fix the deadlock >>>> issue with runqueue lock (untested though), but it won't work if BPF >>>> tracepoint can happen with timer base lock held. >>>> >>> Unfortunately it can be, there is at least one tracepoint that is >>> invoked with hrtimer base lock held. >>> Alexei ended up fixing this in the recent past [0]. So I think this >>> would cause trouble too. >>> >>> hrtimer_start_range_ns() -> __hrtimer_start_range_ns() -> >>> remove_timer() -> __remove_hrtimer() -> debug_deactivate() -> >>> trace_hrtimer_cancel(). >>> BPF can attach to such a tracepoint. >> >> Is it possible to use irq_work_queue() to trigger queue_delay_work() >> by checking if ssp == &rcu_tasks_trace_srcu_struct ? >> > > Good call! I didn't do the exact == check, but close: > > https://lore.kernel.org/rcu/abtMhd_LVp3uL_pA@tardis.local/ > > ;-) IMO I am ok with Boqun patch but I do not think the == check should be used. SRCU internals should not special case specific SRCU users, that is a hack :) Thanks! > > Regards, > Boqun > >> Thanks >> Zqiang >> >> >>> >>> [0]: https://lore.kernel.org/bpf/20260204040834.22263-2-alexei.starovoitov@gmail.com >>> >>>> >>>> Regards, >>>> Boqun >>>> >>>> ------> >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c >>>> index 2328827f8775..a5d67264acb5 100644 >>>> --- a/kernel/rcu/srcutree.c >>>> +++ b/kernel/rcu/srcutree.c >>>> @@ -1061,6 +1061,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, >>>> struct srcu_node *snp_leaf; >>>> unsigned long snp_seq; >>>> struct srcu_usage *sup = ssp->srcu_sup; >>>> + bool irqs_were_disabled; >>>> >>>> /* Ensure that snp node tree is fully initialized before traversing it */ >>>> if (smp_load_acquire(&sup->srcu_size_state) < SRCU_SIZE_WAIT_BARRIER) >>>> @@ -1098,6 +1099,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, >>>> >>>> /* Top of tree, must ensure the grace period will be started. */ >>>> raw_spin_lock_irqsave_ssp_contention(ssp, &flags); >>>> + irqs_were_disabled = irqs_disabled_flags(flags); >>>> if (ULONG_CMP_LT(sup->srcu_gp_seq_needed, s)) { >>>> /* >>>> * Record need for grace period s. Pair with load >>>> @@ -1118,9 +1120,16 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, >>>> // it isn't. And it does not have to be. After all, it >>>> // can only be executed during early boot when there is only >>>> // the one boot CPU running with interrupts still disabled. >>>> + // >>>> + // If irq was disabled when call_srcu() is called, then we >>>> + // could be in the scheduler path with a runqueue lock held, >>>> + // delay the process_srcu() work 1 more jiffies so we don't go >>>> + // through the kick_pool() -> wake_up_process() path below, and >>>> + // we could avoid deadlock with runqueue lock. >>>> if (likely(srcu_init_done)) >>>> queue_delayed_work(rcu_gp_wq, &sup->work, >>>> - !!srcu_get_delay(ssp)); >>>> + !!srcu_get_delay(ssp) + >>>> + !!irqs_were_disabled); >>>> else if (list_empty(&sup->work.work.entry)) >>>> list_add(&sup->work.work.entry, &srcu_boot_list); >>>> } >>>> >>> ^ permalink raw reply [flat|nested] 100+ messages in thread
end of thread, other threads:[~2026-03-24 19:23 UTC | newest] Thread overview: 100+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-17 13:34 Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Paul E. McKenney 2026-03-18 10:50 ` Sebastian Andrzej Siewior 2026-03-18 11:49 ` Paul E. McKenney 2026-03-18 14:43 ` Sebastian Andrzej Siewior 2026-03-18 15:43 ` Paul E. McKenney 2026-03-18 16:04 ` Sebastian Andrzej Siewior 2026-03-18 16:32 ` Paul E. McKenney 2026-03-18 16:42 ` Boqun Feng 2026-03-18 18:45 ` Paul E. McKenney 2026-03-18 16:47 ` Sebastian Andrzej Siewior 2026-03-18 18:48 ` Paul E. McKenney 2026-03-19 8:55 ` Sebastian Andrzej Siewior 2026-03-19 10:05 ` Paul E. McKenney 2026-03-19 10:43 ` Paul E. McKenney 2026-03-19 10:51 ` Sebastian Andrzej Siewior 2026-03-18 15:51 ` Boqun Feng 2026-03-18 18:42 ` Paul E. McKenney 2026-03-18 20:04 ` Joel Fernandes 2026-03-18 20:11 ` Kumar Kartikeya Dwivedi 2026-03-18 20:25 ` Joel Fernandes 2026-03-18 21:52 ` Boqun Feng 2026-03-18 21:55 ` Boqun Feng 2026-03-18 22:15 ` Boqun Feng 2026-03-18 22:52 ` Joel Fernandes 2026-03-18 23:27 ` Boqun Feng 2026-03-19 1:08 ` Boqun Feng 2026-03-19 9:03 ` Sebastian Andrzej Siewior 2026-03-19 16:27 ` Boqun Feng 2026-03-19 16:33 ` Sebastian Andrzej Siewior 2026-03-19 16:48 ` Boqun Feng 2026-03-19 16:59 ` Kumar Kartikeya Dwivedi 2026-03-19 17:27 ` Boqun Feng 2026-03-19 18:41 ` Kumar Kartikeya Dwivedi 2026-03-19 20:14 ` Boqun Feng 2026-03-19 20:21 ` Joel Fernandes 2026-03-19 20:39 ` Boqun Feng 2026-03-20 15:34 ` Paul E. McKenney 2026-03-20 15:59 ` Boqun Feng 2026-03-20 16:24 ` Paul E. McKenney 2026-03-20 16:57 ` Boqun Feng 2026-03-20 17:54 ` Joel Fernandes 2026-03-20 18:14 ` [PATCH] rcu: Use an intermediate irq_work to start process_srcu() Boqun Feng 2026-03-20 19:18 ` Joel Fernandes 2026-03-20 20:47 ` Andrea Righi 2026-03-20 20:54 ` Boqun Feng 2026-03-20 21:00 ` Andrea Righi 2026-03-20 21:02 ` Andrea Righi 2026-03-20 21:06 ` Boqun Feng 2026-03-20 22:29 ` [PATCH v2] " Boqun Feng 2026-03-23 21:09 ` Joel Fernandes 2026-03-23 22:18 ` Boqun Feng 2026-03-23 22:50 ` Joel Fernandes 2026-03-24 11:27 ` Frederic Weisbecker 2026-03-24 14:56 ` Joel Fernandes 2026-03-24 14:56 ` Alexei Starovoitov 2026-03-24 17:36 ` Boqun Feng 2026-03-24 18:40 ` Joel Fernandes 2026-03-24 19:23 ` Paul E. McKenney 2026-03-21 4:27 ` [PATCH] " Zqiang 2026-03-21 18:15 ` Boqun Feng 2026-03-21 10:10 ` Paul E. McKenney 2026-03-21 17:15 ` Boqun Feng 2026-03-21 17:41 ` Paul E. McKenney 2026-03-21 18:06 ` Boqun Feng 2026-03-21 19:31 ` Paul E. McKenney 2026-03-21 19:45 ` Boqun Feng 2026-03-21 20:07 ` Paul E. McKenney 2026-03-21 20:08 ` Boqun Feng 2026-03-22 10:09 ` Paul E. McKenney 2026-03-22 16:16 ` Boqun Feng 2026-03-22 17:09 ` Paul E. McKenney 2026-03-22 17:31 ` Boqun Feng 2026-03-22 17:44 ` Paul E. McKenney 2026-03-22 18:17 ` Boqun Feng 2026-03-22 19:47 ` Paul E. McKenney 2026-03-22 20:26 ` Boqun Feng 2026-03-23 7:50 ` Paul E. McKenney 2026-03-20 18:20 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng 2026-03-20 23:11 ` Paul E. McKenney 2026-03-21 3:29 ` Paul E. McKenney 2026-03-21 17:03 ` [RFC PATCH] rcu-tasks: Avoid using mod_timer() in call_rcu_tasks_generic() Boqun Feng 2026-03-23 15:17 ` Boqun Feng 2026-03-23 20:37 ` Joel Fernandes 2026-03-23 21:50 ` Kumar Kartikeya Dwivedi 2026-03-23 22:13 ` Boqun Feng 2026-03-20 16:15 ` Next-level bug in SRCU implementation of RCU Tasks Trace + PREEMPT_RT Boqun Feng 2026-03-20 16:24 ` Paul E. McKenney 2026-03-19 17:02 ` Sebastian Andrzej Siewior 2026-03-19 17:44 ` Boqun Feng 2026-03-19 18:42 ` Joel Fernandes 2026-03-19 20:20 ` Boqun Feng 2026-03-19 20:26 ` Joel Fernandes 2026-03-19 20:45 ` Joel Fernandes 2026-03-19 10:02 ` Paul E. McKenney 2026-03-19 14:34 ` Boqun Feng 2026-03-19 16:10 ` Paul E. McKenney 2026-03-18 23:56 ` Kumar Kartikeya Dwivedi 2026-03-19 0:26 ` Zqiang 2026-03-19 1:13 ` Boqun Feng 2026-03-19 2:47 ` Joel Fernandes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox