* [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug
@ 2026-05-23 4:27 Chuyi Zhou
2026-05-23 6:07 ` Muchun Song
2026-06-03 13:21 ` Chuyi Zhou
0 siblings, 2 replies; 5+ messages in thread
From: Chuyi Zhou @ 2026-05-23 4:27 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bigeasy,
clrkwllms
Cc: linux-kernel, Chuyi Zhou
Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
CSD-lock diagnostics") made smp_call_function_single() use the destination
CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
associate a stuck CSD lock with the target CPU, but it also means the CPU
argument is used in per_cpu_ptr() before generic_exec_single() has a chance
to validate it.
This becomes unsafe when smp_call_function_any() cannot find an online CPU
in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
offset array with an invalid CPU number.
Use the destination CPU's csd_data only when the CPU number is within
nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
and let generic_exec_single() perform the existing validation and return
-ENXIO.
Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
---
kernel/smp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index a0bb56bd8dda..dc6582bb35d0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -380,7 +380,8 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
static call_single_data_t *get_single_csd_data(int cpu)
{
- if (static_branch_unlikely(&csdlock_debug_enabled))
+ if (static_branch_unlikely(&csdlock_debug_enabled) &&
+ (unsigned int)cpu < nr_cpu_ids)
return per_cpu_ptr(&csd_data, cpu);
return this_cpu_ptr(&csd_data);
}
--
2.20.1
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug 2026-05-23 4:27 [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug Chuyi Zhou @ 2026-05-23 6:07 ` Muchun Song 2026-06-03 13:21 ` Chuyi Zhou 1 sibling, 0 replies; 5+ messages in thread From: Muchun Song @ 2026-05-23 6:07 UTC (permalink / raw) To: Chuyi Zhou Cc: tglx, mingo, luto, peterz, paulmck, bigeasy, clrkwllms, linux-kernel > On May 23, 2026, at 12:27, Chuyi Zhou <zhouchuyi@bytedance.com> wrote: > > Commit b0473dcd4b1d ("smp: Improve smp_call_function_single() > CSD-lock diagnostics") made smp_call_function_single() use the destination > CPU's csd_data when CSD lock debugging is enabled. That lets the debug code > associate a stuck CSD lock with the target CPU, but it also means the CPU > argument is used in per_cpu_ptr() before generic_exec_single() has a chance > to validate it. > > This becomes unsafe when smp_call_function_any() cannot find an online CPU > in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and > the !wait path calls get_single_csd_data(cpu) before generic_exec_single() > returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU > offset array with an invalid CPU number. > > Use the destination CPU's csd_data only when the CPU number is within > nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data > and let generic_exec_single() perform the existing validation and return > -ENXIO. > > Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics") > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Muchun Song <muchun.song@linux.dev> Thanks. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug 2026-05-23 4:27 [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug Chuyi Zhou 2026-05-23 6:07 ` Muchun Song @ 2026-06-03 13:21 ` Chuyi Zhou 2026-06-04 17:46 ` Paul E. McKenney 1 sibling, 1 reply; 5+ messages in thread From: Chuyi Zhou @ 2026-06-03 13:21 UTC (permalink / raw) To: paulmck Cc: linux-kernel, peterz, mingo, luto, tglx, clrkwllms, bigeasy, muchun.song Hi Paul, On 2026-05-23 12:27 p.m., Chuyi Zhou wrote: > Commit b0473dcd4b1d ("smp: Improve smp_call_function_single() > CSD-lock diagnostics") made smp_call_function_single() use the destination > CPU's csd_data when CSD lock debugging is enabled. That lets the debug code > associate a stuck CSD lock with the target CPU, but it also means the CPU > argument is used in per_cpu_ptr() before generic_exec_single() has a chance > to validate it. > > This becomes unsafe when smp_call_function_any() cannot find an online CPU > in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and > the !wait path calls get_single_csd_data(cpu) before generic_exec_single() > returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU > offset array with an invalid CPU number. > > Use the destination CPU's csd_data only when the CPU number is within > nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data > and let generic_exec_single() perform the existing validation and return > -ENXIO. > > Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics") > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> I hit a soft-lockup panic while running scftorture on a large x86_64 SMP machine with a kernel based on origin/sched/core. The issue appears to be a regression from: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics") The reproduction environment was: base: origin/sched/core machine: large x86_64 SMP system; CONFIG_CSD_LOCK_WAIT_DEBUG=y The workload was: modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100 The visible panic was a secondary soft lockup. CPU 78 was exiting a process and got stuck waiting for CPU 81 to run a synchronous TLB shootdown callback: watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732] Kernel panic - not syncing: softlockup: hung tasks smp: csd: Detected non-responsive CSD lock (#118) on CPU#78, waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...) smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81 CPU 78's stack was: csd_lock_wait_toolong smp_call_function_many_cond on_each_cpu_cond_mask flush_tlb_mm_range tlb_finish_mmu exit_mmap __mmput do_exit do_group_exit __x64_sys_exit_group CPU 81 was running a scftorture invoker and was already inside the call-function-single interrupt path: CPU: 81 PID: 13823 Comm: scftorture_invo RIP: __flush_smp_call_function_queue+0x70/0x4d0 __sysvec_call_function_single sysvec_call_function_single asm_sysvec_call_function_single ktime_get_mono_fast_ns csd_lock_wait_toolong smp_call_function_single scftorture_invoker The dump showed CPU 81 scanning a corrupted async scftorture CSD: ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001 func = scf_handler_1 [scftorture] Decoding that CSD: u_flags = 0x1 -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC src = 112 dst = 81 The first word is the llist next pointer, and it points back to the CSD itself: csd->node.llist.next == csd That self-loop explains why CPU 81 made no progress in __flush_smp_call_function_queue(). There was later work still pending on CPU 81's live call_single_queue: call_single_queue[81].first = ff157dfa7fe79c00 That queued entry was CPU 78's synchronous TLB callback: u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC src = 78 dst = 81 func = flush_tlb_func So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck in an earlier invocation of __flush_smp_call_function_queue(). The root cause seems to be that b0473dcd4b1d changes the !wait path of smp_call_function_single() to use the destination CPU's csd_data when csdlock_debug_enabled is set: if (static_branch_unlikely(&csdlock_debug_enabled)) return per_cpu_ptr(&csd_data, cpu); return this_cpu_ptr(&csd_data); The important invariant is that the internal smp-call CSDs are normally single-writer objects. With the old this_cpu_ptr(&csd_data) allocation, smp_call_function_single(..., wait = 0) used the current CPU's csd_data, so each sending CPU had its own CSD. smp_call_function() goes through smp_call_function_many_cond(), and that path also starts from the current CPU's cfd_data: cfd = this_cpu_ptr(&cfd_data); csd = per_cpu_ptr(cfd->csd, cpu); get_cpu() keeps preemption disabled while the CSD is prepared and queued, so another task on the same CPU cannot concurrently reuse that sender's cfd/csd storage either. This is why the non-atomic csd_lock() worked before: the target CPU could still observe and unlock the CSD from the previous request, but there was only one CPU that could prepare and enqueue that particular CSD object. With the debug-mode destination-CPU allocation, all CPUs sending async single calls to CPU 81 contend for the same object: csd_data[81] == ff157dfa7fe72e40 However, csd_lock() is not an atomic test-and-set lock: csd_lock_wait(csd); csd->node.u_flags |= CSD_FLAG_LOCK; That is safe under the previous single-writer rule, but it is not safe once many remote CPUs can acquire the same destination CPU CSD. Two senders can both observe the lock clear, both set CSD_FLAG_LOCK, overwrite func/info/src/dst and then both enqueue the same llist node: llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81)) If the same node is added again while it is already the queue head, llist_add() writes the old head into new->next. Since old head == new, this produces: csd->node.llist.next = csd That exactly matches the dump. I see two possible fixes: 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data allocation and therefore restores the single-writer property, at the cost of losing the improved destination-CPU CSD lock diagnostics. 2. Keep the destination-CPU csd_data behavior for CSD lock debugging, but make csd_lock() an atomic test-and-set lock when csdlock_debug_enabled is set. One possible implementation is: kernel/smp.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/kernel/smp.c b/kernel/smp.c index 15799f842746..e1d2fd3c9d9f 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -362,6 +362,34 @@ static __always_inline void csd_lock_wait(call_single_data_t *csd) } #endif +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG +static __always_inline void csd_lock(call_single_data_t *csd) +{ + if (static_branch_unlikely(&csdlock_debug_enabled)) { + unsigned int flags; + + for (;;) { + __csd_lock_wait(csd); + flags = READ_ONCE(csd->node.u_flags); + if (!(flags & CSD_FLAG_LOCK) && + try_cmpxchg_acquire(&csd->node.u_flags, &flags, + flags | CSD_FLAG_LOCK)) + break; + cpu_relax(); + } + } else { + csd_lock_wait(csd); + csd->node.u_flags |= CSD_FLAG_LOCK; + } + + /* + * prevent CPU from reordering the above assignment + * to ->flags with any subsequent assignments to other + * fields of the specified call_single_data_t structure: + */ + smp_wmb(); +} +#else static __always_inline void csd_lock(call_single_data_t *csd) { csd_lock_wait(csd); @@ -374,6 +402,7 @@ static __always_inline void csd_lock(call_single_data_t *csd) */ smp_wmb(); } +#endif static __always_inline void csd_unlock(call_single_data_t *csd) { -- ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug 2026-06-03 13:21 ` Chuyi Zhou @ 2026-06-04 17:46 ` Paul E. McKenney 2026-06-05 0:25 ` Chuyi Zhou 0 siblings, 1 reply; 5+ messages in thread From: Paul E. McKenney @ 2026-06-04 17:46 UTC (permalink / raw) To: Chuyi Zhou Cc: linux-kernel, peterz, mingo, luto, tglx, clrkwllms, bigeasy, muchun.song On Wed, Jun 03, 2026 at 09:21:10PM +0800, Chuyi Zhou wrote: > Hi Paul, > > On 2026-05-23 12:27 p.m., Chuyi Zhou wrote: > > Commit b0473dcd4b1d ("smp: Improve smp_call_function_single() > > CSD-lock diagnostics") made smp_call_function_single() use the destination > > CPU's csd_data when CSD lock debugging is enabled. That lets the debug code > > associate a stuck CSD lock with the target CPU, but it also means the CPU > > argument is used in per_cpu_ptr() before generic_exec_single() has a chance > > to validate it. > > > > This becomes unsafe when smp_call_function_any() cannot find an online CPU > > in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and > > the !wait path calls get_single_csd_data(cpu) before generic_exec_single() > > returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU > > offset array with an invalid CPU number. > > > > Use the destination CPU's csd_data only when the CPU number is within > > nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data > > and let generic_exec_single() perform the existing validation and return > > -ENXIO. > > > > Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics") > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > > I hit a soft-lockup panic while running scftorture on a large x86_64 SMP > machine with a kernel based on origin/sched/core. The issue appears to > be a regression from: > > b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock > diagnostics") > > The reproduction environment was: > > base: origin/sched/core > machine: large x86_64 SMP system; > CONFIG_CSD_LOCK_WAIT_DEBUG=y > > The workload was: > > modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100 > > The visible panic was a secondary soft lockup. CPU 78 was exiting a > process and got stuck waiting for CPU 81 to run a synchronous TLB > shootdown callback: > > watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732] > Kernel panic - not syncing: softlockup: hung tasks > > smp: csd: Detected non-responsive CSD lock (#118) on CPU#78, > waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...) > smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81 > > CPU 78's stack was: > > csd_lock_wait_toolong > smp_call_function_many_cond > on_each_cpu_cond_mask > flush_tlb_mm_range > tlb_finish_mmu > exit_mmap > __mmput > do_exit > do_group_exit > __x64_sys_exit_group > > CPU 81 was running a scftorture invoker and was already inside the > call-function-single interrupt path: > > CPU: 81 PID: 13823 Comm: scftorture_invo > RIP: __flush_smp_call_function_queue+0x70/0x4d0 > __sysvec_call_function_single > sysvec_call_function_single > asm_sysvec_call_function_single > ktime_get_mono_fast_ns > csd_lock_wait_toolong > smp_call_function_single > scftorture_invoker > > The dump showed CPU 81 scanning a corrupted async scftorture CSD: > > ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001 > func = scf_handler_1 [scftorture] > > Decoding that CSD: > > u_flags = 0x1 -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC > src = 112 > dst = 81 > > The first word is the llist next pointer, and it points back to the CSD > itself: > > csd->node.llist.next == csd > > That self-loop explains why CPU 81 made no progress in > __flush_smp_call_function_queue(). > > There was later work still pending on CPU 81's live call_single_queue: > > call_single_queue[81].first = ff157dfa7fe79c00 > > That queued entry was CPU 78's synchronous TLB callback: > > u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC > src = 78 > dst = 81 > func = flush_tlb_func > > So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck > in an earlier invocation of __flush_smp_call_function_queue(). > > The root cause seems to be that b0473dcd4b1d changes the !wait path of > smp_call_function_single() to use the destination CPU's csd_data when > csdlock_debug_enabled is set: > > if (static_branch_unlikely(&csdlock_debug_enabled)) > return per_cpu_ptr(&csd_data, cpu); > return this_cpu_ptr(&csd_data); > > The important invariant is that the internal smp-call CSDs are normally > single-writer objects. With the old this_cpu_ptr(&csd_data) allocation, > smp_call_function_single(..., wait = 0) used the current CPU's csd_data, > so each sending CPU had its own CSD. smp_call_function() goes through > smp_call_function_many_cond(), and that path also starts from the > current CPU's cfd_data: > > cfd = this_cpu_ptr(&cfd_data); > csd = per_cpu_ptr(cfd->csd, cpu); > > get_cpu() keeps preemption disabled while the CSD is prepared and > queued, so another task on the same CPU cannot concurrently reuse that > sender's cfd/csd storage either. > > This is why the non-atomic csd_lock() worked before: the target CPU > could still observe and unlock the CSD from the previous request, but > there was only one CPU that could prepare and enqueue that particular > CSD object. With the debug-mode destination-CPU allocation, all CPUs > sending async single calls to CPU 81 contend for the same object: > > csd_data[81] == ff157dfa7fe72e40 > > However, csd_lock() is not an atomic test-and-set lock: > > csd_lock_wait(csd); > csd->node.u_flags |= CSD_FLAG_LOCK; > > That is safe under the previous single-writer rule, but it is not safe > once many remote CPUs can acquire the same destination CPU CSD. Two > senders can both observe the lock clear, both set CSD_FLAG_LOCK, > overwrite func/info/src/dst and then both enqueue the same llist node: > > llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81)) > > If the same node is added again while it is already the queue head, > llist_add() writes the old head into new->next. Since old head == new, > this produces: > > csd->node.llist.next = csd > > That exactly matches the dump. > > I see two possible fixes: > > 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data > allocation and therefore restores the single-writer property, at the > cost of losing the improved destination-CPU CSD lock diagnostics. > > 2. Keep the destination-CPU csd_data behavior for CSD lock debugging, > but make csd_lock() an atomic test-and-set lock when > csdlock_debug_enabled is set. Thank you for the analysis and patch, and apologies for my confusion with b0473dcd4b1d! It looks plausible, but unfortunately, I was not able to find a version to which this patch applies, even after fixing the whitespace breakage noted below. Which might be another broken line or other whitespace error that my eyes missed. But even with --ignore-whitespace, it still does not apply. Please also see below for a way to shrink this a bit. If you update the patch and tell me what commit to apply it to, I will continue testing and review. > One possible implementation is: > > kernel/smp.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/kernel/smp.c b/kernel/smp.c > index 15799f842746..e1d2fd3c9d9f 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -362,6 +362,34 @@ static __always_inline void > csd_lock_wait(call_single_data_t *csd) The above two lines need to be one line. > } > #endif > > +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG > +static __always_inline void csd_lock(call_single_data_t *csd) > +{ > + if (static_branch_unlikely(&csdlock_debug_enabled)) { If you make the above line instead read as follows: if (IS_ENABLED(CONFIG_CSD_LOCK_WAIT_DEBUG) && static_branch_unlikely(&csdlock_debug_enabled)) { Then you can get rid of the #ifdef and the other version of the csd_lock() function. > + unsigned int flags; > + > + for (;;) { > + __csd_lock_wait(csd); > + flags = READ_ONCE(csd->node.u_flags); > + if (!(flags & CSD_FLAG_LOCK) && > + try_cmpxchg_acquire(&csd->node.u_flags, &flags, > + flags | CSD_FLAG_LOCK)) > + break; > + cpu_relax(); > + } > + } else { > + csd_lock_wait(csd); > + csd->node.u_flags |= CSD_FLAG_LOCK; > + } > + > + /* > + * prevent CPU from reordering the above assignment > + * to ->flags with any subsequent assignments to other > + * fields of the specified call_single_data_t structure: > + */ > + smp_wmb(); > +} > +#else > static __always_inline void csd_lock(call_single_data_t *csd) > { > csd_lock_wait(csd); > @@ -374,6 +402,7 @@ static __always_inline void > csd_lock(call_single_data_t *csd) And again, the above two lines need to be one line. Thanx, Paul > */ > smp_wmb(); > } > +#endif > > static __always_inline void csd_unlock(call_single_data_t *csd) > { > -- ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug 2026-06-04 17:46 ` Paul E. McKenney @ 2026-06-05 0:25 ` Chuyi Zhou 0 siblings, 0 replies; 5+ messages in thread From: Chuyi Zhou @ 2026-06-05 0:25 UTC (permalink / raw) To: paulmck Cc: linux-kernel, peterz, mingo, luto, tglx, clrkwllms, bigeasy, muchun.song On 2026-06-05 1:46 a.m., Paul E. McKenney wrote: > On Wed, Jun 03, 2026 at 09:21:10PM +0800, Chuyi Zhou wrote: >> Hi Paul, >> >> On 2026-05-23 12:27 p.m., Chuyi Zhou wrote: >>> Commit b0473dcd4b1d ("smp: Improve smp_call_function_single() >>> CSD-lock diagnostics") made smp_call_function_single() use the destination >>> CPU's csd_data when CSD lock debugging is enabled. That lets the debug code >>> associate a stuck CSD lock with the target CPU, but it also means the CPU >>> argument is used in per_cpu_ptr() before generic_exec_single() has a chance >>> to validate it. >>> >>> This becomes unsafe when smp_call_function_any() cannot find an online CPU >>> in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and >>> the !wait path calls get_single_csd_data(cpu) before generic_exec_single() >>> returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU >>> offset array with an invalid CPU number. >>> >>> Use the destination CPU's csd_data only when the CPU number is within >>> nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data >>> and let generic_exec_single() perform the existing validation and return >>> -ENXIO. >>> >>> Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics") >>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> >>> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> >> >> I hit a soft-lockup panic while running scftorture on a large x86_64 SMP >> machine with a kernel based on origin/sched/core. The issue appears to >> be a regression from: >> >> b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock >> diagnostics") >> >> The reproduction environment was: >> >> base: origin/sched/core >> machine: large x86_64 SMP system; >> CONFIG_CSD_LOCK_WAIT_DEBUG=y >> >> The workload was: >> >> modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100 >> >> The visible panic was a secondary soft lockup. CPU 78 was exiting a >> process and got stuck waiting for CPU 81 to run a synchronous TLB >> shootdown callback: >> >> watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732] >> Kernel panic - not syncing: softlockup: hung tasks >> >> smp: csd: Detected non-responsive CSD lock (#118) on CPU#78, >> waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...) >> smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81 >> >> CPU 78's stack was: >> >> csd_lock_wait_toolong >> smp_call_function_many_cond >> on_each_cpu_cond_mask >> flush_tlb_mm_range >> tlb_finish_mmu >> exit_mmap >> __mmput >> do_exit >> do_group_exit >> __x64_sys_exit_group >> >> CPU 81 was running a scftorture invoker and was already inside the >> call-function-single interrupt path: >> >> CPU: 81 PID: 13823 Comm: scftorture_invo >> RIP: __flush_smp_call_function_queue+0x70/0x4d0 >> __sysvec_call_function_single >> sysvec_call_function_single >> asm_sysvec_call_function_single >> ktime_get_mono_fast_ns >> csd_lock_wait_toolong >> smp_call_function_single >> scftorture_invoker >> >> The dump showed CPU 81 scanning a corrupted async scftorture CSD: >> >> ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001 >> func = scf_handler_1 [scftorture] >> >> Decoding that CSD: >> >> u_flags = 0x1 -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC >> src = 112 >> dst = 81 >> >> The first word is the llist next pointer, and it points back to the CSD >> itself: >> >> csd->node.llist.next == csd >> >> That self-loop explains why CPU 81 made no progress in >> __flush_smp_call_function_queue(). >> >> There was later work still pending on CPU 81's live call_single_queue: >> >> call_single_queue[81].first = ff157dfa7fe79c00 >> >> That queued entry was CPU 78's synchronous TLB callback: >> >> u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC >> src = 78 >> dst = 81 >> func = flush_tlb_func >> >> So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck >> in an earlier invocation of __flush_smp_call_function_queue(). >> >> The root cause seems to be that b0473dcd4b1d changes the !wait path of >> smp_call_function_single() to use the destination CPU's csd_data when >> csdlock_debug_enabled is set: >> >> if (static_branch_unlikely(&csdlock_debug_enabled)) >> return per_cpu_ptr(&csd_data, cpu); >> return this_cpu_ptr(&csd_data); >> >> The important invariant is that the internal smp-call CSDs are normally >> single-writer objects. With the old this_cpu_ptr(&csd_data) allocation, >> smp_call_function_single(..., wait = 0) used the current CPU's csd_data, >> so each sending CPU had its own CSD. smp_call_function() goes through >> smp_call_function_many_cond(), and that path also starts from the >> current CPU's cfd_data: >> >> cfd = this_cpu_ptr(&cfd_data); >> csd = per_cpu_ptr(cfd->csd, cpu); >> >> get_cpu() keeps preemption disabled while the CSD is prepared and >> queued, so another task on the same CPU cannot concurrently reuse that >> sender's cfd/csd storage either. >> >> This is why the non-atomic csd_lock() worked before: the target CPU >> could still observe and unlock the CSD from the previous request, but >> there was only one CPU that could prepare and enqueue that particular >> CSD object. With the debug-mode destination-CPU allocation, all CPUs >> sending async single calls to CPU 81 contend for the same object: >> >> csd_data[81] == ff157dfa7fe72e40 >> >> However, csd_lock() is not an atomic test-and-set lock: >> >> csd_lock_wait(csd); >> csd->node.u_flags |= CSD_FLAG_LOCK; >> >> That is safe under the previous single-writer rule, but it is not safe >> once many remote CPUs can acquire the same destination CPU CSD. Two >> senders can both observe the lock clear, both set CSD_FLAG_LOCK, >> overwrite func/info/src/dst and then both enqueue the same llist node: >> >> llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81)) >> >> If the same node is added again while it is already the queue head, >> llist_add() writes the old head into new->next. Since old head == new, >> this produces: >> >> csd->node.llist.next = csd >> >> That exactly matches the dump. >> >> I see two possible fixes: >> >> 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data >> allocation and therefore restores the single-writer property, at the >> cost of losing the improved destination-CPU CSD lock diagnostics. >> >> 2. Keep the destination-CPU csd_data behavior for CSD lock debugging, >> but make csd_lock() an atomic test-and-set lock when >> csdlock_debug_enabled is set. > > Thank you for the analysis and patch, and apologies for my confusion > with b0473dcd4b1d! > > It looks plausible, but unfortunately, I was not able to find a version > to which this patch applies, even after fixing the whitespace breakage > noted below. Which might be another broken line or other whitespace > error that my eyes missed. But even with --ignore-whitespace, it still > does not apply. > > Please also see below for a way to shrink this a bit. > If you update the patch and tell me what commit to apply it to, I will > continue testing and review. > > Thank you, Paul. Sorry for the broken inline patch. It was meant mostly to confirm the root cause and the possible fix direction, but I should have made that clearer. I will send a proper patch series based on tip/sched/core next week, and I will include the exact base commit in the cover letter. The series will have two patches: 1. smp: Avoid invalid per-CPU CSD lookup with CSD lock debug 2. smp: Make csd_lock() atomic for CSD lock debug destination CSDs This fixes the multi-writer case when CSD lock debugging makes smp_call_function_single(..., wait = 0) use the destination CPU's csd_data. Thanks for looking at this and for the suggestion! ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-06-05 0:26 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-23 4:27 [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug Chuyi Zhou 2026-05-23 6:07 ` Muchun Song 2026-06-03 13:21 ` Chuyi Zhou 2026-06-04 17:46 ` Paul E. McKenney 2026-06-05 0:25 ` Chuyi Zhou
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox