[RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug
@ 2026-05-23  4:27 Chuyi Zhou
  2026-05-23  6:07 ` Muchun Song
  2026-06-03 13:21 ` Chuyi Zhou
  0 siblings, 2 replies; 5+ messages in thread
From: Chuyi Zhou @ 2026-05-23  4:27 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bigeasy,
	clrkwllms
  Cc: linux-kernel, Chuyi Zhou

Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
CSD-lock diagnostics") made smp_call_function_single() use the destination
CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
associate a stuck CSD lock with the target CPU, but it also means the CPU
argument is used in per_cpu_ptr() before generic_exec_single() has a chance
to validate it.

This becomes unsafe when smp_call_function_any() cannot find an online CPU
in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
offset array with an invalid CPU number.

Use the destination CPU's csd_data only when the CPU number is within
nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
and let generic_exec_single() perform the existing validation and return
-ENXIO.

Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/smp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index a0bb56bd8dda..dc6582bb35d0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -380,7 +380,8 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
 static call_single_data_t *get_single_csd_data(int cpu)
 {
-	if (static_branch_unlikely(&csdlock_debug_enabled))
+	if (static_branch_unlikely(&csdlock_debug_enabled) &&
+	    (unsigned int)cpu < nr_cpu_ids)
 		return per_cpu_ptr(&csd_data, cpu);
 	return this_cpu_ptr(&csd_data);
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug
  2026-05-23  4:27 [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug Chuyi Zhou
@ 2026-05-23  6:07 ` Muchun Song
  2026-06-03 13:21 ` Chuyi Zhou
  1 sibling, 0 replies; 5+ messages in thread
From: Muchun Song @ 2026-05-23  6:07 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, bigeasy, clrkwllms,
	linux-kernel



> On May 23, 2026, at 12:27, Chuyi Zhou <zhouchuyi@bytedance.com> wrote:
> 
> Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
> CSD-lock diagnostics") made smp_call_function_single() use the destination
> CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
> associate a stuck CSD lock with the target CPU, but it also means the CPU
> argument is used in per_cpu_ptr() before generic_exec_single() has a chance
> to validate it.
> 
> This becomes unsafe when smp_call_function_any() cannot find an online CPU
> in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
> the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
> returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
> offset array with an invalid CPU number.
> 
> Use the destination CPU's csd_data only when the CPU number is within
> nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
> and let generic_exec_single() perform the existing validation and return
> -ENXIO.
> 
> Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

Acked-by: Muchun Song <muchun.song@linux.dev>

Thanks.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug
  2026-05-23  4:27 [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug Chuyi Zhou
  2026-05-23  6:07 ` Muchun Song
@ 2026-06-03 13:21 ` Chuyi Zhou
  2026-06-04 17:46   ` Paul E. McKenney
  1 sibling, 1 reply; 5+ messages in thread
From: Chuyi Zhou @ 2026-06-03 13:21 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, peterz, mingo, luto, tglx, clrkwllms, bigeasy,
	muchun.song

Hi Paul,

On 2026-05-23 12:27 p.m., Chuyi Zhou wrote:
> Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
> CSD-lock diagnostics") made smp_call_function_single() use the destination
> CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
> associate a stuck CSD lock with the target CPU, but it also means the CPU
> argument is used in per_cpu_ptr() before generic_exec_single() has a chance
> to validate it.
> 
> This becomes unsafe when smp_call_function_any() cannot find an online CPU
> in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
> the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
> returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
> offset array with an invalid CPU number.
> 
> Use the destination CPU's csd_data only when the CPU number is within
> nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
> and let generic_exec_single() perform the existing validation and return
> -ENXIO.
> 
> Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

I hit a soft-lockup panic while running scftorture on a large x86_64 SMP
machine with a kernel based on origin/sched/core.  The issue appears to 
be a regression from:

	b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock 
diagnostics")

The reproduction environment was:

   base: origin/sched/core
   machine: large x86_64 SMP system;
   CONFIG_CSD_LOCK_WAIT_DEBUG=y

The workload was:

   modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100

The visible panic was a secondary soft lockup. CPU 78 was exiting a 
process and got stuck waiting for CPU 81 to run a synchronous TLB 
shootdown callback:

   watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732]
   Kernel panic - not syncing: softlockup: hung tasks

   smp: csd: Detected non-responsive CSD lock (#118) on CPU#78,
   waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...)
   smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81

CPU 78's stack was:

   csd_lock_wait_toolong
   smp_call_function_many_cond
   on_each_cpu_cond_mask
   flush_tlb_mm_range
   tlb_finish_mmu
   exit_mmap
   __mmput
   do_exit
   do_group_exit
   __x64_sys_exit_group

CPU 81 was running a scftorture invoker and was already inside the 
call-function-single interrupt path:

   CPU: 81 PID: 13823 Comm: scftorture_invo
   RIP: __flush_smp_call_function_queue+0x70/0x4d0
   __sysvec_call_function_single
   sysvec_call_function_single
   asm_sysvec_call_function_single
   ktime_get_mono_fast_ns
   csd_lock_wait_toolong
   smp_call_function_single
   scftorture_invoker

The dump showed CPU 81 scanning a corrupted async scftorture CSD:

   ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001
   func = scf_handler_1 [scftorture]

Decoding that CSD:

   u_flags = 0x1  -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC
   src     = 112
   dst     = 81

The first word is the llist next pointer, and it points back to the CSD
itself:

   csd->node.llist.next == csd

That self-loop explains why CPU 81 made no progress in
__flush_smp_call_function_queue().

There was later work still pending on CPU 81's live call_single_queue:

   call_single_queue[81].first = ff157dfa7fe79c00

That queued entry was CPU 78's synchronous TLB callback:

   u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC
   src     = 78
   dst     = 81
   func    = flush_tlb_func

So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck 
in an earlier invocation of __flush_smp_call_function_queue().

The root cause seems to be that b0473dcd4b1d changes the !wait path of
smp_call_function_single() to use the destination CPU's csd_data when
csdlock_debug_enabled is set:

   if (static_branch_unlikely(&csdlock_debug_enabled))
           return per_cpu_ptr(&csd_data, cpu);
   return this_cpu_ptr(&csd_data);

The important invariant is that the internal smp-call CSDs are normally
single-writer objects.  With the old this_cpu_ptr(&csd_data) allocation,
smp_call_function_single(..., wait = 0) used the current CPU's csd_data, 
so each sending CPU had its own CSD.  smp_call_function() goes through
smp_call_function_many_cond(), and that path also starts from the 
current CPU's cfd_data:

   cfd = this_cpu_ptr(&cfd_data);
   csd = per_cpu_ptr(cfd->csd, cpu);

get_cpu() keeps preemption disabled while the CSD is prepared and 
queued, so another task on the same CPU cannot concurrently reuse that 
sender's cfd/csd storage either.

This is why the non-atomic csd_lock() worked before: the target CPU 
could still observe and unlock the CSD from the previous request, but 
there was only one CPU that could prepare and enqueue that particular 
CSD object. With the debug-mode destination-CPU allocation, all CPUs 
sending async single calls to CPU 81 contend for the same object:

   csd_data[81] == ff157dfa7fe72e40

However, csd_lock() is not an atomic test-and-set lock:

   csd_lock_wait(csd);
   csd->node.u_flags |= CSD_FLAG_LOCK;

That is safe under the previous single-writer rule, but it is not safe 
once many remote CPUs can acquire the same destination CPU CSD.  Two 
senders can both observe the lock clear, both set CSD_FLAG_LOCK, 
overwrite func/info/src/dst and then both enqueue the same llist node:

   llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81))

If the same node is added again while it is already the queue head,
llist_add() writes the old head into new->next.  Since old head == new, 
this produces:

   csd->node.llist.next = csd

That exactly matches the dump.

I see two possible fixes:

1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data 
allocation and therefore restores the single-writer property, at the 
cost of losing the improved destination-CPU CSD lock diagnostics.

2. Keep the destination-CPU csd_data behavior for CSD lock debugging, 
but make csd_lock() an atomic test-and-set lock when 
csdlock_debug_enabled is set.

One possible implementation is:

kernel/smp.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/kernel/smp.c b/kernel/smp.c
index 15799f842746..e1d2fd3c9d9f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -362,6 +362,34 @@ static __always_inline void 
csd_lock_wait(call_single_data_t *csd)
  }
  #endif

+#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
+static __always_inline void csd_lock(call_single_data_t *csd)
+{
+	if (static_branch_unlikely(&csdlock_debug_enabled)) {
+		unsigned int flags;
+
+		for (;;) {
+			__csd_lock_wait(csd);
+			flags = READ_ONCE(csd->node.u_flags);
+			if (!(flags & CSD_FLAG_LOCK) &&
+			    try_cmpxchg_acquire(&csd->node.u_flags, &flags,
+						flags | CSD_FLAG_LOCK))
+				break;
+			cpu_relax();
+		}
+	} else {
+		csd_lock_wait(csd);
+		csd->node.u_flags |= CSD_FLAG_LOCK;
+	}
+
+	/*
+	 * prevent CPU from reordering the above assignment
+	 * to ->flags with any subsequent assignments to other
+	 * fields of the specified call_single_data_t structure:
+	 */
+	smp_wmb();
+}
+#else
  static __always_inline void csd_lock(call_single_data_t *csd)
  {
  	csd_lock_wait(csd);
@@ -374,6 +402,7 @@ static __always_inline void 
csd_lock(call_single_data_t *csd)
  	 */
  	smp_wmb();
  }
+#endif

  static __always_inline void csd_unlock(call_single_data_t *csd)
  {
--

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug
  2026-06-03 13:21 ` Chuyi Zhou
@ 2026-06-04 17:46   ` Paul E. McKenney
  2026-06-05  0:25     ` Chuyi Zhou
  0 siblings, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2026-06-04 17:46 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: linux-kernel, peterz, mingo, luto, tglx, clrkwllms, bigeasy,
	muchun.song

On Wed, Jun 03, 2026 at 09:21:10PM +0800, Chuyi Zhou wrote:
> Hi Paul,
> 
> On 2026-05-23 12:27 p.m., Chuyi Zhou wrote:
> > Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
> > CSD-lock diagnostics") made smp_call_function_single() use the destination
> > CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
> > associate a stuck CSD lock with the target CPU, but it also means the CPU
> > argument is used in per_cpu_ptr() before generic_exec_single() has a chance
> > to validate it.
> > 
> > This becomes unsafe when smp_call_function_any() cannot find an online CPU
> > in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
> > the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
> > returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
> > offset array with an invalid CPU number.
> > 
> > Use the destination CPU's csd_data only when the CPU number is within
> > nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
> > and let generic_exec_single() perform the existing validation and return
> > -ENXIO.
> > 
> > Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
> > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> 
> I hit a soft-lockup panic while running scftorture on a large x86_64 SMP
> machine with a kernel based on origin/sched/core.  The issue appears to 
> be a regression from:
> 
> 	b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock 
> diagnostics")
> 
> The reproduction environment was:
> 
>    base: origin/sched/core
>    machine: large x86_64 SMP system;
>    CONFIG_CSD_LOCK_WAIT_DEBUG=y
> 
> The workload was:
> 
>    modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100
> 
> The visible panic was a secondary soft lockup. CPU 78 was exiting a 
> process and got stuck waiting for CPU 81 to run a synchronous TLB 
> shootdown callback:
> 
>    watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732]
>    Kernel panic - not syncing: softlockup: hung tasks
> 
>    smp: csd: Detected non-responsive CSD lock (#118) on CPU#78,
>    waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...)
>    smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81
> 
> CPU 78's stack was:
> 
>    csd_lock_wait_toolong
>    smp_call_function_many_cond
>    on_each_cpu_cond_mask
>    flush_tlb_mm_range
>    tlb_finish_mmu
>    exit_mmap
>    __mmput
>    do_exit
>    do_group_exit
>    __x64_sys_exit_group
> 
> CPU 81 was running a scftorture invoker and was already inside the 
> call-function-single interrupt path:
> 
>    CPU: 81 PID: 13823 Comm: scftorture_invo
>    RIP: __flush_smp_call_function_queue+0x70/0x4d0
>    __sysvec_call_function_single
>    sysvec_call_function_single
>    asm_sysvec_call_function_single
>    ktime_get_mono_fast_ns
>    csd_lock_wait_toolong
>    smp_call_function_single
>    scftorture_invoker
> 
> The dump showed CPU 81 scanning a corrupted async scftorture CSD:
> 
>    ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001
>    func = scf_handler_1 [scftorture]
> 
> Decoding that CSD:
> 
>    u_flags = 0x1  -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC
>    src     = 112
>    dst     = 81
> 
> The first word is the llist next pointer, and it points back to the CSD
> itself:
> 
>    csd->node.llist.next == csd
> 
> That self-loop explains why CPU 81 made no progress in
> __flush_smp_call_function_queue().
> 
> There was later work still pending on CPU 81's live call_single_queue:
> 
>    call_single_queue[81].first = ff157dfa7fe79c00
> 
> That queued entry was CPU 78's synchronous TLB callback:
> 
>    u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC
>    src     = 78
>    dst     = 81
>    func    = flush_tlb_func
> 
> So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck 
> in an earlier invocation of __flush_smp_call_function_queue().
> 
> The root cause seems to be that b0473dcd4b1d changes the !wait path of
> smp_call_function_single() to use the destination CPU's csd_data when
> csdlock_debug_enabled is set:
> 
>    if (static_branch_unlikely(&csdlock_debug_enabled))
>            return per_cpu_ptr(&csd_data, cpu);
>    return this_cpu_ptr(&csd_data);
> 
> The important invariant is that the internal smp-call CSDs are normally
> single-writer objects.  With the old this_cpu_ptr(&csd_data) allocation,
> smp_call_function_single(..., wait = 0) used the current CPU's csd_data, 
> so each sending CPU had its own CSD.  smp_call_function() goes through
> smp_call_function_many_cond(), and that path also starts from the 
> current CPU's cfd_data:
> 
>    cfd = this_cpu_ptr(&cfd_data);
>    csd = per_cpu_ptr(cfd->csd, cpu);
> 
> get_cpu() keeps preemption disabled while the CSD is prepared and 
> queued, so another task on the same CPU cannot concurrently reuse that 
> sender's cfd/csd storage either.
> 
> This is why the non-atomic csd_lock() worked before: the target CPU 
> could still observe and unlock the CSD from the previous request, but 
> there was only one CPU that could prepare and enqueue that particular 
> CSD object. With the debug-mode destination-CPU allocation, all CPUs 
> sending async single calls to CPU 81 contend for the same object:
> 
>    csd_data[81] == ff157dfa7fe72e40
> 
> However, csd_lock() is not an atomic test-and-set lock:
> 
>    csd_lock_wait(csd);
>    csd->node.u_flags |= CSD_FLAG_LOCK;
> 
> That is safe under the previous single-writer rule, but it is not safe 
> once many remote CPUs can acquire the same destination CPU CSD.  Two 
> senders can both observe the lock clear, both set CSD_FLAG_LOCK, 
> overwrite func/info/src/dst and then both enqueue the same llist node:
> 
>    llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81))
> 
> If the same node is added again while it is already the queue head,
> llist_add() writes the old head into new->next.  Since old head == new, 
> this produces:
> 
>    csd->node.llist.next = csd
> 
> That exactly matches the dump.
> 
> I see two possible fixes:
> 
> 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data 
> allocation and therefore restores the single-writer property, at the 
> cost of losing the improved destination-CPU CSD lock diagnostics.
> 
> 2. Keep the destination-CPU csd_data behavior for CSD lock debugging, 
> but make csd_lock() an atomic test-and-set lock when 
> csdlock_debug_enabled is set.

Thank you for the analysis and patch, and apologies for my confusion
with b0473dcd4b1d!

It looks plausible, but unfortunately, I was not able to find a version
to which this patch applies, even after fixing the whitespace breakage
noted below.  Which might be another broken line or other whitespace
error that my eyes missed.  But even with --ignore-whitespace, it still
does not apply.

Please also see below for a way to shrink this a bit.
If you update the patch and tell me what commit to apply it to, I will
continue testing and review.

> One possible implementation is:
> 
> kernel/smp.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 15799f842746..e1d2fd3c9d9f 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -362,6 +362,34 @@ static __always_inline void 
> csd_lock_wait(call_single_data_t *csd)

The above two lines need to be one line.

>   }
>   #endif
> 
> +#ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> +static __always_inline void csd_lock(call_single_data_t *csd)
> +{
> +	if (static_branch_unlikely(&csdlock_debug_enabled)) {

If you make the above line instead read as follows:

	if (IS_ENABLED(CONFIG_CSD_LOCK_WAIT_DEBUG) &&
	    static_branch_unlikely(&csdlock_debug_enabled)) {

Then you can get rid of the #ifdef and the other version of the
csd_lock() function.

> +		unsigned int flags;
> +
> +		for (;;) {
> +			__csd_lock_wait(csd);
> +			flags = READ_ONCE(csd->node.u_flags);
> +			if (!(flags & CSD_FLAG_LOCK) &&
> +			    try_cmpxchg_acquire(&csd->node.u_flags, &flags,
> +						flags | CSD_FLAG_LOCK))
> +				break;
> +			cpu_relax();
> +		}
> +	} else {
> +		csd_lock_wait(csd);
> +		csd->node.u_flags |= CSD_FLAG_LOCK;
> +	}
> +
> +	/*
> +	 * prevent CPU from reordering the above assignment
> +	 * to ->flags with any subsequent assignments to other
> +	 * fields of the specified call_single_data_t structure:
> +	 */
> +	smp_wmb();
> +}
> +#else
>   static __always_inline void csd_lock(call_single_data_t *csd)
>   {
>   	csd_lock_wait(csd);
> @@ -374,6 +402,7 @@ static __always_inline void 
> csd_lock(call_single_data_t *csd)

And again, the above two lines need to be one line.

							Thanx, Paul

>   	 */
>   	smp_wmb();
>   }
> +#endif
> 
>   static __always_inline void csd_unlock(call_single_data_t *csd)
>   {
> --

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug
  2026-06-04 17:46   ` Paul E. McKenney
@ 2026-06-05  0:25     ` Chuyi Zhou
  0 siblings, 0 replies; 5+ messages in thread
From: Chuyi Zhou @ 2026-06-05  0:25 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, peterz, mingo, luto, tglx, clrkwllms, bigeasy,
	muchun.song

On 2026-06-05 1:46 a.m., Paul E. McKenney wrote:
> On Wed, Jun 03, 2026 at 09:21:10PM +0800, Chuyi Zhou wrote:
>> Hi Paul,
>>
>> On 2026-05-23 12:27 p.m., Chuyi Zhou wrote:
>>> Commit b0473dcd4b1d ("smp: Improve smp_call_function_single()
>>> CSD-lock diagnostics") made smp_call_function_single() use the destination
>>> CPU's csd_data when CSD lock debugging is enabled. That lets the debug code
>>> associate a stuck CSD lock with the target CPU, but it also means the CPU
>>> argument is used in per_cpu_ptr() before generic_exec_single() has a chance
>>> to validate it.
>>>
>>> This becomes unsafe when smp_call_function_any() cannot find an online CPU
>>> in the supplied mask. In that case the selected CPU can be nr_cpu_ids, and
>>> the !wait path calls get_single_csd_data(cpu) before generic_exec_single()
>>> returns -ENXIO. With csdlock_debug_enabled set, that indexes the per-CPU
>>> offset array with an invalid CPU number.
>>>
>>> Use the destination CPU's csd_data only when the CPU number is within
>>> nr_cpu_ids. For invalid CPU numbers, fall back to the local CPU's csd_data
>>> and let generic_exec_single() perform the existing validation and return
>>> -ENXIO.
>>>
>>> Fixes: b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock diagnostics")
>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>>> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
>>
>> I hit a soft-lockup panic while running scftorture on a large x86_64 SMP
>> machine with a kernel based on origin/sched/core.  The issue appears to
>> be a regression from:
>>
>> 	b0473dcd4b1d ("smp: Improve smp_call_function_single() CSD-lock
>> diagnostics")
>>
>> The reproduction environment was:
>>
>>     base: origin/sched/core
>>     machine: large x86_64 SMP system;
>>     CONFIG_CSD_LOCK_WAIT_DEBUG=y
>>
>> The workload was:
>>
>>     modprobe scftorture nthreads=1024 onoff_holdoff=50 onoff_interval=100
>>
>> The visible panic was a secondary soft lockup. CPU 78 was exiting a
>> process and got stuck waiting for CPU 81 to run a synchronous TLB
>> shootdown callback:
>>
>>     watchdog: BUG: soft lockup - CPU#78 stuck for 30s! [nc:13732]
>>     Kernel panic - not syncing: softlockup: hung tasks
>>
>>     smp: csd: Detected non-responsive CSD lock (#118) on CPU#78,
>>     waiting 5000000032 ns for CPU#81 flush_tlb_func+0x0/0x290(...)
>>     smp: csd: Re-sending CSD lock (#118) IPI from CPU#78 to CPU#81
>>
>> CPU 78's stack was:
>>
>>     csd_lock_wait_toolong
>>     smp_call_function_many_cond
>>     on_each_cpu_cond_mask
>>     flush_tlb_mm_range
>>     tlb_finish_mmu
>>     exit_mmap
>>     __mmput
>>     do_exit
>>     do_group_exit
>>     __x64_sys_exit_group
>>
>> CPU 81 was running a scftorture invoker and was already inside the
>> call-function-single interrupt path:
>>
>>     CPU: 81 PID: 13823 Comm: scftorture_invo
>>     RIP: __flush_smp_call_function_queue+0x70/0x4d0
>>     __sysvec_call_function_single
>>     sysvec_call_function_single
>>     asm_sysvec_call_function_single
>>     ktime_get_mono_fast_ns
>>     csd_lock_wait_toolong
>>     smp_call_function_single
>>     scftorture_invoker
>>
>> The dump showed CPU 81 scanning a corrupted async scftorture CSD:
>>
>>     ff157dfa7fe72e40: ff157dfa7fe72e40 0051007000000001
>>     func = scf_handler_1 [scftorture]
>>
>> Decoding that CSD:
>>
>>     u_flags = 0x1  -> CSD_FLAG_LOCK | CSD_TYPE_ASYNC
>>     src     = 112
>>     dst     = 81
>>
>> The first word is the llist next pointer, and it points back to the CSD
>> itself:
>>
>>     csd->node.llist.next == csd
>>
>> That self-loop explains why CPU 81 made no progress in
>> __flush_smp_call_function_queue().
>>
>> There was later work still pending on CPU 81's live call_single_queue:
>>
>>     call_single_queue[81].first = ff157dfa7fe79c00
>>
>> That queued entry was CPU 78's synchronous TLB callback:
>>
>>     u_flags = 0x11 -> CSD_FLAG_LOCK | CSD_TYPE_SYNC
>>     src     = 78
>>     dst     = 81
>>     func    = flush_tlb_func
>>
>> So CPU 78's TLB shootdown CSD was queued, but CPU 81 was already stuck
>> in an earlier invocation of __flush_smp_call_function_queue().
>>
>> The root cause seems to be that b0473dcd4b1d changes the !wait path of
>> smp_call_function_single() to use the destination CPU's csd_data when
>> csdlock_debug_enabled is set:
>>
>>     if (static_branch_unlikely(&csdlock_debug_enabled))
>>             return per_cpu_ptr(&csd_data, cpu);
>>     return this_cpu_ptr(&csd_data);
>>
>> The important invariant is that the internal smp-call CSDs are normally
>> single-writer objects.  With the old this_cpu_ptr(&csd_data) allocation,
>> smp_call_function_single(..., wait = 0) used the current CPU's csd_data,
>> so each sending CPU had its own CSD.  smp_call_function() goes through
>> smp_call_function_many_cond(), and that path also starts from the
>> current CPU's cfd_data:
>>
>>     cfd = this_cpu_ptr(&cfd_data);
>>     csd = per_cpu_ptr(cfd->csd, cpu);
>>
>> get_cpu() keeps preemption disabled while the CSD is prepared and
>> queued, so another task on the same CPU cannot concurrently reuse that
>> sender's cfd/csd storage either.
>>
>> This is why the non-atomic csd_lock() worked before: the target CPU
>> could still observe and unlock the CSD from the previous request, but
>> there was only one CPU that could prepare and enqueue that particular
>> CSD object. With the debug-mode destination-CPU allocation, all CPUs
>> sending async single calls to CPU 81 contend for the same object:
>>
>>     csd_data[81] == ff157dfa7fe72e40
>>
>> However, csd_lock() is not an atomic test-and-set lock:
>>
>>     csd_lock_wait(csd);
>>     csd->node.u_flags |= CSD_FLAG_LOCK;
>>
>> That is safe under the previous single-writer rule, but it is not safe
>> once many remote CPUs can acquire the same destination CPU CSD.  Two
>> senders can both observe the lock clear, both set CSD_FLAG_LOCK,
>> overwrite func/info/src/dst and then both enqueue the same llist node:
>>
>>     llist_add(&csd->node.llist, &per_cpu(call_single_queue, 81))
>>
>> If the same node is added again while it is already the queue head,
>> llist_add() writes the old head into new->next.  Since old head == new,
>> this produces:
>>
>>     csd->node.llist.next = csd
>>
>> That exactly matches the dump.
>>
>> I see two possible fixes:
>>
>> 1. Revert b0473dcd4b1d. This restores the old source-CPU csd_data
>> allocation and therefore restores the single-writer property, at the
>> cost of losing the improved destination-CPU CSD lock diagnostics.
>>
>> 2. Keep the destination-CPU csd_data behavior for CSD lock debugging,
>> but make csd_lock() an atomic test-and-set lock when
>> csdlock_debug_enabled is set.
> 
> Thank you for the analysis and patch, and apologies for my confusion
> with b0473dcd4b1d!
> 
> It looks plausible, but unfortunately, I was not able to find a version
> to which this patch applies, even after fixing the whitespace breakage
> noted below.  Which might be another broken line or other whitespace
> error that my eyes missed.  But even with --ignore-whitespace, it still
> does not apply.
> 
> Please also see below for a way to shrink this a bit.
> If you update the patch and tell me what commit to apply it to, I will
> continue testing and review.
> 
>

Thank you, Paul.

Sorry for the broken inline patch. It was meant mostly to confirm the
root cause and the possible fix direction, but I should have made that
clearer.

I will send a proper patch series based on tip/sched/core next week, and 
I will include the exact base commit in the cover letter. The series 
will have two patches:

     1. smp: Avoid invalid per-CPU CSD lookup with CSD lock debug

     2. smp: Make csd_lock() atomic for CSD lock debug destination CSDs

        This fixes the multi-writer case when CSD lock debugging makes
        smp_call_function_single(..., wait = 0) use the destination CPU's
        csd_data.


Thanks for looking at this and for the suggestion!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-05  0:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-23  4:27 [RESEND PATCH] smp: Avoid invalid per-CPU CSD lookup with CSD lock debug Chuyi Zhou
2026-05-23  6:07 ` Muchun Song
2026-06-03 13:21 ` Chuyi Zhou
2026-06-04 17:46   ` Paul E. McKenney
2026-06-05  0:25     ` Chuyi Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox