[PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks
@ 2026-06-12 15:16 Breno Leitao
  2026-06-12 16:52 ` Lance Yang
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-12 15:16 UTC (permalink / raw)
  To: Catalin Marinas, Andrew Morton, lance.yang, Davidlohr Bueso,
	Oleg Nesterov, Qian Cai
  Cc: linux-mm, linux-kernel, kernel-team, stable, Breno Leitao

kmemleak_scan() walks every thread and scans its kernel stack under a
single rcu_read_lock() with no reschedule point. On a host with very
many threads -- amplified by KASAN/lockdep in debug builds -- this loop
can hog a CPU long enough to trip the soft lockup watchdog:

  watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [kmemleak:537]
   scan_block
   kmemleak_scan
   kmemleak_scan_thread
   kthread

A cond_resched() cannot be added directly: the loop runs inside an RCU
read-side critical section.

Borrow the rcu_lock_break() pattern from kernel/hung_task.c: when a
reschedule is needed, pin the two iteration cursors, drop the RCU read
lock, cond_resched(), then re-acquire it and continue only if both
cursors are still hashed.

If a cursor was unhashed while the lock was dropped, the thread list
cannot be walked further, so the round is aborted. Such a round scans
only part of the task stacks, which would make live objects look
unreferenced, so reuse the existing "scan interrupted" path to skip
reporting; the next full scan reports the real leaks.

Fixes: c4b28963fd79 ("mm/kmemleak: rely on rcu for task stack scanning")
Cc: stable@vger.kernel.org
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Do not create the nasty array, but use the same pattern as
  kernel/hung_task.c.
- Link to v1: https://lore.kernel.org/r/20260611-kmemleak-stack-resched-v1-1-d6248ade5f4a@debian.org
---
 mm/kmemleak.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 7c7ba17ce7af0..d88274dc0c605 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1695,6 +1695,32 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
 	put_object(object);
 }
 
+/*
+ * Briefly drop the RCU read lock to reschedule during the task stack scan.
+ * Both cursors are pinned across the gap; return false if either one was
+ * unhashed meanwhile, so the caller stops this round instead of walking a
+ * stale list.
+ */
+static bool kmemleak_stack_scan_break(struct task_struct *g,
+				      struct task_struct *p)
+{
+	bool can_cont;
+
+	get_task_struct(g);
+	get_task_struct(p);
+
+	rcu_read_unlock();
+	cond_resched();
+	rcu_read_lock();
+
+	can_cont = pid_alive(g) && pid_alive(p);
+
+	put_task_struct(p);
+	put_task_struct(g);
+
+	return can_cont;
+}
+
 /*
  * Print one leak inline. The hex dump is gated on OBJECT_ALLOCATED so it
  * does not touch user memory that was freed concurrently; the rest of the
@@ -1804,6 +1830,7 @@ static void kmemleak_scan(void)
 	int __maybe_unused i;
 	struct xarray dedup;
 	int new_leaks = 0;
+	bool aborted = false;
 
 	jiffies_last_scan = jiffies;
 
@@ -1890,11 +1917,21 @@ static void kmemleak_scan(void)
 		rcu_read_lock();
 		for_each_process_thread(g, p) {
 			void *stack = try_get_task_stack(p);
+
 			if (stack) {
 				scan_block(stack, stack + THREAD_SIZE, NULL);
 				put_task_stack(p);
 			}
+			/*
+			 * This is an expensive loop, we must to call the
+			 * scheduler to avoid lockups
+			 */
+			if (need_resched() && !kmemleak_stack_scan_break(g, p)) {
+				aborted = true;
+				goto unlock;
+			}
 		}
+unlock:
 		rcu_read_unlock();
 	}
 
@@ -1937,9 +1974,10 @@ static void kmemleak_scan(void)
 	scan_gray_list();
 
 	/*
-	 * If scanning was stopped do not report any new unreferenced objects.
+	 * If scanning was stopped or a stack scan round was aborted, do not
+	 * report any new unreferenced objects.
 	 */
-	if (scan_should_stop())
+	if (scan_should_stop() || aborted)
 		return;
 
 	/*

---
base-commit: abe651837cb394f76d738a7a747322fca3bf17ba
change-id: 20260611-kmemleak-stack-resched-01ed72858a7f

Best regards,
-- 
Breno Leitao <leitao@debian.org>



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks
  2026-06-12 15:16 [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks Breno Leitao
@ 2026-06-12 16:52 ` Lance Yang
  2026-06-12 17:11 ` Catalin Marinas
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Lance Yang @ 2026-06-12 16:52 UTC (permalink / raw)
  To: leitao
  Cc: catalin.marinas, akpm, lance.yang, dave, oleg, cai, linux-mm,
	linux-kernel, kernel-team, stable


On Fri, Jun 12, 2026 at 08:16:07AM -0700, Breno Leitao wrote:
>kmemleak_scan() walks every thread and scans its kernel stack under a
>single rcu_read_lock() with no reschedule point. On a host with very
>many threads -- amplified by KASAN/lockdep in debug builds -- this loop
>can hog a CPU long enough to trip the soft lockup watchdog:
>
>  watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [kmemleak:537]
>   scan_block
>   kmemleak_scan
>   kmemleak_scan_thread
>   kthread
>
>A cond_resched() cannot be added directly: the loop runs inside an RCU
>read-side critical section.
>
>Borrow the rcu_lock_break() pattern from kernel/hung_task.c: when a
>reschedule is needed, pin the two iteration cursors, drop the RCU read
>lock, cond_resched(), then re-acquire it and continue only if both
>cursors are still hashed.
>
>If a cursor was unhashed while the lock was dropped, the thread list
>cannot be walked further, so the round is aborted. Such a round scans
>only part of the task stacks, which would make live objects look
>unreferenced, so reuse the existing "scan interrupted" path to skip
>reporting; the next full scan reports the real leaks.

TBH, a bit dense to me as written ...

>Fixes: c4b28963fd79 ("mm/kmemleak: rely on rcu for task stack scanning")
>Cc: stable@vger.kernel.org
>Signed-off-by: Breno Leitao <leitao@debian.org>
>---
>Changes in v2:
>- Do not create the nasty array, but use the same pattern as
>  kernel/hung_task.c.
>- Link to v1: https://lore.kernel.org/r/20260611-kmemleak-stack-resched-v1-1-d6248ade5f4a@debian.org
>---
> mm/kmemleak.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 2 deletions(-)
>
>diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>index 7c7ba17ce7af0..d88274dc0c605 100644
>--- a/mm/kmemleak.c
>+++ b/mm/kmemleak.c
>@@ -1695,6 +1695,32 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
> 	put_object(object);
> }
> 
>+/*
>+ * Briefly drop the RCU read lock to reschedule during the task stack scan.
>+ * Both cursors are pinned across the gap; return false if either one was
>+ * unhashed meanwhile, so the caller stops this round instead of walking a
>+ * stale list.
>+ */

Personally, looks a bit clunky to me with "gap" and "unhashed" ...

Maybe:

"
Drop RCU long enough to reschedule during task stack scanning. Keep both
cursors alive while RCU is dropped; return false if either cursor can no
longer continue the walk.
"

>+static bool kmemleak_stack_scan_break(struct task_struct *g,
>+				      struct task_struct *p)
>+{
>+	bool can_cont;
>+
>+	get_task_struct(g);
>+	get_task_struct(p);
>+
>+	rcu_read_unlock();
>+	cond_resched();
>+	rcu_read_lock();
>+
>+	can_cont = pid_alive(g) && pid_alive(p);
>+
>+	put_task_struct(p);
>+	put_task_struct(g);
>+
>+	return can_cont;
>+}
>+
> /*
>  * Print one leak inline. The hex dump is gated on OBJECT_ALLOCATED so it
>  * does not touch user memory that was freed concurrently; the rest of the
>@@ -1804,6 +1830,7 @@ static void kmemleak_scan(void)
> 	int __maybe_unused i;
> 	struct xarray dedup;
> 	int new_leaks = 0;
>+	bool aborted = false;
> 
> 	jiffies_last_scan = jiffies;
> 
>@@ -1890,11 +1917,21 @@ static void kmemleak_scan(void)
> 		rcu_read_lock();
> 		for_each_process_thread(g, p) {
> 			void *stack = try_get_task_stack(p);
>+
> 			if (stack) {
> 				scan_block(stack, stack + THREAD_SIZE, NULL);
> 				put_task_stack(p);
> 			}
>+			/*
>+			 * This is an expensive loop, we must to call the
>+			 * scheduler to avoid lockups
>+			 */

need_resched() plus the helper name already says most of it. Maybe just:

"
Break the RCU read-side section before rescheduling.
"

>+			if (need_resched() && !kmemleak_stack_scan_break(g, p)) {
>+				aborted = true;
>+				goto unlock;
>+			}
> 		}
>+unlock:
> 		rcu_read_unlock();
> 	}
> 
>@@ -1937,9 +1974,10 @@ static void kmemleak_scan(void)
> 	scan_gray_list();
> 
> 	/*
>-	 * If scanning was stopped do not report any new unreferenced objects.
>+	 * If scanning was stopped or a stack scan round was aborted, do not
>+	 * report any new unreferenced objects.
> 	 */

Maybe just say "stack root scan was incomplete" here? That's the actual
reason we skip reporting.

"
If scanning was stopped or the stack root scan was incomplete, do not
report any new unreferenced objects.
"

>-	if (scan_should_stop())
>+	if (scan_should_stop() || aborted)
> 		return;
> 
> 	/*
>
>---

Apart from that, feel free to add:

Acked-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks
  2026-06-12 15:16 [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks Breno Leitao
  2026-06-12 16:52 ` Lance Yang
@ 2026-06-12 17:11 ` Catalin Marinas
  2026-06-13  0:53 ` SeongJae Park
  2026-06-13 10:45 ` Oleg Nesterov
  3 siblings, 0 replies; 6+ messages in thread
From: Catalin Marinas @ 2026-06-12 17:11 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Morton, lance.yang, Davidlohr Bueso, Oleg Nesterov,
	Qian Cai, linux-mm, linux-kernel, kernel-team, stable

Hi Breno,

Thanks for addressing this long-standing soft lockup problem.

On Fri, Jun 12, 2026 at 08:16:07AM -0700, Breno Leitao wrote:
> +/*
> + * Briefly drop the RCU read lock to reschedule during the task stack scan.
> + * Both cursors are pinned across the gap; return false if either one was
> + * unhashed meanwhile, so the caller stops this round instead of walking a
> + * stale list.
> + */
> +static bool kmemleak_stack_scan_break(struct task_struct *g,
> +				      struct task_struct *p)
> +{
> +	bool can_cont;
> +
> +	get_task_struct(g);
> +	get_task_struct(p);
> +
> +	rcu_read_unlock();
> +	cond_resched();
> +	rcu_read_lock();
> +
> +	can_cont = pid_alive(g) && pid_alive(p);
> +
> +	put_task_struct(p);
> +	put_task_struct(g);
> +
> +	return can_cont;
> +}

While this matches rcu_lock_break(), it looks to me like we rely too
much on the internals of kernel/exit.c. Ideally this function should be
provided as an API alongside for_each_process_thread() so that we only
have the idiom in one place in case something changes in the future.

Yet anther variant below, untested. Basically, it follows the
next_tgid() or task_seq_get_next() approach (we might as well move this
to a separate function to avoid excessive indentation):

	if (kmemleak_stack_scan) {
		struct pid *pid;
		int nr = 1;

		do {
			struct task_struct *p = NULL;

			rcu_read_lock();
			pid = find_ge_pid(nr, &init_pid_ns);
			if (pid) {
				nr = pid_nr(pid) + 1;
				p = pid_task(pid, PIDTYPE_PID);
				if (p)
					get_task_struct(p);
			}
			rcu_read_unlock();

			if (p) {
				void *stack = try_get_task_stack(p);

				if (stack) {
					scan_block(stack, stack + THREAD_SIZE,
							NULL);
					put_task_stack(p);
				}
				put_task_struct(p);
			}
			cond_resched();
		} while (pid);
	}

-- 
Catalin


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks
  2026-06-12 15:16 [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks Breno Leitao
  2026-06-12 16:52 ` Lance Yang
  2026-06-12 17:11 ` Catalin Marinas
@ 2026-06-13  0:53 ` SeongJae Park
  2026-06-13 10:45 ` Oleg Nesterov
  3 siblings, 0 replies; 6+ messages in thread
From: SeongJae Park @ 2026-06-13  0:53 UTC (permalink / raw)
  To: Breno Leitao
  Cc: SeongJae Park, Catalin Marinas, Andrew Morton, lance.yang,
	Davidlohr Bueso, Oleg Nesterov, Qian Cai, linux-mm, linux-kernel,
	kernel-team, stable

On Fri, 12 Jun 2026 08:16:07 -0700 Breno Leitao <leitao@debian.org> wrote:

> kmemleak_scan() walks every thread and scans its kernel stack under a
> single rcu_read_lock() with no reschedule point. On a host with very
> many threads -- amplified by KASAN/lockdep in debug builds -- this loop
> can hog a CPU long enough to trip the soft lockup watchdog:
> 
>   watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [kmemleak:537]
>    scan_block
>    kmemleak_scan
>    kmemleak_scan_thread
>    kthread
> 
> A cond_resched() cannot be added directly: the loop runs inside an RCU
> read-side critical section.
> 
> Borrow the rcu_lock_break() pattern from kernel/hung_task.c: when a
> reschedule is needed, pin the two iteration cursors, drop the RCU read
> lock, cond_resched(), then re-acquire it and continue only if both
> cursors are still hashed.
> 
> If a cursor was unhashed while the lock was dropped, the thread list
> cannot be walked further, so the round is aborted. Such a round scans
> only part of the task stacks, which would make live objects look
> unreferenced, so reuse the existing "scan interrupted" path to skip
> reporting; the next full scan reports the real leaks.
> 
> Fixes: c4b28963fd79 ("mm/kmemleak: rely on rcu for task stack scanning")
> Cc: stable@vger.kernel.org
> Signed-off-by: Breno Leitao <leitao@debian.org>

Thank you for fixing this, Breno.  Nothing stood out to me while reading the
patch, other than the below tiny and trivial nit.  Regardless of that, please
feel free to add

Reviewed-by: SeongJae Park <sj@kernel.org>

[...]
> @@ -1890,11 +1917,21 @@ static void kmemleak_scan(void)
>  		rcu_read_lock();
>  		for_each_process_thread(g, p) {
>  			void *stack = try_get_task_stack(p);
> +
>  			if (stack) {
>  				scan_block(stack, stack + THREAD_SIZE, NULL);
>  				put_task_stack(p);
>  			}
> +			/*
> +			 * This is an expensive loop, we must to call the
> +			 * scheduler to avoid lockups

s/must to call/must call/ ?

I saw Lance also provided a suggestion for making this comment better.  I think
that's also good and maybe even better than my suggestion. :)


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks
  2026-06-12 15:16 [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks Breno Leitao
                   ` (2 preceding siblings ...)
  2026-06-13  0:53 ` SeongJae Park
@ 2026-06-13 10:45 ` Oleg Nesterov
  2026-06-13 11:42   ` Lance Yang
  3 siblings, 1 reply; 6+ messages in thread
From: Oleg Nesterov @ 2026-06-13 10:45 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Catalin Marinas, Andrew Morton, lance.yang, Davidlohr Bueso,
	Qian Cai, linux-mm, linux-kernel, kernel-team, stable

To avoid the confusion, I see nothing wrong in this patch, but see
the question at the end.

On 06/12, Breno Leitao wrote:
>
> +/*
> + * Briefly drop the RCU read lock to reschedule during the task stack scan.
> + * Both cursors are pinned across the gap; return false if either one was
> + * unhashed meanwhile, so the caller stops this round instead of walking a
> + * stale list.
> + */
> +static bool kmemleak_stack_scan_break(struct task_struct *g,
> +				      struct task_struct *p)
> +{
> +	bool can_cont;
> +
> +	get_task_struct(g);
> +	get_task_struct(p);
> +
> +	rcu_read_unlock();
> +	cond_resched();
> +	rcu_read_lock();
> +
> +	can_cont = pid_alive(g) && pid_alive(p);
> +
> +	put_task_struct(p);
> +	put_task_struct(g);
> +
> +	return can_cont;
> +}

Perhaps we can rename and export rcu_lock_break() to avoid the duplication...

And, this is slightly off-topic, please ignore, but this reminds me about
[PATCH 1/2] introduce for_each_process_thread_break() and for_each_process_thread_continue()
https://lore.kernel.org/all/20180912163335.GA18748@redhat.com/

> @@ -1890,11 +1917,21 @@ static void kmemleak_scan(void)
>  		rcu_read_lock();
>  		for_each_process_thread(g, p) {
>  			void *stack = try_get_task_stack(p);
> +
>  			if (stack) {
>  				scan_block(stack, stack + THREAD_SIZE, NULL);
>  				put_task_stack(p);
>  			}
> +			/*
> +			 * This is an expensive loop, we must to call the
> +			 * scheduler to avoid lockups
> +			 */
> +			if (need_resched() && !kmemleak_stack_scan_break(g, p)) {
> +				aborted = true;
> +				goto unlock;

Can this need_resched() check actually help if CONFIG_PREEMPTION &&
CONFIG_PREEMPT_RCU ?

In this case (lets ignore PREEMPT_DYNAMIC to simplify) rcu_read_lock()
doesn't disable preemption and cond_resched() is nop, need_resched() is
(almost) never true. Right?

I guess even in this case it makes sense to not abuse rcu_read_lock()
"too much", but perhaps we need something more clever than need_resched() ?

Note that check_hung_uninterruptible_tasks() uses time_after()...

Oleg.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks
  2026-06-13 10:45 ` Oleg Nesterov
@ 2026-06-13 11:42   ` Lance Yang
  0 siblings, 0 replies; 6+ messages in thread
From: Lance Yang @ 2026-06-13 11:42 UTC (permalink / raw)
  To: oleg
  Cc: leitao, catalin.marinas, akpm, lance.yang, dave, cai, linux-mm,
	linux-kernel, kernel-team, stable


On Sat, Jun 13, 2026 at 12:45:20PM +0200, Oleg Nesterov wrote:
>To avoid the confusion, I see nothing wrong in this patch, but see
>the question at the end.
>
>On 06/12, Breno Leitao wrote:
>>
>> +/*
>> + * Briefly drop the RCU read lock to reschedule during the task stack scan.
>> + * Both cursors are pinned across the gap; return false if either one was
>> + * unhashed meanwhile, so the caller stops this round instead of walking a
>> + * stale list.
>> + */
>> +static bool kmemleak_stack_scan_break(struct task_struct *g,
>> +				      struct task_struct *p)
>> +{
>> +	bool can_cont;
>> +
>> +	get_task_struct(g);
>> +	get_task_struct(p);
>> +
>> +	rcu_read_unlock();
>> +	cond_resched();
>> +	rcu_read_lock();
>> +
>> +	can_cont = pid_alive(g) && pid_alive(p);
>> +
>> +	put_task_struct(p);
>> +	put_task_struct(g);
>> +
>> +	return can_cont;
>> +}
>
>Perhaps we can rename and export rcu_lock_break() to avoid the duplication...
>
>And, this is slightly off-topic, please ignore, but this reminds me about
>[PATCH 1/2] introduce for_each_process_thread_break() and for_each_process_thread_continue()
>https://lore.kernel.org/all/20180912163335.GA18748@redhat.com/
>
>> @@ -1890,11 +1917,21 @@ static void kmemleak_scan(void)
>>  		rcu_read_lock();
>>  		for_each_process_thread(g, p) {
>>  			void *stack = try_get_task_stack(p);
>> +
>>  			if (stack) {
>>  				scan_block(stack, stack + THREAD_SIZE, NULL);
>>  				put_task_stack(p);
>>  			}
>> +			/*
>> +			 * This is an expensive loop, we must to call the
>> +			 * scheduler to avoid lockups
>> +			 */
>> +			if (need_resched() && !kmemleak_stack_scan_break(g, p)) {
>> +				aborted = true;
>> +				goto unlock;
>
>Can this need_resched() check actually help if CONFIG_PREEMPTION &&
>CONFIG_PREEMPT_RCU ?

Well spotted.

>In this case (lets ignore PREEMPT_DYNAMIC to simplify) rcu_read_lock()
>doesn't disable preemption and cond_resched() is nop, need_resched() is
>(almost) never true. Right?
>
>I guess even in this case it makes sense to not abuse rcu_read_lock()
>"too much", but perhaps we need something more clever than need_resched() ?
>
>Note that check_hung_uninterruptible_tasks() uses time_after()...

Ouch, right, I missed that ...

Would be better trigger the break from time_after(), not need_resched().
need_resched() may not buy much on PREEMPT_RCU ...

So yeah, a time-based check should address your concern, right?

Cheers, Lance


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-13 11:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12 15:16 [PATCH v2] mm/kmemleak: avoid soft lockup when scanning task stacks Breno Leitao
2026-06-12 16:52 ` Lance Yang
2026-06-12 17:11 ` Catalin Marinas
2026-06-13  0:53 ` SeongJae Park
2026-06-13 10:45 ` Oleg Nesterov
2026-06-13 11:42   ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox