[PATCH] mm: prevent RCU stalls in kswapd by adding cond

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched()
@ 2025-08-09 11:59 Subrata Nath (Nokia)
  2025-08-09 15:09 ` Matthew Wilcox
  2025-08-10  1:56 ` Hillf Danton
  0 siblings, 2 replies; 5+ messages in thread
From: Subrata Nath (Nokia) @ 2025-08-09 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org

From b21478b5333e2bf48391914d109bfd97a50d5203 Mon Sep 17 00:00:00 2001
From: Subrata Nath <subrata.nath@nokia.com>
Date: Sat, 9 Aug 2025 11:08:30 +0000
Subject: [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched()
Based on: v6.1.128

The kswapd0 thread can spend extended time in
page_vma_mapped_walk() -> queued_spin_lock_slowpath() without
yielding the CPU. Even with CONFIG_PREEMPTION=y, the rcu_preempt
kthread cannot preempt kswapd0 because preemption and interrupts
are disabled while holding the spinlock.

Example stall report:
  rcu: INFO: rcu_preempt self-detected stall on CPU
  rcu: rcu_preempt kthread starved for 65939907 jiffies!
  Call trace:
    queued_spin_lock_slowpath
    page_vma_mapped_walk
    folio_referenced_one
    kswapd

Similar stalls occur in shrink_zones(), where long-running loops
prevent CPUs from reporting a quiescent state during the RCU grace
period. Without such reports, RCU stall warnings can escalate to
soft lockups or OOM kills.

A quiescent state is reported when a CPU exits an RCU read-side
critical section, enters idle/user mode, performs a context switch,
or voluntarily reschedules.

Fix this by adding cond_resched() after all spinlock release points
in page_vma_mapped_walk() and in the main loop of shrink_zones().
These calls, placed outside spinlock-held sections, allow voluntary
scheduling and ensure timely quiescent state reporting, avoiding
prolonged RCU stalls.

Signed-off-by: Subrata Nath <subrata.nath@nokia.com>
---
 mm/page_vma_mapped.c | 3 +++
 mm/vmscan.c          | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17..7775c151f 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -234,6 +234,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			}
 			/* THP pmd was split under us: handle on pte level */
 			spin_unlock(pvmw->ptl);
+			cond_resched();
 			pvmw->ptl = NULL;
 		} else if (!pmd_present(pmde)) {
 			/*
@@ -247,6 +248,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 				spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
 
 				spin_unlock(ptl);
+				cond_resched();
 			}
 			step_forward(pvmw, PMD_SIZE);
 			continue;
@@ -265,6 +267,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
 				if (pvmw->ptl) {
 					spin_unlock(pvmw->ptl);
+					cond_resched();
 					pvmw->ptl = NULL;
 				}
 				pte_unmap(pvmw->pte);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index be863204d..02064b4fe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6415,6 +6415,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			continue;
 		last_pgdat = zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc);
+		cond_resched();
 	}
 
 	if (first_pgdat)
@@ -6490,6 +6491,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (sc->priority < DEF_PRIORITY - 2)
 			sc->may_writepage = 1;
+		cond_resched();
 	} while (--sc->priority >= 0);
 
 	last_pgdat = NULL;
@@ -6508,6 +6510,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 						   zone->zone_pgdat);
 			clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
 		}
+		cond_resched();
 	}
 
 	delayacct_freepages_end();
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched()
  2025-08-09 11:59 [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched() Subrata Nath (Nokia)
@ 2025-08-09 15:09 ` Matthew Wilcox
  2025-08-09 17:38   ` Andrew Morton
  2025-08-10  1:56 ` Hillf Danton
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-08-09 15:09 UTC (permalink / raw)
  To: Subrata Nath (Nokia)
  Cc: Andrew Morton, linux-mm@kvack.org, linux-kernel@vger.kernel.org

On Sat, Aug 09, 2025 at 11:59:16AM +0000, Subrata Nath (Nokia) wrote:
> Fix this by adding cond_resched() after all spinlock release points
> in page_vma_mapped_walk() and in the main loop of shrink_zones().
> These calls, placed outside spinlock-held sections, allow voluntary
> scheduling and ensure timely quiescent state reporting, avoiding
> prolonged RCU stalls.

No.  We're removing cond_resched().  See
https://lore.kernel.org/linux-mm/87cyyfxd4k.ffs@tglx/
and many many other emails over the past few years.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched()
  2025-08-09 15:09 ` Matthew Wilcox
@ 2025-08-09 17:38   ` Andrew Morton
  2025-08-09 17:53     ` Matthew Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2025-08-09 17:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Subrata Nath (Nokia), linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Sat, 9 Aug 2025 16:09:12 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Sat, Aug 09, 2025 at 11:59:16AM +0000, Subrata Nath (Nokia) wrote:
> > Fix this by adding cond_resched() after all spinlock release points
> > in page_vma_mapped_walk() and in the main loop of shrink_zones().
> > These calls, placed outside spinlock-held sections, allow voluntary
> > scheduling and ensure timely quiescent state reporting, avoiding
> > prolonged RCU stalls.
> 
> No.  We're removing cond_resched().  See
> https://lore.kernel.org/linux-mm/87cyyfxd4k.ffs@tglx/
> and many many other emails over the past few years.

tglx's email was sent two years ago.

Meanwhile we have shipped kernels which are emitting nasty warning
splats (which are indications of possible other misbehavior).  So I
think we should proceed with Subrata's change and give it a cc:stable
also.

We already have 285 cond_resched()s in mm/.  If Thomas's idea ever gets
implemented then six more won't kill us.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched()
  2025-08-09 17:38   ` Andrew Morton
@ 2025-08-09 17:53     ` Matthew Wilcox
  0 siblings, 0 replies; 5+ messages in thread
From: Matthew Wilcox @ 2025-08-09 17:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Subrata Nath (Nokia), linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Sat, Aug 09, 2025 at 10:38:45AM -0700, Andrew Morton wrote:
> On Sat, 9 Aug 2025 16:09:12 +0100 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Sat, Aug 09, 2025 at 11:59:16AM +0000, Subrata Nath (Nokia) wrote:
> > > Fix this by adding cond_resched() after all spinlock release points
> > > in page_vma_mapped_walk() and in the main loop of shrink_zones().
> > > These calls, placed outside spinlock-held sections, allow voluntary
> > > scheduling and ensure timely quiescent state reporting, avoiding
> > > prolonged RCU stalls.
> > 
> > No.  We're removing cond_resched().  See
> > https://lore.kernel.org/linux-mm/87cyyfxd4k.ffs@tglx/
> > and many many other emails over the past few years.
> 
> tglx's email was sent two years ago.

... and there has been much progress since then.  Most recently,
https://lore.kernel.org/all/20250225035516.26443-1-boqun.feng@gmail.com/

This report is not from a recent kernel.  Subrata was good enough to
include:

Based on: v6.1.128

and I think it is very much on them to prove that this is still a
problem in 2025.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched()
  2025-08-09 11:59 [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched() Subrata Nath (Nokia)
  2025-08-09 15:09 ` Matthew Wilcox
@ 2025-08-10  1:56 ` Hillf Danton
  1 sibling, 0 replies; 5+ messages in thread
From: Hillf Danton @ 2025-08-10  1:56 UTC (permalink / raw)
  To: Subrata Nath (Nokia); +Cc: Andrew Morton, MM, LKML

On Sat, 9 Aug 2025 11:59:16 +0000 Subrata Nath (Nokia) wrote:
> The kswapd0 thread can spend extended time in
> page_vma_mapped_walk() -> queued_spin_lock_slowpath() without
> yielding the CPU. Even with CONFIG_PREEMPTION=3Dy, the rcu_preempt
> kthread cannot preempt kswapd0 because preemption and interrupts
> are disabled while holding the spinlock.
> 
> Example stall report:
>   rcu: INFO: rcu_preempt self-detected stall on CPU
>   rcu: rcu_preempt kthread starved for 65939907 jiffies!
>   Call trace:
>     queued_spin_lock_slowpath
>     page_vma_mapped_walk
>     folio_referenced_one
>     kswapd
> 
> Similar stalls occur in shrink_zones(), where long-running loops
> prevent CPUs from reporting a quiescent state during the RCU grace
> period. Without such reports, RCU stall warnings can escalate to
> soft lockups or OOM kills.
> 
> A quiescent state is reported when a CPU exits an RCU read-side
> critical section, enters idle/user mode, performs a context switch,
> or voluntarily reschedules.
> 
> Fix this by adding cond_resched() after all spinlock release points
> in page_vma_mapped_walk() and in the main loop of shrink_zones().

Given spinlock in calltrace, this fixes nothing at best.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-10  1:56 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-09 11:59 [PATCH] mm: prevent RCU stalls in kswapd by adding cond_resched() Subrata Nath (Nokia)
2025-08-09 15:09 ` Matthew Wilcox
2025-08-09 17:38   ` Andrew Morton
2025-08-09 17:53     ` Matthew Wilcox
2025-08-10  1:56 ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).