linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
@ 2025-08-18 18:58 Joshua Hahn
  2025-08-19  0:13 ` Andrew Morton
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Joshua Hahn @ 2025-08-18 18:58 UTC (permalink / raw)
  To: Johannes Weiner, Chris Mason
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Zi Yan, linux-mm, linux-kernel, kernel-team

While testing workloads with high sustained memory pressure on large machines
(1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
Further investigation showed that the lock in free_pcppages_bulk was being held
for a long time, even being held while 2k+ pages were being freed.

Instead of holding the lock for the entirety of the freeing, check to see if
the zone lock is contended every pcp->batch pages. If there is contention,
relinquish the lock so that other processors have a change to grab the lock
and perform critical work.

In our fleet, we have seen that performing batched lock freeing has led to
significantly lower rates of softlockups, while incurring relatively small
regressions (relative to the workload and relative to the variation).

The following are a few synthetic benchmarks:

Test 1: Small machine (30G RAM, 36 CPUs)

stress-ng --vm 30 --vm-bytes 1G -M -t 100
+----------------------+---------------+-----------+
|        Metric        | Variation (%) | Delta (%) |
+----------------------+---------------+-----------+
| bogo ops             |        0.0076 |   -0.0183 |
| bogo ops/s (real)    |        0.0064 |   -0.0207 |
| bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
+----------------------+---------------+-----------+

stress-ng --vm 20 --vm-bytes 3G -M -t 100
+----------------------+---------------+-----------+
|        Metric        | Variation (%) | Delta (%) |
+----------------------+---------------+-----------+
| bogo ops             |        0.0295 |   -0.0078 |
| bogo ops/s (real)    |        0.0267 |   -0.0177 |
| bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
+----------------------+---------------+-----------+

Test 2: Big machine (250G RAM, 176 CPUs)

stress-ng --vm 50 --vm-bytes 5G -M -t 100
+----------------------+---------------+-----------+
|        Metric        | Variation (%) | Delta (%) |
+----------------------+---------------+-----------+
| bogo ops             |        0.0362 |   -0.0187 |
| bogo ops/s (real)    |        0.0391 |   -0.0220 |
| bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
+----------------------+---------------+-----------+

stress-ng --vm 10 --vm-bytes 30G -M -t 100
+----------------------+---------------+-----------+
|        Metric        | Variation (%) | Delta (%) |
+----------------------+---------------+-----------+
| bogo ops             |        2.3130 |   -0.0754 |
| bogo ops/s (real)    |        3.3069 |   -0.8579 |
| bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
+----------------------+---------------+-----------+

Suggested-by: Chris Mason <clm@fb.com>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

---
 mm/page_alloc.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8a84c3b5fe5..bd7a8da3e159 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	 * below while (list_empty(list)) loop.
 	 */
 	count = min(pcp->count, count);
+	if (!count)
+		return;
 
 	/* Ensure requested pindex is drained first. */
 	pindex = pindex - 1;
@@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	while (count > 0) {
 		struct list_head *list;
 		int nr_pages;
+		int batch = min(count, pcp->batch);
 
 		/* Remove pages from lists in a round-robin fashion. */
 		do {
@@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 			/* must delete to avoid corrupting pcp list */
 			list_del(&page->pcp_list);
+			batch -= nr_pages;
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
 			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
 			trace_mm_page_pcpu_drain(page, order, mt);
-		} while (count > 0 && !list_empty(list));
+		} while (batch > 0 && !list_empty(list));
+
+		/*
+		 * Prevent starving the lock for other users; every pcp->batch
+		 * pages freed, relinquish the zone lock if it is contended.
+		 */
+		if (count && spin_is_contended(&zone->lock)) {
+			spin_unlock_irqrestore(&zone->lock, flags);
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);

base-commit: 137a6423b60fe0785aada403679d3b086bb83062
-- 
2.47.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-18 18:58 [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing Joshua Hahn
@ 2025-08-19  0:13 ` Andrew Morton
  2025-08-19 15:18   ` Joshua Hahn
  2025-08-19  9:15 ` Kiryl Shutsemau
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2025-08-19  0:13 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Chris Mason, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed.
> 
> Instead of holding the lock for the entirety of the freeing, check to see if
> the zone lock is contended every pcp->batch pages. If there is contention,
> relinquish the lock so that other processors have a change to grab the lock
> and perform critical work.
> 
> In our fleet,

who is "our"?

> we have seen that performing batched lock freeing has led to
> significantly lower rates of softlockups, while incurring relatively small
> regressions (relative to the workload and relative to the variation).
> 
> The following are a few synthetic benchmarks:
> 
> Test 1: Small machine (30G RAM, 36 CPUs)
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
>
> ...
>
> @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  
>  			/* must delete to avoid corrupting pcp list */
>  			list_del(&page->pcp_list);
> +			batch -= nr_pages;
>  			count -= nr_pages;
>  			pcp->count -= nr_pages;
>  
>  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
>  			trace_mm_page_pcpu_drain(page, order, mt);
> -		} while (count > 0 && !list_empty(list));
> +		} while (batch > 0 && !list_empty(list));
> +
> +		/*
> +		 * Prevent starving the lock for other users; every pcp->batch
> +		 * pages freed, relinquish the zone lock if it is contended.
> +		 */
> +		if (count && spin_is_contended(&zone->lock)) {
> +			spin_unlock_irqrestore(&zone->lock, flags);
> +			spin_lock_irqsave(&zone->lock, flags);
> +		}
>  	}

Pretty this isn't.

Sigh, we do so much stuff here and in __free_one_page().

What sort of guarantee do we have that the contending task will be able
to get in and grab the spinlock in that tiny time window?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-18 18:58 [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing Joshua Hahn
  2025-08-19  0:13 ` Andrew Morton
@ 2025-08-19  9:15 ` Kiryl Shutsemau
  2025-08-19 15:28   ` Joshua Hahn
  2025-08-19 17:15   ` Shakeel Butt
  2025-08-19 15:34 ` Joshua Hahn
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-08-19  9:15 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Chris Mason, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
	linux-mm, linux-kernel, kernel-team

On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed.
> 
> Instead of holding the lock for the entirety of the freeing, check to see if
> the zone lock is contended every pcp->batch pages. If there is contention,
> relinquish the lock so that other processors have a change to grab the lock
> and perform critical work.

Hm. It doesn't necessary to be contention on the lock, but just that you
holding the lock for too long so the CPU is not available for the scheduler.

> In our fleet, we have seen that performing batched lock freeing has led to
> significantly lower rates of softlockups, while incurring relatively small
> regressions (relative to the workload and relative to the variation).
> 
> The following are a few synthetic benchmarks:
> 
> Test 1: Small machine (30G RAM, 36 CPUs)
> 
> stress-ng --vm 30 --vm-bytes 1G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        0.0076 |   -0.0183 |
> | bogo ops/s (real)    |        0.0064 |   -0.0207 |
> | bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
> +----------------------+---------------+-----------+
> 
> stress-ng --vm 20 --vm-bytes 3G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        0.0295 |   -0.0078 |
> | bogo ops/s (real)    |        0.0267 |   -0.0177 |
> | bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
> +----------------------+---------------+-----------+
> 
> Test 2: Big machine (250G RAM, 176 CPUs)
> 
> stress-ng --vm 50 --vm-bytes 5G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        0.0362 |   -0.0187 |
> | bogo ops/s (real)    |        0.0391 |   -0.0220 |
> | bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
> +----------------------+---------------+-----------+
> 
> stress-ng --vm 10 --vm-bytes 30G -M -t 100
> +----------------------+---------------+-----------+
> |        Metric        | Variation (%) | Delta (%) |
> +----------------------+---------------+-----------+
> | bogo ops             |        2.3130 |   -0.0754 |
> | bogo ops/s (real)    |        3.3069 |   -0.8579 |
> | bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
> +----------------------+---------------+-----------+
> 
> Suggested-by: Chris Mason <clm@fb.com>
> Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> 
> ---
>  mm/page_alloc.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a8a84c3b5fe5..bd7a8da3e159 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	 * below while (list_empty(list)) loop.
>  	 */
>  	count = min(pcp->count, count);
> +	if (!count)
> +		return;
>  
>  	/* Ensure requested pindex is drained first. */
>  	pindex = pindex - 1;
> @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  	while (count > 0) {
>  		struct list_head *list;
>  		int nr_pages;
> +		int batch = min(count, pcp->batch);
>  
>  		/* Remove pages from lists in a round-robin fashion. */
>  		do {
> @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  
>  			/* must delete to avoid corrupting pcp list */
>  			list_del(&page->pcp_list);
> +			batch -= nr_pages;
>  			count -= nr_pages;
>  			pcp->count -= nr_pages;
>  
>  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
>  			trace_mm_page_pcpu_drain(page, order, mt);
> -		} while (count > 0 && !list_empty(list));
> +		} while (batch > 0 && !list_empty(list));
> +
> +		/*
> +		 * Prevent starving the lock for other users; every pcp->batch
> +		 * pages freed, relinquish the zone lock if it is contended.
> +		 */
> +		if (count && spin_is_contended(&zone->lock)) {

I would rather drop the count thing and do something like this:

		if (need_resched() || spin_needbreak(&zone->lock) {
			spin_unlock_irqrestore(&zone->lock, flags);
			cond_resched();
			spin_lock_irqsave(&zone->lock, flags);
		}

> +			spin_unlock_irqrestore(&zone->lock, flags);
> +			spin_lock_irqsave(&zone->lock, flags);
> +		}
>  	}
>  
>  	spin_unlock_irqrestore(&zone->lock, flags);
> 
> base-commit: 137a6423b60fe0785aada403679d3b086bb83062
> -- 
> 2.47.3

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-19  0:13 ` Andrew Morton
@ 2025-08-19 15:18   ` Joshua Hahn
  2025-08-19 21:44     ` Andrew Morton
  0 siblings, 1 reply; 15+ messages in thread
From: Joshua Hahn @ 2025-08-19 15:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Chris Mason, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Mon, 18 Aug 2025 17:13:40 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

Hello Andrew,

Thank you for your time & feedback, as always!

> On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> > 
> > Instead of holding the lock for the entirety of the freeing, check to see if
> > the zone lock is contended every pcp->batch pages. If there is contention,
> > relinquish the lock so that other processors have a change to grab the lock
> > and perform critical work.
> > 
> > In our fleet,
> 
> who is "our"?

Sorry for the ambiguity -- I work for Meta, so I was referring to their
machines. I'll make this clearer in the next version.

> > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  
> >  			/* must delete to avoid corrupting pcp list */
> >  			list_del(&page->pcp_list);
> > +			batch -= nr_pages;
> >  			count -= nr_pages;
> >  			pcp->count -= nr_pages;
> >  
> >  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> >  			trace_mm_page_pcpu_drain(page, order, mt);
> > -		} while (count > 0 && !list_empty(list));
> > +		} while (batch > 0 && !list_empty(list));
> > +
> > +		/*
> > +		 * Prevent starving the lock for other users; every pcp->batch
> > +		 * pages freed, relinquish the zone lock if it is contended.
> > +		 */
> > +		if (count && spin_is_contended(&zone->lock)) {
> > +			spin_unlock_irqrestore(&zone->lock, flags);
> > +			spin_lock_irqsave(&zone->lock, flags);
> > +		}
> >  	}
> 
> Pretty this isn't.
> 
> Sigh, we do so much stuff here and in __free_one_page().
> 
> What sort of guarantee do we have that the contending task will be able
> to get in and grab the spinlock in that tiny time window?

Thank you for pointing this out -- I don't think there is any guarantee.
Kiryl suggested that I put a cond_resched() here, in order to guarantee that
the contending tasks will be able to grab the spinlock. I think that's a great
idea -- I'll make this change in v2.

Thank you for your feedback, have a great day!
Joshua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-19  9:15 ` Kiryl Shutsemau
@ 2025-08-19 15:28   ` Joshua Hahn
  2025-08-19 17:15   ` Shakeel Butt
  1 sibling, 0 replies; 15+ messages in thread
From: Joshua Hahn @ 2025-08-19 15:28 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Johannes Weiner, Chris Mason, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
	linux-mm, linux-kernel, kernel-team

On Tue, 19 Aug 2025 10:15:13 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:

Hello Kiryl,

Thank you for your review!

> On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> > 
> > Instead of holding the lock for the entirety of the freeing, check to see if
> > the zone lock is contended every pcp->batch pages. If there is contention,
> > relinquish the lock so that other processors have a change to grab the lock
> > and perform critical work.
> 
> Hm. It doesn't necessary to be contention on the lock, but just that you
> holding the lock for too long so the CPU is not available for the scheduler.

I see, I think that also makes sense. So you are suggesting that it is not
lock contention that is the issue, but rather, that the CPU is not being used
to perform more critical work?

I can definitely test this idea, and let you know what I see. With my very
limited understanding, I wonder though if 1 busy CPU will make a big difference
on a machine with 316 CPUs. That is, my instinct tells me that the zone lock is
a more hotly contended resource than the CPU is -- but I will try to run some
tests to figure out which of these is more heavily contended.

[...snip...]

> > ---
> >  mm/page_alloc.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a8a84c3b5fe5..bd7a8da3e159 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  	 * below while (list_empty(list)) loop.
> >  	 */
> >  	count = min(pcp->count, count);
> > +	if (!count)
> > +		return;
> >  
> >  	/* Ensure requested pindex is drained first. */
> >  	pindex = pindex - 1;
> > @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  	while (count > 0) {
> >  		struct list_head *list;
> >  		int nr_pages;
> > +		int batch = min(count, pcp->batch);
> >  
> >  		/* Remove pages from lists in a round-robin fashion. */
> >  		do {
> > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  
> >  			/* must delete to avoid corrupting pcp list */
> >  			list_del(&page->pcp_list);
> > +			batch -= nr_pages;
> >  			count -= nr_pages;
> >  			pcp->count -= nr_pages;
> >  
> >  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> >  			trace_mm_page_pcpu_drain(page, order, mt);
> > -		} while (count > 0 && !list_empty(list));
> > +		} while (batch > 0 && !list_empty(list));
> > +
> > +		/*
> > +		 * Prevent starving the lock for other users; every pcp->batch
> > +		 * pages freed, relinquish the zone lock if it is contended.
> > +		 */
> > +		if (count && spin_is_contended(&zone->lock)) {
> 
> I would rather drop the count thing and do something like this:
> 
> 		if (need_resched() || spin_needbreak(&zone->lock) {
> 			spin_unlock_irqrestore(&zone->lock, flags);
> 			cond_resched();
> 			spin_lock_irqsave(&zone->lock, flags);
> 		}

Thank you for this idea, Kiryl. I think adding the cond_resched() is absolutely
necessary here, as Andrew has also kindly pointed out in his response. I also
like the idea of adding the need_resched() and spin_needbreak() checks here.

I still think having the if (count) check is important here, since I don't
think we want to stall the exit of this function if there are no more pages
remaining to be freed, we can simply spin_unlock_irqrestore() and exit the
function. So maybe something like this?

		if (count && (need_resched() || spin_needbreak(&zone->lock))

Thank you again for your review, Kiryl! I hope you have a great day : -)
Joshua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-18 18:58 [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing Joshua Hahn
  2025-08-19  0:13 ` Andrew Morton
  2025-08-19  9:15 ` Kiryl Shutsemau
@ 2025-08-19 15:34 ` Joshua Hahn
  2025-08-20  1:29 ` Hillf Danton
  2025-08-20  5:41 ` Andrew Morton
  4 siblings, 0 replies; 15+ messages in thread
From: Joshua Hahn @ 2025-08-19 15:34 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Chris Mason, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Zi Yan,
	linux-mm, linux-kernel, kernel-team

On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

Cc-ing Suren, who works at Google, not Gogle. Sorry for the typo!
Joshua

> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed.
> 
> Instead of holding the lock for the entirety of the freeing, check to see if
> the zone lock is contended every pcp->batch pages. If there is contention,
> relinquish the lock so that other processors have a change to grab the lock
> and perform critical work.
> 
> In our fleet, we have seen that performing batched lock freeing has led to
> significantly lower rates of softlockups, while incurring relatively small
> regressions (relative to the workload and relative to the variation).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-19  9:15 ` Kiryl Shutsemau
  2025-08-19 15:28   ` Joshua Hahn
@ 2025-08-19 17:15   ` Shakeel Butt
  2025-08-20 12:58     ` Kiryl Shutsemau
  1 sibling, 1 reply; 15+ messages in thread
From: Shakeel Butt @ 2025-08-19 17:15 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Joshua Hahn, Johannes Weiner, Chris Mason, Andrew Morton,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Zi Yan, linux-mm, linux-kernel, kernel-team

On Tue, Aug 19, 2025 at 10:15:13AM +0100, Kiryl Shutsemau wrote:
> On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> > 
> > Instead of holding the lock for the entirety of the freeing, check to see if
> > the zone lock is contended every pcp->batch pages. If there is contention,
> > relinquish the lock so that other processors have a change to grab the lock
> > and perform critical work.
> 
> Hm. It doesn't necessary to be contention on the lock, but just that you
> holding the lock for too long so the CPU is not available for the scheduler.
> 
> > In our fleet, we have seen that performing batched lock freeing has led to
> > significantly lower rates of softlockups, while incurring relatively small
> > regressions (relative to the workload and relative to the variation).
> > 
> > The following are a few synthetic benchmarks:
> > 
> > Test 1: Small machine (30G RAM, 36 CPUs)
> > 
> > stress-ng --vm 30 --vm-bytes 1G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        0.0076 |   -0.0183 |
> > | bogo ops/s (real)    |        0.0064 |   -0.0207 |
> > | bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
> > +----------------------+---------------+-----------+
> > 
> > stress-ng --vm 20 --vm-bytes 3G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        0.0295 |   -0.0078 |
> > | bogo ops/s (real)    |        0.0267 |   -0.0177 |
> > | bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
> > +----------------------+---------------+-----------+
> > 
> > Test 2: Big machine (250G RAM, 176 CPUs)
> > 
> > stress-ng --vm 50 --vm-bytes 5G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        0.0362 |   -0.0187 |
> > | bogo ops/s (real)    |        0.0391 |   -0.0220 |
> > | bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
> > +----------------------+---------------+-----------+
> > 
> > stress-ng --vm 10 --vm-bytes 30G -M -t 100
> > +----------------------+---------------+-----------+
> > |        Metric        | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops             |        2.3130 |   -0.0754 |
> > | bogo ops/s (real)    |        3.3069 |   -0.8579 |
> > | bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
> > +----------------------+---------------+-----------+
> > 
> > Suggested-by: Chris Mason <clm@fb.com>
> > Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> > 
> > ---
> >  mm/page_alloc.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a8a84c3b5fe5..bd7a8da3e159 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  	 * below while (list_empty(list)) loop.
> >  	 */
> >  	count = min(pcp->count, count);
> > +	if (!count)
> > +		return;
> >  
> >  	/* Ensure requested pindex is drained first. */
> >  	pindex = pindex - 1;
> > @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  	while (count > 0) {
> >  		struct list_head *list;
> >  		int nr_pages;
> > +		int batch = min(count, pcp->batch);
> >  
> >  		/* Remove pages from lists in a round-robin fashion. */
> >  		do {
> > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  
> >  			/* must delete to avoid corrupting pcp list */
> >  			list_del(&page->pcp_list);
> > +			batch -= nr_pages;
> >  			count -= nr_pages;
> >  			pcp->count -= nr_pages;
> >  
> >  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> >  			trace_mm_page_pcpu_drain(page, order, mt);
> > -		} while (count > 0 && !list_empty(list));
> > +		} while (batch > 0 && !list_empty(list));
> > +
> > +		/*
> > +		 * Prevent starving the lock for other users; every pcp->batch
> > +		 * pages freed, relinquish the zone lock if it is contended.
> > +		 */
> > +		if (count && spin_is_contended(&zone->lock)) {
> 
> I would rather drop the count thing and do something like this:
> 
> 		if (need_resched() || spin_needbreak(&zone->lock) {
> 			spin_unlock_irqrestore(&zone->lock, flags);
> 			cond_resched();

Can this function be called from non-sleepable context?

> 			spin_lock_irqsave(&zone->lock, flags);
> 		}
> 
> > +			spin_unlock_irqrestore(&zone->lock, flags);
> > +			spin_lock_irqsave(&zone->lock, flags);
> > +		}
> >  	}
> >  
> >  	spin_unlock_irqrestore(&zone->lock, flags);
> > 
> > base-commit: 137a6423b60fe0785aada403679d3b086bb83062
> > -- 
> > 2.47.3
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-19 15:18   ` Joshua Hahn
@ 2025-08-19 21:44     ` Andrew Morton
  2025-08-20 13:20       ` Joshua Hahn
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2025-08-19 21:44 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Chris Mason, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Tue, 19 Aug 2025 08:18:45 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> > Pretty this isn't.
> > 
> > Sigh, we do so much stuff here and in __free_one_page().
> > 
> > What sort of guarantee do we have that the contending task will be able
> > to get in and grab the spinlock in that tiny time window?
> 
> Thank you for pointing this out -- I don't think there is any guarantee.
> Kiryl suggested that I put a cond_resched() here, in order to guarantee that
> the contending tasks will be able to grab the spinlock. I think that's a great
> idea -- I'll make this change in v2.

cond_resched() might help because it takes more CPU cycles and expands
the window. A udelay() would of course do this more nicely.

But the contending task is already in state TASK_RUNNING so a
cond_resched() won't have any effect on it?

Also, callers hold pcp->lock, so cond_resched() cannot be called.

Sigh, I dunno, it's all very nasty.  I have vague memories of there
being a way of relinquishing a lock to some other task which is
spinning on that lock.  Or at least, a proposal.  Or I dreamed it. 
peterz would be a good person to ask.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-18 18:58 [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing Joshua Hahn
                   ` (2 preceding siblings ...)
  2025-08-19 15:34 ` Joshua Hahn
@ 2025-08-20  1:29 ` Hillf Danton
  2025-08-20 15:13   ` Joshua Hahn
  2025-08-20  5:41 ` Andrew Morton
  4 siblings, 1 reply; 15+ messages in thread
From: Hillf Danton @ 2025-08-20  1:29 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Michal Hocko,
	linux-mm, linux-kernel

On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn wrote:
> 
> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed.
> 
> Instead of holding the lock for the entirety of the freeing, check to see if
> the zone lock is contended every pcp->batch pages. If there is contention,
> relinquish the lock so that other processors have a change to grab the lock
> and perform critical work.
> 
Instead of the unlock/lock game, simply return with the rest left to workqueue
in case of lock contension. But workqueue is still unable to kill soft lockup
if the number of contending CPUs is large enough.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-18 18:58 [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing Joshua Hahn
                   ` (3 preceding siblings ...)
  2025-08-20  1:29 ` Hillf Danton
@ 2025-08-20  5:41 ` Andrew Morton
  2025-08-20 15:48   ` Joshua Hahn
  4 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2025-08-20  5:41 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Johannes Weiner, Chris Mason, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> While testing workloads with high sustained memory pressure on large machines
> (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> Further investigation showed that the lock in free_pcppages_bulk was being held
> for a long time, even being held while 2k+ pages were being freed.

It would be interesting to share some of those softlockup traces.

We have this CONFIG_PCP_BATCH_SCALE_MAX which appears to exist to
address precisely this issue.  But only about half of the
free_pcppages_bulk() callers actually honor it.

So perhaps the fix is to fix the callers which forgot to implement this?

- decay_pcp_high() tried to implement CONFIG_PCP_BATCH_SCALE_MAX, but
  that code hurts my brain.

- drain_pages_zone() implements it but, regrettably, doesn't use it
  to periodically release pcp->lock.  Room for improvement there.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-19 17:15   ` Shakeel Butt
@ 2025-08-20 12:58     ` Kiryl Shutsemau
  0 siblings, 0 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-08-20 12:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Joshua Hahn, Johannes Weiner, Chris Mason, Andrew Morton,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Zi Yan, linux-mm, linux-kernel, kernel-team

On Tue, Aug 19, 2025 at 10:15:39AM -0700, Shakeel Butt wrote:
> On Tue, Aug 19, 2025 at 10:15:13AM +0100, Kiryl Shutsemau wrote:
> > On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> > > While testing workloads with high sustained memory pressure on large machines
> > > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > > Further investigation showed that the lock in free_pcppages_bulk was being held
> > > for a long time, even being held while 2k+ pages were being freed.
> > > 
> > > Instead of holding the lock for the entirety of the freeing, check to see if
> > > the zone lock is contended every pcp->batch pages. If there is contention,
> > > relinquish the lock so that other processors have a change to grab the lock
> > > and perform critical work.
> > 
> > Hm. It doesn't necessary to be contention on the lock, but just that you
> > holding the lock for too long so the CPU is not available for the scheduler.
> > 
> > > In our fleet, we have seen that performing batched lock freeing has led to
> > > significantly lower rates of softlockups, while incurring relatively small
> > > regressions (relative to the workload and relative to the variation).
> > > 
> > > The following are a few synthetic benchmarks:
> > > 
> > > Test 1: Small machine (30G RAM, 36 CPUs)
> > > 
> > > stress-ng --vm 30 --vm-bytes 1G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        0.0076 |   -0.0183 |
> > > | bogo ops/s (real)    |        0.0064 |   -0.0207 |
> > > | bogo ops/s (usr+sys) |        0.3151 |   +0.4141 |
> > > +----------------------+---------------+-----------+
> > > 
> > > stress-ng --vm 20 --vm-bytes 3G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        0.0295 |   -0.0078 |
> > > | bogo ops/s (real)    |        0.0267 |   -0.0177 |
> > > | bogo ops/s (usr+sys) |        1.7079 |   -0.0096 |
> > > +----------------------+---------------+-----------+
> > > 
> > > Test 2: Big machine (250G RAM, 176 CPUs)
> > > 
> > > stress-ng --vm 50 --vm-bytes 5G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        0.0362 |   -0.0187 |
> > > | bogo ops/s (real)    |        0.0391 |   -0.0220 |
> > > | bogo ops/s (usr+sys) |        2.9603 |   +1.3758 |
> > > +----------------------+---------------+-----------+
> > > 
> > > stress-ng --vm 10 --vm-bytes 30G -M -t 100
> > > +----------------------+---------------+-----------+
> > > |        Metric        | Variation (%) | Delta (%) |
> > > +----------------------+---------------+-----------+
> > > | bogo ops             |        2.3130 |   -0.0754 |
> > > | bogo ops/s (real)    |        3.3069 |   -0.8579 |
> > > | bogo ops/s (usr+sys) |        4.0369 |   -1.1985 |
> > > +----------------------+---------------+-----------+
> > > 
> > > Suggested-by: Chris Mason <clm@fb.com>
> > > Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> > > 
> > > ---
> > >  mm/page_alloc.c | 15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index a8a84c3b5fe5..bd7a8da3e159 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  	 * below while (list_empty(list)) loop.
> > >  	 */
> > >  	count = min(pcp->count, count);
> > > +	if (!count)
> > > +		return;
> > >  
> > >  	/* Ensure requested pindex is drained first. */
> > >  	pindex = pindex - 1;
> > > @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  	while (count > 0) {
> > >  		struct list_head *list;
> > >  		int nr_pages;
> > > +		int batch = min(count, pcp->batch);
> > >  
> > >  		/* Remove pages from lists in a round-robin fashion. */
> > >  		do {
> > > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > >  
> > >  			/* must delete to avoid corrupting pcp list */
> > >  			list_del(&page->pcp_list);
> > > +			batch -= nr_pages;
> > >  			count -= nr_pages;
> > >  			pcp->count -= nr_pages;
> > >  
> > >  			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> > >  			trace_mm_page_pcpu_drain(page, order, mt);
> > > -		} while (count > 0 && !list_empty(list));
> > > +		} while (batch > 0 && !list_empty(list));
> > > +
> > > +		/*
> > > +		 * Prevent starving the lock for other users; every pcp->batch
> > > +		 * pages freed, relinquish the zone lock if it is contended.
> > > +		 */
> > > +		if (count && spin_is_contended(&zone->lock)) {
> > 
> > I would rather drop the count thing and do something like this:
> > 
> > 		if (need_resched() || spin_needbreak(&zone->lock) {
> > 			spin_unlock_irqrestore(&zone->lock, flags);
> > 			cond_resched();
> 
> Can this function be called from non-sleepable context?

No, it cannot.

And looking at the locking context -- caller holds pcp->lock -- looks
like my proposal with need_resched()/cond_resched() doesn't work.

We need to either push for wider rework and make cond_resched() happen
upper by the stack or ignore it and have cpu_relax() called on the lock
drop.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-19 21:44     ` Andrew Morton
@ 2025-08-20 13:20       ` Joshua Hahn
  0 siblings, 0 replies; 15+ messages in thread
From: Joshua Hahn @ 2025-08-20 13:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Chris Mason, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-mm, linux-kernel,
	kernel-team, Peter Zijlstra

On Tue, 19 Aug 2025 14:44:21 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 19 Aug 2025 08:18:45 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > > Pretty this isn't.
> > > 
> > > Sigh, we do so much stuff here and in __free_one_page().
> > > 
> > > What sort of guarantee do we have that the contending task will be able
> > > to get in and grab the spinlock in that tiny time window?
> > 
> > Thank you for pointing this out -- I don't think there is any guarantee.
> > Kiryl suggested that I put a cond_resched() here, in order to guarantee that
> > the contending tasks will be able to grab the spinlock. I think that's a great
> > idea -- I'll make this change in v2.

Hello Andrew, thank you for your review!

> cond_resched() might help because it takes more CPU cycles and expands
> the window. A udelay() would of course do this more nicely.

I was wondering if we could rely on the spinlock implementation here in order
to allow some fairness in who grabs the lock. From what I have gathered, on
a lot of architectures, the default implementation for spin locks use a
queued spin lock (on x86 and arm64, among others, just by doing a quick
grep for "select ARCH_USE_QUEUED_SPINLOCKS"). This means that whoever was
waiting the longest for the spin lock will be able to grab it, guaranteeing
that this function doesn't immediately lock again.

With that said, I understand that the solution should be generic and work for
all architectures. I wonder if it would make sense to change the zone lock
into an explicit queued spin lock?

> But the contending task is already in state TASK_RUNNING so a
> cond_resched() won't have any effect on it?
> 
> Also, callers hold pcp->lock, so cond_resched() cannot be called.

Ah yes, that makes sense.

> Sigh, I dunno, it's all very nasty.  I have vague memories of there
> being a way of relinquishing a lock to some other task which is
> spinning on that lock.  Or at least, a proposal.  Or I dreamed it. 
> peterz would be a good person to ask.

Cc-ing Peter, please let us know if you have any thoughts about all of this!

Thank you Andrew, I hope you have a great day!
Joshua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-20  1:29 ` Hillf Danton
@ 2025-08-20 15:13   ` Joshua Hahn
  2025-08-21  1:03     ` Hillf Danton
  0 siblings, 1 reply; 15+ messages in thread
From: Joshua Hahn @ 2025-08-20 15:13 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Michal Hocko,
	linux-mm, linux-kernel

On Wed, 20 Aug 2025 09:29:00 +0800 Hillf Danton <hdanton@sina.com> wrote:

Hello Hillf, thank you for your review!

> On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn wrote:
> > 
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> > 
> > Instead of holding the lock for the entirety of the freeing, check to see if
> > the zone lock is contended every pcp->batch pages. If there is contention,
> > relinquish the lock so that other processors have a change to grab the lock
> > and perform critical work.
> > 
> Instead of the unlock/lock game, simply return with the rest left to workqueue
> in case of lock contension. But workqueue is still unable to kill soft lockup
> if the number of contending CPUs is large enough.

Thank you for the idea. One concern that I have is that sometimes, we do expect
free_pcppages_bulk to actually free all of the pages that it has promised to
do. One example is when it is called from drain_zone_pages. Of course, we can
have a while loop that would call free_pcppages_bulk until it returns 0, but
I think that would be reduced to unlocking / locking over and over again.

As for the number of contending CPUs -- I'm not really sure what the number
looks like. In my testing, I have just done some spot checks to see that the
zone lock is indeed contended, but I'm not entirely sure how hotly it is
contended. I can run some tests before sending out the next version to see if
it is higher / lower than expected.

Thank you, I hope you have a great day!
Joshua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-20  5:41 ` Andrew Morton
@ 2025-08-20 15:48   ` Joshua Hahn
  0 siblings, 0 replies; 15+ messages in thread
From: Joshua Hahn @ 2025-08-20 15:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Chris Mason, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Zi Yan, linux-mm, linux-kernel,
	kernel-team

On Tue, 19 Aug 2025 22:41:11 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

Hello Andrew, thank you again for your input : -)

> On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> 
> It would be interesting to share some of those softlockup traces.

Unfortunately it has been a long time since these softlockups have been detected
on our fleet, so the records of them have disappeared : -(

What I do have is an example trace of a rcu stall warning:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 	20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: 	         hardirqs   softirqs   csw/system
[ 4512.638793] rcu: 	 number:        0        145            0
[ 4512.651177] rcu: 	cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu: 	(t=21077 jiffies g=783665 q=1242213 ncpus=316)

And here is the trace that accompanies it:

[ 4512.666815] RIP: 0010:free_unref_folios+0x47d/0xd80
[ 4512.666818] Code: 00 00 31 ff 40 80 ce 01 41 88 76 18 e9 a8 fe ff ff 40 84 ff 0f 84 d6 00 00 00 39 f0 0f 4c f0 4c 89 ff 4c 89 f2 e8 13 f2 fe ff <49> f7 87 88 05 00 00 04 00 00 00 0f 84 00 ff ff ff 49 8b 47 20 49
[ 4512.666820] RSP: 0018:ffffc900a62f3878 EFLAGS: 00000206
[ 4512.666822] RAX: 000000000005ae80 RBX: 000000000000087a RCX: 0000000000000001
[ 4512.666824] RDX: 000000000000007d RSI: 0000000000000282 RDI: ffff89404c8ba310
[ 4512.666825] RBP: 0000000000000001 R08: ffff89404c8b9d80 R09: 0000000000000001
[ 4512.666826] R10: 0000000000000010 R11: 00000000000130de R12: ffff89404c8b9d80
[ 4512.666827] R13: ffffea01cf3c0000 R14: ffff893d3ac5aec0 R15: ffff89404c8b9d80
[ 4512.666833]  ? free_unref_folios+0x47d/0xd80
[ 4512.666836]  free_pages_and_swap_cache+0xcd/0x1a0
[ 4512.666847]  tlb_finish_mmu+0x11c/0x350
[ 4512.666850]  vms_clear_ptes+0xf9/0x120
[ 4512.666855]  __mmap_region+0x29a/0xc00
[ 4512.666867]  do_mmap+0x34e/0x910
[ 4512.666873]  vm_mmap_pgoff+0xbb/0x200
[ 4512.666877]  ? hrtimer_interrupt+0x337/0x5c0
[ 4512.666879]  ? sched_clock+0x5/0x10
[ 4512.666882]  ? sched_clock_cpu+0xc/0x170
[ 4512.666885]  ? irqtime_account_irq+0x2b/0xa0
[ 4512.666888]  do_syscall_64+0x68/0x130
[ 4512.666892]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 4512.666896] RIP: 0033:0x7f1afe9257e2

> We have this CONFIG_PCP_BATCH_SCALE_MAX which appears to exist to
> address precisely this issue.  But only about half of the
> free_pcppages_bulk() callers actually honor it.

I see. I think this makes sense, and I also agree that there should probably be
some guardrails from the callers of this function, especially since free
pcppages bulk is unaware of how the pcp lock is acquired / freed.

Functions like drain_zone_pages, which explicitly enforce this by setting
"to_drain" to be min(pcp->batch, pcp->count) seem like a smart way to do this.

> So perhaps the fix is to fix the callers which forgot to implement this?
> 
> - decay_pcp_high() tried to implement CONFIG_PCP_BATCH_SCALE_MAX, but
>   that code hurts my brain.

To be honest, I don't fully understand decay_pcp_high() as well : -)
From what I can tell, it seems like CONFIG_PCP_BATCH_SCALE_MAX doesn't directly
pass in a value that limits how many pages are freed at once for the bulk
freer, but rather tunes the parameters pcp->high. (Except for drain_pages_zone,
which you have pointed out below).

> - drain_pages_zone() implements it but, regrettably, doesn't use it
>   to periodically release pcp->lock.  Room for improvement there.

From what I can see, it seems like drain_pages_zone() does release pcp->lock
every pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX (simplified pseudocode below)

do {
	spin_lock(&pcp->lock);
	to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
	free_pcppages_bulk(zone, to_drain, ...);
	spin_unlock(&pcp->lock);
} while (count);

Although, the concern you raised earlier of whether another thread can
reasonably grab pcp->lock during the short check between the unlock and lock
is still valid here (which means the concern is also relieved if the machine is
x86, arm64, or any other arch that defaults spin locks to be queued).

With all of this said, I think adding the periodic unlocking / locking of the
zone lock within free_pcppages_bulk still makes sense; if the caller enforces
count to be <= pcp->batch, then the check is essentially a no-op; otherwise, we
create some locking safety, whihc would protect it in case there are any new
callers in the future, which forget to add the check as well.

Thank you for your thoughtful review, Andrew. I hope you have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing
  2025-08-20 15:13   ` Joshua Hahn
@ 2025-08-21  1:03     ` Hillf Danton
  0 siblings, 0 replies; 15+ messages in thread
From: Hillf Danton @ 2025-08-21  1:03 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Michal Hocko,
	linux-mm, linux-kernel

On Wed, 20 Aug 2025 08:13:07 -0700 Joshua Hahn wrote:
> On Wed, 20 Aug 2025 09:29:00 +0800 Hillf Danton <hdanton@sina.com> wrote:
> > On Mon, 18 Aug 2025 11:58:03 -0700 Joshua Hahn wrote:
> > > 
> > > While testing workloads with high sustained memory pressure on large machines
> > > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > > Further investigation showed that the lock in free_pcppages_bulk was being held
> > > for a long time, even being held while 2k+ pages were being freed.
> > > 
> > > Instead of holding the lock for the entirety of the freeing, check to see if
> > > the zone lock is contended every pcp->batch pages. If there is contention,
> > > relinquish the lock so that other processors have a change to grab the lock
> > > and perform critical work.
> > > 
> > Instead of the unlock/lock game, simply return with the rest left to workqueue
> > in case of lock contension. But workqueue is still unable to kill soft lockup
> > if the number of contending CPUs is large enough.
> 
> Thank you for the idea. One concern that I have is that sometimes, we do expect
> free_pcppages_bulk to actually free all of the pages that it has promised to
> do. One example is when it is called from drain_zone_pages. Of course, we can
> have a while loop that would call free_pcppages_bulk until it returns 0, but
> I think that would be reduced to unlocking / locking over and over again.
> 
In the case of drain_zone_pages(), I think adding something like the pcpu_drain_mutex
to the path updating zone counters is a cure.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-08-21  1:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18 18:58 [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing Joshua Hahn
2025-08-19  0:13 ` Andrew Morton
2025-08-19 15:18   ` Joshua Hahn
2025-08-19 21:44     ` Andrew Morton
2025-08-20 13:20       ` Joshua Hahn
2025-08-19  9:15 ` Kiryl Shutsemau
2025-08-19 15:28   ` Joshua Hahn
2025-08-19 17:15   ` Shakeel Butt
2025-08-20 12:58     ` Kiryl Shutsemau
2025-08-19 15:34 ` Joshua Hahn
2025-08-20  1:29 ` Hillf Danton
2025-08-20 15:13   ` Joshua Hahn
2025-08-21  1:03     ` Hillf Danton
2025-08-20  5:41 ` Andrew Morton
2025-08-20 15:48   ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).