[PATCH v3 0/3] mm/page_alloc: Batch callers of free_pcppages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk
@ 2025-10-02 20:46 Joshua Hahn
  2025-10-02 20:46 ` [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Joshua Hahn @ 2025-10-02 20:46 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, Hillf Danton,
	linux-kernel, linux-mm, kernel-team

Motivation & Approach
=====================

While testing workloads with high sustained memory pressure on large machines
in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
of softlockups. Further investigation showed that the zone lock in
free_pcppages_bulk was being held for a long time, and was called to free
2k+ pages over 100 times just during boot.

This causes starvation in other processes for the zone lock, which can lead
to the system stalling as multiple threads cannot make progress without the
locks. We can see these issues manifesting as warnings:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu:     20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu:              hardirqs   softirqs   csw/system
[ 4512.638793] rcu:      number:        0        145            0
[ 4512.651177] rcu:     cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu:     (t=21077 jiffies g=783665 q=1242213 ncpus=316)

While these warnings are benign, they do point to the underlying issue of
lock contention. To prevent starvation in both locks, batch the freeing of
pages using pcp->batch.

Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of them
are broken together (unless the system was built with queued spinlocks).
Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
freeing from its callers instead.

A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.

Testing
=======
The following are a few synthetic benchmarks, made on two machines. The
first is a large, single-node machine with 754GiB memory and 316 processors.
The second is a relatively smaller single-node machine with 251GiB memory
and 176 processors.

On both machines, I kick off a kernel build with -j$(nproc).
Lower delta is better (faster compilation).

Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real       |        0.4627 |  -1.2627 |
| user       |        0.2281 |  +0.2680 |
| sys        |        4.6345 |  -7.5425 |
+------------+---------------+----------+

Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real       |        0.2321 |  +0.0888 |
| user       |        0.1730 |  -0.1182 |
| sys        |        0.7680 |  +1.2067 |
+------------+---------------+----------+

Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real       |        0.1920 |  -0.1270 |
| user       |        0.1730 |  -0.0358 |
| sys        |        0.7680 |  +0.9143 |
+------------+---------------+----------+

Here, variation is the coefficient of variation, i.e. standard deviation / mean.

Based on these results, there is definitely some gain to be had when
lock contention is observed in a larger machine, especially if it is running
on one node. It leads to both a measurable decrease in compilation time for
both the real and system times (i.e. relieving lock contention).

For the medium machine, there is negligible regression in real time
(<< coefficient of variation), although it leads to a measurable increase
in the system time.

For the small machine, there is a negligible performance gain in real time,
but has a similar regression to the medium machine.

Despite the regressions (~1%) of system time in the smaller machines, it
seems to be (1) not observable in realtime and (2) is much smaller than the
gain made in the large machine.

Changelog
=========
v2 --> v3:
- Refactored on top of mm-new
- Wordsmithing the cover letter & commit messages to clarify which lock
  is contended, as suggested by Hillf Danton.
- Ran new tests for the cover letter, instead of running stress-ng, I decided
  to compile the kernel which I think will be more reflective of the "default"
  workload that might be run. Also ran on a smaller machines to show the
  expected behavior of this patchset when there is lock contention vs.
  lower lock contention.
- Removed patch 2/4, which would have batched page freeing for
  drain_pages_zone. It is not a good candidate for this series since it is
  called on each CPU in __drain_all_pages.
- Small change in 1/4 to initialize todo, as suggested by Christoph Lameter
- Small change in 1/4 to avoid bit manipulation, as suggested by SeongJae Park.
- Change in 4/4 to handle the case when the thread gets migrated to a different
  CPU during the window between unlocking & reacquiring the pcp lock, as
  suggested by Vlastimil Babka.
- Small change in 4/4 to handle the case when pcp lock could not be acquired
  within the loop in free_unref_folios.

v1 --> v2:
- Reworded cover letter to be more explicit about what kinds of issues
  running processes might face as a result of the existing lock starvation
- Reworded cover letter to be in sections to make it easier to read
- Fixed patch 4/4 to properly store & restore UP flags.
- Re-ran tests, updated the testing results and interpretation

Joshua Hahn (3):
  mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  mm/page_alloc: Batch page freeing in decay_pcp_high
  mm/page_alloc: Batch page freeing in free_frozen_page_commit

 include/linux/gfp.h |  2 +-
 mm/page_alloc.c     | 83 ++++++++++++++++++++++++++++++++++++---------
 mm/vmstat.c         | 28 ++++++++-------
 3 files changed, 83 insertions(+), 30 deletions(-)

base-commit: 71fffcaf9c5c5eb17f90a8db478586091cd300c5
-- 
2.47.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-10-02 20:46 [PATCH v3 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
@ 2025-10-02 20:46 ` Joshua Hahn
  2025-10-10 12:44   ` Vlastimil Babka
  2025-10-10 18:30   ` SeongJae Park
  2025-10-02 20:46 ` [PATCH v3 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
  2025-10-02 20:46 ` [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
  2 siblings, 2 replies; 9+ messages in thread
From: Joshua Hahn @ 2025-10-02 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Mason, Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Johannes Weiner, Lorenzo Stoakes, Michal Hocko,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm, kernel-team

Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.

However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.

Simplify the code by returning a bool instead to indicate if updates
were made.

In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/gfp.h |  2 +-
 mm/page_alloc.c     |  8 ++++----
 mm/vmstat.c         | 28 +++++++++++++++-------------
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5ebf26fcdcfa..63c72cb1d117 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -386,7 +386,7 @@ extern void free_pages(unsigned long addr, unsigned int order);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 304e12bf2e4e..b9fc357d2d48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2557,10 +2557,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to decay the PCP high.
  * Return whether there are addition works to do.
  */
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
 	int high_min, to_drain, batch;
-	int todo = 0;
+	bool todo = false;
 
 	high_min = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
@@ -2573,7 +2573,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
 				 pcp->high - (pcp->high >> 3), high_min);
 		if (pcp->high > high_min)
-			todo++;
+			todo = true;
 	}
 
 	to_drain = pcp->count - pcp->high;
@@ -2581,7 +2581,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
 		spin_unlock(&pcp->lock);
-		todo++;
+		todo = true;
 	}
 
 	return todo;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb09c032eecf..98855f31294d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -771,25 +771,25 @@ EXPORT_SYMBOL(dec_node_page_state);
 
 /*
  * Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
  */
 static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
-	int changes = 0;
+	bool changed = false;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		if (zone_diff[i]) {
 			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
-			changes++;
+			changed = true;
 	}
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		if (node_diff[i]) {
 			atomic_long_add(node_diff[i], &vm_node_stat[i]);
-			changes++;
+			changed = true;
 	}
-	return changes;
+	return changed;
 }
 
 /*
@@ -806,16 +806,16 @@ static int fold_diff(int *zone_diff, int *node_diff)
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
  *
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
  */
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
-	int changes = 0;
+	bool changed = false;
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -839,7 +839,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		if (do_pagesets) {
 			cond_resched();
 
-			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+			if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
+				changed = true;
 #ifdef CONFIG_NUMA
 			/*
 			 * Deal with draining the remote pageset of this
@@ -861,13 +862,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			}
 
 			if (__this_cpu_dec_return(pcp->expire)) {
-				changes++;
+				changed = true;
 				continue;
 			}
 
 			if (__this_cpu_read(pcp->count)) {
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
-				changes++;
+				changed = true;
 			}
 #endif
 		}
@@ -887,8 +888,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 	}
 
-	changes += fold_diff(global_zone_diff, global_node_diff);
-	return changes;
+	if (fold_diff(global_zone_diff, global_node_diff))
+		changed = true;
+	return changed;
 }
 
 /*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-10-02 20:46 ` [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
@ 2025-10-10 12:44   ` Vlastimil Babka
  2025-10-10 18:30   ` SeongJae Park
  1 sibling, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2025-10-10 12:44 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton
  Cc: Chris Mason, Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Johannes Weiner, Lorenzo Stoakes, Michal Hocko,
	Mike Rapoport, Suren Baghdasaryan, Zi Yan, linux-kernel, linux-mm,
	kernel-team

On 10/2/25 22:46, Joshua Hahn wrote:
> Currently, refresh_cpu_vm_stats returns an int, indicating how many
> changes were made during its updates. Using this information, callers
> like vmstat_update can heuristically determine if more work will be done
> in the future.
> 
> However, all of refresh_cpu_vm_stats's callers either (a) ignore the
> result, only caring about performing the updates, or (b) only care about
> whether changes were made, but not *how many* changes were made.
> 
> Simplify the code by returning a bool instead to indicate if updates
> were made.
> 
> In addition, simplify fold_diff and decay_pcp_high to return a bool
> for the same reason.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-10-02 20:46 ` [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
  2025-10-10 12:44   ` Vlastimil Babka
@ 2025-10-10 18:30   ` SeongJae Park
  1 sibling, 0 replies; 9+ messages in thread
From: SeongJae Park @ 2025-10-10 18:30 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: SeongJae Park, Andrew Morton, Chris Mason, Kiryl Shutsemau,
	Liam R. Howlett, Brendan Jackman, David Hildenbrand,
	Johannes Weiner, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

On Thu,  2 Oct 2025 13:46:31 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Currently, refresh_cpu_vm_stats returns an int, indicating how many
> changes were made during its updates. Using this information, callers
> like vmstat_update can heuristically determine if more work will be done
> in the future.
> 
> However, all of refresh_cpu_vm_stats's callers either (a) ignore the
> result, only caring about performing the updates, or (b) only care about
> whether changes were made, but not *how many* changes were made.
> 
> Simplify the code by returning a bool instead to indicate if updates
> were made.
> 
> In addition, simplify fold_diff and decay_pcp_high to return a bool
> for the same reason.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high
  2025-10-02 20:46 [PATCH v3 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
  2025-10-02 20:46 ` [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
@ 2025-10-02 20:46 ` Joshua Hahn
  2025-10-10 12:48   ` Vlastimil Babka
  2025-10-02 20:46 ` [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
  2 siblings, 1 reply; 9+ messages in thread
From: Joshua Hahn @ 2025-10-02 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm, kernel-team

It is possible for pcp->count - pcp->high to exceed pcp->batch by a lot.
When this happens, we should perform batching to ensure that
free_pcppages_bulk isn't called with too many pages to free at once and
starve out other threads that need the pcp lock.

Since we are still only freeing the difference between the initial
pcp->count and pcp->high values, there should be no change to how many
pages are freed.

Suggested-by: Chris Mason <clm@fb.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b9fc357d2d48..f525f197c5fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2559,7 +2559,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	int high_min, to_drain, batch;
+	int high_min, to_drain, to_drain_batched, batch;
 	bool todo = false;
 
 	high_min = READ_ONCE(pcp->high_min);
@@ -2577,11 +2577,14 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 	}
 
 	to_drain = pcp->count - pcp->high;
-	if (to_drain > 0) {
+	while (to_drain > 0) {
+		to_drain_batched = min(to_drain, batch);
 		spin_lock(&pcp->lock);
-		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
 		spin_unlock(&pcp->lock);
 		todo = true;
+
+		to_drain -= to_drain_batched;
 	}
 
 	return todo;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high
  2025-10-02 20:46 ` [PATCH v3 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
@ 2025-10-10 12:48   ` Vlastimil Babka
  0 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2025-10-10 12:48 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
	Michal Hocko, Suren Baghdasaryan, Zi Yan, linux-kernel, linux-mm,
	kernel-team

On 10/2/25 22:46, Joshua Hahn wrote:
> It is possible for pcp->count - pcp->high to exceed pcp->batch by a lot.
> When this happens, we should perform batching to ensure that
> free_pcppages_bulk isn't called with too many pages to free at once and
> starve out other threads that need the pcp lock.

... or the zone lock.

> Since we are still only freeing the difference between the initial
> pcp->count and pcp->high values, there should be no change to how many
> pages are freed.
> 
> Suggested-by: Chris Mason <clm@fb.com>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b9fc357d2d48..f525f197c5fd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2559,7 +2559,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   */
>  bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  {
> -	int high_min, to_drain, batch;
> +	int high_min, to_drain, to_drain_batched, batch;
>  	bool todo = false;
>  
>  	high_min = READ_ONCE(pcp->high_min);
> @@ -2577,11 +2577,14 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  	}
>  
>  	to_drain = pcp->count - pcp->high;
> -	if (to_drain > 0) {
> +	while (to_drain > 0) {
> +		to_drain_batched = min(to_drain, batch);
>  		spin_lock(&pcp->lock);
> -		free_pcppages_bulk(zone, to_drain, pcp, 0);
> +		free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
>  		spin_unlock(&pcp->lock);
>  		todo = true;
> +
> +		to_drain -= to_drain_batched;
>  	}
>  
>  	return todo;



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-10-02 20:46 [PATCH v3 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
  2025-10-02 20:46 ` [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
  2025-10-02 20:46 ` [PATCH v3 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
@ 2025-10-02 20:46 ` Joshua Hahn
  2025-10-10 13:37   ` Vlastimil Babka
  2 siblings, 1 reply; 9+ messages in thread
From: Joshua Hahn @ 2025-10-02 20:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm, kernel-team

Before returning, free_frozen_page_commit calls free_pcppages_bulk using
nr_pcp_free to determine how many pages can appropritately be freed,
based on the tunable parameters stored in pcp. While this number is an
accurate representation of how many pages should be freed in total, it
is not an appropriate number of pages to free at once using
free_pcppages_bulk, since we have seen the value consistently go above
2000 in the Meta fleet on larger machines.

As such, perform batched page freeing in free_pcppages_bulk by using
pcp->batch member. In order to ensure that other processes are not
starved of the zone lock, free both the zone lock and pcp lock to yield to
other threads.

Note that because free_frozen_page_commit now performs a spinlock inside the
function (and can fail), the function may now return with a freed pcp.
To handle this, return true if the pcp is locked on exit and false otherwise.

In addition, since free_frozen_page_commit must now be aware of what UP
flags were stored at the time of the spin lock, and because we must be
able to report new UP flags to the callers, add a new unsigned long*
parameter UP_flags to keep track of this.

Suggested-by: Chris Mason <clm@fb.com>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f525f197c5fd..9b9f5a44496c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2818,12 +2818,21 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	return high;
 }
 
-static void free_frozen_page_commit(struct zone *zone,
+/*
+ * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
+ * pcp's watermarks below high.
+ *
+ * May return a freed pcp, if during page freeing the pcp spinlock cannot be
+ * reacquired. Return true if pcp is locked, false otherwise.
+ */
+static bool free_frozen_page_commit(struct zone *zone,
 		struct per_cpu_pages *pcp, struct page *page, int migratetype,
-		unsigned int order, fpi_t fpi_flags)
+		unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
 {
 	int high, batch;
+	int to_free, to_free_batched;
 	int pindex;
+	int cpu = smp_processor_id();
 	bool free_high = false;
 
 	/*
@@ -2861,15 +2870,20 @@ static void free_frozen_page_commit(struct zone *zone,
 		 * Do not attempt to take a zone lock. Let pcp->count get
 		 * over high mark temporarily.
 		 */
-		return;
+		return true;
 	}
 
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count < high)
-		return;
+		return true;
+
+	to_free = nr_pcp_free(pcp, batch, high, free_high);
+	if (to_free == 0)
+		return true;
 
-	free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
-			   pcp, pindex);
+free_batch:
+	to_free_batched = min(to_free, batch);
+	free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
 	if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
 	    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
 			      ZONE_MOVABLE, 0)) {
@@ -2887,6 +2901,35 @@ static void free_frozen_page_commit(struct zone *zone,
 		    next_memory_node(pgdat->node_id) < MAX_NUMNODES)
 			atomic_set(&pgdat->kswapd_failures, 0);
 	}
+	high = nr_pcp_high(pcp, zone, batch, free_high);
+	to_free -= to_free_batched;
+	if (pcp->count >= high) {
+		pcp_spin_unlock(pcp);
+		pcp_trylock_finish(*UP_flags);
+
+		pcp_trylock_prepare(*UP_flags);
+		pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+		if (!pcp) {
+			pcp_trylock_finish(*UP_flags);
+			return false;
+		}
+
+		/*
+		 * Check if this thread has been migrated to a different
+		 * CPU. If that is the case, give up and indicate that
+		 * the pcp is returned in an unlocked state.
+		 */
+		if (smp_processor_id() != cpu) {
+			pcp_spin_unlock(pcp);
+			pcp_trylock_finish(*UP_flags);
+			return false;
+		}
+	}
+
+	if (to_free > 0 && pcp->count >= high)
+		goto free_batch;
+
+	return true;
 }
 
 /*
@@ -2934,7 +2977,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
+		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
+						order, fpi_flags, &UP_flags))
+			return;
 		pcp_spin_unlock(pcp);
 	} else {
 		free_one_page(zone, page, pfn, order, fpi_flags);
@@ -3034,8 +3079,11 @@ void free_unref_folios(struct folio_batch *folios)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(&folio->page);
-		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
-					order, FPI_NONE);
+		if (!free_frozen_page_commit(zone, pcp, &folio->page,
+				migratetype, order, FPI_NONE, &UP_flags)) {
+			pcp = NULL;
+			locked_zone = NULL;
+		}
 	}
 
 	if (pcp) {
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-10-02 20:46 ` [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
@ 2025-10-10 13:37   ` Vlastimil Babka
  2025-10-10 15:13     ` Joshua Hahn
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2025-10-10 13:37 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
	Michal Hocko, Suren Baghdasaryan, Zi Yan, linux-kernel, linux-mm,
	kernel-team

On 10/2/25 22:46, Joshua Hahn wrote:
> Before returning, free_frozen_page_commit calls free_pcppages_bulk using
> nr_pcp_free to determine how many pages can appropritately be freed,
> based on the tunable parameters stored in pcp. While this number is an
> accurate representation of how many pages should be freed in total, it
> is not an appropriate number of pages to free at once using
> free_pcppages_bulk, since we have seen the value consistently go above
> 2000 in the Meta fleet on larger machines.
> 
> As such, perform batched page freeing in free_pcppages_bulk by using
> pcp->batch member. In order to ensure that other processes are not
> starved of the zone lock, free both the zone lock and pcp lock to yield to
> other threads.
> 
> Note that because free_frozen_page_commit now performs a spinlock inside the
> function (and can fail), the function may now return with a freed pcp.
> To handle this, return true if the pcp is locked on exit and false otherwise.
> 
> In addition, since free_frozen_page_commit must now be aware of what UP
> flags were stored at the time of the spin lock, and because we must be
> able to report new UP flags to the callers, add a new unsigned long*
> parameter UP_flags to keep track of this.
> 
> Suggested-by: Chris Mason <clm@fb.com>
> Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  mm/page_alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 57 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f525f197c5fd..9b9f5a44496c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2818,12 +2818,21 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>  	return high;
>  }
>  
> -static void free_frozen_page_commit(struct zone *zone,
> +/*
> + * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
> + * pcp's watermarks below high.
> + *
> + * May return a freed pcp, if during page freeing the pcp spinlock cannot be
> + * reacquired. Return true if pcp is locked, false otherwise.
> + */
> +static bool free_frozen_page_commit(struct zone *zone,
>  		struct per_cpu_pages *pcp, struct page *page, int migratetype,
> -		unsigned int order, fpi_t fpi_flags)
> +		unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
>  {
>  	int high, batch;
> +	int to_free, to_free_batched;
>  	int pindex;
> +	int cpu = smp_processor_id();
>  	bool free_high = false;
>  
>  	/*
> @@ -2861,15 +2870,20 @@ static void free_frozen_page_commit(struct zone *zone,
>  		 * Do not attempt to take a zone lock. Let pcp->count get
>  		 * over high mark temporarily.
>  		 */
> -		return;
> +		return true;
>  	}
>  
>  	high = nr_pcp_high(pcp, zone, batch, free_high);
>  	if (pcp->count < high)
> -		return;
> +		return true;
> +
> +	to_free = nr_pcp_free(pcp, batch, high, free_high);
> +	if (to_free == 0)
> +		return true;
>  
> -	free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> -			   pcp, pindex);
> +free_batch:
> +	to_free_batched = min(to_free, batch);
> +	free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
>  	if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&

We could do this handling once after all batches. But maybe it's better to
act as soon as it becomes true, and checking once ber batch isn't measurably
slower, dunno.

>  	    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
>  			      ZONE_MOVABLE, 0)) {
> @@ -2887,6 +2901,35 @@ static void free_frozen_page_commit(struct zone *zone,
>  		    next_memory_node(pgdat->node_id) < MAX_NUMNODES)
>  			atomic_set(&pgdat->kswapd_failures, 0);
>  	}
> +	high = nr_pcp_high(pcp, zone, batch, free_high);

It's not clear why we recalculate this. Ah I see, the calculation involves a
ZONE_BELOW_HIGH check which we might have changed above. So as a result ths
patch isn't a straightforward "we free the same amount of pages but in
smaller batches" but something different (and I'm not immediately sure what
exactly).

I think it's an argument for doing the ZONE_BELOW_HIGH test above just once
in the end, and not recalculating "high" here. I'd just stick to the
"to_free" calculated uprofront and decreasing it by "to_free_batched" each
round.
We should maybe also check that pcp->count went to 0 and bail out so we
don't make useless iterations in rare cases when someone else drains the pcp
between our batches (free_pcppages_bulk() protects itself from passing too
high "count" so it would be just some wasted cpu otherwise, not a disaster).

> +	to_free -= to_free_batched;
> +	if (pcp->count >= high) {
> +		pcp_spin_unlock(pcp);
> +		pcp_trylock_finish(*UP_flags);
> +
> +		pcp_trylock_prepare(*UP_flags);
> +		pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> +		if (!pcp) {
> +			pcp_trylock_finish(*UP_flags);
> +			return false;
> +		}
> +
> +		/*
> +		 * Check if this thread has been migrated to a different
> +		 * CPU. If that is the case, give up and indicate that
> +		 * the pcp is returned in an unlocked state.
> +		 */
> +		if (smp_processor_id() != cpu) {

We could have remembered the old pcp pointer and compare that instead of
doing smp_processor_id(), although that should also work.

> +			pcp_spin_unlock(pcp);
> +			pcp_trylock_finish(*UP_flags);
> +			return false;
> +		}
> +	}
> +
> +	if (to_free > 0 && pcp->count >= high)
> +		goto free_batch;
> +
> +	return true;
>  }
>  
>  /*
> @@ -2934,7 +2977,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	pcp_trylock_prepare(UP_flags);
>  	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  	if (pcp) {
> -		free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
> +		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> +						order, fpi_flags, &UP_flags))
> +			return;
>  		pcp_spin_unlock(pcp);
>  	} else {
>  		free_one_page(zone, page, pfn, order, fpi_flags);
> @@ -3034,8 +3079,11 @@ void free_unref_folios(struct folio_batch *folios)
>  			migratetype = MIGRATE_MOVABLE;
>  
>  		trace_mm_page_free_batched(&folio->page);
> -		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
> -					order, FPI_NONE);
> +		if (!free_frozen_page_commit(zone, pcp, &folio->page,
> +				migratetype, order, FPI_NONE, &UP_flags)) {
> +			pcp = NULL;
> +			locked_zone = NULL;
> +		}
>  	}
>  
>  	if (pcp) {



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-10-10 13:37   ` Vlastimil Babka
@ 2025-10-10 15:13     ` Joshua Hahn
  0 siblings, 0 replies; 9+ messages in thread
From: Joshua Hahn @ 2025-10-10 15:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Chris Mason, Kiryl Shutsemau, Brendan Jackman,
	Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Zi Yan,
	linux-kernel, linux-mm, kernel-team

On Fri, 10 Oct 2025 15:37:23 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

[...snip...]

> >  	high = nr_pcp_high(pcp, zone, batch, free_high);
> >  	if (pcp->count < high)
> > -		return;
> > +		return true;
> > +
> > +	to_free = nr_pcp_free(pcp, batch, high, free_high);
> > +	if (to_free == 0)
> > +		return true;
> >  
> > -	free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> > -			   pcp, pindex);
> > +free_batch:
> > +	to_free_batched = min(to_free, batch);
> > +	free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
> >  	if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&

Hello Vlastimil, thank you for your review!

> We could do this handling once after all batches. But maybe it's better to
> act as soon as it becomes true, and checking once ber batch isn't measurably
> slower, dunno.

My thinking was that if we are to release the pcp and zone lock and another
task remotely inspects this pcp in the middle of the loop, I would want the
ZONE_BELOW_HIGH bit to reflect the correct state of the pcp, as opposed to
doing it once at the end.

> >  	    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
> >  			      ZONE_MOVABLE, 0)) {
> > @@ -2887,6 +2901,35 @@ static void free_frozen_page_commit(struct zone *zone,
> >  		    next_memory_node(pgdat->node_id) < MAX_NUMNODES)
> >  			atomic_set(&pgdat->kswapd_failures, 0);
> >  	}
> > +	high = nr_pcp_high(pcp, zone, batch, free_high);
> 
> It's not clear why we recalculate this. Ah I see, the calculation involves a
> ZONE_BELOW_HIGH check which we might have changed above. So as a result ths
> patch isn't a straightforward "we free the same amount of pages but in
> smaller batches" but something different (and I'm not immediately sure what
> exactly).

Yes! So nr_pcp_high will check if we are now below high, and see what the
new high should be, and see if we need to free some more pages. My goal was
to prevent having to call free_frozen_page_commit again in the future once
we realize that there are still pages yet to be freed, and handle that case
within this loop as if it was what we originally decided to free.

> I think it's an argument for doing the ZONE_BELOW_HIGH test above just once
> in the end, and not recalculating "high" here. I'd just stick to the
> "to_free" calculated uprofront and decreasing it by "to_free_batched" each
> round.

Yes, this makes sense to me. The last sentence in my paragraph above seems
a little sneaky in doing more work than it originally was called to do, so
I do agree that we should do the ZONE_BELOW_HIGH check once, and also
calculate how many pages to free at once in the beginning.

> We should maybe also check that pcp->count went to 0 and bail out so we
> don't make useless iterations in rare cases when someone else drains the pcp
> between our batches (free_pcppages_bulk() protects itself from passing too
> high "count" so it would be just some wasted cpu otherwise, not a disaster).

Hopefully the check below is enough to handle this race case. When it
determines whether to iterate through in the loop again, we check that
to_free > 0 && pcp->count >= high, so if someone else were to drain the
drain the pcp->count, it would go under high (I don't think it's possible
for high to become negative here) it would break this check and prevent
another iteration. 

> > +	to_free -= to_free_batched;
> > +	if (pcp->count >= high) {
> > +		pcp_spin_unlock(pcp);
> > +		pcp_trylock_finish(*UP_flags);
> > +
> > +		pcp_trylock_prepare(*UP_flags);
> > +		pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> > +		if (!pcp) {
> > +			pcp_trylock_finish(*UP_flags);
> > +			return false;
> > +		}
> > +
> > +		/*
> > +		 * Check if this thread has been migrated to a different
> > +		 * CPU. If that is the case, give up and indicate that
> > +		 * the pcp is returned in an unlocked state.
> > +		 */
> > +		if (smp_processor_id() != cpu) {
> 
> We could have remembered the old pcp pointer and compare that instead of
> doing smp_processor_id(), although that should also work.

I think that would have also worked. To be completely honest, I wasn't
sure if there was some pcp stuff inside that would change the pointer,
so just went with using the CPU int instead. 

> > +			pcp_spin_unlock(pcp);
> > +			pcp_trylock_finish(*UP_flags);
> > +			return false;
> > +		}
> > +	}
> > +
> > +	if (to_free > 0 && pcp->count >= high)

           Check is here:  ^^^^^^^^^^^^^^^^^^

> > +		goto free_batch;
> > +
> > +	return true;
> >  }
> >  
> >  /*
> > @@ -2934,7 +2977,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >  	pcp_trylock_prepare(UP_flags);
> >  	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> >  	if (pcp) {
> > -		free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
> > +		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> > +						order, fpi_flags, &UP_flags))
> > +			return;
> >  		pcp_spin_unlock(pcp);
> >  	} else {
> >  		free_one_page(zone, page, pfn, order, fpi_flags);
> > @@ -3034,8 +3079,11 @@ void free_unref_folios(struct folio_batch *folios)
> >  			migratetype = MIGRATE_MOVABLE;
> >  
> >  		trace_mm_page_free_batched(&folio->page);
> > -		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
> > -					order, FPI_NONE);
> > +		if (!free_frozen_page_commit(zone, pcp, &folio->page,
> > +				migratetype, order, FPI_NONE, &UP_flags)) {
> > +			pcp = NULL;
> > +			locked_zone = NULL;
> > +		}
> >  	}
> >  
> >  	if (pcp) {

Thank you again for your review, Vlastimil. I agree with all of your
feedback, so I will make those changes (as well as the commit message
change for patch 2/3) and submit a new version. I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-10-10 18:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 20:46 [PATCH v3 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
2025-10-02 20:46 ` [PATCH v3 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
2025-10-10 12:44   ` Vlastimil Babka
2025-10-10 18:30   ` SeongJae Park
2025-10-02 20:46 ` [PATCH v3 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
2025-10-10 12:48   ` Vlastimil Babka
2025-10-02 20:46 ` [PATCH v3 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
2025-10-10 13:37   ` Vlastimil Babka
2025-10-10 15:13     ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).