* + mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch added to mm-new branch
@ 2025-10-08 0:47 Andrew Morton
0 siblings, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2025-10-08 0:47 UTC (permalink / raw)
To: mm-commits, ziy, vbabka, surenb, rppt, mhocko, lorenzo.stoakes,
liam.howlett, kirill, jackmanb, hdanton, hannes, david, clm,
joshua.hahnjy, akpm
The patch titled
Subject: mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
has been added to the -mm mm-new branch. Its filename is
mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Joshua Hahn <joshua.hahnjy@gmail.com>
Subject: mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
Date: Thu, 2 Oct 2025 13:46:31 -0700
Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v3.
Motivation & Approach
=====================
While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups. Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.
This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
While these warnings are benign, they do point to the underlying issue of
lock contention. To prevent starvation in both locks, batch the freeing
of pages using pcp->batch.
Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of
them are broken together (unless the system was built with queued
spinlocks). Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.
A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.
Testing
=======
The following are a few synthetic benchmarks, made on two machines. The
first is a large, single-node machine with 754GiB memory and 316
processors. The second is a relatively smaller single-node machine with
251GiB memory and 176 processors.
On both machines, I kick off a kernel build with -j$(nproc). Lower delta
is better (faster compilation).
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.4627 | -1.2627 |
| user | 0.2281 | +0.2680 |
| sys | 4.6345 | -7.5425 |
+------------+---------------+----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2321 | +0.0888 |
| user | 0.1730 | -0.1182 |
| sys | 0.7680 | +1.2067 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1920 | -0.1270 |
| user | 0.1730 | -0.0358 |
| sys | 0.7680 | +0.9143 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.
Based on these results, there is definitely some gain to be had when lock
contention is observed in a larger machine, especially if it is running on
one node. It leads to both a measurable decrease in compilation time for
both the real and system times (i.e. relieving lock contention).
For the medium machine, there is negligible regression in real time (<<
coefficient of variation), although it leads to a measurable increase in
the system time.
For the small machine, there is a negligible performance gain in real
time, but has a similar regression to the medium machine.
Despite the regressions (~1%) of system time in the smaller machines, it
seems to be (1) not observable in realtime and (2) is much smaller than
the gain made in the large machine.
This patch (of 3):
Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.
However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.
Simplify the code by returning a bool instead to indicate if updates
were made.
In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.
Link: https://lkml.kernel.org/r/20251002204636.4016712-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251002204636.4016712-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 8 ++++----
mm/vmstat.c | 28 +++++++++++++++-------------
3 files changed, 20 insertions(+), 18 deletions(-)
--- a/include/linux/gfp.h~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/include/linux/gfp.h
@@ -386,7 +386,7 @@ extern void free_pages(unsigned long add
#define free_page(addr) free_pages((addr), 0)
void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
--- a/mm/page_alloc.c~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/mm/page_alloc.c
@@ -2557,10 +2557,10 @@ static int rmqueue_bulk(struct zone *zon
* Called from the vmstat counter updater to decay the PCP high.
* Return whether there are addition works to do.
*/
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, batch;
- int todo = 0;
+ bool todo = false;
high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
@@ -2573,7 +2573,7 @@ int decay_pcp_high(struct zone *zone, st
pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
pcp->high - (pcp->high >> 3), high_min);
if (pcp->high > high_min)
- todo++;
+ todo = true;
}
to_drain = pcp->count - pcp->high;
@@ -2581,7 +2581,7 @@ int decay_pcp_high(struct zone *zone, st
spin_lock(&pcp->lock);
free_pcppages_bulk(zone, to_drain, pcp, 0);
spin_unlock(&pcp->lock);
- todo++;
+ todo = true;
}
return todo;
--- a/mm/vmstat.c~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/mm/vmstat.c
@@ -771,25 +771,25 @@ EXPORT_SYMBOL(dec_node_page_state);
/*
* Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
*/
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
- int changes = 0;
+ bool changed = false;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (zone_diff[i]) {
atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
+ changed = true;
}
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
if (node_diff[i]) {
atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
+ changed = true;
}
- return changes;
+ return changed;
}
/*
@@ -806,16 +806,16 @@ static int fold_diff(int *zone_diff, int
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
*/
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
{
struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
- int changes = 0;
+ bool changed = false;
for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -839,7 +839,8 @@ static int refresh_cpu_vm_stats(bool do_
if (do_pagesets) {
cond_resched();
- changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+ if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
+ changed = true;
#ifdef CONFIG_NUMA
/*
* Deal with draining the remote pageset of this
@@ -861,13 +862,13 @@ static int refresh_cpu_vm_stats(bool do_
}
if (__this_cpu_dec_return(pcp->expire)) {
- changes++;
+ changed = true;
continue;
}
if (__this_cpu_read(pcp->count)) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
- changes++;
+ changed = true;
}
#endif
}
@@ -887,8 +888,9 @@ static int refresh_cpu_vm_stats(bool do_
}
}
- changes += fold_diff(global_zone_diff, global_node_diff);
- return changes;
+ if (fold_diff(global_zone_diff, global_node_diff))
+ changed = true;
+ return changed;
}
/*
_
Patches currently in -mm which might be from joshua.hahnjy@gmail.com are
mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
mm-page_alloc-batch-page-freeing-in-decay_pcp_high.patch
mm-page_alloc-batch-page-freeing-in-free_frozen_page_commit.patch
^ permalink raw reply [flat|nested] 3+ messages in thread
* + mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch added to mm-new branch
@ 2025-10-13 22:40 Andrew Morton
0 siblings, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2025-10-13 22:40 UTC (permalink / raw)
To: mm-commits, ziy, vbabka, surenb, sj, rppt, mhocko,
lorenzo.stoakes, liam.howlett, kirill, jackmanb, hannes, david,
clm, joshua.hahnjy, akpm
The patch titled
Subject: mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
has been added to the -mm mm-new branch. Its filename is
mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Joshua Hahn <joshua.hahnjy@gmail.com>
Subject: mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
Date: Mon, 13 Oct 2025 12:08:09 -0700
Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v4.
Motivation & Approach
=====================
While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups. Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.
This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
While these warnings are benign, they do point to the underlying issue of
lock contention. To prevent starvation in both locks, batch the freeing
of pages using pcp->batch.
Because free_pcppages_bulk is called with the pcp lock and acquires the
zone lock, relinquishing and reacquiring the locks are only effective when
both of them are broken together (unless the system was built with queued
spinlocks). Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.
A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.
Testing
=======
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.
On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation).
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.
Based on these results, it seems like there are varying degrees to how
much lock contention this reduces. For the largest and smallest machines
that I ran the tests on, it seems like there is quite some significant
reduction. There is also some performance increases visible from
userspace.
Interestingly, the performance gains don't scale with the size of the
machine, but rather there seems to be a dip in the gain there is for the
medium-sized machine.
This patch (of 3):
Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.
However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.
Simplify the code by returning a bool instead to indicate if updates were
made.
In addition, simplify fold_diff and decay_pcp_high to return a bool for
the same reason.
Link: https://lkml.kernel.org/r/20251013190812.787205-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251013190812.787205-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 8 ++++----
mm/vmstat.c | 28 +++++++++++++++-------------
3 files changed, 20 insertions(+), 18 deletions(-)
--- a/include/linux/gfp.h~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/include/linux/gfp.h
@@ -386,7 +386,7 @@ extern void free_pages(unsigned long add
#define free_page(addr) free_pages((addr), 0)
void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
--- a/mm/page_alloc.c~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/mm/page_alloc.c
@@ -2557,10 +2557,10 @@ static int rmqueue_bulk(struct zone *zon
* Called from the vmstat counter updater to decay the PCP high.
* Return whether there are addition works to do.
*/
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, batch;
- int todo = 0;
+ bool todo = false;
high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
@@ -2573,7 +2573,7 @@ int decay_pcp_high(struct zone *zone, st
pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
pcp->high - (pcp->high >> 3), high_min);
if (pcp->high > high_min)
- todo++;
+ todo = true;
}
to_drain = pcp->count - pcp->high;
@@ -2581,7 +2581,7 @@ int decay_pcp_high(struct zone *zone, st
spin_lock(&pcp->lock);
free_pcppages_bulk(zone, to_drain, pcp, 0);
spin_unlock(&pcp->lock);
- todo++;
+ todo = true;
}
return todo;
--- a/mm/vmstat.c~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/mm/vmstat.c
@@ -771,25 +771,25 @@ EXPORT_SYMBOL(dec_node_page_state);
/*
* Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
*/
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
- int changes = 0;
+ bool changed = false;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (zone_diff[i]) {
atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
+ changed = true;
}
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
if (node_diff[i]) {
atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
+ changed = true;
}
- return changes;
+ return changed;
}
/*
@@ -806,16 +806,16 @@ static int fold_diff(int *zone_diff, int
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
*/
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
{
struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
- int changes = 0;
+ bool changed = false;
for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -839,7 +839,8 @@ static int refresh_cpu_vm_stats(bool do_
if (do_pagesets) {
cond_resched();
- changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+ if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
+ changed = true;
#ifdef CONFIG_NUMA
/*
* Deal with draining the remote pageset of this
@@ -861,13 +862,13 @@ static int refresh_cpu_vm_stats(bool do_
}
if (__this_cpu_dec_return(pcp->expire)) {
- changes++;
+ changed = true;
continue;
}
if (__this_cpu_read(pcp->count)) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
- changes++;
+ changed = true;
}
#endif
}
@@ -887,8 +888,9 @@ static int refresh_cpu_vm_stats(bool do_
}
}
- changes += fold_diff(global_zone_diff, global_node_diff);
- return changes;
+ if (fold_diff(global_zone_diff, global_node_diff))
+ changed = true;
+ return changed;
}
/*
_
Patches currently in -mm which might be from joshua.hahnjy@gmail.com are
mm-page_alloc-clarify-batch-tuning-in-zone_batchsize.patch
mm-page_alloc-prevent-reporting-pcp-batch-=-0.patch
mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
mm-page_alloc-batch-page-freeing-in-decay_pcp_high.patch
mm-page_alloc-batch-page-freeing-in-free_frozen_page_commit.patch
^ permalink raw reply [flat|nested] 3+ messages in thread
* + mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch added to mm-new branch
@ 2025-10-14 22:12 Andrew Morton
0 siblings, 0 replies; 3+ messages in thread
From: Andrew Morton @ 2025-10-14 22:12 UTC (permalink / raw)
To: mm-commits, ziy, vbabka, surenb, sj, mhocko, kirill, jackmanb,
hannes, clm, joshua.hahnjy, akpm
The patch titled
Subject: mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
has been added to the -mm mm-new branch. Its filename is
mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Joshua Hahn <joshua.hahnjy@gmail.com>
Subject: mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
Date: Tue, 14 Oct 2025 07:50:08 -0700
Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5.
Motivation & Approach
=====================
While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups. Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.
This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
While these warnings don't indicate a crash or a kernel panic, they do
point to the underlying issue of lock contention. To prevent starvation
in both locks, batch the freeing of pages using pcp->batch.
Because free_pcppages_bulk is called with the pcp lock and acquires the
zone lock, relinquishing and reacquiring the locks are only effective when
both of them are broken together (unless the system was built with queued
spinlocks). Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.
A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.
Testing
=======
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.
On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation).
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.
Based on these results, it seems like there are varying degrees to how
much lock contention this reduces. For the largest and smallest machines
that I ran the tests on, it seems like there is quite some significant
reduction. There is also some performance increases visible from
userspace.
Interestingly, the performance gains don't scale with the size of the
machine, but rather there seems to be a dip in the gain there is for the
medium-sized machine. One possible theory is that because the high
watermark depends on both memory and the number of local CPUs, what
impacts zone contention the most is not these individual values, but
rather the ratio of mem:processors.
This patch (of 5):
Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.
However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.
Simplify the code by returning a bool instead to indicate if updates
were made.
In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.
Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 8 ++++----
mm/vmstat.c | 28 +++++++++++++++-------------
3 files changed, 20 insertions(+), 18 deletions(-)
--- a/include/linux/gfp.h~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/include/linux/gfp.h
@@ -386,7 +386,7 @@ extern void free_pages(unsigned long add
#define free_page(addr) free_pages((addr), 0)
void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
--- a/mm/page_alloc.c~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/mm/page_alloc.c
@@ -2557,10 +2557,10 @@ static int rmqueue_bulk(struct zone *zon
* Called from the vmstat counter updater to decay the PCP high.
* Return whether there are addition works to do.
*/
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, batch;
- int todo = 0;
+ bool todo = false;
high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
@@ -2573,7 +2573,7 @@ int decay_pcp_high(struct zone *zone, st
pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
pcp->high - (pcp->high >> 3), high_min);
if (pcp->high > high_min)
- todo++;
+ todo = true;
}
to_drain = pcp->count - pcp->high;
@@ -2581,7 +2581,7 @@ int decay_pcp_high(struct zone *zone, st
spin_lock(&pcp->lock);
free_pcppages_bulk(zone, to_drain, pcp, 0);
spin_unlock(&pcp->lock);
- todo++;
+ todo = true;
}
return todo;
--- a/mm/vmstat.c~mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection
+++ a/mm/vmstat.c
@@ -771,25 +771,25 @@ EXPORT_SYMBOL(dec_node_page_state);
/*
* Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
*/
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
- int changes = 0;
+ bool changed = false;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (zone_diff[i]) {
atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
+ changed = true;
}
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
if (node_diff[i]) {
atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
+ changed = true;
}
- return changes;
+ return changed;
}
/*
@@ -806,16 +806,16 @@ static int fold_diff(int *zone_diff, int
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
*/
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
{
struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
- int changes = 0;
+ bool changed = false;
for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -839,7 +839,8 @@ static int refresh_cpu_vm_stats(bool do_
if (do_pagesets) {
cond_resched();
- changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+ if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
+ changed = true;
#ifdef CONFIG_NUMA
/*
* Deal with draining the remote pageset of this
@@ -861,13 +862,13 @@ static int refresh_cpu_vm_stats(bool do_
}
if (__this_cpu_dec_return(pcp->expire)) {
- changes++;
+ changed = true;
continue;
}
if (__this_cpu_read(pcp->count)) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
- changes++;
+ changed = true;
}
#endif
}
@@ -887,8 +888,9 @@ static int refresh_cpu_vm_stats(bool do_
}
}
- changes += fold_diff(global_zone_diff, global_node_diff);
- return changes;
+ if (fold_diff(global_zone_diff, global_node_diff))
+ changed = true;
+ return changed;
}
/*
_
Patches currently in -mm which might be from joshua.hahnjy@gmail.com are
mm-page_alloc-clarify-batch-tuning-in-zone_batchsize.patch
mm-page_alloc-prevent-reporting-pcp-batch-=-0.patch
mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch
mm-page_alloc-batch-page-freeing-in-decay_pcp_high.patch
mm-page_alloc-batch-page-freeing-in-free_frozen_page_commit.patch
mm-page_alloc-batch-page-freeing-in-free_frozen_page_commit-fix.patch
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-10-14 22:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 22:40 + mm-page_alloc-vmstat-simplify-refresh_cpu_vm_stats-change-detection.patch added to mm-new branch Andrew Morton
-- strict thread matches above, loose matches on Subject: below --
2025-10-14 22:12 Andrew Morton
2025-10-08 0:47 Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.