* FAILED: patch "[PATCH] mm/page_alloc: prevent pcp corruption with SMP=n" failed to apply to 6.12-stable tree
@ 2026-01-20 13:08 gregkh
2026-01-21 11:28 ` [PATCH 6.12.y 1/3] mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection Sasha Levin
0 siblings, 1 reply; 4+ messages in thread
From: gregkh @ 2026-01-20 13:08 UTC (permalink / raw)
To: vbabka, akpm, bigeasy, hannes, jackmanb, mgorman, mhocko,
oliver.sang, rostedt, stable, surenb, willy, ziy
Cc: stable
The patch below does not apply to the 6.12-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable@vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.12.y
git checkout FETCH_HEAD
git cherry-pick -x 038a102535eb49e10e93eafac54352fcc5d78847
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable@vger.kernel.org>' --in-reply-to '2026012041-wilder-jalapeno-0398@gregkh' --subject-prefix 'PATCH 6.12.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 038a102535eb49e10e93eafac54352fcc5d78847 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Mon, 5 Jan 2026 16:08:56 +0100
Subject: [PATCH] mm/page_alloc: prevent pcp corruption with SMP=n
The kernel test robot has reported:
BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28
lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0
CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470
Call Trace:
<IRQ>
__dump_stack (lib/dump_stack.c:95)
dump_stack_lvl (lib/dump_stack.c:123)
dump_stack (lib/dump_stack.c:130)
spin_dump (kernel/locking/spinlock_debug.c:71)
do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?)
_raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138)
__free_frozen_pages (mm/page_alloc.c:2973)
___free_pages (mm/page_alloc.c:5295)
__free_pages (mm/page_alloc.c:5334)
tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290)
? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289)
? rcu_core (kernel/rcu/tree.c:?)
rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861)
rcu_core_si (kernel/rcu/tree.c:2879)
handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623)
__irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725)
irq_exit_rcu (kernel/softirq.c:741)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052)
</IRQ>
<TASK>
RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
free_pcppages_bulk (mm/page_alloc.c:1494)
drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632)
__drain_all_pages (mm/page_alloc.c:2731)
drain_all_pages (mm/page_alloc.c:2747)
kcompactd (mm/compaction.c:3115)
kthread (kernel/kthread.c:465)
? __cfi_kcompactd (mm/compaction.c:3166)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork (arch/x86/kernel/process.c:164)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
</TASK>
Matthew has analyzed the report and identified that in drain_page_zone()
we are in a section protected by spin_lock(&pcp->lock) and then get an
interrupt that attempts spin_trylock() on the same lock. The code is
designed to work this way without disabling IRQs and occasionally fail the
trylock with a fallback. However, the SMP=n spinlock implementation
assumes spin_trylock() will always succeed, and thus it's normally a
no-op. Here the enabled lock debugging catches the problem, but otherwise
it could cause a corruption of the pcp structure.
The problem has been introduced by commit 574907741599 ("mm/page_alloc:
leave IRQs enabled for per-cpu page allocations"). The pcp locking scheme
recognizes the need for disabling IRQs to prevent nesting spin_trylock()
sections on SMP=n, but the need to prevent the nesting in spin_lock() has
not been recognized. Fix it by introducing local wrappers that change the
spin_lock() to spin_lock_iqsave() with SMP=n and use them in all places
that do spin_lock(&pcp->lock).
[vbabka@suse.cz: add pcp_ prefix to the spin_lock_irqsave wrappers, per Steven]
Link: https://lkml.kernel.org/r/20260105-fix-pcp-up-v1-1-5579662d2071@suse.cz
Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3dce96f3be49..f65c4edf199d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -167,6 +167,33 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { }
pcp_trylock_finish(UP_flags); \
})
+/*
+ * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e.
+ * a potentially remote cpu drain) and get interrupted by an operation that
+ * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP
+ * spinlock assumptions making the trylock a no-op. So we have to turn that
+ * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no
+ * remote cpu's so we can only be locking the only existing local one.
+ */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+static inline void __flags_noop(unsigned long *flags) { }
+#define pcp_spin_lock_maybe_irqsave(ptr, flags) \
+({ \
+ __flags_noop(&(flags)); \
+ spin_lock(&(ptr)->lock); \
+})
+#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \
+({ \
+ spin_unlock(&(ptr)->lock); \
+ __flags_noop(&(flags)); \
+})
+#else
+#define pcp_spin_lock_maybe_irqsave(ptr, flags) \
+ spin_lock_irqsave(&(ptr)->lock, flags)
+#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \
+ spin_unlock_irqrestore(&(ptr)->lock, flags)
+#endif
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -2556,6 +2583,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, to_drain_batched, batch;
+ unsigned long UP_flags;
bool todo = false;
high_min = READ_ONCE(pcp->high_min);
@@ -2575,9 +2603,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->count - pcp->high;
while (to_drain > 0) {
to_drain_batched = min(to_drain, batch);
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
todo = true;
to_drain -= to_drain_batched;
@@ -2594,14 +2622,15 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
+ unsigned long UP_flags;
int to_drain, batch;
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0) {
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
free_pcppages_bulk(zone, to_drain, pcp, 0);
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
}
}
#endif
@@ -2612,10 +2641,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ unsigned long UP_flags;
int count;
do {
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
count = pcp->count;
if (count) {
int to_drain = min(count,
@@ -2624,7 +2654,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
free_pcppages_bulk(zone, to_drain, pcp, 0);
count -= to_drain;
}
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
} while (count);
}
@@ -6109,6 +6139,7 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
{
struct per_cpu_pages *pcp;
struct cpu_cacheinfo *cci;
+ unsigned long UP_flags;
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
cci = get_cpu_cacheinfo(cpu);
@@ -6119,12 +6150,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
* This can reduce zone lock contention without hurting
* cache-hot pages sharing.
*/
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch)
pcp->flags |= PCPF_FREE_HIGH_BATCH;
else
pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
}
void setup_pcp_cacheinfo(unsigned int cpu)
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH 6.12.y 1/3] mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
2026-01-20 13:08 FAILED: patch "[PATCH] mm/page_alloc: prevent pcp corruption with SMP=n" failed to apply to 6.12-stable tree gregkh
@ 2026-01-21 11:28 ` Sasha Levin
2026-01-21 11:28 ` [PATCH 6.12.y 2/3] mm/page_alloc: batch page freeing in decay_pcp_high Sasha Levin
2026-01-21 11:28 ` [PATCH 6.12.y 3/3] mm/page_alloc: prevent pcp corruption with SMP=n Sasha Levin
0 siblings, 2 replies; 4+ messages in thread
From: Sasha Levin @ 2026-01-21 11:28 UTC (permalink / raw)
To: stable
Cc: Joshua Hahn, Vlastimil Babka, SeongJae Park, Brendan Jackman,
Chris Mason, Johannes Weiner, Kirill A. Shutemov, Michal Hocko,
Suren Baghdasaryan, Zi Yan, Andrew Morton, Sasha Levin
From: Joshua Hahn <joshua.hahnjy@gmail.com>
[ Upstream commit 0acc67c4030c39f39ac90413cc5d0abddd3a9527 ]
Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5.
Motivation & Approach
=====================
While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups. Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.
This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
While these warnings don't indicate a crash or a kernel panic, they do
point to the underlying issue of lock contention. To prevent starvation
in both locks, batch the freeing of pages using pcp->batch.
Because free_pcppages_bulk is called with the pcp lock and acquires the
zone lock, relinquishing and reacquiring the locks are only effective when
both of them are broken together (unless the system was built with queued
spinlocks). Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.
A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.
Testing
=======
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.
On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation).
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.
Based on these results, it seems like there are varying degrees to how
much lock contention this reduces. For the largest and smallest machines
that I ran the tests on, it seems like there is quite some significant
reduction. There is also some performance increases visible from
userspace.
Interestingly, the performance gains don't scale with the size of the
machine, but rather there seems to be a dip in the gain there is for the
medium-sized machine. One possible theory is that because the high
watermark depends on both memory and the number of local CPUs, what
impacts zone contention the most is not these individual values, but
rather the ratio of mem:processors.
This patch (of 5):
Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.
However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.
Simplify the code by returning a bool instead to indicate if updates
were made.
In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.
Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 8 ++++----
mm/vmstat.c | 28 +++++++++++++++-------------
3 files changed, 20 insertions(+), 18 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a951de920e208..bc59016743fb7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -397,7 +397,7 @@ extern void page_frag_free(void *addr);
#define free_page(addr) free_pages((addr), 0)
void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d43bd47da263..6e1669a562946 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2363,10 +2363,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* Called from the vmstat counter updater to decay the PCP high.
* Return whether there are addition works to do.
*/
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, batch;
- int todo = 0;
+ bool todo = false;
high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
@@ -2379,7 +2379,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
pcp->high - (pcp->high >> 3), high_min);
if (pcp->high > high_min)
- todo++;
+ todo = true;
}
to_drain = pcp->count - pcp->high;
@@ -2387,7 +2387,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
spin_lock(&pcp->lock);
free_pcppages_bulk(zone, to_drain, pcp, 0);
spin_unlock(&pcp->lock);
- todo++;
+ todo = true;
}
return todo;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3f41344239126..3ca572cbeaf1c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -768,25 +768,25 @@ EXPORT_SYMBOL(dec_node_page_state);
/*
* Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
*/
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
- int changes = 0;
+ bool changed = false;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (zone_diff[i]) {
atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
+ changed = true;
}
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
if (node_diff[i]) {
atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
+ changed = true;
}
- return changes;
+ return changed;
}
/*
@@ -803,16 +803,16 @@ static int fold_diff(int *zone_diff, int *node_diff)
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
*/
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
{
struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
- int changes = 0;
+ bool changed = false;
for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -836,7 +836,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
if (do_pagesets) {
cond_resched();
- changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+ if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
+ changed = true;
#ifdef CONFIG_NUMA
/*
* Deal with draining the remote pageset of this
@@ -858,13 +859,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
if (__this_cpu_dec_return(pcp->expire)) {
- changes++;
+ changed = true;
continue;
}
if (__this_cpu_read(pcp->count)) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
- changes++;
+ changed = true;
}
#endif
}
@@ -884,8 +885,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}
- changes += fold_diff(global_zone_diff, global_node_diff);
- return changes;
+ if (fold_diff(global_zone_diff, global_node_diff))
+ changed = true;
+ return changed;
}
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH 6.12.y 2/3] mm/page_alloc: batch page freeing in decay_pcp_high
2026-01-21 11:28 ` [PATCH 6.12.y 1/3] mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection Sasha Levin
@ 2026-01-21 11:28 ` Sasha Levin
2026-01-21 11:28 ` [PATCH 6.12.y 3/3] mm/page_alloc: prevent pcp corruption with SMP=n Sasha Levin
1 sibling, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-01-21 11:28 UTC (permalink / raw)
To: stable
Cc: Joshua Hahn, Chris Mason, Andrew Morton, Johannes Weiner,
Vlastimil Babka, Brendan Jackman, Kirill A. Shutemov,
Michal Hocko, SeongJae Park, Suren Baghdasaryan, Zi Yan,
Sasha Levin
From: Joshua Hahn <joshua.hahnjy@gmail.com>
[ Upstream commit fc4b909c368f3a7b08c895dd5926476b58e85312 ]
It is possible for pcp->count - pcp->high to exceed pcp->batch by a lot.
When this happens, we should perform batching to ensure that
free_pcppages_bulk isn't called with too many pages to free at once and
starve out other threads that need the pcp or zone lock.
Since we are still only freeing the difference between the initial
pcp->count and pcp->high values, there should be no change to how many
pages are freed.
Link: https://lkml.kernel.org/r/20251014145011.3427205-3-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Chris Mason <clm@fb.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/page_alloc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1669a562946..23ad33020f312 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2365,7 +2365,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
- int high_min, to_drain, batch;
+ int high_min, to_drain, to_drain_batched, batch;
bool todo = false;
high_min = READ_ONCE(pcp->high_min);
@@ -2383,11 +2383,14 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
}
to_drain = pcp->count - pcp->high;
- if (to_drain > 0) {
+ while (to_drain > 0) {
+ to_drain_batched = min(to_drain, batch);
spin_lock(&pcp->lock);
- free_pcppages_bulk(zone, to_drain, pcp, 0);
+ free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
spin_unlock(&pcp->lock);
todo = true;
+
+ to_drain -= to_drain_batched;
}
return todo;
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH 6.12.y 3/3] mm/page_alloc: prevent pcp corruption with SMP=n
2026-01-21 11:28 ` [PATCH 6.12.y 1/3] mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection Sasha Levin
2026-01-21 11:28 ` [PATCH 6.12.y 2/3] mm/page_alloc: batch page freeing in decay_pcp_high Sasha Levin
@ 2026-01-21 11:28 ` Sasha Levin
1 sibling, 0 replies; 4+ messages in thread
From: Sasha Levin @ 2026-01-21 11:28 UTC (permalink / raw)
To: stable
Cc: Vlastimil Babka, kernel test robot, Matthew Wilcox, Mel Gorman,
Brendan Jackman, Johannes Weiner, Michal Hocko,
Sebastian Andrzej Siewior, Steven Rostedt, Suren Baghdasaryan,
Zi Yan, Andrew Morton, Sasha Levin
From: Vlastimil Babka <vbabka@suse.cz>
[ Upstream commit 038a102535eb49e10e93eafac54352fcc5d78847 ]
The kernel test robot has reported:
BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28
lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0
CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470
Call Trace:
<IRQ>
__dump_stack (lib/dump_stack.c:95)
dump_stack_lvl (lib/dump_stack.c:123)
dump_stack (lib/dump_stack.c:130)
spin_dump (kernel/locking/spinlock_debug.c:71)
do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?)
_raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138)
__free_frozen_pages (mm/page_alloc.c:2973)
___free_pages (mm/page_alloc.c:5295)
__free_pages (mm/page_alloc.c:5334)
tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290)
? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289)
? rcu_core (kernel/rcu/tree.c:?)
rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861)
rcu_core_si (kernel/rcu/tree.c:2879)
handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623)
__irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725)
irq_exit_rcu (kernel/softirq.c:741)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052)
</IRQ>
<TASK>
RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
free_pcppages_bulk (mm/page_alloc.c:1494)
drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632)
__drain_all_pages (mm/page_alloc.c:2731)
drain_all_pages (mm/page_alloc.c:2747)
kcompactd (mm/compaction.c:3115)
kthread (kernel/kthread.c:465)
? __cfi_kcompactd (mm/compaction.c:3166)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork (arch/x86/kernel/process.c:164)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
</TASK>
Matthew has analyzed the report and identified that in drain_page_zone()
we are in a section protected by spin_lock(&pcp->lock) and then get an
interrupt that attempts spin_trylock() on the same lock. The code is
designed to work this way without disabling IRQs and occasionally fail the
trylock with a fallback. However, the SMP=n spinlock implementation
assumes spin_trylock() will always succeed, and thus it's normally a
no-op. Here the enabled lock debugging catches the problem, but otherwise
it could cause a corruption of the pcp structure.
The problem has been introduced by commit 574907741599 ("mm/page_alloc:
leave IRQs enabled for per-cpu page allocations"). The pcp locking scheme
recognizes the need for disabling IRQs to prevent nesting spin_trylock()
sections on SMP=n, but the need to prevent the nesting in spin_lock() has
not been recognized. Fix it by introducing local wrappers that change the
spin_lock() to spin_lock_iqsave() with SMP=n and use them in all places
that do spin_lock(&pcp->lock).
[vbabka@suse.cz: add pcp_ prefix to the spin_lock_irqsave wrappers, per Steven]
Link: https://lkml.kernel.org/r/20260105-fix-pcp-up-v1-1-5579662d2071@suse.cz
Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/page_alloc.c | 47 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 39 insertions(+), 8 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23ad33020f312..a49043798bbf0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -163,6 +163,33 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
#define pcp_spin_unlock(ptr) \
pcpu_spin_unlock(lock, ptr)
+/*
+ * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e.
+ * a potentially remote cpu drain) and get interrupted by an operation that
+ * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP
+ * spinlock assumptions making the trylock a no-op. So we have to turn that
+ * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no
+ * remote cpu's so we can only be locking the only existing local one.
+ */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+static inline void __flags_noop(unsigned long *flags) { }
+#define pcp_spin_lock_maybe_irqsave(ptr, flags) \
+({ \
+ __flags_noop(&(flags)); \
+ spin_lock(&(ptr)->lock); \
+})
+#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \
+({ \
+ spin_unlock(&(ptr)->lock); \
+ __flags_noop(&(flags)); \
+})
+#else
+#define pcp_spin_lock_maybe_irqsave(ptr, flags) \
+ spin_lock_irqsave(&(ptr)->lock, flags)
+#define pcp_spin_unlock_maybe_irqrestore(ptr, flags) \
+ spin_unlock_irqrestore(&(ptr)->lock, flags)
+#endif
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -2366,6 +2393,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, to_drain_batched, batch;
+ unsigned long UP_flags;
bool todo = false;
high_min = READ_ONCE(pcp->high_min);
@@ -2385,9 +2413,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->count - pcp->high;
while (to_drain > 0) {
to_drain_batched = min(to_drain, batch);
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
todo = true;
to_drain -= to_drain_batched;
@@ -2404,14 +2432,15 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
+ unsigned long UP_flags;
int to_drain, batch;
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0) {
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
free_pcppages_bulk(zone, to_drain, pcp, 0);
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
}
}
#endif
@@ -2422,10 +2451,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ unsigned long UP_flags;
int count;
do {
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
count = pcp->count;
if (count) {
int to_drain = min(count,
@@ -2434,7 +2464,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
free_pcppages_bulk(zone, to_drain, pcp, 0);
count -= to_drain;
}
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
} while (count);
}
@@ -5795,6 +5825,7 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
{
struct per_cpu_pages *pcp;
struct cpu_cacheinfo *cci;
+ unsigned long UP_flags;
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
cci = get_cpu_cacheinfo(cpu);
@@ -5805,12 +5836,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
* This can reduce zone lock contention without hurting
* cache-hot pages sharing.
*/
- spin_lock(&pcp->lock);
+ pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch)
pcp->flags |= PCPF_FREE_HIGH_BATCH;
else
pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
- spin_unlock(&pcp->lock);
+ pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
}
void setup_pcp_cacheinfo(unsigned int cpu)
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-01-21 11:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-20 13:08 FAILED: patch "[PATCH] mm/page_alloc: prevent pcp corruption with SMP=n" failed to apply to 6.12-stable tree gregkh
2026-01-21 11:28 ` [PATCH 6.12.y 1/3] mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection Sasha Levin
2026-01-21 11:28 ` [PATCH 6.12.y 2/3] mm/page_alloc: batch page freeing in decay_pcp_high Sasha Levin
2026-01-21 11:28 ` [PATCH 6.12.y 3/3] mm/page_alloc: prevent pcp corruption with SMP=n Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox