* [PATCH] x86/mm/pat: take cpa_lock around large-page collapse
@ 2026-06-26 16:32 Denis V. Lunev
2026-07-02 17:47 ` Denis V. Lunev
2026-07-03 13:01 ` Kiryl Shutsemau
0 siblings, 2 replies; 3+ messages in thread
From: Denis V. Lunev @ 2026-06-26 16:32 UTC (permalink / raw)
To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86
Cc: den, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
H. Peter Anvin, Mike Rapoport (Microsoft), Juergen Gross,
Kiryl Shutsemau, linux-kernel
Loading and unloading modules concurrently on several CPUs on a KASAN
build, with a short delay injected at the CPA page-table lookup to
widen the window, faults within minutes:
BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
Write of size 8 at addr ffff888181139718 by task modprobe
...
The buggy address belongs to the physical page:
pfn:0x181139 ... page_type: f2(table)
cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
frees the old PTE-table pages, while __change_page_attr() fetches a
PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
it with set_pte_atomic() only later. When module text is served from
a shared large ROX mapping the two run on the same PMD:
CPU A (module load) CPU B (module finalize)
------------------- -----------------------
execmem_make_temp_rw
set_memory_nx
__change_page_attr
split 2M -> 4K table P
kpte = &P[i] (lockless)
execmem_restore_rox
set_memory_rox (CPA_COLLAPSE)
cpa_collapse_large_pages
rebuild leaf PMD
flush_tlb_all
pagetable_free(P)
set_pte_atomic(kpte, ...)
-> writes into freed P
P is a page-table page (page_type: table), reused at once, so the
write corrupts whatever got the page next: a bad-pte or bad-page
splat, or a fatal fault once P has been turned into read-only text.
The flush_tlb_all() before the free does not close this: its IPI only
serializes against page-table walkers that run with interrupts off
(e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
on, so nothing stops it from holding a stale pointer into P.
Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
free - under cpa_lock, the lock __change_page_attr() takes for the
split path, so a concurrent walker can no longer hold a pointer into
a table the collapse is about to free.
debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
map is 4K then, with no large pages to serialize), so the lock cannot
order the two there. Skip the collapse in that config: it is only an
optimization, and not freeing the tables leaves the unserialized walk
nothing to race.
Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
arch/x86/mm/pat/set_memory.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d023a40a1e03..ff6e3f612986 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -418,6 +418,16 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
int collapsed = 0;
int i;
+ /*
+ * debug_pagealloc bypasses cpa_lock, so __change_page_attr() walks
+ * unserialized and freeing collapsed PTE-tables could race it; skip
+ * the optional merge there.
+ */
+ if (debug_pagealloc_enabled())
+ return;
+
+ spin_lock(&cpa_lock);
+
if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
for (i = 0; i < cpa->numpages; i++)
collapsed += collapse_large_pages(__cpa_addr(cpa, i),
@@ -431,8 +441,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
collapsed += collapse_large_pages(addr, &pgtables);
}
- if (!collapsed)
+ if (!collapsed) {
+ spin_unlock(&cpa_lock);
return;
+ }
flush_tlb_all();
@@ -440,6 +452,8 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
list_del(&ptdesc->pt_list);
pagetable_free(ptdesc);
}
+
+ spin_unlock(&cpa_lock);
}
static void cpa_flush(struct cpa_data *cpa, int cache)
base-commit: b81d185839fade27f7c4e885856696cf497d53c1
--
2.53.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH] x86/mm/pat: take cpa_lock around large-page collapse
2026-06-26 16:32 [PATCH] x86/mm/pat: take cpa_lock around large-page collapse Denis V. Lunev
@ 2026-07-02 17:47 ` Denis V. Lunev
2026-07-03 13:01 ` Kiryl Shutsemau
1 sibling, 0 replies; 3+ messages in thread
From: Denis V. Lunev @ 2026-07-02 17:47 UTC (permalink / raw)
To: Denis V. Lunev, Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86
Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Mike Rapoport (Microsoft), Juergen Gross, Kiryl Shutsemau,
linux-kernel
On 6/26/26 18:32, Denis V. Lunev wrote:
> This email originated from an IP that might not be authorized by the domain it was sent from.
> Do not click links or open attachments unless it is an email you expected to receive.
> Loading and unloading modules concurrently on several CPUs on a KASAN
> build, with a short delay injected at the CPA page-table lookup to
> widen the window, faults within minutes:
>
> BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
> Write of size 8 at addr ffff888181139718 by task modprobe
> ...
> The buggy address belongs to the physical page:
> pfn:0x181139 ... page_type: f2(table)
>
> cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
> frees the old PTE-table pages, while __change_page_attr() fetches a
> PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
> it with set_pte_atomic() only later. When module text is served from
> a shared large ROX mapping the two run on the same PMD:
>
> CPU A (module load) CPU B (module finalize)
> ------------------- -----------------------
> execmem_make_temp_rw
> set_memory_nx
> __change_page_attr
> split 2M -> 4K table P
> kpte = &P[i] (lockless)
> execmem_restore_rox
> set_memory_rox (CPA_COLLAPSE)
> cpa_collapse_large_pages
> rebuild leaf PMD
> flush_tlb_all
> pagetable_free(P)
> set_pte_atomic(kpte, ...)
> -> writes into freed P
>
> P is a page-table page (page_type: table), reused at once, so the
> write corrupts whatever got the page next: a bad-pte or bad-page
> splat, or a fatal fault once P has been turned into read-only text.
>
> The flush_tlb_all() before the free does not close this: its IPI only
> serializes against page-table walkers that run with interrupts off
> (e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
> on, so nothing stops it from holding a stale pointer into P.
>
> Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
> free - under cpa_lock, the lock __change_page_attr() takes for the
> split path, so a concurrent walker can no longer hold a pointer into
> a table the collapse is about to free.
>
> debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
> map is 4K then, with no large pages to serialize), so the lock cannot
> order the two there. Skip the collapse in that config: it is only an
> optimization, and not freeing the tables leaves the unserialized walk
> nothing to race.
>
> Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
> arch/x86/mm/pat/set_memory.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index d023a40a1e03..ff6e3f612986 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -418,6 +418,16 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
> int collapsed = 0;
> int i;
>
> + /*
> + * debug_pagealloc bypasses cpa_lock, so __change_page_attr() walks
> + * unserialized and freeing collapsed PTE-tables could race it; skip
> + * the optional merge there.
> + */
> + if (debug_pagealloc_enabled())
> + return;
> +
> + spin_lock(&cpa_lock);
> +
> if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
> for (i = 0; i < cpa->numpages; i++)
> collapsed += collapse_large_pages(__cpa_addr(cpa, i),
> @@ -431,8 +441,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
> collapsed += collapse_large_pages(addr, &pgtables);
> }
>
> - if (!collapsed)
> + if (!collapsed) {
> + spin_unlock(&cpa_lock);
> return;
> + }
>
> flush_tlb_all();
>
> @@ -440,6 +452,8 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
> list_del(&ptdesc->pt_list);
> pagetable_free(ptdesc);
> }
> +
> + spin_unlock(&cpa_lock);
> }
>
> static void cpa_flush(struct cpa_data *cpa, int cache)
>
> base-commit: b81d185839fade27f7c4e885856696cf497d53c1
ping
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH] x86/mm/pat: take cpa_lock around large-page collapse
2026-06-26 16:32 [PATCH] x86/mm/pat: take cpa_lock around large-page collapse Denis V. Lunev
2026-07-02 17:47 ` Denis V. Lunev
@ 2026-07-03 13:01 ` Kiryl Shutsemau
1 sibling, 0 replies; 3+ messages in thread
From: Kiryl Shutsemau @ 2026-07-03 13:01 UTC (permalink / raw)
To: Denis V. Lunev
Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
Mike Rapoport (Microsoft), Juergen Gross, linux-kernel
On Fri, Jun 26, 2026 at 06:32:11PM +0200, Denis V. Lunev wrote:
> Loading and unloading modules concurrently on several CPUs on a KASAN
> build, with a short delay injected at the CPA page-table lookup to
> widen the window, faults within minutes:
>
> BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
> Write of size 8 at addr ffff888181139718 by task modprobe
> ...
> The buggy address belongs to the physical page:
> pfn:0x181139 ... page_type: f2(table)
>
> cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
> frees the old PTE-table pages, while __change_page_attr() fetches a
> PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
> it with set_pte_atomic() only later. When module text is served from
> a shared large ROX mapping the two run on the same PMD:
>
> CPU A (module load) CPU B (module finalize)
> ------------------- -----------------------
> execmem_make_temp_rw
> set_memory_nx
> __change_page_attr
> split 2M -> 4K table P
> kpte = &P[i] (lockless)
> execmem_restore_rox
> set_memory_rox (CPA_COLLAPSE)
> cpa_collapse_large_pages
> rebuild leaf PMD
> flush_tlb_all
> pagetable_free(P)
> set_pte_atomic(kpte, ...)
> -> writes into freed P
>
> P is a page-table page (page_type: table), reused at once, so the
> write corrupts whatever got the page next: a bad-pte or bad-page
> splat, or a fatal fault once P has been turned into read-only text.
>
> The flush_tlb_all() before the free does not close this: its IPI only
> serializes against page-table walkers that run with interrupts off
> (e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
> on, so nothing stops it from holding a stale pointer into P.
>
> Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
> free - under cpa_lock, the lock __change_page_attr() takes for the
> split path, so a concurrent walker can no longer hold a pointer into
> a table the collapse is about to free.
>
> debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
> map is 4K then, with no large pages to serialize), so the lock cannot
> order the two there. Skip the collapse in that config: it is only an
> optimization, and not freeing the tables leaves the unserialized walk
> nothing to race.
>
> Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-07-03 13:01 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 16:32 [PATCH] x86/mm/pat: take cpa_lock around large-page collapse Denis V. Lunev
2026-07-02 17:47 ` Denis V. Lunev
2026-07-03 13:01 ` Kiryl Shutsemau
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox