[PATCH] x86/mm/pat: take cpa_lock around large-page collapse

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] x86/mm/pat: take cpa_lock around large-page collapse
@ 2026-06-26 16:32 Denis V. Lunev
  2026-07-02 17:47 ` Denis V. Lunev
  2026-07-03 13:01 ` Kiryl Shutsemau
  0 siblings, 2 replies; 3+ messages in thread
From: Denis V. Lunev @ 2026-06-26 16:32 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86
  Cc: den, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Mike Rapoport (Microsoft), Juergen Gross,
	Kiryl Shutsemau, linux-kernel

Loading and unloading modules concurrently on several CPUs on a KASAN
build, with a short delay injected at the CPA page-table lookup to
widen the window, faults within minutes:

  BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
  Write of size 8 at addr ffff888181139718 by task modprobe
  ...
  The buggy address belongs to the physical page:
   pfn:0x181139 ... page_type: f2(table)

cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
frees the old PTE-table pages, while __change_page_attr() fetches a
PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
it with set_pte_atomic() only later. When module text is served from
a shared large ROX mapping the two run on the same PMD:

  CPU A (module load)              CPU B (module finalize)
  -------------------              -----------------------
  execmem_make_temp_rw
   set_memory_nx
    __change_page_attr
     split 2M -> 4K table P
     kpte = &P[i]  (lockless)
                                   execmem_restore_rox
                                    set_memory_rox (CPA_COLLAPSE)
                                     cpa_collapse_large_pages
                                      rebuild leaf PMD
                                      flush_tlb_all
                                      pagetable_free(P)
     set_pte_atomic(kpte, ...)
       -> writes into freed P

P is a page-table page (page_type: table), reused at once, so the
write corrupts whatever got the page next: a bad-pte or bad-page
splat, or a fatal fault once P has been turned into read-only text.

The flush_tlb_all() before the free does not close this: its IPI only
serializes against page-table walkers that run with interrupts off
(e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
on, so nothing stops it from holding a stale pointer into P.

Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
free - under cpa_lock, the lock __change_page_attr() takes for the
split path, so a concurrent walker can no longer hold a pointer into
a table the collapse is about to free.

debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
map is 4K then, with no large pages to serialize), so the lock cannot
order the two there. Skip the collapse in that config: it is only an
optimization, and not freeing the tables leaves the unserialized walk
nothing to race.

Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 arch/x86/mm/pat/set_memory.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d023a40a1e03..ff6e3f612986 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -418,6 +418,16 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 	int collapsed = 0;
 	int i;

+	/*
+	 * debug_pagealloc bypasses cpa_lock, so __change_page_attr() walks
+	 * unserialized and freeing collapsed PTE-tables could race it; skip
+	 * the optional merge there.
+	 */
+	if (debug_pagealloc_enabled())
+		return;
+
+	spin_lock(&cpa_lock);
+
 	if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
 		for (i = 0; i < cpa->numpages; i++)
 			collapsed += collapse_large_pages(__cpa_addr(cpa, i),
@@ -431,8 +441,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 			collapsed += collapse_large_pages(addr, &pgtables);
 	}

-	if (!collapsed)
+	if (!collapsed) {
+		spin_unlock(&cpa_lock);
 		return;
+	}

 	flush_tlb_all();

@@ -440,6 +452,8 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
 		list_del(&ptdesc->pt_list);
 		pagetable_free(ptdesc);
 	}
+
+	spin_unlock(&cpa_lock);
 }

 static void cpa_flush(struct cpa_data *cpa, int cache)

base-commit: b81d185839fade27f7c4e885856696cf497d53c1
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/mm/pat: take cpa_lock around large-page collapse
  2026-06-26 16:32 [PATCH] x86/mm/pat: take cpa_lock around large-page collapse Denis V. Lunev
@ 2026-07-02 17:47 ` Denis V. Lunev
  2026-07-03 13:01 ` Kiryl Shutsemau
  1 sibling, 0 replies; 3+ messages in thread
From: Denis V. Lunev @ 2026-07-02 17:47 UTC (permalink / raw)
  To: Denis V. Lunev, Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Mike Rapoport (Microsoft), Juergen Gross, Kiryl Shutsemau,
	linux-kernel

On 6/26/26 18:32, Denis V. Lunev wrote:
> This email originated from an IP that might not be authorized by the domain it was sent from.
> Do not click links or open attachments unless it is an email you expected to receive.
> Loading and unloading modules concurrently on several CPUs on a KASAN
> build, with a short delay injected at the CPA page-table lookup to
> widen the window, faults within minutes:
>
>   BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
>   Write of size 8 at addr ffff888181139718 by task modprobe
>   ...
>   The buggy address belongs to the physical page:
>    pfn:0x181139 ... page_type: f2(table)
>
> cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
> frees the old PTE-table pages, while __change_page_attr() fetches a
> PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
> it with set_pte_atomic() only later. When module text is served from
> a shared large ROX mapping the two run on the same PMD:
>
>   CPU A (module load)              CPU B (module finalize)
>   -------------------              -----------------------
>   execmem_make_temp_rw
>    set_memory_nx
>     __change_page_attr
>      split 2M -> 4K table P
>      kpte = &P[i]  (lockless)
>                                    execmem_restore_rox
>                                     set_memory_rox (CPA_COLLAPSE)
>                                      cpa_collapse_large_pages
>                                       rebuild leaf PMD
>                                       flush_tlb_all
>                                       pagetable_free(P)
>      set_pte_atomic(kpte, ...)
>        -> writes into freed P
>
> P is a page-table page (page_type: table), reused at once, so the
> write corrupts whatever got the page next: a bad-pte or bad-page
> splat, or a fatal fault once P has been turned into read-only text.
>
> The flush_tlb_all() before the free does not close this: its IPI only
> serializes against page-table walkers that run with interrupts off
> (e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
> on, so nothing stops it from holding a stale pointer into P.
>
> Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
> free - under cpa_lock, the lock __change_page_attr() takes for the
> split path, so a concurrent walker can no longer hold a pointer into
> a table the collapse is about to free.
>
> debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
> map is 4K then, with no large pages to serialize), so the lock cannot
> order the two there. Skip the collapse in that config: it is only an
> optimization, and not freeing the tables leaves the unserialized walk
> nothing to race.
>
> Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  arch/x86/mm/pat/set_memory.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index d023a40a1e03..ff6e3f612986 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -418,6 +418,16 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
>  	int collapsed = 0;
>  	int i;
>  
> +	/*
> +	 * debug_pagealloc bypasses cpa_lock, so __change_page_attr() walks
> +	 * unserialized and freeing collapsed PTE-tables could race it; skip
> +	 * the optional merge there.
> +	 */
> +	if (debug_pagealloc_enabled())
> +		return;
> +
> +	spin_lock(&cpa_lock);
> +
>  	if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
>  		for (i = 0; i < cpa->numpages; i++)
>  			collapsed += collapse_large_pages(__cpa_addr(cpa, i),
> @@ -431,8 +441,10 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
>  			collapsed += collapse_large_pages(addr, &pgtables);
>  	}
>  
> -	if (!collapsed)
> +	if (!collapsed) {
> +		spin_unlock(&cpa_lock);
>  		return;
> +	}
>  
>  	flush_tlb_all();
>  
> @@ -440,6 +452,8 @@ static void cpa_collapse_large_pages(struct cpa_data *cpa)
>  		list_del(&ptdesc->pt_list);
>  		pagetable_free(ptdesc);
>  	}
> +
> +	spin_unlock(&cpa_lock);
>  }
>  
>  static void cpa_flush(struct cpa_data *cpa, int cache)
>
> base-commit: b81d185839fade27f7c4e885856696cf497d53c1
ping

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] x86/mm/pat: take cpa_lock around large-page collapse
  2026-06-26 16:32 [PATCH] x86/mm/pat: take cpa_lock around large-page collapse Denis V. Lunev
  2026-07-02 17:47 ` Denis V. Lunev
@ 2026-07-03 13:01 ` Kiryl Shutsemau
  1 sibling, 0 replies; 3+ messages in thread
From: Kiryl Shutsemau @ 2026-07-03 13:01 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Mike Rapoport (Microsoft), Juergen Gross, linux-kernel

On Fri, Jun 26, 2026 at 06:32:11PM +0200, Denis V. Lunev wrote:
> Loading and unloading modules concurrently on several CPUs on a KASAN
> build, with a short delay injected at the CPA page-table lookup to
> widen the window, faults within minutes:
> 
>   BUG: KASAN: use-after-free in __change_page_attr+0x7cc/0x7e0
>   Write of size 8 at addr ffff888181139718 by task modprobe
>   ...
>   The buggy address belongs to the physical page:
>    pfn:0x181139 ... page_type: f2(table)
> 
> cpa_collapse_large_pages() rebuilds a leaf PMD from its 4K PTEs and
> frees the old PTE-table pages, while __change_page_attr() fetches a
> PTE pointer from a lockless lookup_address_in_pgd_attr() and writes
> it with set_pte_atomic() only later. When module text is served from
> a shared large ROX mapping the two run on the same PMD:
> 
>   CPU A (module load)              CPU B (module finalize)
>   -------------------              -----------------------
>   execmem_make_temp_rw
>    set_memory_nx
>     __change_page_attr
>      split 2M -> 4K table P
>      kpte = &P[i]  (lockless)
>                                    execmem_restore_rox
>                                     set_memory_rox (CPA_COLLAPSE)
>                                      cpa_collapse_large_pages
>                                       rebuild leaf PMD
>                                       flush_tlb_all
>                                       pagetable_free(P)
>      set_pte_atomic(kpte, ...)
>        -> writes into freed P
> 
> P is a page-table page (page_type: table), reused at once, so the
> write corrupts whatever got the page next: a bad-pte or bad-page
> splat, or a fatal fault once P has been turned into read-only text.
> 
> The flush_tlb_all() before the free does not close this: its IPI only
> serializes against page-table walkers that run with interrupts off
> (e.g. GUP-fast); the walk in __change_page_attr() runs with interrupts
> on, so nothing stops it from holding a stale pointer into P.
> 
> Serialize the collapse - the PMD rebuild, TLB flush and PTE-table
> free - under cpa_lock, the lock __change_page_attr() takes for the
> split path, so a concurrent walker can no longer hold a pointer into
> a table the collapse is about to free.
> 
> debug_pagealloc bypasses cpa_lock in __change_page_attr() (the direct
> map is 4K then, with no large pages to serialize), so the lock cannot
> order the two there. Skip the collapse in that config: it is only an
> optimization, and not freeing the tables leaves the unserialized walk
> nothing to race.
> 
> Fixes: 41d88484c71c ("x86/mm/pat: restore large ROX pages after fragmentation")
> Signed-off-by: Denis V. Lunev <den@openvz.org>

Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-07-03 13:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 16:32 [PATCH] x86/mm/pat: take cpa_lock around large-page collapse Denis V. Lunev
2026-07-02 17:47 ` Denis V. Lunev
2026-07-03 13:01 ` Kiryl Shutsemau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox