All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU
@ 2026-06-12  4:38 David Carlier
  2026-06-12  4:52 ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: David Carlier @ 2026-06-12  4:38 UTC (permalink / raw)
  To: akpm
  Cc: syzbot+fd95a72470f5a44e464c, David Carlier, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Lu Baolu, linux-mm, linux-kernel

ptdump walks the kernel page tables locklessly through
walk_kernel_page_table_range_lockless().  It only holds the init_mm
mmap lock and the memory hotplug lock, and neither excludes
vmalloc/ioremap teardown from freeing kernel PTE pages via
pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
use-after-free in ptdump_pte_entry() reading a PTE page that was freed
underneath the walk.

Deferring the kernel page table free only batches the TLB flush; it does
not wait for lockless walkers.  Mirror the user page table walk, where
pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
across the lockless kernel walk and wait for a grace period in the kernel
page table free worker before releasing the pages.  A walker then either
observes the cleared PMD and skips the page, or keeps it alive until it
drops the RCU read lock.

Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 mm/pagewalk.c        | 15 ++++++++++++++-
 mm/pgtable-generic.c |  8 ++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..6d9f14f86784 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -655,13 +655,26 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
 		.private	= private,
 		.no_vma		= true
 	};
+	int err;
 
 	if (start >= end)
 		return -EINVAL;
 	if (!check_ops_safe(ops))
 		return -EINVAL;
 
-	return walk_pgd_range(start, end, &walk);
+	/*
+	 * Kernel intermediate page tables can be freed concurrently by
+	 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which routes
+	 * the freed pages through pagetable_free_kernel(). That path defers
+	 * the free past an RCU grace period, so hold the RCU read lock across
+	 * the lockless walk to prevent a page table from being freed while we
+	 * are still dereferencing it.
+	 */
+	rcu_read_lock();
+	err = walk_pgd_range(start, end, &walk);
+	rcu_read_unlock();
+
+	return err;
 }
 
 /**
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..59e1315185b4 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -434,6 +434,14 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
+
+	/*
+	 * Lockless kernel page table walkers (ptdump, and any other user of
+	 * walk_kernel_page_table_range_lockless()) dereference these pages
+	 * under rcu_read_lock(). Wait for a grace period so no walker can
+	 * still be reading a page we are about to free.
+	 */
+	synchronize_rcu();
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
 		__pagetable_free(pt);
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12  4:38 [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU David Carlier
@ 2026-06-12  4:52 ` Matthew Wilcox
  2026-06-12  4:59   ` David CARLIER
  2026-06-12  5:05   ` [PATCH v2] " David Carlier
  0 siblings, 2 replies; 11+ messages in thread
From: Matthew Wilcox @ 2026-06-12  4:52 UTC (permalink / raw)
  To: David Carlier
  Cc: akpm, syzbot+fd95a72470f5a44e464c, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Lu Baolu, linux-mm, linux-kernel

On Fri, Jun 12, 2026 at 05:38:27AM +0100, David Carlier wrote:
> @@ -434,6 +434,14 @@ static void kernel_pgtable_work_func(struct work_struct *work)
>  	spin_unlock(&kernel_pgtable_work.lock);
>  
>  	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
> +
> +	/*
> +	 * Lockless kernel page table walkers (ptdump, and any other user of
> +	 * walk_kernel_page_table_range_lockless()) dereference these pages
> +	 * under rcu_read_lock(). Wait for a grace period so no walker can
> +	 * still be reading a page we are about to free.
> +	 */
> +	synchronize_rcu();
>  	list_for_each_entry_safe(pt, next, &page_list, pt_list)
>  		__pagetable_free(pt);

synchronize_rcu() is rather expensive.  Can't you rcu-free the page
tables instead?  There's an rcu head in struct page.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12  4:52 ` Matthew Wilcox
@ 2026-06-12  4:59   ` David CARLIER
  2026-06-12  5:05   ` [PATCH v2] " David Carlier
  1 sibling, 0 replies; 11+ messages in thread
From: David CARLIER @ 2026-06-12  4:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, syzbot+fd95a72470f5a44e464c, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Lu Baolu, linux-mm, linux-kernel

sure thing. Thanks !

On Fri, 12 Jun 2026 at 05:52, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 12, 2026 at 05:38:27AM +0100, David Carlier wrote:
> > @@ -434,6 +434,14 @@ static void kernel_pgtable_work_func(struct work_struct *work)
> >       spin_unlock(&kernel_pgtable_work.lock);
> >
> >       iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
> > +
> > +     /*
> > +      * Lockless kernel page table walkers (ptdump, and any other user of
> > +      * walk_kernel_page_table_range_lockless()) dereference these pages
> > +      * under rcu_read_lock(). Wait for a grace period so no walker can
> > +      * still be reading a page we are about to free.
> > +      */
> > +     synchronize_rcu();
> >       list_for_each_entry_safe(pt, next, &page_list, pt_list)
> >               __pagetable_free(pt);
>
> synchronize_rcu() is rather expensive.  Can't you rcu-free the page
> tables instead?  There's an rcu head in struct page.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12  4:52 ` Matthew Wilcox
  2026-06-12  4:59   ` David CARLIER
@ 2026-06-12  5:05   ` David Carlier
  2026-06-12 16:12     ` Andrew Morton
  1 sibling, 1 reply; 11+ messages in thread
From: David Carlier @ 2026-06-12  5:05 UTC (permalink / raw)
  To: akpm
  Cc: syzbot+fd95a72470f5a44e464c, David Carlier, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Lu Baolu, Dave Hansen, linux-mm,
	linux-kernel

ptdump walks the kernel page tables locklessly through
walk_kernel_page_table_range_lockless().  It only holds the init_mm
mmap lock and the memory hotplug lock, and neither excludes
vmalloc/ioremap teardown from freeing kernel PTE pages via
pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
use-after-free in ptdump_pte_entry() reading a PTE page that was freed
underneath the walk.

Deferring the kernel page table free only batches the TLB flush; it does
not wait for lockless walkers.  Mirror the user page table walk, where
pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
across the lockless kernel walk and rcu-free the page tables in the
kernel page table free worker, after the batched TLB flush.  A walker
then either observes the cleared PMD and skips the page, or keeps it
alive until it drops the RCU read lock.

Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 mm/pagewalk.c        | 15 ++++++++++++++-
 mm/pgtable-generic.c | 16 +++++++++++++++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..6d9f14f86784 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -655,13 +655,26 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
 		.private	= private,
 		.no_vma		= true
 	};
+	int err;
 
 	if (start >= end)
 		return -EINVAL;
 	if (!check_ops_safe(ops))
 		return -EINVAL;
 
-	return walk_pgd_range(start, end, &walk);
+	/*
+	 * Kernel intermediate page tables can be freed concurrently by
+	 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which routes
+	 * the freed pages through pagetable_free_kernel(). That path defers
+	 * the free past an RCU grace period, so hold the RCU read lock across
+	 * the lockless walk to prevent a page table from being freed while we
+	 * are still dereferencing it.
+	 */
+	rcu_read_lock();
+	err = walk_pgd_range(start, end, &walk);
+	rcu_read_unlock();
+
+	return err;
 }
 
 /**
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..5b53e9a5b7f8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -424,6 +424,13 @@ static struct {
 	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
 };
 
+static void kernel_pgtable_free_rcu(struct rcu_head *head)
+{
+	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
+
+	__pagetable_free(pt);
+}
+
 static void kernel_pgtable_work_func(struct work_struct *work)
 {
 	struct ptdesc *pt, *next;
@@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
+
+	/*
+	 * Lockless kernel page table walkers (ptdump, and any other user of
+	 * walk_kernel_page_table_range_lockless()) dereference these pages
+	 * under rcu_read_lock(). Free them after a grace period so a walker
+	 * cannot still be reading a page we release.
+	 */
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
-		__pagetable_free(pt);
+		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
 }
 
 void pagetable_free_kernel(struct ptdesc *pt)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12  5:05   ` [PATCH v2] " David Carlier
@ 2026-06-12 16:12     ` Andrew Morton
  2026-06-12 17:21       ` David CARLIER
  2026-06-12 17:23       ` [PATCH v3] " David Carlier
  0 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2026-06-12 16:12 UTC (permalink / raw)
  To: David Carlier
  Cc: syzbot+fd95a72470f5a44e464c, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Lu Baolu, Dave Hansen, linux-mm,
	linux-kernel

On Fri, 12 Jun 2026 06:05:40 +0100 David Carlier <devnexen@gmail.com> wrote:

> ptdump walks the kernel page tables locklessly through
> walk_kernel_page_table_range_lockless().  It only holds the init_mm
> mmap lock and the memory hotplug lock, and neither excludes
> vmalloc/ioremap teardown from freeing kernel PTE pages via
> pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
> use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> underneath the walk.
> 
> Deferring the kernel page table free only batches the TLB flush; it does
> not wait for lockless walkers.  Mirror the user page table walk, where
> pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
> across the lockless kernel walk and rcu-free the page tables in the
> kernel page table free worker, after the batched TLB flush.  A walker
> then either observes the cleared PMD and skips the page, or keeps it
> alive until it drops the RCU read lock.
> 
> ...
>
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -655,13 +655,26 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
>  		.private	= private,
>  		.no_vma		= true
>  	};
> +	int err;
>  
>  	if (start >= end)
>  		return -EINVAL;
>  	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
> -	return walk_pgd_range(start, end, &walk);
> +	/*
> +	 * Kernel intermediate page tables can be freed concurrently by
> +	 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which routes
> +	 * the freed pages through pagetable_free_kernel(). That path defers
> +	 * the free past an RCU grace period, so hold the RCU read lock across
> +	 * the lockless walk to prevent a page table from being freed while we
> +	 * are still dereferencing it.
> +	 */
> +	rcu_read_lock();
> +	err = walk_pgd_range(start, end, &walk);
> +	rcu_read_unlock();
> +
> +	return err;
>  }

Adding a lock to a function which is advertised to "walk the kernel
page tables locklessly" is a bit of a head-spinner.

Sashiko claims that some callback functions can perform sleeping
allocations:

	https://sashiko.dev/#/patchset/20260612050540.31594-1-devnexen@gmail.com




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12 16:12     ` Andrew Morton
@ 2026-06-12 17:21       ` David CARLIER
  2026-06-12 17:23       ` [PATCH v3] " David Carlier
  1 sibling, 0 replies; 11+ messages in thread
From: David CARLIER @ 2026-06-12 17:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: syzbot+fd95a72470f5a44e464c, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Lu Baolu, Dave Hansen, linux-mm,
	linux-kernel

On Fri, 12 Jun 2026 at 17:12, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 12 Jun 2026 06:05:40 +0100 David Carlier <devnexen@gmail.com> wrote:
>
> > ptdump walks the kernel page tables locklessly through
> > walk_kernel_page_table_range_lockless().  It only holds the init_mm
> > mmap lock and the memory hotplug lock, and neither excludes
> > vmalloc/ioremap teardown from freeing kernel PTE pages via
> > pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
> > use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> > underneath the walk.
> >
> > Deferring the kernel page table free only batches the TLB flush; it does
> > not wait for lockless walkers.  Mirror the user page table walk, where
> > pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
> > across the lockless kernel walk and rcu-free the page tables in the
> > kernel page table free worker, after the batched TLB flush.  A walker
> > then either observes the cleared PMD and skips the page, or keeps it
> > alive until it drops the RCU read lock.
> >
> > ...
> >
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -655,13 +655,26 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
> >               .private        = private,
> >               .no_vma         = true
> >       };
> > +     int err;
> >
> >       if (start >= end)
> >               return -EINVAL;
> >       if (!check_ops_safe(ops))
> >               return -EINVAL;
> >
> > -     return walk_pgd_range(start, end, &walk);
> > +     /*
> > +      * Kernel intermediate page tables can be freed concurrently by
> > +      * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which routes
> > +      * the freed pages through pagetable_free_kernel(). That path defers
> > +      * the free past an RCU grace period, so hold the RCU read lock across
> > +      * the lockless walk to prevent a page table from being freed while we
> > +      * are still dereferencing it.
> > +      */
> > +     rcu_read_lock();
> > +     err = walk_pgd_range(start, end, &walk);
> > +     rcu_read_unlock();
> > +
> > +     return err;
> >  }
>
> Adding a lock to a function which is advertised to "walk the kernel
> page tables locklessly" is a bit of a head-spinner.
>
> Sashiko claims that some callback functions can perform sleeping
> allocations:
>
>         https://sashiko.dev/#/patchset/20260612050540.31594-1-devnexen@gmail.com
>
>

Sashiko's right, and it's the same issue you flagged about the name.
arm64's range_split_to_ptes() also goes through
walk_kernel_page_table_range_lockless() and
  passes GFP_PGTABLE_KERNEL into split_pmd()/split_pud(), which can
sleep, so rcu_read_lock() inside the lockless helper is wrong on both
counts.

  v3 leaves that helper lockless and takes rcu_read_lock() only in the
init_mm branch of walk_page_range_debug(), whose sole caller is
ptdump. Its callbacks don't
  sleep. The arm64 splitters keep relying on their existing exclusive
access guarantee, untouched.

Cheers.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12 16:12     ` Andrew Morton
  2026-06-12 17:21       ` David CARLIER
@ 2026-06-12 17:23       ` David Carlier
  2026-06-12 17:39         ` Matthew Wilcox
  2026-06-12 18:29         ` Lorenzo Stoakes
  1 sibling, 2 replies; 11+ messages in thread
From: David Carlier @ 2026-06-12 17:23 UTC (permalink / raw)
  To: akpm
  Cc: syzbot+fd95a72470f5a44e464c, David Carlier, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Dave Hansen, linux-mm, linux-kernel

ptdump walks the kernel page tables locklessly through
walk_kernel_page_table_range_lockless().  It only holds the init_mm
mmap lock and the memory hotplug lock, and neither excludes
vmalloc/ioremap teardown from freeing kernel PTE pages via
pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
use-after-free in ptdump_pte_entry() reading a PTE page that was freed
underneath the walk.

Deferring the kernel page table free only batches the TLB flush; it does
not wait for lockless walkers.  Mirror the user page table walk, where
pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
across the kernel walk in the init_mm branch of walk_page_range_debug()
and rcu-free the page tables in the kernel page table free worker, after
the batched TLB flush.  ptdump is the only walker that races with these
frees and its callbacks do not sleep, so the lockless walker itself stays
lockless for its other, exclusive-access callers (e.g. the arm64 page
table split paths, which allocate with GFP_PGTABLE_KERNEL and may sleep).
A walker then either observes the cleared PMD and skips the page, or
keeps it alive until it drops the RCU read lock.

Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: David Carlier <devnexen@gmail.com>
---
v3: take rcu_read_lock() only in the init_mm branch of
    walk_page_range_debug() instead of inside
    walk_kernel_page_table_range_lockless().  The lockless helper is also
    reached by the arm64 split paths, which allocate page tables with
    GFP_PGTABLE_KERNEL and can sleep, so it must stay lockless (Andrew,
    Sashiko).
v2: rcu-free the page tables with call_rcu() instead of synchronize_rcu()
    (Matthew Wilcox).
---
 mm/pagewalk.c        | 21 ++++++++++++++++++---
 mm/pgtable-generic.c | 16 +++++++++++++++-
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..dbb443c72353 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -692,9 +692,24 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 	};
 
 	/* For convenience, we allow traversal of kernel mappings. */
-	if (mm == &init_mm)
-		return walk_kernel_page_table_range(start, end, ops,
-						    pgd, private);
+	if (mm == &init_mm) {
+		int err;
+
+		/*
+		 * Kernel intermediate page tables can be freed concurrently by
+		 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which
+		 * routes the freed pages through pagetable_free_kernel(). That
+		 * path defers the free past an RCU grace period, so hold the RCU
+		 * read lock across the walk to prevent a page table from being
+		 * freed while we are still dereferencing it. ptdump is the only
+		 * caller here and its callbacks do not sleep, so this is safe.
+		 */
+		rcu_read_lock();
+		err = walk_kernel_page_table_range(start, end, ops, pgd, private);
+		rcu_read_unlock();
+		return err;
+	}
+
 	if (start >= end || !walk.mm)
 		return -EINVAL;
 	if (!check_ops_safe(ops))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..5b53e9a5b7f8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -424,6 +424,13 @@ static struct {
 	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
 };
 
+static void kernel_pgtable_free_rcu(struct rcu_head *head)
+{
+	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
+
+	__pagetable_free(pt);
+}
+
 static void kernel_pgtable_work_func(struct work_struct *work)
 {
 	struct ptdesc *pt, *next;
@@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
+
+	/*
+	 * Lockless kernel page table walkers (ptdump, and any other user of
+	 * walk_kernel_page_table_range_lockless()) dereference these pages
+	 * under rcu_read_lock(). Free them after a grace period so a walker
+	 * cannot still be reading a page we release.
+	 */
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
-		__pagetable_free(pt);
+		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
 }
 
 void pagetable_free_kernel(struct ptdesc *pt)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12 17:23       ` [PATCH v3] " David Carlier
@ 2026-06-12 17:39         ` Matthew Wilcox
  2026-06-12 18:10           ` David CARLIER
  2026-06-12 18:29         ` Lorenzo Stoakes
  1 sibling, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2026-06-12 17:39 UTC (permalink / raw)
  To: David Carlier
  Cc: akpm, syzbot+fd95a72470f5a44e464c, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Dave Hansen, linux-mm, linux-kernel

On Fri, Jun 12, 2026 at 06:23:55PM +0100, David Carlier wrote:
> ptdump walks the kernel page tables locklessly through
> walk_kernel_page_table_range_lockless().  It only holds the init_mm
> mmap lock and the memory hotplug lock, and neither excludes
> vmalloc/ioremap teardown from freeing kernel PTE pages via
> pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
> use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> underneath the walk.

Does it make sense to walk the iomap / vmap ranges in ptdump?  I can't
really tell if this is something that's useful, or something that nobody
thought to exclude.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12 17:39         ` Matthew Wilcox
@ 2026-06-12 18:10           ` David CARLIER
  0 siblings, 0 replies; 11+ messages in thread
From: David CARLIER @ 2026-06-12 18:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, syzbot+fd95a72470f5a44e464c, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Dave Hansen, linux-mm, linux-kernel

On Fri, 12 Jun 2026 at 18:39, Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jun 12, 2026 at 06:23:55PM +0100, David Carlier wrote:
> > ptdump walks the kernel page tables locklessly through
> > walk_kernel_page_table_range_lockless().  It only holds the init_mm
> > mmap lock and the memory hotplug lock, and neither excludes
> > vmalloc/ioremap teardown from freeing kernel PTE pages via
> > pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
> > use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> > underneath the walk.
>
> Does it make sense to walk the iomap / vmap ranges in ptdump?  I can't
> really tell if this is something that's useful, or something that nobody
> thought to exclude.

Yes, it's intentional. ptdump_check_wx() walks the whole kernel half
(_PAGE_OFFSET..~0UL on arm64, equivalent on x86) precisely to audit
W+X mappings, and those live
  in the module/vmalloc/BPF JIT ranges. The debugfs dump labels the
vmalloc and modules markers for the same reason. Skipping those ranges
would defeat the W^X check,
  so the walk has to cover them, which is why it needs the RCU
protection rather than an exclusion.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12 17:23       ` [PATCH v3] " David Carlier
  2026-06-12 17:39         ` Matthew Wilcox
@ 2026-06-12 18:29         ` Lorenzo Stoakes
  2026-06-12 18:48           ` David CARLIER
  1 sibling, 1 reply; 11+ messages in thread
From: Lorenzo Stoakes @ 2026-06-12 18:29 UTC (permalink / raw)
  To: David Carlier
  Cc: akpm, syzbot+fd95a72470f5a44e464c, David Hildenbrand,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Dave Hansen, linux-mm, linux-kernel

Please stop sending new versions of this patch in reply to random
emails...!  Just send respins to the mailing list, not in-reply-to
anything.

Also stop sending respins this quickly. Generally leave a day between
especially as a new contributor.

On Fri, Jun 12, 2026 at 06:23:55PM +0100, David Carlier wrote:
> ptdump walks the kernel page tables locklessly through
> walk_kernel_page_table_range_lockless().  It only holds the init_mm

No, through walk_page_range_debug(). ptdump doesn't call
walk_kernel_page_table_range_lockless() directly. There's no need to point
out the specifically function it calls in turn.

> mmap lock and the memory hotplug lock, and neither excludes
> vmalloc/ioremap teardown from freeing kernel PTE pages via
> pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a

It's not teardown is it? The only caller is vmap_try_huge_pmd(). So it's
actually trying to install a leaf PMD and freeing the PTE there.

> use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> underneath the walk.

Yep that's a problem.

>
> Deferring the kernel page table free only batches the TLB flush; it does
> not wait for lockless walkers.  Mirror the user page table walk, where

This feels like useless information re: TLB? I'm not sure why you're
talking about it.

Succinctness matters too.

> pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()

Referencing pte_offset_map() seems bizarrely irrelevant here?

> across the kernel walk in the init_mm branch of walk_page_range_debug()
> and rcu-free the page tables in the kernel page table free worker, after
> the batched TLB flush.  ptdump is the only walker that races with these

But you're having to actually make the page table freeing wait for a grace
period to do this so what has it got to do with mirorring pte_offset_map()?

And I'm not sure we really need to talk about the ordering of freeing
pagetables and TLB flush here at all.

What's happening is you're actually _changing_ the kernel code to wait for
a grace period and then relying on that for the sake of ptdump.

> frees and its callbacks do not sleep, so the lockless walker itself stays
> lockless for its other, exclusive-access callers (e.g. the arm64 page
> table split paths, which allocate with GFP_PGTABLE_KERNEL and may sleep).
> A walker then either observes the cleared PMD and skips the page, or
> keeps it alive until it drops the RCU read lock.

I don't really understand from this why the arm64 page table split path is
safe from the apge table being freed?

In any case, this is an utterly unreadable wall of text. Please separate
things into paragraphs so a human being can read them... Don't let the LLM
do this for you.

Make sure you understand what you're doing and write the commit message _in
your own words_.

As I note below you also need to explain why other callers to
walk_kernel_page_table_range_lockless() and walk_kernel_page_table_range()
are safe but ptdump is not?

And why exactly this is the page walker's problem?

>
> Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
> Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
> Assisted-by: Claude:claude-opus-4-8

The commit message + comments all feel generated, please edit them before
sending them, LLMs are _terrible_ at this.

> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
> v3: take rcu_read_lock() only in the init_mm branch of
>     walk_page_range_debug() instead of inside
>     walk_kernel_page_table_range_lockless().  The lockless helper is also
>     reached by the arm64 split paths, which allocate page tables with
>     GFP_PGTABLE_KERNEL and can sleep, so it must stay lockless (Andrew,
>     Sashiko).
> v2: rcu-free the page tables with call_rcu() instead of synchronize_rcu()
>     (Matthew Wilcox).
> ---
>  mm/pagewalk.c        | 21 ++++++++++++++++++---
>  mm/pgtable-generic.c | 16 +++++++++++++++-
>  2 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..dbb443c72353 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -692,9 +692,24 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>  	};
>
>  	/* For convenience, we allow traversal of kernel mappings. */
> -	if (mm == &init_mm)
> -		return walk_kernel_page_table_range(start, end, ops,
> -						    pgd, private);
> +	if (mm == &init_mm) {
> +		int err;
> +
> +		/*
> +		 * Kernel intermediate page tables can be freed concurrently by
> +		 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which
> +		 * routes the freed pages through pagetable_free_kernel(). That
> +		 * path defers the free past an RCU grace period, so hold the RCU
> +		 * read lock across the walk to prevent a page table from being
> +		 * freed while we are still dereferencing it. ptdump is the only
> +		 * caller here and its callbacks do not sleep, so this is safe.
> +		 */

This is unreadable garbage. Again, please don't let LLMs write comments,
they're awful at it.

And I don't love that you're explicitly saying 'X is the only thing that calls
this and it's safe!'

I mean, other things might call this right? And then this comment just bitrots?

If you go and read the comment in walk_kernel_page_table_range(), you'll see:

"
to prevent the intermediate kernel pages tables belonging to the
specified address range from being freed. The caller should take
other actions to prevent this race.
"

Your patch should explain why other callers don't need to do this and why
ptdump can't itself do something to avoid this?

Why are we doing this just for ptdump?


> +		rcu_read_lock();
> +		err = walk_kernel_page_table_range(start, end, ops, pgd, private);

> +		rcu_read_unlock();
> +		return err;
> +	}

If we do keep this approach we're essentially adding a whole new function
open coded in another function...

I'd rather you add walk_kernel_page_table_range_rcu() or something in that case.

> +
>  	if (start >= end || !walk.mm)
>  		return -EINVAL;
>  	if (!check_ops_safe(ops))
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index b91b1a98029c..5b53e9a5b7f8 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -424,6 +424,13 @@ static struct {
>  	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
>  };
>
> +static void kernel_pgtable_free_rcu(struct rcu_head *head)
> +{
> +	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
> +
> +	__pagetable_free(pt);
> +}
> +
>  static void kernel_pgtable_work_func(struct work_struct *work)
>  {
>  	struct ptdesc *pt, *next;
> @@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
>  	spin_unlock(&kernel_pgtable_work.lock);
>
>  	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
> +
> +	/*
> +	 * Lockless kernel page table walkers (ptdump, and any other user of
> +	 * walk_kernel_page_table_range_lockless()) dereference these pages

Err wait what? Which other users? And why are you referencing the function
you didn't change?

Note again that the comment for walk_kernel_page_table_range_lockless()
mentions that you need to guarantee 'exclusive access' over the range...

> +	 * under rcu_read_lock(). Free them after a grace period so a walker
> +	 * cannot still be reading a page we release.
> +	 */

In any case this comment is really useless and begging for bit rot.

>  	list_for_each_entry_safe(pt, next, &page_list, pt_list)
> -		__pagetable_free(pt);
> +		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);

Hm now we're unconditionally waiting a grace period for page table freeing
here just to account for ptdump, this seems... silly?

But this is only used when CONFIG_ASYNC_KERNEL_PGTABLE_FREE is defined, and
nothing in your patch references that?

And if !CONFIG_ASYNC_KERNEL_PGTABLE_FREE then pagetable_free_kernel() is
declared in mm.h as:

static inline void pagetable_free_kernel(struct ptdesc *pt)
{
	__pagetable_free(pt);
}

And everything's broken again isn't it?

>  }
>
>  void pagetable_free_kernel(struct ptdesc *pt)
> --
> 2.53.0
>

Overall it seems like a ptdump bug that ought to be fixed there? If it's a
broader 'use RCU' change the patch should more broadly alter things to
reflect that.

So far this seems like a hack?

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
  2026-06-12 18:29         ` Lorenzo Stoakes
@ 2026-06-12 18:48           ` David CARLIER
  0 siblings, 0 replies; 11+ messages in thread
From: David CARLIER @ 2026-06-12 18:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, syzbot+fd95a72470f5a44e464c, David Hildenbrand,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Kevin Tian, Jason Gunthorpe,
	Dave Hansen, linux-mm, linux-kernel

> It's not teardown is it? The only caller is vmap_try_huge_pmd().

  Right, it's the huge-pmd install collapsing the pte table, not teardown.

  > And if !CONFIG_ASYNC_KERNEL_PGTABLE_FREE then ... everything's
broken again isn't it?

  Yeah, that's the real bug. Non-async pagetable_free_kernel() frees
  immediately, so the rcu_read_lock() does nothing. v3 only ever helped the
  async config.

  > now we're unconditionally waiting a grace period ... this seems... silly?

  It only fires for kernel pte tables freed by the huge collapse and by
  hot-remove, not every page table free. User pte tables already rcu-free,
  this just does the same for the kernel ones.

  > why the arm64 page table split path is safe

  It only walks the range it's actively splitting, so nothing frees a table
  under it. Same for the other walk_kernel_page_table_range() users. ptdump
  walks the whole kernel range for the W+X check and can't take exclusive
  access, so it's the one that races.

  > I'd rather you add walk_kernel_page_table_range_rcu()

  Will do.

  Resend standalone in a day or two.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-06-12 18:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  4:38 [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU David Carlier
2026-06-12  4:52 ` Matthew Wilcox
2026-06-12  4:59   ` David CARLIER
2026-06-12  5:05   ` [PATCH v2] " David Carlier
2026-06-12 16:12     ` Andrew Morton
2026-06-12 17:21       ` David CARLIER
2026-06-12 17:23       ` [PATCH v3] " David Carlier
2026-06-12 17:39         ` Matthew Wilcox
2026-06-12 18:10           ` David CARLIER
2026-06-12 18:29         ` Lorenzo Stoakes
2026-06-12 18:48           ` David CARLIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.