Linux-RISC-V Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7] mm: pgtable: free kernel page tables via RCU to fix ptdump UAF
@ 2026-07-02  9:30 David Carlier
  0 siblings, 0 replies; 2+ messages in thread
From: David Carlier @ 2026-07-02  9:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dev Jain, David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Dave Hansen, Lu Baolu, syzbot+fd95a72470f5a44e464c, linux-mm,
	linux-kernel, linux-riscv, David Carlier

ptdump_walk_pgd() walks the kernel page tables under get_online_mems().
That does not stop vmalloc from freeing a kernel PTE page underneath the
walk.

When vmap_try_huge_pmd() promotes a range to a huge PMD it collapses the
existing PTE table and frees it via pmd_free_pte_page(). On x86, riscv and
powerpc this runs without the init_mm mmap lock; only arm64 takes it, and
not on the block-split path. So ptdump can dereference a just-freed PTE
page, which is the use after free syzbot hit in ptdump_pte_entry().

The race is not new. ptdump walks the whole kernel address space, including
ranges other code is actively mapping, so it reads page tables it does not
own. 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
only widened the window; the Fixes tag points there for that reason.

Every other walker works on a range it owns and is the only one mutating
it: set_memory() on arm64/riscv/loongarch, the arm64 block-split path, the
openrisc DMA path and the hugetlb_vmemmap remap. Nothing frees those ranges
concurrently, so they cannot race and do not need RCU. ptdump is the only
walker that traverses ranges it does not own.

Defer the free by an RCU grace period. pagetable_free_kernel() now frees
via call_rcu() in both the async and non-async configs. The async path
still flushes the TLB first, then queues the per-page RCU free. The page
stays valid until any walk that may have observed it drops its RCU read
lock.

On the read side walk_page_range_debug() walks the init_mm range in bounded
chunks, taking rcu_read_lock() around each chunk and calling cond_resched()
between them. A walker either sees the cleared PMD and skips, or keeps the
page alive until it drops the lock; chunking keeps the read section short
on a large kernel address space instead of holding RCU across the whole
walk. The owned-range walkers are unchanged.

Drop the mmap_write_lock() in ptdump_walk_pgd(). It never guarded against
this free -- most architectures free the collapsed PTE table without it --
and RCU now provides the synchronisation.

ptdump callbacks run under RCU within a chunk, so they must not sleep. The
arch note_page() and effective_prot() callbacks only format into the
preallocated seq_file buffer; the only GFP_KERNEL marker setup runs before
the walk, and cond_resched() happens between chunks, outside the read lock.

Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: David Carlier <devnexen@gmail.com>
---
v7: no code change; add version tag and per-revision changelog (Dev).
v6: chunk the init_mm walk in walk_page_range_debug() and take
    rcu_read_lock() per chunk (reverting v5's caller-side lock + assert)
    so the read section stays bounded on large kernel address spaces and
    can cond_resched() between chunks; drop the now-redundant
    mmap_write_lock() in ptdump_walk_pgd().
v5: reframe changelog around the pre-existing race and range ownership;
    correct the mmap-lock description (arm64 is the exception, not x86);
    move rcu_read_lock() into ptdump_walk_pgd() and assert it in
    walk_page_range_debug(); drop walk_kernel_page_table_range_rcu(); fix
    the pgtable-generic.c comment; document the no-sleep audit of the
    callbacks.
v4: defer the free in both the async and non-async configs, not just the
    async one; add a walk_kernel_page_table_range_rcu() helper.
v3: take rcu_read_lock() in the init_mm branch of walk_page_range_debug().
v2: use call_rcu() instead of synchronize_rcu().
 include/linux/mm.h   |  7 -------
 mm/pagewalk.c        | 35 ++++++++++++++++++++++++++++++-----
 mm/pgtable-generic.c | 22 +++++++++++++++++++++-
 mm/ptdump.c          |  2 --
 4 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..79408a17a1b0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3695,14 +3695,7 @@ static inline void __pagetable_free(struct ptdesc *pt)
 	__free_pages(page, compound_order(page));
 }
 
-#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
 void pagetable_free_kernel(struct ptdesc *pt);
-#else
-static inline void pagetable_free_kernel(struct ptdesc *pt)
-{
-	__pagetable_free(pt);
-}
-#endif
 /**
  * pagetable_free - Free pagetables
  * @pt:	The page table descriptor
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ee7250ad0571 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -620,7 +620,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
  * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
- * from being freed (e.g. memory hot-remove).
+ * from being freed (e.g. memory hot-remove, vmap huge page promotion).
  */
 int walk_kernel_page_table_range(unsigned long start, unsigned long end,
 		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
@@ -643,7 +643,7 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
  * Use this function to walk the kernel page tables locklessly. It should be
  * guaranteed that the caller has exclusive access over the range they are
  * operating on - that there should be no concurrent access, for example,
- * changing permissions for vmalloc objects.
+ * changing permissions for vmalloc objects, or vmap huge page promotion.
  */
 int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
 		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
@@ -692,9 +692,34 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 	};
 
 	/* For convenience, we allow traversal of kernel mappings. */
-	if (mm == &init_mm)
-		return walk_kernel_page_table_range(start, end, ops,
-						    pgd, private);
+	if (mm == &init_mm) {
+		unsigned long addr = start;
+
+		/*
+		 * Walk in bounded chunks so the RCU read lock is never held
+		 * across the whole kernel address space.  A kernel page table
+		 * freed via pagetable_free_kernel() stays valid until the walk
+		 * that may have observed it drops the lock; releasing the lock
+		 * between chunks is safe as no page table pointer is held
+		 * across the gap.
+		 */
+		while (addr < end) {
+			unsigned long next = min(end, ALIGN(addr + 1, PGDIR_SIZE));
+			int err;
+
+			rcu_read_lock();
+			err = walk_kernel_page_table_range(addr, next, ops,
+							   pgd, private);
+			rcu_read_unlock();
+			if (err)
+				return err;
+
+			addr = next;
+			cond_resched();
+		}
+		return 0;
+	}
+
 	if (start >= end || !walk.mm)
 		return -EINVAL;
 	if (!check_ops_safe(ops))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..7a32e4821957 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -410,6 +410,13 @@ pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	goto again;
 }
 
+static void kernel_pgtable_free_rcu(struct rcu_head *head)
+{
+	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
+
+	__pagetable_free(pt);
+}
+
 #ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
 static void kernel_pgtable_work_func(struct work_struct *work);
 
@@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
+
+	/*
+	 * Debug walkers (ptdump) may walk ranges they do not own and race this
+	 * free, so they walk under rcu_read_lock(). Free after a grace period:
+	 * a walker either already saw the cleared PMD, or keeps the page alive
+	 * until it drops the RCU lock.
+	 */
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
-		__pagetable_free(pt);
+		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
 }
 
 void pagetable_free_kernel(struct ptdesc *pt)
@@ -446,4 +460,10 @@ void pagetable_free_kernel(struct ptdesc *pt)
 
 	schedule_work(&kernel_pgtable_work.work);
 }
+#else
+void pagetable_free_kernel(struct ptdesc *pt)
+{
+	/* Defer the free by a grace period; see kernel_pgtable_work_func(). */
+	call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
+}
 #endif
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 973020000096..4fc6313f290c 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -177,13 +177,11 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
 	const struct ptdump_range *range = st->range;
 
 	get_online_mems();
-	mmap_write_lock(mm);
 	while (range->start != range->end) {
 		walk_page_range_debug(mm, range->start, range->end,
 				      &ptdump_ops, pgd, st);
 		range++;
 	}
-	mmap_write_unlock(mm);
 	put_online_mems();
 
 	/* Flush out the last page */
-- 
2.53.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 2+ messages in thread
* [PATCH v7] mm: pgtable: free kernel page tables via RCU to fix ptdump UAF
@ 2026-07-02  9:28 David Carlier
  0 siblings, 0 replies; 2+ messages in thread
From: David Carlier @ 2026-07-02  9:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dev Jain, David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	Dave Hansen, Lu Baolu, syzbot+fd95a72470f5a44e464c, linux-mm,
	linux-kernel, linux-riscv, David Carlier

ptdump_walk_pgd() walks the kernel page tables under get_online_mems().
That does not stop vmalloc from freeing a kernel PTE page underneath the
walk.

When vmap_try_huge_pmd() promotes a range to a huge PMD it collapses the
existing PTE table and frees it via pmd_free_pte_page(). On x86, riscv and
powerpc this runs without the init_mm mmap lock; only arm64 takes it, and
not on the block-split path. So ptdump can dereference a just-freed PTE
page, which is the use after free syzbot hit in ptdump_pte_entry().

The race is not new. ptdump walks the whole kernel address space, including
ranges other code is actively mapping, so it reads page tables it does not
own. 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
only widened the window; the Fixes tag points there for that reason.

Every other walker works on a range it owns and is the only one mutating
it: set_memory() on arm64/riscv/loongarch, the arm64 block-split path, the
openrisc DMA path and the hugetlb_vmemmap remap. Nothing frees those ranges
concurrently, so they cannot race and do not need RCU. ptdump is the only
walker that traverses ranges it does not own.

Defer the free by an RCU grace period. pagetable_free_kernel() now frees
via call_rcu() in both the async and non-async configs. The async path
still flushes the TLB first, then queues the per-page RCU free. The page
stays valid until any walk that may have observed it drops its RCU read
lock.

On the read side walk_page_range_debug() walks the init_mm range in bounded
chunks, taking rcu_read_lock() around each chunk and calling cond_resched()
between them. A walker either sees the cleared PMD and skips, or keeps the
page alive until it drops the lock; chunking keeps the read section short
on a large kernel address space instead of holding RCU across the whole
walk. The owned-range walkers are unchanged.

Drop the mmap_write_lock() in ptdump_walk_pgd(). It never guarded against
this free -- most architectures free the collapsed PTE table without it --
and RCU now provides the synchronisation.

ptdump callbacks run under RCU within a chunk, so they must not sleep. The
arch note_page() and effective_prot() callbacks only format into the
preallocated seq_file buffer; the only GFP_KERNEL marker setup runs before
the walk, and cond_resched() happens between chunks, outside the read lock.

Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: David Carlier <devnexen@gmail.com>
---
v7: no code change; add version tag and per-revision changelog (Dev).
v6: chunk the init_mm walk in walk_page_range_debug() and take
    rcu_read_lock() per chunk (reverting v5's caller-side lock + assert)
    so the read section stays bounded on large kernel address spaces and
    can cond_resched() between chunks; drop the now-redundant
    mmap_write_lock() in ptdump_walk_pgd().
v5: reframe changelog around the pre-existing race and range ownership;
    correct the mmap-lock description (arm64 is the exception, not x86);
    move rcu_read_lock() into ptdump_walk_pgd() and assert it in
    walk_page_range_debug(); drop walk_kernel_page_table_range_rcu(); fix
    the pgtable-generic.c comment; document the no-sleep audit of the
    callbacks.
v4: defer the free in both the async and non-async configs, not just the
    async one; add a walk_kernel_page_table_range_rcu() helper.
v3: take rcu_read_lock() in the init_mm branch of walk_page_range_debug().
v2: use call_rcu() instead of synchronize_rcu().
 include/linux/mm.h   |  7 -------
 mm/pagewalk.c        | 35 ++++++++++++++++++++++++++++++-----
 mm/pgtable-generic.c | 22 +++++++++++++++++++++-
 mm/ptdump.c          |  2 --
 4 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..79408a17a1b0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3695,14 +3695,7 @@ static inline void __pagetable_free(struct ptdesc *pt)
 	__free_pages(page, compound_order(page));
 }
 
-#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
 void pagetable_free_kernel(struct ptdesc *pt);
-#else
-static inline void pagetable_free_kernel(struct ptdesc *pt)
-{
-	__pagetable_free(pt);
-}
-#endif
 /**
  * pagetable_free - Free pagetables
  * @pt:	The page table descriptor
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..ee7250ad0571 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -620,7 +620,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
  * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
- * from being freed (e.g. memory hot-remove).
+ * from being freed (e.g. memory hot-remove, vmap huge page promotion).
  */
 int walk_kernel_page_table_range(unsigned long start, unsigned long end,
 		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
@@ -643,7 +643,7 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
  * Use this function to walk the kernel page tables locklessly. It should be
  * guaranteed that the caller has exclusive access over the range they are
  * operating on - that there should be no concurrent access, for example,
- * changing permissions for vmalloc objects.
+ * changing permissions for vmalloc objects, or vmap huge page promotion.
  */
 int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
 		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
@@ -692,9 +692,34 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 	};
 
 	/* For convenience, we allow traversal of kernel mappings. */
-	if (mm == &init_mm)
-		return walk_kernel_page_table_range(start, end, ops,
-						    pgd, private);
+	if (mm == &init_mm) {
+		unsigned long addr = start;
+
+		/*
+		 * Walk in bounded chunks so the RCU read lock is never held
+		 * across the whole kernel address space.  A kernel page table
+		 * freed via pagetable_free_kernel() stays valid until the walk
+		 * that may have observed it drops the lock; releasing the lock
+		 * between chunks is safe as no page table pointer is held
+		 * across the gap.
+		 */
+		while (addr < end) {
+			unsigned long next = min(end, ALIGN(addr + 1, PGDIR_SIZE));
+			int err;
+
+			rcu_read_lock();
+			err = walk_kernel_page_table_range(addr, next, ops,
+							   pgd, private);
+			rcu_read_unlock();
+			if (err)
+				return err;
+
+			addr = next;
+			cond_resched();
+		}
+		return 0;
+	}
+
 	if (start >= end || !walk.mm)
 		return -EINVAL;
 	if (!check_ops_safe(ops))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..7a32e4821957 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -410,6 +410,13 @@ pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	goto again;
 }
 
+static void kernel_pgtable_free_rcu(struct rcu_head *head)
+{
+	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
+
+	__pagetable_free(pt);
+}
+
 #ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
 static void kernel_pgtable_work_func(struct work_struct *work);
 
@@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
+
+	/*
+	 * Debug walkers (ptdump) may walk ranges they do not own and race this
+	 * free, so they walk under rcu_read_lock(). Free after a grace period:
+	 * a walker either already saw the cleared PMD, or keeps the page alive
+	 * until it drops the RCU lock.
+	 */
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
-		__pagetable_free(pt);
+		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
 }
 
 void pagetable_free_kernel(struct ptdesc *pt)
@@ -446,4 +460,10 @@ void pagetable_free_kernel(struct ptdesc *pt)
 
 	schedule_work(&kernel_pgtable_work.work);
 }
+#else
+void pagetable_free_kernel(struct ptdesc *pt)
+{
+	/* Defer the free by a grace period; see kernel_pgtable_work_func(). */
+	call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
+}
 #endif
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 973020000096..4fc6313f290c 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -177,13 +177,11 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
 	const struct ptdump_range *range = st->range;
 
 	get_online_mems();
-	mmap_write_lock(mm);
 	while (range->start != range->end) {
 		walk_page_range_debug(mm, range->start, range->end,
 				      &ptdump_ops, pgd, st);
 		range++;
 	}
-	mmap_write_unlock(mm);
 	put_online_mems();
 
 	/* Flush out the last page */
-- 
2.53.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-07-02  9:30 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02  9:30 [PATCH v7] mm: pgtable: free kernel page tables via RCU to fix ptdump UAF David Carlier
  -- strict thread matches above, loose matches on Subject: below --
2026-07-02  9:28 David Carlier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox