All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Carlier <devnexen@gmail.com>
To: akpm@linux-foundation.org
Cc: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com,
	David Carlier <devnexen@gmail.com>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Kevin Tian <kevin.tian@intel.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
Date: Fri, 12 Jun 2026 18:23:55 +0100	[thread overview]
Message-ID: <20260612172356.356894-1-devnexen@gmail.com> (raw)
In-Reply-To: <20260612091215.b06dc7dc9dc894a5bfc75429@linux-foundation.org>

ptdump walks the kernel page tables locklessly through
walk_kernel_page_table_range_lockless().  It only holds the init_mm
mmap lock and the memory hotplug lock, and neither excludes
vmalloc/ioremap teardown from freeing kernel PTE pages via
pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a
use-after-free in ptdump_pte_entry() reading a PTE page that was freed
underneath the walk.

Deferring the kernel page table free only batches the TLB flush; it does
not wait for lockless walkers.  Mirror the user page table walk, where
pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()
across the kernel walk in the init_mm branch of walk_page_range_debug()
and rcu-free the page tables in the kernel page table free worker, after
the batched TLB flush.  ptdump is the only walker that races with these
frees and its callbacks do not sleep, so the lockless walker itself stays
lockless for its other, exclusive-access callers (e.g. the arm64 page
table split paths, which allocate with GFP_PGTABLE_KERNEL and may sleep).
A walker then either observes the cleared PMD and skips the page, or
keeps it alive until it drops the RCU read lock.

Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: David Carlier <devnexen@gmail.com>
---
v3: take rcu_read_lock() only in the init_mm branch of
    walk_page_range_debug() instead of inside
    walk_kernel_page_table_range_lockless().  The lockless helper is also
    reached by the arm64 split paths, which allocate page tables with
    GFP_PGTABLE_KERNEL and can sleep, so it must stay lockless (Andrew,
    Sashiko).
v2: rcu-free the page tables with call_rcu() instead of synchronize_rcu()
    (Matthew Wilcox).
---
 mm/pagewalk.c        | 21 ++++++++++++++++++---
 mm/pgtable-generic.c | 16 +++++++++++++++-
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 3ae2586ff45b..dbb443c72353 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -692,9 +692,24 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 	};
 
 	/* For convenience, we allow traversal of kernel mappings. */
-	if (mm == &init_mm)
-		return walk_kernel_page_table_range(start, end, ops,
-						    pgd, private);
+	if (mm == &init_mm) {
+		int err;
+
+		/*
+		 * Kernel intermediate page tables can be freed concurrently by
+		 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which
+		 * routes the freed pages through pagetable_free_kernel(). That
+		 * path defers the free past an RCU grace period, so hold the RCU
+		 * read lock across the walk to prevent a page table from being
+		 * freed while we are still dereferencing it. ptdump is the only
+		 * caller here and its callbacks do not sleep, so this is safe.
+		 */
+		rcu_read_lock();
+		err = walk_kernel_page_table_range(start, end, ops, pgd, private);
+		rcu_read_unlock();
+		return err;
+	}
+
 	if (start >= end || !walk.mm)
 		return -EINVAL;
 	if (!check_ops_safe(ops))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b91b1a98029c..5b53e9a5b7f8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -424,6 +424,13 @@ static struct {
 	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
 };
 
+static void kernel_pgtable_free_rcu(struct rcu_head *head)
+{
+	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
+
+	__pagetable_free(pt);
+}
+
 static void kernel_pgtable_work_func(struct work_struct *work)
 {
 	struct ptdesc *pt, *next;
@@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
 	spin_unlock(&kernel_pgtable_work.lock);
 
 	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
+
+	/*
+	 * Lockless kernel page table walkers (ptdump, and any other user of
+	 * walk_kernel_page_table_range_lockless()) dereference these pages
+	 * under rcu_read_lock(). Free them after a grace period so a walker
+	 * cannot still be reading a page we release.
+	 */
 	list_for_each_entry_safe(pt, next, &page_list, pt_list)
-		__pagetable_free(pt);
+		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
 }
 
 void pagetable_free_kernel(struct ptdesc *pt)
-- 
2.53.0



  parent reply	other threads:[~2026-06-12 17:24 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  4:38 [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU David Carlier
2026-06-12  4:52 ` Matthew Wilcox
2026-06-12  4:59   ` David CARLIER
2026-06-12  5:05   ` [PATCH v2] " David Carlier
2026-06-12 16:12     ` Andrew Morton
2026-06-12 17:21       ` David CARLIER
2026-06-12 17:23       ` David Carlier [this message]
2026-06-12 17:39         ` [PATCH v3] " Matthew Wilcox
2026-06-12 18:10           ` David CARLIER
2026-06-12 18:29         ` Lorenzo Stoakes
2026-06-12 18:48           ` David CARLIER

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612172356.356894-1-devnexen@gmail.com \
    --to=devnexen@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.