From: "David Hildenbrand (Arm)" <david@kernel.org>
To: David Carlier <devnexen@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>,
"Liam R . Howlett" <liam@infradead.org>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Paul Walmsley <pjw@kernel.org>,
Palmer Dabbelt <palmer@dabbelt.com>,
Albert Ou <aou@eecs.berkeley.edu>,
Alexandre Ghiti <alex@ghiti.fr>,
Dave Hansen <dave.hansen@linux.intel.com>,
Lu Baolu <baolu.lu@linux.intel.com>,
syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-riscv@lists.infradead.org
Subject: Re: [PATCH] mm: pgtable: free kernel page tables via RCU to fix ptdump UAF
Date: Wed, 1 Jul 2026 17:58:35 +0200 [thread overview]
Message-ID: <560ce7db-1afb-421a-bac3-3710f8f1966e@kernel.org> (raw)
In-Reply-To: <20260630041012.5975-1-devnexen@gmail.com>
On 6/30/26 06:10, David Carlier wrote:
> ptdump_walk_pgd() walks the kernel page tables under get_online_mems() and
> mmap_write_lock(&init_mm). Neither lock stops vmalloc from freeing a kernel
> PTE page underneath the walk.
I guess other directmap modifications would similarly be problematic (assuming
we'd collapse into PMDs for reducing directmap fragmentation, I think some
patches are on the list for that).
Not sure if all of these would go through pagetable_free_kernel(), given that we
have memblock-allocated page tables that are freed entirely differently.
>
> When vmap_try_huge_pmd() promotes a range to a huge PMD it collapses the
> existing PTE table and frees it via pmd_free_pte_page(). On x86, riscv and
> powerpc this runs without the init_mm mmap lock; only arm64 takes it, and not
> on the block-split path. So ptdump can dereference a just-freed PTE page,
> which is the use after free syzbot hit in ptdump_pte_entry().
>
> The race is not new. ptdump walks the whole kernel address space, including
> ranges other code is actively mapping, so it reads page tables it does not
> own. 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
> only widened the window; the Fixes tag points there for that reason.
>
> Every other walker works on a range it owns and is the only one mutating it:
> set_memory() on arm64/riscv/loongarch, the arm64 block-split path, the
> openrisc DMA path and the hugetlb_vmemmap remap. Nothing frees those ranges
> concurrently, so they cannot race and do not need RCU. ptdump is the only
> walker that traverses ranges it does not own.
>
> Defer the free by an RCU grace period. pagetable_free_kernel() now frees via
> call_rcu() in both the async and non-async configs. The async path still
> flushes the TLB first, then queues the per-page RCU free. The page stays
> valid until any walk that may have observed it drops its RCU read lock.
>
> On the read side ptdump_walk_pgd() takes the RCU read lock around the walk,
> and walk_page_range_debug() asserts it with RCU_LOCKDEP_WARN() for the
> init_mm case rather than taking it, matching pagewalk.c convention. A walker
> either sees the cleared PMD and skips, or keeps the page alive until it drops
> the lock. The owned-range walkers are unchanged.
>
> ptdump callbacks now run under RCU, so they must not sleep. The arch
> note_page() and effective_prot() callbacks only format into the preallocated
> seq_file buffer, and the walker does not call cond_resched(); the only
> GFP_KERNEL marker setup runs before the walk.
>
> Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
> Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
> Assisted-by: Claude:claude-opus-4-8
> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
> v5: reframe changelog around the pre-existing race and range ownership;
> correct the mmap-lock description (arm64 is the exception, not x86);
> move rcu_read_lock() into ptdump_walk_pgd() and assert it in
> walk_page_range_debug(); drop walk_kernel_page_table_range_rcu(); fix the
> pgtable-generic.c comment; document the no-sleep audit of the callbacks.
> v4: defer the free in both the async and non async configs, not just
> the async one. Move the walk under a named
> walk_kernel_page_table_range_rcu() helper instead of open coding
> rcu_read_lock() in walk_page_range_debug().
> v3: take rcu_read_lock() in the init_mm branch of
> walk_page_range_debug() rather than inside the lockless walker,
> which the arm64 split paths also use with GFP_PGTABLE_KERNEL and
> can sleep.
> v2: use call_rcu() instead of synchronize_rcu().
> ---
> include/linux/mm.h | 7 -------
> mm/pagewalk.c | 14 +++++++++-----
> mm/pgtable-generic.c | 22 +++++++++++++++++++++-
> mm/ptdump.c | 2 ++
> 4 files changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 485df9c2dbdd..79408a17a1b0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3695,14 +3695,7 @@ static inline void __pagetable_free(struct ptdesc *pt)
> __free_pages(page, compound_order(page));
> }
>
> -#ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> void pagetable_free_kernel(struct ptdesc *pt);
> -#else
> -static inline void pagetable_free_kernel(struct ptdesc *pt)
> -{
> - __pagetable_free(pt);
> -}
> -#endif
Okay, now we always incur a function call overhead. I guess it will be fine,
just saying.
> /**
> * pagetable_free - Free pagetables
> * @pt: The page table descriptor
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..c0be87580989 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -620,7 +620,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> * Note: Be careful to walk the kernel pages tables, the caller may be need to
> * take other effective approaches (mmap lock may be insufficient) to prevent
> * the intermediate kernel page tables belonging to the specified address range
> - * from being freed (e.g. memory hot-remove).
> + * from being freed (e.g. memory hot-remove, vmap huge page promotion).
> */
> int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
> @@ -643,7 +643,7 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> * Use this function to walk the kernel page tables locklessly. It should be
> * guaranteed that the caller has exclusive access over the range they are
> * operating on - that there should be no concurrent access, for example,
> - * changing permissions for vmalloc objects.
> + * changing permissions for vmalloc objects, or vmap huge page promotion.
> */
> int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end,
> const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
> @@ -692,9 +692,13 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> };
>
> /* For convenience, we allow traversal of kernel mappings. */
> - if (mm == &init_mm)
> - return walk_kernel_page_table_range(start, end, ops,
> - pgd, private);
> + if (mm == &init_mm) {
> + RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
> + "RCU read lock must be held across kernel page table walk");
> + return walk_kernel_page_table_range(start, end, ops, pgd,
> + private);
> + }
> +
> if (start >= end || !walk.mm)
> return -EINVAL;
> if (!check_ops_safe(ops))
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index b91b1a98029c..7a32e4821957 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -410,6 +410,13 @@ pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> goto again;
> }
>
> +static void kernel_pgtable_free_rcu(struct rcu_head *head)
> +{
> + struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
> +
> + __pagetable_free(pt);
> +}
> +
> #ifdef CONFIG_ASYNC_KERNEL_PGTABLE_FREE
> static void kernel_pgtable_work_func(struct work_struct *work);
>
> @@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
> spin_unlock(&kernel_pgtable_work.lock);
>
> iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
> +
> + /*
> + * Debug walkers (ptdump) may walk ranges they do not own and race this
> + * free, so they walk under rcu_read_lock(). Free after a grace period:
> + * a walker either already saw the cleared PMD, or keeps the page alive
> + * until it drops the RCU lock.
> + */
> list_for_each_entry_safe(pt, next, &page_list, pt_list)
> - __pagetable_free(pt);
> + call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
> }
>
> void pagetable_free_kernel(struct ptdesc *pt)
> @@ -446,4 +460,10 @@ void pagetable_free_kernel(struct ptdesc *pt)
>
> schedule_work(&kernel_pgtable_work.work);
> }
> +#else
> +void pagetable_free_kernel(struct ptdesc *pt)
> +{
> + /* Defer the free by a grace period; see kernel_pgtable_work_func(). */
> + call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);
> +}
> #endif
> diff --git a/mm/ptdump.c b/mm/ptdump.c
> index 973020000096..50cd96a33dfd 100644
> --- a/mm/ptdump.c
> +++ b/mm/ptdump.c
> @@ -178,11 +178,13 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
>
> get_online_mems();
> mmap_write_lock(mm);
> + rcu_read_lock();
> while (range->start != range->end) {
> walk_page_range_debug(mm, range->start, range->end,
> &ptdump_ops, pgd, st);
> range++;
> }
> + rcu_read_unlock();
Any reason that is not done inside walk_page_range_debug() ? Dropping the lock
in between should not be a concern?
--
Cheers,
David
prev parent reply other threads:[~2026-07-01 15:58 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-30 4:10 [PATCH] mm: pgtable: free kernel page tables via RCU to fix ptdump UAF David Carlier
2026-07-01 5:42 ` Dev Jain
2026-07-01 5:57 ` David CARLIER
2026-07-01 12:26 ` Dev Jain
2026-07-01 15:58 ` David Hildenbrand (Arm) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=560ce7db-1afb-421a-bac3-3710f8f1966e@kernel.org \
--to=david@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=aou@eecs.berkeley.edu \
--cc=baolu.lu@linux.intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=devnexen@gmail.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=palmer@dabbelt.com \
--cc=pjw@kernel.org \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox