Re: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Lorenzo Stoakes <ljs@kernel.org>
To: David Carlier <devnexen@gmail.com>
Cc: akpm@linux-foundation.org,
	 syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com,
	David Hildenbrand <david@kernel.org>,
	 "Liam R. Howlett" <liam@infradead.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	 Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	 Michal Hocko <mhocko@suse.com>,
	Kevin Tian <kevin.tian@intel.com>,
	 Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3] mm: pgtable: protect lockless kernel page table walks with RCU
Date: Fri, 12 Jun 2026 19:29:27 +0100	[thread overview]
Message-ID: <aixGczWMAWoi0e5U@lucifer> (raw)
In-Reply-To: <20260612172356.356894-1-devnexen@gmail.com>

Please stop sending new versions of this patch in reply to random
emails...!  Just send respins to the mailing list, not in-reply-to
anything.

Also stop sending respins this quickly. Generally leave a day between
especially as a new contributor.

On Fri, Jun 12, 2026 at 06:23:55PM +0100, David Carlier wrote:
> ptdump walks the kernel page tables locklessly through
> walk_kernel_page_table_range_lockless().  It only holds the init_mm

No, through walk_page_range_debug(). ptdump doesn't call
walk_kernel_page_table_range_lockless() directly. There's no need to point
out the specifically function it calls in turn.

> mmap lock and the memory hotplug lock, and neither excludes
> vmalloc/ioremap teardown from freeing kernel PTE pages via
> pmd_free_pte_page() -> pagetable_free_kernel().  syzbot hit a

It's not teardown is it? The only caller is vmap_try_huge_pmd(). So it's
actually trying to install a leaf PMD and freeing the PTE there.

> use-after-free in ptdump_pte_entry() reading a PTE page that was freed
> underneath the walk.

Yep that's a problem.

>
> Deferring the kernel page table free only batches the TLB flush; it does
> not wait for lockless walkers.  Mirror the user page table walk, where

This feels like useless information re: TLB? I'm not sure why you're
talking about it.

Succinctness matters too.

> pte_offset_map() already takes the RCU read lock: hold rcu_read_lock()

Referencing pte_offset_map() seems bizarrely irrelevant here?

> across the kernel walk in the init_mm branch of walk_page_range_debug()
> and rcu-free the page tables in the kernel page table free worker, after
> the batched TLB flush.  ptdump is the only walker that races with these

But you're having to actually make the page table freeing wait for a grace
period to do this so what has it got to do with mirorring pte_offset_map()?

And I'm not sure we really need to talk about the ordering of freeing
pagetables and TLB flush here at all.

What's happening is you're actually _changing_ the kernel code to wait for
a grace period and then relying on that for the sake of ptdump.

> frees and its callbacks do not sleep, so the lockless walker itself stays
> lockless for its other, exclusive-access callers (e.g. the arm64 page
> table split paths, which allocate with GFP_PGTABLE_KERNEL and may sleep).
> A walker then either observes the cleared PMD and skips the page, or
> keeps it alive until it drops the RCU read lock.

I don't really understand from this why the arm64 page table split path is
safe from the apge table being freed?

In any case, this is an utterly unreadable wall of text. Please separate
things into paragraphs so a human being can read them... Don't let the LLM
do this for you.

Make sure you understand what you're doing and write the commit message _in
your own words_.

As I note below you also need to explain why other callers to
walk_kernel_page_table_range_lockless() and walk_kernel_page_table_range()
are safe but ptdump is not?

And why exactly this is the page walker's problem?

>
> Fixes: 5ba2f0a15564 ("mm: introduce deferred freeing for kernel page tables")
> Reported-by: syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/6a287988.39669fcc.33b062.00a0.GAE@google.com/T/
> Assisted-by: Claude:claude-opus-4-8

The commit message + comments all feel generated, please edit them before
sending them, LLMs are _terrible_ at this.

> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
> v3: take rcu_read_lock() only in the init_mm branch of
>     walk_page_range_debug() instead of inside
>     walk_kernel_page_table_range_lockless().  The lockless helper is also
>     reached by the arm64 split paths, which allocate page tables with
>     GFP_PGTABLE_KERNEL and can sleep, so it must stay lockless (Andrew,
>     Sashiko).
> v2: rcu-free the page tables with call_rcu() instead of synchronize_rcu()
>     (Matthew Wilcox).
> ---
>  mm/pagewalk.c        | 21 ++++++++++++++++++---
>  mm/pgtable-generic.c | 16 +++++++++++++++-
>  2 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 3ae2586ff45b..dbb443c72353 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -692,9 +692,24 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>  	};
>
>  	/* For convenience, we allow traversal of kernel mappings. */
> -	if (mm == &init_mm)
> -		return walk_kernel_page_table_range(start, end, ops,
> -						    pgd, private);
> +	if (mm == &init_mm) {
> +		int err;
> +
> +		/*
> +		 * Kernel intermediate page tables can be freed concurrently by
> +		 * vmalloc/ioremap teardown (e.g. pmd_free_pte_page()), which
> +		 * routes the freed pages through pagetable_free_kernel(). That
> +		 * path defers the free past an RCU grace period, so hold the RCU
> +		 * read lock across the walk to prevent a page table from being
> +		 * freed while we are still dereferencing it. ptdump is the only
> +		 * caller here and its callbacks do not sleep, so this is safe.
> +		 */

This is unreadable garbage. Again, please don't let LLMs write comments,
they're awful at it.

And I don't love that you're explicitly saying 'X is the only thing that calls
this and it's safe!'

I mean, other things might call this right? And then this comment just bitrots?

If you go and read the comment in walk_kernel_page_table_range(), you'll see:

"
to prevent the intermediate kernel pages tables belonging to the
specified address range from being freed. The caller should take
other actions to prevent this race.
"

Your patch should explain why other callers don't need to do this and why
ptdump can't itself do something to avoid this?

Why are we doing this just for ptdump?

> +		rcu_read_lock();
> +		err = walk_kernel_page_table_range(start, end, ops, pgd, private);

> +		rcu_read_unlock();
> +		return err;
> +	}

If we do keep this approach we're essentially adding a whole new function
open coded in another function...

I'd rather you add walk_kernel_page_table_range_rcu() or something in that case.

> +
>  	if (start >= end || !walk.mm)
>  		return -EINVAL;
>  	if (!check_ops_safe(ops))
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index b91b1a98029c..5b53e9a5b7f8 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -424,6 +424,13 @@ static struct {
>  	.work = __WORK_INITIALIZER(kernel_pgtable_work.work, kernel_pgtable_work_func),
>  };
>
> +static void kernel_pgtable_free_rcu(struct rcu_head *head)
> +{
> +	struct ptdesc *pt = container_of(head, struct ptdesc, pt_rcu_head);
> +
> +	__pagetable_free(pt);
> +}
> +
>  static void kernel_pgtable_work_func(struct work_struct *work)
>  {
>  	struct ptdesc *pt, *next;
> @@ -434,8 +441,15 @@ static void kernel_pgtable_work_func(struct work_struct *work)
>  	spin_unlock(&kernel_pgtable_work.lock);
>
>  	iommu_sva_invalidate_kva_range(PAGE_OFFSET, TLB_FLUSH_ALL);
> +
> +	/*
> +	 * Lockless kernel page table walkers (ptdump, and any other user of
> +	 * walk_kernel_page_table_range_lockless()) dereference these pages

Err wait what? Which other users? And why are you referencing the function
you didn't change?

Note again that the comment for walk_kernel_page_table_range_lockless()
mentions that you need to guarantee 'exclusive access' over the range...

> +	 * under rcu_read_lock(). Free them after a grace period so a walker
> +	 * cannot still be reading a page we release.
> +	 */

In any case this comment is really useless and begging for bit rot.

>  	list_for_each_entry_safe(pt, next, &page_list, pt_list)
> -		__pagetable_free(pt);
> +		call_rcu(&pt->pt_rcu_head, kernel_pgtable_free_rcu);

Hm now we're unconditionally waiting a grace period for page table freeing
here just to account for ptdump, this seems... silly?

But this is only used when CONFIG_ASYNC_KERNEL_PGTABLE_FREE is defined, and
nothing in your patch references that?

And if !CONFIG_ASYNC_KERNEL_PGTABLE_FREE then pagetable_free_kernel() is
declared in mm.h as:

static inline void pagetable_free_kernel(struct ptdesc *pt)
{
	__pagetable_free(pt);
}

And everything's broken again isn't it?

>  }
>
>  void pagetable_free_kernel(struct ptdesc *pt)
> --
> 2.53.0
>

Overall it seems like a ptdump bug that ought to be fixed there? If it's a
broader 'use RCU' change the patch should more broadly alter things to
reflect that.

So far this seems like a hack?

Thanks, Lorenzo

next prev parent reply	other threads:[~2026-06-12 18:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  4:38 [PATCH] mm: pgtable: protect lockless kernel page table walks with RCU David Carlier
2026-06-12  4:52 ` Matthew Wilcox
2026-06-12  4:59   ` David CARLIER
2026-06-12  5:05   ` [PATCH v2] " David Carlier
2026-06-12 16:12     ` Andrew Morton
2026-06-12 17:21       ` David CARLIER
2026-06-12 17:23       ` [PATCH v3] " David Carlier
2026-06-12 17:39         ` Matthew Wilcox
2026-06-12 18:10           ` David CARLIER
2026-06-12 18:29         ` Lorenzo Stoakes [this message]
2026-06-12 18:48           ` David CARLIER

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aixGczWMAWoi0e5U@lucifer \
    --to=ljs@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=devnexen@gmail.com \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=syzbot+fd95a72470f5a44e464c@syzkaller.appspotmail.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox