Re: mm: opaque hardware page-table entry handles

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Usama Anjum <usama.anjum@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <liam@infradead.org>,
	Mike Rapoport <rppt@kernel.org>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Samuel Holland <samuel.holland@sifive.com>
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: mm: opaque hardware page-table entry handles
Date: Wed, 1 Jul 2026 22:56:58 +0200	[thread overview]
Message-ID: <31d36023-d728-4eee-90f8-158c7066f565@kernel.org> (raw)
In-Reply-To: <74182e50-b54f-4d2d-a27f-3a59a538d6bc@arm.com>

On 6/24/26 16:09, Usama Anjum wrote:
> Hi all,

Hi!

> 
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
> 
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
> 
>   1. a page-table entry value (pte_t, pmd_t, ...)
>   2. a pointer to an entry value, e.g. a pXX_t on the stack
>   3. a pointer to a live entry in the hardware page table
> 
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.

Yes, I just stumbled over that myself while working on Levi on some folded page
table optimizations for pdgp_get() and friends.

The stack usage is nasty. Calling ptep_get() on stack values makes no sense.
Reading actual page table values without ptep_get() is suboptimal. Punching
stack pointers into functions that don't expect the, is shaky.

> 
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
> 
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
> 
>     /* opaque handle to a HW page-table entry; not dereferenceable */
>     typedef struct {
> 	pte_t *ptr;
>     } hw_ptep;

I guess the proper way of doing it would really be for hw_ptes to have a
distinct type, to completely decouple both concepts.

That's where the fun begins :(

We'd need hw_ptep++ to jump to the next entry in the page table. Assuming we're
on 32bit and have 64bit entries, would that work with the hw_ptep? hw_pte_next()
is rather nasty.

So, similar to what Pedro says

typedef struct {
	pte_t __pte;
} hw_pte_t;

And then simply use

hw_pte_t *hptep;


> 
> With this:
> 
>   - a stack value can no longer masquerade as a hardware table entry,

Right. What we don't care about is if someone deliberately would instantiate a
hw_pte_t above on the stack. We can catch that more easily.

>   - a hardware handle can no longer be raw-dereferenced,

That's the important part, yes.

>   - cases that genuinely operate on a value can be refactored to pass the value
>     and let the caller, which knows whether it holds a handle or a stack copy,
>     read it once.

The question is if these cases really just support one type of pointer (I assume
so).

> 
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
> 
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
> 
>     typedef pte_t *hw_ptep;

Like Pedro says, pointer typedefs are really nasty.

> 
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
> 
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
> 
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
> 
>   - pXXp_get(hw_ptep)        plain C dereference (compiler may optimize)

That's just what we have. Defaults to READ_ONCE().

>   - pXXp_get_once(hw_ptep)   single-copy-atomic, not torn, elided or
>                              duplicated by the compiler

Why do we need this and what would we use it for?

> 
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
> 
> Example conversion
> ------------------
> Most of the conversion is mechanical.
> 
>   -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   -		pte_t *ptep, pte_t pte, unsigned int nr)
>   +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   +		hw_ptep ptep, pte_t pte, unsigned int nr)

hw_pte_t *ptep, pte_t pte, unsigned int nr)

or (with sw ptep)

pte_t *ptep, pte_t pte, unsigned int nr)

>    {
>    	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>    	for (;;) {
>    		set_pte(ptep, pte);
>    		if (--nr == 0)
>    			break;
>   -		ptep++;
>   +		ptep = hw_pte_next(ptep);

We should really just let ptep++ work as before.

>    		pte = pte_next_pfn(pte);
>    	}
>    }
> 
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and 
> fix.
> 
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
>   77 files changed, +1801 / -1425
>   ~57 files reference the new types
> 
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
> 
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
> 
> Open questions
> --------------
>   - Is the type-safety + future-feature enablement worth the churn?

We have to minimize the churn. But yes, we really have to find a way to stop
ptep_get() and friends getting used on stack variables, or *ptep getting used
without ptep_get().

We have object_is_on_stack(), but that doesn't really allow for compile-time
checks ... and I don't know how safe it is in general.

>   - Naming: hw_ptep/hw_pmdp vs something else?

Really avoid ptep typedefs.

>   - Should all five levels be converted before merging anything, or is a staged
>     PTE-and-PMD then landing others acceptable?
>   - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>     level?

I'm still not sure about the _once() really, and if we need that right now. We
survived without is so far, why do we need it now?


-- 
Cheers,

David

     prev parent reply	other threads:[~2026-07-01 20:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
2026-06-24 15:52 ` Zi Yan
2026-06-24 22:39   ` Muhammad Usama Anjum
2026-06-24 19:25 ` Pedro Falcato
2026-06-25 10:50   ` Muhammad Usama Anjum
2026-06-25 11:08     ` Pedro Falcato
2026-06-25 12:15       ` Muhammad Usama Anjum
2026-07-01 20:56 ` David Hildenbrand (Arm) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31d36023-d728-4eee-90f8-158c7066f565@kernel.org \
    --to=david@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=liam@infradead.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=samuel.holland@sifive.com \
    --cc=usama.anjum@arm.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox