All of lore.kernel.org
 help / color / mirror / Atom feed
* mm: opaque hardware page-table entry handles
@ 2026-06-24 14:09 Usama Anjum
  2026-06-24 15:52 ` Zi Yan
  2026-06-24 19:25 ` Pedro Falcato
  0 siblings, 2 replies; 3+ messages in thread
From: Usama Anjum @ 2026-06-24 14:09 UTC (permalink / raw)
  To: Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
	Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
	Catalin Marinas, Will Deacon, Samuel Holland
  Cc: usama.anjum, linux-mm, linux-arm-kernel, linux-kernel

Hi all,

This is a direction-check with the wider community before spending time on the
development. This picks up the idea that was raised and broadly agreed in the
earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].

The problem
-----------
Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
representation. Sprinkling getters wouldn't solve the problem entirely. The
problem is one level up: the *pointer type* itself is overloaded. At each level
there are really three distinct things:

  1. a page-table entry value (pte_t, pmd_t, ...)
  2. a pointer to an entry value, e.g. a pXX_t on the stack
  3. a pointer to a live entry in the hardware page table

Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
distinguishes a pointer into a live table from a pointer to a stack copy.

A pointer to an on-stack entry value and a pointer to a live hardware entry have
the same type, so the compiler cannot distinguish them. Passing the stack
pointer to an arch helper that expects a hardware-entry pointer compiles fine,
but is wrong - a bug class the type system makes invisible. It also blocks
evolution: an arch helper may need to read beyond the addressed entry (e.g.
adjacent or contiguous entries), which only makes sense for a real page-table
pointer, not a stack copy.

The idea
--------
Give (3) its own opaque type that cannot be dereferenced:

    /* opaque handle to a HW page-table entry; not dereferenceable */
    typedef struct {
	pte_t *ptr;
    } hw_ptep;

With this:

  - a stack value can no longer masquerade as a hardware table entry,
  - a hardware handle can no longer be raw-dereferenced,
  - cases that genuinely operate on a value can be refactored to pass the value
    and let the caller, which knows whether it holds a handle or a stack copy,
    read it once.

The overload becomes a compile-time type error instead of a silent runtime bug,
and converting the tree forces every such site to be made explicit. This gives
us a framework where the architecture can completely virtualize the pgtable if
it likes; and the compiler can enforce that higher level code can't accidentally
work around it.

It is opt-in by architectures and incremental. The generic definition is
just an alias, so arches that do not care build unchanged:

    typedef pte_t *hw_ptep;

An arch flips to the strong struct type when it is ready, and only then does
it get the stronger checking. This lets the conversion land gradually.

Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
features that need tighter control over how page tables are accessed and
manipulated.

Getter flavours
---------------
While converting, it is useful to have two accessor flavours at each level:

  - pXXp_get(hw_ptep)        plain C dereference (compiler may optimize)
  - pXXp_get_once(hw_ptep)   single-copy-atomic, not torn, elided or
                             duplicated by the compiler

Keeping them distinct simplifies the conversion and avoids re-introducing the
class of lockless-read bugs seen on 32-bit.

Example conversion
------------------
Most of the conversion is mechanical.

  -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
  -		pte_t *ptep, pte_t pte, unsigned int nr)
  +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
  +		hw_ptep ptep, pte_t pte, unsigned int nr)
   {
   	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
   	for (;;) {
   		set_pte(ptep, pte);
   		if (--nr == 0)
   			break;
  -		ptep++;
  +		ptep = hw_pte_next(ptep);
   		pte = pte_next_pfn(pte);
   	}
   }

The bulk of work is this kind of rote substitution. The genuine work is the
handful of sites that turn out to be operating on a stack copy rather than a
live entry - those are exactly the ones the new type forces us to surface and 
fix.

Estimated churn:
----------------
Half way through the prototyping converting only PTE and PMD levels:
  77 files changed, +1801 / -1425
  ~57 files reference the new types

So the line count will grow once PUD/P4D/PGD and the remaining call sites are
converted; expect meaningfully more churn than the numbers above.

Introduce the type as an alias, convert one helper family per patch, and flip
an arch to the strong type last - with non-opted arches building unchanged at
every step.

Open questions
--------------
  - Is the type-safety + future-feature enablement worth the churn?
  - Naming: hw_ptep/hw_pmdp vs something else?
  - Should all five levels be converted before merging anything, or is a staged
    PTE-and-PMD then landing others acceptable?
  - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
    level?

[1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com

Thanks,
Usama


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mm: opaque hardware page-table entry handles
  2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
@ 2026-06-24 15:52 ` Zi Yan
  2026-06-24 19:25 ` Pedro Falcato
  1 sibling, 0 replies; 3+ messages in thread
From: Zi Yan @ 2026-06-24 15:52 UTC (permalink / raw)
  To: Usama Anjum, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
	Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
	Catalin Marinas, Will Deacon, Samuel Holland
  Cc: linux-mm, linux-arm-kernel, linux-kernel

On Wed Jun 24, 2026 at 10:09 AM EDT, Usama Anjum wrote:
> Hi all,
>
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
>
>   1. a page-table entry value (pte_t, pmd_t, ...)
>   2. a pointer to an entry value, e.g. a pXX_t on the stack
>   3. a pointer to a live entry in the hardware page table

This sounds good to me, but can you clarify the situation below?

A live entry means the entry can be accessed by hardware when the code
is manipulating it? What type should we use if we are pre-populating
PTEs in a PMD page before we establish the PMD page as a HW page table?
In __split_huge_pmd_locked(), we do that. A PMD page is first withdrawn
and filled with after-split PTEs, pmd_populate() and pte_offset_map()
are used for this not-yet-HW page table. Later, pmd_populate() is used
to make this page table visible to HW. Should we have two versions of
pmd_populate() and pte_offset_map()? Since the first pmd_populate()
would accept pmd_t*, but the second one would accept hw_pmdp, if we are
pedantic. Of course, we can be flexible here to use pmd_populate()
accpeting hw_pmdp for both, since the PMD page table we are modifying
is going to be visible to HW soon. But I think we should have clear
definitions for where these types are used and document them well.

You probably can ask LLMs to check these ambiguous/vague uses throughout
the code base.

>
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
>
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
>
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
>
>     /* opaque handle to a HW page-table entry; not dereferenceable */
>     typedef struct {
> 	pte_t *ptr;
>     } hw_ptep;
>
> With this:
>
>   - a stack value can no longer masquerade as a hardware table entry,
>   - a hardware handle can no longer be raw-dereferenced,
>   - cases that genuinely operate on a value can be refactored to pass the value
>     and let the caller, which knows whether it holds a handle or a stack copy,
>     read it once.
>
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
>
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
>
>     typedef pte_t *hw_ptep;
>
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
>
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
>
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
>
>   - pXXp_get(hw_ptep)        plain C dereference (compiler may optimize)
>   - pXXp_get_once(hw_ptep)   single-copy-atomic, not torn, elided or
>                              duplicated by the compiler
>
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
>
> Example conversion
> ------------------
> Most of the conversion is mechanical.
>
>   -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   -		pte_t *ptep, pte_t pte, unsigned int nr)
>   +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   +		hw_ptep ptep, pte_t pte, unsigned int nr)
>    {
>    	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>    	for (;;) {
>    		set_pte(ptep, pte);
>    		if (--nr == 0)
>    			break;
>   -		ptep++;
>   +		ptep = hw_pte_next(ptep);
>    		pte = pte_next_pfn(pte);
>    	}
>    }
>
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and 
> fix.
>
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
>   77 files changed, +1801 / -1425
>   ~57 files reference the new types
>
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
>
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
>
> Open questions
> --------------
>   - Is the type-safety + future-feature enablement worth the churn?
>   - Naming: hw_ptep/hw_pmdp vs something else?
>   - Should all five levels be converted before merging anything, or is a staged
>     PTE-and-PMD then landing others acceptable?
>   - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>     level?
>
> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>
> Thanks,
> Usama




-- 
Best Regards,
Yan, Zi



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mm: opaque hardware page-table entry handles
  2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
  2026-06-24 15:52 ` Zi Yan
@ 2026-06-24 19:25 ` Pedro Falcato
  1 sibling, 0 replies; 3+ messages in thread
From: Pedro Falcato @ 2026-06-24 19:25 UTC (permalink / raw)
  To: Usama Anjum
  Cc: Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
	Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
	Catalin Marinas, Will Deacon, Samuel Holland, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, Jun 24, 2026 at 03:09:08PM +0100, Usama Anjum wrote:
> Hi all,
> 
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
> 
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
> 
>   1. a page-table entry value (pte_t, pmd_t, ...)
>   2. a pointer to an entry value, e.g. a pXX_t on the stack
>   3. a pointer to a live entry in the hardware page table
> 
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
> 
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
> 
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
> 
>     /* opaque handle to a HW page-table entry; not dereferenceable */
>     typedef struct {
> 	pte_t *ptr;
>     } hw_ptep;

I don't love typedefs that hide pointers.

> 
> With this:
> 
>   - a stack value can no longer masquerade as a hardware table entry,
>   - a hardware handle can no longer be raw-dereferenced,
>   - cases that genuinely operate on a value can be refactored to pass the value
>     and let the caller, which knows whether it holds a handle or a stack copy,
>     read it once.

Just a small passing comment: how about doing it differently? like

typedef struct {
	pte_t *ptep;
} sw_ptep_t;

or something like that. Were I to guess, referring to a pte_t on the stack
is much rarer than all the pte_t references to actual page tables. But maybe
reality doesn't match up with my guess :)

> 
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
> 
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
> 
>     typedef pte_t *hw_ptep;
> 
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
> 
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
> 
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
> 
>   - pXXp_get(hw_ptep)        plain C dereference (compiler may optimize)
>   - pXXp_get_once(hw_ptep)   single-copy-atomic, not torn, elided or
>                              duplicated by the compiler
> 
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
> 
> Example conversion
> ------------------
> Most of the conversion is mechanical.
> 
>   -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   -		pte_t *ptep, pte_t pte, unsigned int nr)
>   +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   +		hw_ptep ptep, pte_t pte, unsigned int nr)
>    {
>    	page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>    	for (;;) {
>    		set_pte(ptep, pte);
>    		if (--nr == 0)
>    			break;
>   -		ptep++;
>   +		ptep = hw_pte_next(ptep);
>    		pte = pte_next_pfn(pte);
>    	}
>    }
> 
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and 
> fix.
> 
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
>   77 files changed, +1801 / -1425
>   ~57 files reference the new types

Right, the churn would be very unfortunate.

> 
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
> 
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
> 
> Open questions
> --------------
>   - Is the type-safety + future-feature enablement worth the churn?
>   - Naming: hw_ptep/hw_pmdp vs something else?
>   - Should all five levels be converted before merging anything, or is a staged
>     PTE-and-PMD then landing others acceptable?
>   - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>     level?
> 
> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
> 
> Thanks,
> Usama
> 

-- 
Pedro

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-24 19:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
2026-06-24 15:52 ` Zi Yan
2026-06-24 19:25 ` Pedro Falcato

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.