* Re: mm: opaque hardware page-table entry handles
2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
@ 2026-06-24 15:52 ` Zi Yan
2026-06-24 22:39 ` Muhammad Usama Anjum
2026-06-24 19:25 ` Pedro Falcato
2026-07-01 20:56 ` David Hildenbrand (Arm)
2 siblings, 1 reply; 8+ messages in thread
From: Zi Yan @ 2026-06-24 15:52 UTC (permalink / raw)
To: Usama Anjum, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
Catalin Marinas, Will Deacon, Samuel Holland
Cc: linux-mm, linux-arm-kernel, linux-kernel
On Wed Jun 24, 2026 at 10:09 AM EDT, Usama Anjum wrote:
> Hi all,
>
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
>
> 1. a page-table entry value (pte_t, pmd_t, ...)
> 2. a pointer to an entry value, e.g. a pXX_t on the stack
> 3. a pointer to a live entry in the hardware page table
This sounds good to me, but can you clarify the situation below?
A live entry means the entry can be accessed by hardware when the code
is manipulating it? What type should we use if we are pre-populating
PTEs in a PMD page before we establish the PMD page as a HW page table?
In __split_huge_pmd_locked(), we do that. A PMD page is first withdrawn
and filled with after-split PTEs, pmd_populate() and pte_offset_map()
are used for this not-yet-HW page table. Later, pmd_populate() is used
to make this page table visible to HW. Should we have two versions of
pmd_populate() and pte_offset_map()? Since the first pmd_populate()
would accept pmd_t*, but the second one would accept hw_pmdp, if we are
pedantic. Of course, we can be flexible here to use pmd_populate()
accpeting hw_pmdp for both, since the PMD page table we are modifying
is going to be visible to HW soon. But I think we should have clear
definitions for where these types are used and document them well.
You probably can ask LLMs to check these ambiguous/vague uses throughout
the code base.
>
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
>
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
>
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
>
> /* opaque handle to a HW page-table entry; not dereferenceable */
> typedef struct {
> pte_t *ptr;
> } hw_ptep;
>
> With this:
>
> - a stack value can no longer masquerade as a hardware table entry,
> - a hardware handle can no longer be raw-dereferenced,
> - cases that genuinely operate on a value can be refactored to pass the value
> and let the caller, which knows whether it holds a handle or a stack copy,
> read it once.
>
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
>
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
>
> typedef pte_t *hw_ptep;
>
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
>
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
>
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
>
> - pXXp_get(hw_ptep) plain C dereference (compiler may optimize)
> - pXXp_get_once(hw_ptep) single-copy-atomic, not torn, elided or
> duplicated by the compiler
>
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
>
> Example conversion
> ------------------
> Most of the conversion is mechanical.
>
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> - pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + hw_ptep ptep, pte_t pte, unsigned int nr)
> {
> page_table_check_ptes_set(mm, addr, ptep, pte, nr);
> for (;;) {
> set_pte(ptep, pte);
> if (--nr == 0)
> break;
> - ptep++;
> + ptep = hw_pte_next(ptep);
> pte = pte_next_pfn(pte);
> }
> }
>
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and
> fix.
>
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
> 77 files changed, +1801 / -1425
> ~57 files reference the new types
>
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
>
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
>
> Open questions
> --------------
> - Is the type-safety + future-feature enablement worth the churn?
> - Naming: hw_ptep/hw_pmdp vs something else?
> - Should all five levels be converted before merging anything, or is a staged
> PTE-and-PMD then landing others acceptable?
> - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
> level?
>
> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>
> Thanks,
> Usama
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: mm: opaque hardware page-table entry handles
2026-06-24 15:52 ` Zi Yan
@ 2026-06-24 22:39 ` Muhammad Usama Anjum
0 siblings, 0 replies; 8+ messages in thread
From: Muhammad Usama Anjum @ 2026-06-24 22:39 UTC (permalink / raw)
To: Zi Yan, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
Catalin Marinas, Will Deacon, Samuel Holland
Cc: usama.anjum, linux-mm, linux-arm-kernel, linux-kernel
On 24/06/2026 4:52 pm, Zi Yan wrote:
> On Wed Jun 24, 2026 at 10:09 AM EDT, Usama Anjum wrote:
>> Hi all,
>>
>> This is a direction-check with the wider community before spending time on the
>> development. This picks up the idea that was raised and broadly agreed in the
>> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>>
>> The problem
>> -----------
>> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
>> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
>> representation. Sprinkling getters wouldn't solve the problem entirely. The
>> problem is one level up: the *pointer type* itself is overloaded. At each level
>> there are really three distinct things:
>>
>> 1. a page-table entry value (pte_t, pmd_t, ...)
>> 2. a pointer to an entry value, e.g. a pXX_t on the stack
>> 3. a pointer to a live entry in the hardware page table
>
> This sounds good to me, but can you clarify the situation below?
>
> A live entry means the entry can be accessed by hardware when the code
> is manipulating it?
I think live is wrong world to chose here. Its a mistake on my end. (3) means
the pointer points into real page-table memory and that table is complete,
whether or not it's linked in yet. A withdrawn-but-not-yet-installed table is
still a hardware table and can be represented by hw_pXXp type.
> What type should we use if we are pre-populating
> PTEs in a PMD page before we establish the PMD page as a HW page table?
> In __split_huge_pmd_locked(), we do that. A PMD page is first withdrawn
> and filled with after-split PTEs, pmd_populate() and pte_offset_map()
> are used for this not-yet-HW page table. Later, pmd_populate() is used> to make this page table visible to HW. Should we have two versions of
> pmd_populate() and pte_offset_map()? Since the first pmd_populate()
> would accept pmd_t*, but the second one would accept hw_pmdp, if we are
> pedantic. Of course, we can be flexible here to use pmd_populate()
> accpeting hw_pmdp for both, since the PMD page table we are modifying
> is going to be visible to HW soon. But I think we should have clear
> definitions for where these types are used and document them well.
This is exactly the example that causes the confusion. Following the definition
above, the pmd is on the stack while the PTEs are being prepared, and the PTE
table is complete — so the pmd pointer should be pmd_t * and the PTE table
hw_ptep. I'd keep the two APIs distinct rather than overloading hw_pmdp for
both: that's what enforces the rule that no stack pointer reaches a
table-writing API, and what lets the *_stack path drop the synchronization.
(One thing I still need to chase: there are cases where we convert between pmd
and pte. I need to understand how often that happens — if it's common, a
hw_ptep could get converted into a pmd and bring the confusion back, and if we
have to account for that, definition (3) may need to change.)
>
> You probably can ask LLMs to check these ambiguous/vague uses throughout
> the code base.
>
>>
>> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
>> distinguishes a pointer into a live table from a pointer to a stack copy.
>>
>> A pointer to an on-stack entry value and a pointer to a live hardware entry have
>> the same type, so the compiler cannot distinguish them. Passing the stack
>> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
>> but is wrong - a bug class the type system makes invisible. It also blocks
>> evolution: an arch helper may need to read beyond the addressed entry (e.g.
>> adjacent or contiguous entries), which only makes sense for a real page-table
>> pointer, not a stack copy.
>>
>> The idea
>> --------
>> Give (3) its own opaque type that cannot be dereferenced:
>>
>> /* opaque handle to a HW page-table entry; not dereferenceable */
>> typedef struct {
>> pte_t *ptr;
>> } hw_ptep;
>>
>> With this:
>>
>> - a stack value can no longer masquerade as a hardware table entry,
>> - a hardware handle can no longer be raw-dereferenced,
>> - cases that genuinely operate on a value can be refactored to pass the value
>> and let the caller, which knows whether it holds a handle or a stack copy,
>> read it once.
>>
>> The overload becomes a compile-time type error instead of a silent runtime bug,
>> and converting the tree forces every such site to be made explicit. This gives
>> us a framework where the architecture can completely virtualize the pgtable if
>> it likes; and the compiler can enforce that higher level code can't accidentally
>> work around it.
>>
>> It is opt-in by architectures and incremental. The generic definition is
>> just an alias, so arches that do not care build unchanged:
>>
>> typedef pte_t *hw_ptep;
>>
>> An arch flips to the strong struct type when it is ready, and only then does
>> it get the stronger checking. This lets the conversion land gradually.
>>
>> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
>> features that need tighter control over how page tables are accessed and
>> manipulated.
>>
>> Getter flavours
>> ---------------
>> While converting, it is useful to have two accessor flavours at each level:
>>
>> - pXXp_get(hw_ptep) plain C dereference (compiler may optimize)
>> - pXXp_get_once(hw_ptep) single-copy-atomic, not torn, elided or
>> duplicated by the compiler
>>
>> Keeping them distinct simplifies the conversion and avoids re-introducing the
>> class of lockless-read bugs seen on 32-bit.
>>
>> Example conversion
>> ------------------
>> Most of the conversion is mechanical.
>>
>> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> - pte_t *ptep, pte_t pte, unsigned int nr)
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> + hw_ptep ptep, pte_t pte, unsigned int nr)
>> {
>> page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>> for (;;) {
>> set_pte(ptep, pte);
>> if (--nr == 0)
>> break;
>> - ptep++;
>> + ptep = hw_pte_next(ptep);
>> pte = pte_next_pfn(pte);
>> }
>> }
>>
>> The bulk of work is this kind of rote substitution. The genuine work is the
>> handful of sites that turn out to be operating on a stack copy rather than a
>> live entry - those are exactly the ones the new type forces us to surface and
>> fix.
>>
>> Estimated churn:
>> ----------------
>> Half way through the prototyping converting only PTE and PMD levels:
>> 77 files changed, +1801 / -1425
>> ~57 files reference the new types
>>
>> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
>> converted; expect meaningfully more churn than the numbers above.
>>
>> Introduce the type as an alias, convert one helper family per patch, and flip
>> an arch to the strong type last - with non-opted arches building unchanged at
>> every step.
>>
>> Open questions
>> --------------
>> - Is the type-safety + future-feature enablement worth the churn?
>> - Naming: hw_ptep/hw_pmdp vs something else?
>> - Should all five levels be converted before merging anything, or is a staged
>> PTE-and-PMD then landing others acceptable?
>> - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>> level?
>>
>> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>>
>> Thanks,
>> Usama
>
>
>
>
--
Thanks,
Usama
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mm: opaque hardware page-table entry handles
2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
2026-06-24 15:52 ` Zi Yan
@ 2026-06-24 19:25 ` Pedro Falcato
2026-06-25 10:50 ` Muhammad Usama Anjum
2026-07-01 20:56 ` David Hildenbrand (Arm)
2 siblings, 1 reply; 8+ messages in thread
From: Pedro Falcato @ 2026-06-24 19:25 UTC (permalink / raw)
To: Usama Anjum
Cc: Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
Catalin Marinas, Will Deacon, Samuel Holland, linux-mm,
linux-arm-kernel, linux-kernel
On Wed, Jun 24, 2026 at 03:09:08PM +0100, Usama Anjum wrote:
> Hi all,
>
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
>
> 1. a page-table entry value (pte_t, pmd_t, ...)
> 2. a pointer to an entry value, e.g. a pXX_t on the stack
> 3. a pointer to a live entry in the hardware page table
>
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
>
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
>
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
>
> /* opaque handle to a HW page-table entry; not dereferenceable */
> typedef struct {
> pte_t *ptr;
> } hw_ptep;
I don't love typedefs that hide pointers.
>
> With this:
>
> - a stack value can no longer masquerade as a hardware table entry,
> - a hardware handle can no longer be raw-dereferenced,
> - cases that genuinely operate on a value can be refactored to pass the value
> and let the caller, which knows whether it holds a handle or a stack copy,
> read it once.
Just a small passing comment: how about doing it differently? like
typedef struct {
pte_t *ptep;
} sw_ptep_t;
or something like that. Were I to guess, referring to a pte_t on the stack
is much rarer than all the pte_t references to actual page tables. But maybe
reality doesn't match up with my guess :)
>
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
>
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
>
> typedef pte_t *hw_ptep;
>
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
>
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
>
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
>
> - pXXp_get(hw_ptep) plain C dereference (compiler may optimize)
> - pXXp_get_once(hw_ptep) single-copy-atomic, not torn, elided or
> duplicated by the compiler
>
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
>
> Example conversion
> ------------------
> Most of the conversion is mechanical.
>
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> - pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + hw_ptep ptep, pte_t pte, unsigned int nr)
> {
> page_table_check_ptes_set(mm, addr, ptep, pte, nr);
> for (;;) {
> set_pte(ptep, pte);
> if (--nr == 0)
> break;
> - ptep++;
> + ptep = hw_pte_next(ptep);
> pte = pte_next_pfn(pte);
> }
> }
>
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and
> fix.
>
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
> 77 files changed, +1801 / -1425
> ~57 files reference the new types
Right, the churn would be very unfortunate.
>
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
>
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
>
> Open questions
> --------------
> - Is the type-safety + future-feature enablement worth the churn?
> - Naming: hw_ptep/hw_pmdp vs something else?
> - Should all five levels be converted before merging anything, or is a staged
> PTE-and-PMD then landing others acceptable?
> - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
> level?
>
> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>
> Thanks,
> Usama
>
--
Pedro
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: mm: opaque hardware page-table entry handles
2026-06-24 19:25 ` Pedro Falcato
@ 2026-06-25 10:50 ` Muhammad Usama Anjum
2026-06-25 11:08 ` Pedro Falcato
0 siblings, 1 reply; 8+ messages in thread
From: Muhammad Usama Anjum @ 2026-06-25 10:50 UTC (permalink / raw)
To: Pedro Falcato
Cc: usama.anjum, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
Catalin Marinas, Will Deacon, Samuel Holland, linux-mm,
linux-arm-kernel, linux-kernel
On 24/06/2026 8:25 pm, Pedro Falcato wrote:
> On Wed, Jun 24, 2026 at 03:09:08PM +0100, Usama Anjum wrote:
>> Hi all,
>>
>> This is a direction-check with the wider community before spending time on the
>> development. This picks up the idea that was raised and broadly agreed in the
>> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>>
>> The problem
>> -----------
>> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
>> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
>> representation. Sprinkling getters wouldn't solve the problem entirely. The
>> problem is one level up: the *pointer type* itself is overloaded. At each level
>> there are really three distinct things:
>>
>> 1. a page-table entry value (pte_t, pmd_t, ...)
>> 2. a pointer to an entry value, e.g. a pXX_t on the stack
>> 3. a pointer to a live entry in the hardware page table
>>
>> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
>> distinguishes a pointer into a live table from a pointer to a stack copy.
>>
>> A pointer to an on-stack entry value and a pointer to a live hardware entry have
>> the same type, so the compiler cannot distinguish them. Passing the stack
>> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
>> but is wrong - a bug class the type system makes invisible. It also blocks
>> evolution: an arch helper may need to read beyond the addressed entry (e.g.
>> adjacent or contiguous entries), which only makes sense for a real page-table
>> pointer, not a stack copy.
>>
>> The idea
>> --------
>> Give (3) its own opaque type that cannot be dereferenced:
>>
>> /* opaque handle to a HW page-table entry; not dereferenceable */
>> typedef struct {
>> pte_t *ptr;
>> } hw_ptep;
>
> I don't love typedefs that hide pointers.
Nobody likes them. This is the only way so that by mistake stack pointers
don't get reintroduced. Its also hard to catch such cases during review.
>
>>
>> With this:
>>
>> - a stack value can no longer masquerade as a hardware table entry,
>> - a hardware handle can no longer be raw-dereferenced,
>> - cases that genuinely operate on a value can be refactored to pass the value
>> and let the caller, which knows whether it holds a handle or a stack copy,
>> read it once.
>
> Just a small passing comment: how about doing it differently? like
>
> typedef struct {
> pte_t *ptep;
> } sw_ptep_t;
>
> or something like that. Were I to guess, referring to a pte_t on the stack
> is much rarer than all the pte_t references to actual page tables. But maybe
> reality doesn't match up with my guess :)
We want to fix the current usages and future usages as well. sw_ptep_t can work
for current usages, but it'll not force the new code to be written using correct
notations. Apart from different types, another benefit of hw_pXXp would be that
it'll become an opaque object which only architecture can manipulate. Hence
architecture can decide howeverever it wants to manage them in certain cases.
>
>>
>> The overload becomes a compile-time type error instead of a silent runtime bug,
>> and converting the tree forces every such site to be made explicit. This gives
>> us a framework where the architecture can completely virtualize the pgtable if
>> it likes; and the compiler can enforce that higher level code can't accidentally
>> work around it.
>>
>> It is opt-in by architectures and incremental. The generic definition is
>> just an alias, so arches that do not care build unchanged:
>>
>> typedef pte_t *hw_ptep;
>>
>> An arch flips to the strong struct type when it is ready, and only then does
>> it get the stronger checking. This lets the conversion land gradually.
>>
>> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
>> features that need tighter control over how page tables are accessed and
>> manipulated.
>>
>> Getter flavours
>> ---------------
>> While converting, it is useful to have two accessor flavours at each level:
>>
>> - pXXp_get(hw_ptep) plain C dereference (compiler may optimize)
>> - pXXp_get_once(hw_ptep) single-copy-atomic, not torn, elided or
>> duplicated by the compiler
>>
>> Keeping them distinct simplifies the conversion and avoids re-introducing the
>> class of lockless-read bugs seen on 32-bit.
>>
>> Example conversion
>> ------------------
>> Most of the conversion is mechanical.
>>
>> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> - pte_t *ptep, pte_t pte, unsigned int nr)
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> + hw_ptep ptep, pte_t pte, unsigned int nr)
>> {
>> page_table_check_ptes_set(mm, addr, ptep, pte, nr);
>> for (;;) {
>> set_pte(ptep, pte);
>> if (--nr == 0)
>> break;
>> - ptep++;
>> + ptep = hw_pte_next(ptep);
>> pte = pte_next_pfn(pte);
>> }
>> }
>>
>> The bulk of work is this kind of rote substitution. The genuine work is the
>> handful of sites that turn out to be operating on a stack copy rather than a
>> live entry - those are exactly the ones the new type forces us to surface and
>> fix.
>>
>> Estimated churn:
>> ----------------
>> Half way through the prototyping converting only PTE and PMD levels:
>> 77 files changed, +1801 / -1425
>> ~57 files reference the new types
>
> Right, the churn would be very unfortunate.
>
>>
>> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
>> converted; expect meaningfully more churn than the numbers above.
>>
>> Introduce the type as an alias, convert one helper family per patch, and flip
>> an arch to the strong type last - with non-opted arches building unchanged at
>> every step.
>>
>> Open questions
>> --------------
>> - Is the type-safety + future-feature enablement worth the churn?
>> - Naming: hw_ptep/hw_pmdp vs something else?
>> - Should all five levels be converted before merging anything, or is a staged
>> PTE-and-PMD then landing others acceptable?
>> - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
>> level?
>>
>> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>>
>> Thanks,
>> Usama
>>
>
--
Thanks,
Usama
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: mm: opaque hardware page-table entry handles
2026-06-25 10:50 ` Muhammad Usama Anjum
@ 2026-06-25 11:08 ` Pedro Falcato
2026-06-25 12:15 ` Muhammad Usama Anjum
0 siblings, 1 reply; 8+ messages in thread
From: Pedro Falcato @ 2026-06-25 11:08 UTC (permalink / raw)
To: Muhammad Usama Anjum
Cc: Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
Catalin Marinas, Will Deacon, Samuel Holland, linux-mm,
linux-arm-kernel, linux-kernel
On Thu, Jun 25, 2026 at 11:50:28AM +0100, Muhammad Usama Anjum wrote:
> On 24/06/2026 8:25 pm, Pedro Falcato wrote:
> > On Wed, Jun 24, 2026 at 03:09:08PM +0100, Usama Anjum wrote:
> >> Hi all,
> >>
> >> This is a direction-check with the wider community before spending time on the
> >> development. This picks up the idea that was raised and broadly agreed in the
> >> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
> >>
> >> The problem
> >> -----------
> >> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> >> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> >> representation. Sprinkling getters wouldn't solve the problem entirely. The
> >> problem is one level up: the *pointer type* itself is overloaded. At each level
> >> there are really three distinct things:
> >>
> >> 1. a page-table entry value (pte_t, pmd_t, ...)
> >> 2. a pointer to an entry value, e.g. a pXX_t on the stack
> >> 3. a pointer to a live entry in the hardware page table
> >>
> >> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> >> distinguishes a pointer into a live table from a pointer to a stack copy.
> >>
> >> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> >> the same type, so the compiler cannot distinguish them. Passing the stack
> >> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> >> but is wrong - a bug class the type system makes invisible. It also blocks
> >> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> >> adjacent or contiguous entries), which only makes sense for a real page-table
> >> pointer, not a stack copy.
> >>
> >> The idea
> >> --------
> >> Give (3) its own opaque type that cannot be dereferenced:
> >>
> >> /* opaque handle to a HW page-table entry; not dereferenceable */
> >> typedef struct {
> >> pte_t *ptr;
> >> } hw_ptep;
> >
> > I don't love typedefs that hide pointers.
> Nobody likes them. This is the only way so that by mistake stack pointers
> don't get reintroduced. Its also hard to catch such cases during review.
That's not true, you could have:
typedef struct { pteval_t pte; } sw_pte_t;
and
/* only usable by arch code and whoever wants to interpret these
* types */
static inline sw_to_ptep(sw_pte_t *swptep)
{
return (pte_t *) swptep;
}
and so on... Also, see Documentation/process/coding-style.rst 5) typedefs, it
explicitly warns against pointer typedefs.
>
> >
> >>
> >> With this:
> >>
> >> - a stack value can no longer masquerade as a hardware table entry,
> >> - a hardware handle can no longer be raw-dereferenced,
> >> - cases that genuinely operate on a value can be refactored to pass the value
> >> and let the caller, which knows whether it holds a handle or a stack copy,
> >> read it once.
> >
> > Just a small passing comment: how about doing it differently? like
> >
> > typedef struct {
> > pte_t *ptep;
> > } sw_ptep_t;
> >
> > or something like that. Were I to guess, referring to a pte_t on the stack
> > is much rarer than all the pte_t references to actual page tables. But maybe
> > reality doesn't match up with my guess :)
> We want to fix the current usages and future usages as well. sw_ptep_t can work
> for current usages, but it'll not force the new code to be written using correct
> notations.
I don't understand what you mean. pte_t is a perfectly correct notation,
it's just currently maybe too ambiguously overloaded.
> Apart from different types, another benefit of hw_pXXp would be that
> it'll become an opaque object which only architecture can manipulate. Hence
> architecture can decide howeverever it wants to manage them in certain cases.
That's already the case. pte_t is fully opaque apart from the little fact
that you can declare one on your stack. Introducing a different sw_pte_t
would further reinforce that. And if you want ways to find raw derefs on
pointers, we can simply slap on __attribute__((noderef)) (available in
sparse and clang) on those types after sw_pte_t is introduced and pte_t
is unambiguously a "hardware" PTE.
I dunno, I'm not convinced that changing around ~450 files is worth it, and
_if_ we want to do something like this I would strongly prefer the way that
is less churny.
--
Pedro
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: mm: opaque hardware page-table entry handles
2026-06-25 11:08 ` Pedro Falcato
@ 2026-06-25 12:15 ` Muhammad Usama Anjum
0 siblings, 0 replies; 8+ messages in thread
From: Muhammad Usama Anjum @ 2026-06-25 12:15 UTC (permalink / raw)
To: Pedro Falcato
Cc: usama.anjum, Andrew Morton, Lorenzo Stoakes, David Hildenbrand,
Liam R. Howlett, Mike Rapoport, Ryan Roberts, Anshuman Khandual,
Catalin Marinas, Will Deacon, Samuel Holland, linux-mm,
linux-arm-kernel, linux-kernel
On 25/06/2026 12:08 pm, Pedro Falcato wrote:
> On Thu, Jun 25, 2026 at 11:50:28AM +0100, Muhammad Usama Anjum wrote:
>> On 24/06/2026 8:25 pm, Pedro Falcato wrote:
>>> On Wed, Jun 24, 2026 at 03:09:08PM +0100, Usama Anjum wrote:
>>>> Hi all,
>>>>
>>>> This is a direction-check with the wider community before spending time on the
>>>> development. This picks up the idea that was raised and broadly agreed in the
>>>> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>>>>
>>>> The problem
>>>> -----------
>>>> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
>>>> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
>>>> representation. Sprinkling getters wouldn't solve the problem entirely. The
>>>> problem is one level up: the *pointer type* itself is overloaded. At each level
>>>> there are really three distinct things:
>>>>
>>>> 1. a page-table entry value (pte_t, pmd_t, ...)
>>>> 2. a pointer to an entry value, e.g. a pXX_t on the stack
>>>> 3. a pointer to a live entry in the hardware page table
>>>>
>>>> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
>>>> distinguishes a pointer into a live table from a pointer to a stack copy.
>>>>
>>>> A pointer to an on-stack entry value and a pointer to a live hardware entry have
>>>> the same type, so the compiler cannot distinguish them. Passing the stack
>>>> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
>>>> but is wrong - a bug class the type system makes invisible. It also blocks
>>>> evolution: an arch helper may need to read beyond the addressed entry (e.g.
>>>> adjacent or contiguous entries), which only makes sense for a real page-table
>>>> pointer, not a stack copy.
>>>>
>>>> The idea
>>>> --------
>>>> Give (3) its own opaque type that cannot be dereferenced:
>>>>
>>>> /* opaque handle to a HW page-table entry; not dereferenceable */
>>>> typedef struct {
>>>> pte_t *ptr;
>>>> } hw_ptep;
>>>
>>> I don't love typedefs that hide pointers.
>> Nobody likes them. This is the only way so that by mistake stack pointers
>> don't get reintroduced. Its also hard to catch such cases during review.
>
> That's not true, you could have:
>
> typedef struct { pteval_t pte; } sw_pte_t;
>
> and
>
> /* only usable by arch code and whoever wants to interpret these
> * types */
> static inline sw_to_ptep(sw_pte_t *swptep)
> {
> return (pte_t *) swptep;
> }
>
> and so on... Also, see Documentation/process/coding-style.rst 5) typedefs, it
> explicitly warns against pointer typedefs.
I hear your concern, but I think the sw_pte_t inversion solves the small problem
and gives up the big one. Let me make the case for keeping the opaque hardware type.
The narrow goal is "no stack pointer reaches a table-writing API". Both schemes catch
that. But the actual reasons for this idea is broader one:
* making core independent of how a real page table entry is represented and accessed.
It only works if hardware type is the abstract one.
* As you may have noted with pmdp_get(): on some arches the read is not a pure load
(folding, lockless ordering, kmap of a highmem table page). pte_t * lets callers
bypass all of that with *ptep. The handle makes the accessor the only door, so the
barriers/folds can't be skipped by accident.
>>
>>>
>>>>
>>>> With this:
>>>>
>>>> - a stack value can no longer masquerade as a hardware table entry,
>>>> - a hardware handle can no longer be raw-dereferenced,
>>>> - cases that genuinely operate on a value can be refactored to pass the value
>>>> and let the caller, which knows whether it holds a handle or a stack copy,
>>>> read it once.
>>>
>>> Just a small passing comment: how about doing it differently? like
>>>
>>> typedef struct {
>>> pte_t *ptep;
>>> } sw_ptep_t;
>>>
>>> or something like that. Were I to guess, referring to a pte_t on the stack
>>> is much rarer than all the pte_t references to actual page tables. But maybe
>>> reality doesn't match up with my guess :)
>> We want to fix the current usages and future usages as well. sw_ptep_t can work
>> for current usages, but it'll not force the new code to be written using correct
>> notations.
>
> I don't understand what you mean. pte_t is a perfectly correct notation,
> it's just currently maybe too ambiguously overloaded.
Yes, this overload is what need fixing.
>
>> Apart from different types, another benefit of hw_pXXp would be that
>> it'll become an opaque object which only architecture can manipulate. Hence
>> architecture can decide howeverever it wants to manage them in certain cases.
>
> That's already the case. pte_t is fully opaque apart from the little fact
> that you can declare one on your stack. Introducing a different sw_pte_t
> would further reinforce that. And if you want ways to find raw derefs on
> pointers, we can simply slap on __attribute__((noderef)) (available in
> sparse and clang) on those types after sw_pte_t is introduced and pte_t
> is unambiguously a "hardware" PTE.
The pte_t iterator loops in core code prove that it isn't opaque enough.
The pointer arithmetic (ptep++) is done at several places in the core.
The sw_pte_t + deref protection only catches misues under sparse. While the
hw_ptep type is enforced by every compiler for every build.
>
> I dunno, I'm not convinced that changing around ~450 files is worth it, and
> _if_ we want to do something like this I would strongly prefer the way that
> is less churny.
Probably you grepped these types to come up with 450 files? But we aren't going
to update all files. Only the generic code would be converted with one or two
architectures. Its architecture opt-in. It'll be transparent to non-converted
architectures. So if arch/ is excluded, the number of files would become a
quarter?
This type change is going to localize the future churn. It is one time cost;
after that, every future representation change lives behind the accessors.
If pte_t * stays the live type, each such change is another N files audit.
The struct buys us one choke point to evolve.
--
Thanks,
Usama
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: mm: opaque hardware page-table entry handles
2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
2026-06-24 15:52 ` Zi Yan
2026-06-24 19:25 ` Pedro Falcato
@ 2026-07-01 20:56 ` David Hildenbrand (Arm)
2 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-07-01 20:56 UTC (permalink / raw)
To: Usama Anjum, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
Mike Rapoport, Ryan Roberts, Anshuman Khandual, Catalin Marinas,
Will Deacon, Samuel Holland
Cc: linux-mm, linux-arm-kernel, linux-kernel
On 6/24/26 16:09, Usama Anjum wrote:
> Hi all,
Hi!
>
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
>
> 1. a page-table entry value (pte_t, pmd_t, ...)
> 2. a pointer to an entry value, e.g. a pXX_t on the stack
> 3. a pointer to a live entry in the hardware page table
>
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
Yes, I just stumbled over that myself while working on Levi on some folded page
table optimizations for pdgp_get() and friends.
The stack usage is nasty. Calling ptep_get() on stack values makes no sense.
Reading actual page table values without ptep_get() is suboptimal. Punching
stack pointers into functions that don't expect the, is shaky.
>
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
>
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
>
> /* opaque handle to a HW page-table entry; not dereferenceable */
> typedef struct {
> pte_t *ptr;
> } hw_ptep;
I guess the proper way of doing it would really be for hw_ptes to have a
distinct type, to completely decouple both concepts.
That's where the fun begins :(
We'd need hw_ptep++ to jump to the next entry in the page table. Assuming we're
on 32bit and have 64bit entries, would that work with the hw_ptep? hw_pte_next()
is rather nasty.
So, similar to what Pedro says
typedef struct {
pte_t __pte;
} hw_pte_t;
And then simply use
hw_pte_t *hptep;
>
> With this:
>
> - a stack value can no longer masquerade as a hardware table entry,
Right. What we don't care about is if someone deliberately would instantiate a
hw_pte_t above on the stack. We can catch that more easily.
> - a hardware handle can no longer be raw-dereferenced,
That's the important part, yes.
> - cases that genuinely operate on a value can be refactored to pass the value
> and let the caller, which knows whether it holds a handle or a stack copy,
> read it once.
The question is if these cases really just support one type of pointer (I assume
so).
>
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
>
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
>
> typedef pte_t *hw_ptep;
Like Pedro says, pointer typedefs are really nasty.
>
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
>
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
>
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
>
> - pXXp_get(hw_ptep) plain C dereference (compiler may optimize)
That's just what we have. Defaults to READ_ONCE().
> - pXXp_get_once(hw_ptep) single-copy-atomic, not torn, elided or
> duplicated by the compiler
Why do we need this and what would we use it for?
>
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
>
> Example conversion
> ------------------
> Most of the conversion is mechanical.
>
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> - pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + hw_ptep ptep, pte_t pte, unsigned int nr)
hw_pte_t *ptep, pte_t pte, unsigned int nr)
or (with sw ptep)
pte_t *ptep, pte_t pte, unsigned int nr)
> {
> page_table_check_ptes_set(mm, addr, ptep, pte, nr);
> for (;;) {
> set_pte(ptep, pte);
> if (--nr == 0)
> break;
> - ptep++;
> + ptep = hw_pte_next(ptep);
We should really just let ptep++ work as before.
> pte = pte_next_pfn(pte);
> }
> }
>
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and
> fix.
>
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
> 77 files changed, +1801 / -1425
> ~57 files reference the new types
>
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
>
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
>
> Open questions
> --------------
> - Is the type-safety + future-feature enablement worth the churn?
We have to minimize the churn. But yes, we really have to find a way to stop
ptep_get() and friends getting used on stack variables, or *ptep getting used
without ptep_get().
We have object_is_on_stack(), but that doesn't really allow for compile-time
checks ... and I don't know how safe it is in general.
> - Naming: hw_ptep/hw_pmdp vs something else?
Really avoid ptep typedefs.
> - Should all five levels be converted before merging anything, or is a staged
> PTE-and-PMD then landing others acceptable?
> - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
> level?
I'm still not sure about the _once() really, and if we need that right now. We
survived without is so far, why do we need it now?
--
Cheers,
David
^ permalink raw reply [flat|nested] 8+ messages in thread