From: "Zi Yan" <ziy@nvidia.com>
To: "Usama Anjum" <usama.anjum@arm.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Lorenzo Stoakes" <ljs@kernel.org>,
"David Hildenbrand" <david@kernel.org>,
"Liam R. Howlett" <liam@infradead.org>,
"Mike Rapoport" <rppt@kernel.org>,
"Ryan Roberts" <ryan.roberts@arm.com>,
"Anshuman Khandual" <anshuman.khandual@arm.com>,
"Catalin Marinas" <catalin.marinas@arm.com>,
"Will Deacon" <will@kernel.org>,
"Samuel Holland" <samuel.holland@sifive.com>
Cc: <linux-mm@kvack.org>, <linux-arm-kernel@lists.infradead.org>,
<linux-kernel@vger.kernel.org>
Subject: Re: mm: opaque hardware page-table entry handles
Date: Wed, 24 Jun 2026 11:52:49 -0400 [thread overview]
Message-ID: <DJHEF8MWTZSC.3KMTACZH7KWP@nvidia.com> (raw)
In-Reply-To: <74182e50-b54f-4d2d-a27f-3a59a538d6bc@arm.com>
On Wed Jun 24, 2026 at 10:09 AM EDT, Usama Anjum wrote:
> Hi all,
>
> This is a direction-check with the wider community before spending time on the
> development. This picks up the idea that was raised and broadly agreed in the
> earlier thread (Ryan Roberts, Lorenzo Stoakes, David Hildenbrand) [1].
>
> The problem
> -----------
> Core MM code reaches page-table entries by raw pointer dereference (pte_t *,
> pmd_t *, *pud, ...) in places, implicitly assuming a single, uniform
> representation. Sprinkling getters wouldn't solve the problem entirely. The
> problem is one level up: the *pointer type* itself is overloaded. At each level
> there are really three distinct things:
>
> 1. a page-table entry value (pte_t, pmd_t, ...)
> 2. a pointer to an entry value, e.g. a pXX_t on the stack
> 3. a pointer to a live entry in the hardware page table
This sounds good to me, but can you clarify the situation below?
A live entry means the entry can be accessed by hardware when the code
is manipulating it? What type should we use if we are pre-populating
PTEs in a PMD page before we establish the PMD page as a HW page table?
In __split_huge_pmd_locked(), we do that. A PMD page is first withdrawn
and filled with after-split PTEs, pmd_populate() and pte_offset_map()
are used for this not-yet-HW page table. Later, pmd_populate() is used
to make this page table visible to HW. Should we have two versions of
pmd_populate() and pte_offset_map()? Since the first pmd_populate()
would accept pmd_t*, but the second one would accept hw_pmdp, if we are
pedantic. Of course, we can be flexible here to use pmd_populate()
accpeting hw_pmdp for both, since the PMD page table we are modifying
is going to be visible to HW soon. But I think we should have clear
definitions for where these types are used and document them well.
You probably can ask LLMs to check these ambiguous/vague uses throughout
the code base.
>
> Today (2) and (3) share the same type - pte_t *, pmd_t *, and so on. Nothing
> distinguishes a pointer into a live table from a pointer to a stack copy.
>
> A pointer to an on-stack entry value and a pointer to a live hardware entry have
> the same type, so the compiler cannot distinguish them. Passing the stack
> pointer to an arch helper that expects a hardware-entry pointer compiles fine,
> but is wrong - a bug class the type system makes invisible. It also blocks
> evolution: an arch helper may need to read beyond the addressed entry (e.g.
> adjacent or contiguous entries), which only makes sense for a real page-table
> pointer, not a stack copy.
>
> The idea
> --------
> Give (3) its own opaque type that cannot be dereferenced:
>
> /* opaque handle to a HW page-table entry; not dereferenceable */
> typedef struct {
> pte_t *ptr;
> } hw_ptep;
>
> With this:
>
> - a stack value can no longer masquerade as a hardware table entry,
> - a hardware handle can no longer be raw-dereferenced,
> - cases that genuinely operate on a value can be refactored to pass the value
> and let the caller, which knows whether it holds a handle or a stack copy,
> read it once.
>
> The overload becomes a compile-time type error instead of a silent runtime bug,
> and converting the tree forces every such site to be made explicit. This gives
> us a framework where the architecture can completely virtualize the pgtable if
> it likes; and the compiler can enforce that higher level code can't accidentally
> work around it.
>
> It is opt-in by architectures and incremental. The generic definition is
> just an alias, so arches that do not care build unchanged:
>
> typedef pte_t *hw_ptep;
>
> An arch flips to the strong struct type when it is ready, and only then does
> it get the stronger checking. This lets the conversion land gradually.
>
> Beyond fixing the latent bug class, this abstraction is an enabler for upcoming
> features that need tighter control over how page tables are accessed and
> manipulated.
>
> Getter flavours
> ---------------
> While converting, it is useful to have two accessor flavours at each level:
>
> - pXXp_get(hw_ptep) plain C dereference (compiler may optimize)
> - pXXp_get_once(hw_ptep) single-copy-atomic, not torn, elided or
> duplicated by the compiler
>
> Keeping them distinct simplifies the conversion and avoids re-introducing the
> class of lockless-read bugs seen on 32-bit.
>
> Example conversion
> ------------------
> Most of the conversion is mechanical.
>
> -static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> - pte_t *ptep, pte_t pte, unsigned int nr)
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + hw_ptep ptep, pte_t pte, unsigned int nr)
> {
> page_table_check_ptes_set(mm, addr, ptep, pte, nr);
> for (;;) {
> set_pte(ptep, pte);
> if (--nr == 0)
> break;
> - ptep++;
> + ptep = hw_pte_next(ptep);
> pte = pte_next_pfn(pte);
> }
> }
>
> The bulk of work is this kind of rote substitution. The genuine work is the
> handful of sites that turn out to be operating on a stack copy rather than a
> live entry - those are exactly the ones the new type forces us to surface and
> fix.
>
> Estimated churn:
> ----------------
> Half way through the prototyping converting only PTE and PMD levels:
> 77 files changed, +1801 / -1425
> ~57 files reference the new types
>
> So the line count will grow once PUD/P4D/PGD and the remaining call sites are
> converted; expect meaningfully more churn than the numbers above.
>
> Introduce the type as an alias, convert one helper family per patch, and flip
> an arch to the strong type last - with non-opted arches building unchanged at
> every step.
>
> Open questions
> --------------
> - Is the type-safety + future-feature enablement worth the churn?
> - Naming: hw_ptep/hw_pmdp vs something else?
> - Should all five levels be converted before merging anything, or is a staged
> PTE-and-PMD then landing others acceptable?
> - Do we want the two getter flavours (pXXp_get / pXXp_get_once) at every
> level?
>
> [1] https://lore.kernel.org/all/a063f6c5-2785-4a9f-8079-25edb3e54cef@arm.com
>
> Thanks,
> Usama
--
Best Regards,
Yan, Zi
next prev parent reply other threads:[~2026-06-24 15:53 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-24 14:09 mm: opaque hardware page-table entry handles Usama Anjum
2026-06-24 15:52 ` Zi Yan [this message]
2026-06-24 19:25 ` Pedro Falcato
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DJHEF8MWTZSC.3KMTACZH7KWP@nvidia.com \
--to=ziy@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=catalin.marinas@arm.com \
--cc=david@kernel.org \
--cc=liam@infradead.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=samuel.holland@sifive.com \
--cc=usama.anjum@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox