From: Yang Shi <yang@os.amperecomputing.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
"David Hildenbrand (Arm)" <david@kernel.org>,
Dev Jain <dev.jain@arm.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Jinjiang Tu <tujinjiang@huawei.com>,
Kevin Brodsky <kevin.brodsky@arm.com>
Cc: linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org
Subject: Re: [PATCH v1 2/3] arm64: mm: Handle invalid large leaf mappings correctly
Date: Tue, 24 Mar 2026 11:20:07 -0700 [thread overview]
Message-ID: <401073fd-3438-419d-8287-35eea61919b0@os.amperecomputing.com> (raw)
In-Reply-To: <20260323130317.1737522-3-ryan.roberts@arm.com>
On 3/23/26 6:03 AM, Ryan Roberts wrote:
> It has been possible for a long time to mark ptes in the linear map as
> invalid. This is done for secretmem, kfence, realm dma memory un/share,
> and others, by simply clearing the PTE_VALID bit. But until commit
> a166563e7ec37 ("arm64: mm: support large block mapping when
> rodata=full") large leaf mappings were never made invalid in this way.
>
> It turns out various parts of the code base are not equipped to handle
> invalid large leaf mappings (in the way they are currently encoded) and
> I've observed a kernel panic while booting a realm guest on a
> BBML2_NOABORT system as a result:
>
> [ 15.432706] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
> [ 15.476896] Unable to handle kernel paging request at virtual address ffff000019600000
> [ 15.513762] Mem abort info:
> [ 15.527245] ESR = 0x0000000096000046
> [ 15.548553] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 15.572146] SET = 0, FnV = 0
> [ 15.592141] EA = 0, S1PTW = 0
> [ 15.612694] FSC = 0x06: level 2 translation fault
> [ 15.640644] Data abort info:
> [ 15.661983] ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000
> [ 15.694875] CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [ 15.723740] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 15.755776] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000081f3f000
> [ 15.800410] [ffff000019600000] pgd=0000000000000000, p4d=180000009ffff403, pud=180000009fffe403, pmd=00e8000199600704
> [ 15.855046] Internal error: Oops: 0000000096000046 [#1] SMP
> [ 15.886394] Modules linked in:
> [ 15.900029] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-dirty #4 PREEMPT
> [ 15.935258] Hardware name: linux,dummy-virt (DT)
> [ 15.955612] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [ 15.986009] pc : __pi_memcpy_generic+0x128/0x22c
> [ 16.006163] lr : swiotlb_bounce+0xf4/0x158
> [ 16.024145] sp : ffff80008000b8f0
> [ 16.038896] x29: ffff80008000b8f0 x28: 0000000000000000 x27: 0000000000000000
> [ 16.069953] x26: ffffb3976d261ba8 x25: 0000000000000000 x24: ffff000019600000
> [ 16.100876] x23: 0000000000000001 x22: ffff0000043430d0 x21: 0000000000007ff0
> [ 16.131946] x20: 0000000084570010 x19: 0000000000000000 x18: ffff00001ffe3fcc
> [ 16.163073] x17: 0000000000000000 x16: 00000000003fffff x15: 646e612065766974
> [ 16.194131] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> [ 16.225059] x11: 0000000000000000 x10: 0000000000000010 x9 : 0000000000000018
> [ 16.256113] x8 : 0000000000000018 x7 : 0000000000000000 x6 : 0000000000000000
> [ 16.287203] x5 : ffff000019607ff0 x4 : ffff000004578000 x3 : ffff000019600000
> [ 16.318145] x2 : 0000000000007ff0 x1 : ffff000004570010 x0 : ffff000019600000
> [ 16.349071] Call trace:
> [ 16.360143] __pi_memcpy_generic+0x128/0x22c (P)
> [ 16.380310] swiotlb_tbl_map_single+0x154/0x2b4
> [ 16.400282] swiotlb_map+0x5c/0x228
> [ 16.415984] dma_map_phys+0x244/0x2b8
> [ 16.432199] dma_map_page_attrs+0x44/0x58
> [ 16.449782] virtqueue_map_page_attrs+0x38/0x44
> [ 16.469596] virtqueue_map_single_attrs+0xc0/0x130
> [ 16.490509] virtnet_rq_alloc.isra.0+0xa4/0x1fc
> [ 16.510355] try_fill_recv+0x2a4/0x584
> [ 16.526989] virtnet_open+0xd4/0x238
> [ 16.542775] __dev_open+0x110/0x24c
> [ 16.558280] __dev_change_flags+0x194/0x20c
> [ 16.576879] netif_change_flags+0x24/0x6c
> [ 16.594489] dev_change_flags+0x48/0x7c
> [ 16.611462] ip_auto_config+0x258/0x1114
> [ 16.628727] do_one_initcall+0x80/0x1c8
> [ 16.645590] kernel_init_freeable+0x208/0x2f0
> [ 16.664917] kernel_init+0x24/0x1e0
> [ 16.680295] ret_from_fork+0x10/0x20
> [ 16.696369] Code: 927cec03 cb0e0021 8b0e0042 a9411c26 (a900340c)
> [ 16.723106] ---[ end trace 0000000000000000 ]---
> [ 16.752866] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 16.792556] Kernel Offset: 0x3396ea200000 from 0xffff800080000000
> [ 16.818966] PHYS_OFFSET: 0xfff1000080000000
> [ 16.837237] CPU features: 0x0000000,00060005,13e38581,957e772f
> [ 16.862904] Memory Limit: none
> [ 16.876526] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
> This panic occurs because the swiotlb memory was previously shared to
> the host (__set_memory_enc_dec()), which involves transitioning the
> (large) leaf mappings to invalid, sharing to the host, then marking the
> mappings valid again. But pageattr_p[mu]d_entry() would only update the
> entry if it is a section mapping, since otherwise it concluded it must
> be a table entry so shouldn't be modified. But p[mu]d_sect() only
> returns true if the entry is valid. So the result was that the large
> leaf entry was made invalid in the first pass then ignored in the second
> pass. It remains invalid until the above code tries to access it and
> blows up.
Good catch. I recall I met this problem when I worked on a very early
PoC of large block mapping patch. It took a total different approach
than BBML2_NOABORT. I didn't run into that problem when I implemented
BBML2_NOABORT because nobody actually changed valid/invalid attribute on
large block mapping granule so I forgot it. But I definitely missed
realm usecase.
>
> The simple fix would be to update pageattr_pmd_entry() to use
> !pmd_table() instead of pmd_sect(). That would solve this problem.
Yes, I agree.
>
> But the ptdump code also suffers from a similar issue. It checks
> pmd_leaf() and doesn't call into the arch-specific note_page() machinery
> if it returns false. As a result of this, ptdump wasn't even able to
> show the invalid large leaf mappings; it looked like they were valid
> which made this super fun to debug. the ptdump code is core-mm and
> pmd_table() is arm64-specific so we can't use the same trick to solve
> that.
I don't quite get why we need to show invalid mappings in ptdump? IIUC
ptdump is not supposed to show invalid mappings even though they are
transient.
Thanks,
Yang
>
> But we already support the concept of "present-invalid" for user space
> entries. And even better, pmd_leaf() will return true for a leaf mapping
> that is marked present-invalid. So let's just use that encoding for
> present-invalid kernel mappings too. Then we can use pmd_leaf() where we
> previously used pmd_sect() and everything is magically fixed.
>
> Additionally, from inspection kernel_page_present() was broken in a
> similar way, so I'm also updating that to use pmd_leaf().
>
> I haven't spotted any other issues of this shape but plan to do a follow
> up patch to remove pmd_sect() and pud_sect() in favour of the more
> sophisticated pmd_leaf()/pud_leaf() which are core-mm APIs and will
> simplify arm64 code a bit.
>
> Fixes: a166563e7ec37 ("arm64: mm: support large block mapping when rodata=full")
> Cc: stable@vger.kernel.org
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> arch/arm64/mm/pageattr.c | 50 ++++++++++++++++++++++------------------
> 1 file changed, 28 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 358d1dc9a576f..87dfe4c82fa92 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -25,6 +25,11 @@ static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
> {
> struct page_change_data *masks = walk->private;
>
> + /*
> + * Some users clear and set bits which alias eachother (e.g. PTE_NG and
> + * PTE_PRESENT_INVALID). It is therefore important that we always clear
> + * first then set.
> + */
> val &= ~(pgprot_val(masks->clear_mask));
> val |= (pgprot_val(masks->set_mask));
>
> @@ -36,7 +41,7 @@ static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> {
> pud_t val = pudp_get(pud);
>
> - if (pud_sect(val)) {
> + if (pud_leaf(val)) {
> if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> return -EINVAL;
> val = __pud(set_pageattr_masks(pud_val(val), walk));
> @@ -52,7 +57,7 @@ static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> {
> pmd_t val = pmdp_get(pmd);
>
> - if (pmd_sect(val)) {
> + if (pmd_leaf(val)) {
> if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> return -EINVAL;
> val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> @@ -132,11 +137,12 @@ static int __change_memory_common(unsigned long start, unsigned long size,
> ret = update_range_prot(start, size, set_mask, clear_mask);
>
> /*
> - * If the memory is being made valid without changing any other bits
> - * then a TLBI isn't required as a non-valid entry cannot be cached in
> - * the TLB.
> + * If the memory is being switched from present-invalid to valid without
> + * changing any other bits then a TLBI isn't required as a non-valid
> + * entry cannot be cached in the TLB.
> */
> - if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
> + if (pgprot_val(set_mask) != (PTE_MAYBE_NG | PTE_VALID) ||
> + pgprot_val(clear_mask) != PTE_PRESENT_INVALID)
> flush_tlb_kernel_range(start, start + size);
> return ret;
> }
> @@ -237,18 +243,18 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
> {
> if (enable)
> return __change_memory_common(addr, PAGE_SIZE * numpages,
> - __pgprot(PTE_VALID),
> - __pgprot(0));
> + __pgprot(PTE_MAYBE_NG | PTE_VALID),
> + __pgprot(PTE_PRESENT_INVALID));
> else
> return __change_memory_common(addr, PAGE_SIZE * numpages,
> - __pgprot(0),
> - __pgprot(PTE_VALID));
> + __pgprot(PTE_PRESENT_INVALID),
> + __pgprot(PTE_MAYBE_NG | PTE_VALID));
> }
>
> int set_direct_map_invalid_noflush(struct page *page)
> {
> - pgprot_t clear_mask = __pgprot(PTE_VALID);
> - pgprot_t set_mask = __pgprot(0);
> + pgprot_t clear_mask = __pgprot(PTE_MAYBE_NG | PTE_VALID);
> + pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID);
>
> if (!can_set_direct_map())
> return 0;
> @@ -259,8 +265,8 @@ int set_direct_map_invalid_noflush(struct page *page)
>
> int set_direct_map_default_noflush(struct page *page)
> {
> - pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
> - pgprot_t clear_mask = __pgprot(PTE_RDONLY);
> + pgprot_t set_mask = __pgprot(PTE_MAYBE_NG | PTE_VALID | PTE_WRITE);
> + pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY);
>
> if (!can_set_direct_map())
> return 0;
> @@ -296,8 +302,8 @@ static int __set_memory_enc_dec(unsigned long addr,
> * entries or Synchronous External Aborts caused by RIPAS_EMPTY
> */
> ret = __change_memory_common(addr, PAGE_SIZE * numpages,
> - __pgprot(set_prot),
> - __pgprot(clear_prot | PTE_VALID));
> + __pgprot(set_prot | PTE_PRESENT_INVALID),
> + __pgprot(clear_prot | PTE_MAYBE_NG | PTE_VALID));
>
> if (ret)
> return ret;
> @@ -311,8 +317,8 @@ static int __set_memory_enc_dec(unsigned long addr,
> return ret;
>
> return __change_memory_common(addr, PAGE_SIZE * numpages,
> - __pgprot(PTE_VALID),
> - __pgprot(0));
> + __pgprot(PTE_MAYBE_NG | PTE_VALID),
> + __pgprot(PTE_PRESENT_INVALID));
> }
>
> static int realm_set_memory_encrypted(unsigned long addr, int numpages)
> @@ -404,15 +410,15 @@ bool kernel_page_present(struct page *page)
> pud = READ_ONCE(*pudp);
> if (pud_none(pud))
> return false;
> - if (pud_sect(pud))
> - return true;
> + if (pud_leaf(pud))
> + return pud_valid(pud);
>
> pmdp = pmd_offset(pudp, addr);
> pmd = READ_ONCE(*pmdp);
> if (pmd_none(pmd))
> return false;
> - if (pmd_sect(pmd))
> - return true;
> + if (pmd_leaf(pmd))
> + return pmd_valid(pmd);
>
> ptep = pte_offset_kernel(pmdp, addr);
> return pte_valid(__ptep_get(ptep));
next prev parent reply other threads:[~2026-03-24 18:20 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260323130317.1737522-1-ryan.roberts@arm.com>
2026-03-23 13:03 ` [PATCH v1 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
2026-03-23 16:52 ` Kevin Brodsky
2026-03-23 17:20 ` Ryan Roberts
2026-03-23 21:34 ` Yang Shi
2026-03-25 17:29 ` Ryan Roberts
2026-03-23 13:03 ` [PATCH v1 2/3] arm64: mm: Handle invalid large leaf mappings correctly Ryan Roberts
2026-03-23 16:52 ` Kevin Brodsky
2026-03-23 17:25 ` Ryan Roberts
2026-03-23 19:56 ` Kevin Brodsky
2026-03-24 2:30 ` Jinjiang Tu
2026-03-25 17:31 ` Ryan Roberts
2026-03-26 1:54 ` Jinjiang Tu
2026-03-24 18:20 ` Yang Shi [this message]
2026-03-25 17:37 ` Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=401073fd-3438-419d-8287-35eea61919b0@os.amperecomputing.com \
--to=yang@os.amperecomputing.com \
--cc=catalin.marinas@arm.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=kevin.brodsky@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ryan.roberts@arm.com \
--cc=stable@vger.kernel.org \
--cc=suzuki.poulose@arm.com \
--cc=tujinjiang@huawei.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox