public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT
@ 2026-03-30 16:17 Ryan Roberts
  2026-03-30 16:17 ` [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Ryan Roberts @ 2026-03-30 16:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, David Hildenbrand (Arm), Dev Jain,
	Yang Shi, Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

Hi All,

This fixes a couple of bugs in the "large block mappings for linear map when we
have BBML2_NOABORT" feature when used in conjunction with a CCA realm guest.
While investigating I found and fixed some more general issues too. See commit
logs for full explanations.

Applies on top of v7.0-rc4.

Changes since v1 [1]
====================

Patch 1:
  - Moved page_alloc_available declaration to asm/mmu.c (per Kevin)
  - Added PTE_PRESENT_VALID_KERNEL macro to hide VALID|NG confusion (per Kevin)
  - Improved logic in split_kernel_leaf_mapping() to avoid warning for
    DEBUG_PAGEALLOC (per Sashiko)
Patch 2:
  - Fixed transitional pgtables to handle present-invalid large leaves (per
    Sashiko)
  - Hardened split_pXd() for present-invalid leaves (per Sashiko)
Patch 3:
  - Converted pXd_leaf() to function to avoid multi-eval of READ_ONCE() (per
    Sashiko)

[1] https://lore.kernel.org/all/20260323130317.1737522-1-ryan.roberts@arm.com/

Thanks,
Ryan


Ryan Roberts (3):
  arm64: mm: Fix rodata=full block mapping support for realm guests
  arm64: mm: Handle invalid large leaf mappings correctly
  arm64: mm: Remove pmd_sect() and pud_sect()

 arch/arm64/include/asm/mmu.h          |  2 +
 arch/arm64/include/asm/pgtable-prot.h |  2 +
 arch/arm64/include/asm/pgtable.h      | 28 +++++++----
 arch/arm64/mm/init.c                  |  9 +++-
 arch/arm64/mm/mmu.c                   | 67 ++++++++++++++++++---------
 arch/arm64/mm/pageattr.c              | 50 +++++++++++---------
 arch/arm64/mm/trans_pgd.c             | 42 +++--------------
 7 files changed, 111 insertions(+), 89 deletions(-)

--
2.43.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests
  2026-03-30 16:17 [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Ryan Roberts
@ 2026-03-30 16:17 ` Ryan Roberts
  2026-03-31 14:35   ` Suzuki K Poulose
  2026-04-02 20:43   ` Catalin Marinas
  2026-03-30 16:17 ` [PATCH v2 2/3] arm64: mm: Handle invalid large leaf mappings correctly Ryan Roberts
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 8+ messages in thread
From: Ryan Roberts @ 2026-03-30 16:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, David Hildenbrand (Arm), Dev Jain,
	Yang Shi, Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, stable

Commit a166563e7ec37 ("arm64: mm: support large block mapping when
rodata=full") enabled the linear map to be mapped by block/cont while
still allowing granular permission changes on BBML2_NOABORT systems by
lazily splitting the live mappings. This mechanism was intended to be
usable by realm guests since they need to dynamically share dma buffers
with the host by "decrypting" them - which for Arm CCA, means marking
them as shared in the page tables.

However, it turns out that the mechanism was failing for realm guests
because realms need to share their dma buffers (via
__set_memory_enc_dec()) much earlier during boot than
split_kernel_leaf_mapping() was able to handle. The report linked below
showed that GIC's ITS was one such user. But during the investigation I
found other callsites that could not meet the
split_kernel_leaf_mapping() constraints.

The problem is that we block map the linear map based on the boot CPU
supporting BBML2_NOABORT, then check that all the other CPUs support it
too when finalizing the caps. If they don't, then we stop_machine() and
split to ptes. For safety, split_kernel_leaf_mapping() previously
wouldn't permit splitting until after the caps were finalized. That
ensured that if any secondary cpus were running that didn't support
BBML2_NOABORT, we wouldn't risk breaking them.

I've fix this problem by reducing the black-out window where we refuse
to split; there are now 2 windows. The first is from T0 until the page
allocator is inititialized. Splitting allocates memory for the page
allocator so it must be in use. The second covers the period between
starting to online the secondary cpus until the system caps are
finalized (this is a very small window).

All of the problematic callers are calling __set_memory_enc_dec() before
the secondary cpus come online, so this solves the problem. However, one
of these callers, swiotlb_update_mem_attributes(), was trying to split
before the page allocator was initialized. So I have moved this call
from arch_mm_preinit() to mem_init(), which solves the ordering issue.

I've added warnings and return an error if any attempt is made to split
in the black-out windows.

Note there are other issues which prevent booting all the way to user
space, which will be fixed in subsequent patches.

Reported-by: Jinjiang Tu <tujinjiang@huawei.com>
Closes: https://lore.kernel.org/all/0b2a4ae5-fc51-4d77-b177-b2e9db74f11d@huawei.com/
Fixes: a166563e7ec37 ("arm64: mm: support large block mapping when rodata=full")
Cc: stable@vger.kernel.org
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/mmu.h |  2 ++
 arch/arm64/mm/init.c         |  9 +++++++-
 arch/arm64/mm/mmu.c          | 45 +++++++++++++++++++++++++-----------
 3 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 137a173df1ff8..472610433aaea 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -112,5 +112,7 @@ void kpti_install_ng_mappings(void);
 static inline void kpti_install_ng_mappings(void) {}
 #endif
 
+extern bool page_alloc_available;
+
 #endif	/* !__ASSEMBLER__ */
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 96711b8578fd0..b9b248d24fd10 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -350,7 +350,6 @@ void __init arch_mm_preinit(void)
 	}
 
 	swiotlb_init(swiotlb, flags);
-	swiotlb_update_mem_attributes();
 
 	/*
 	 * Check boundaries twice: Some fundamental inconsistencies can be
@@ -377,6 +376,14 @@ void __init arch_mm_preinit(void)
 	}
 }
 
+bool page_alloc_available __ro_after_init;
+
+void __init mem_init(void)
+{
+	page_alloc_available = true;
+	swiotlb_update_mem_attributes();
+}
+
 void free_initmem(void)
 {
 	void *lm_init_begin = lm_alias(__init_begin);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a6a00accf4f93..223947487a223 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -768,30 +768,51 @@ static inline bool force_pte_mapping(void)
 }
 
 static DEFINE_MUTEX(pgtable_split_lock);
+static bool linear_map_requires_bbml2;
 
 int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 {
 	int ret;
 
-	/*
-	 * !BBML2_NOABORT systems should not be trying to change permissions on
-	 * anything that is not pte-mapped in the first place. Just return early
-	 * and let the permission change code raise a warning if not already
-	 * pte-mapped.
-	 */
-	if (!system_supports_bbml2_noabort())
-		return 0;
-
 	/*
 	 * If the region is within a pte-mapped area, there is no need to try to
 	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
 	 * change permissions from atomic context so for those cases (which are
 	 * always pte-mapped), we must not go any further because taking the
-	 * mutex below may sleep.
+	 * mutex below may sleep. Do not call force_pte_mapping() here because
+	 * it could return a confusing result if called from a secondary cpu
+	 * prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us
+	 * what we need.
 	 */
-	if (force_pte_mapping() || is_kfence_address((void *)start))
+	if (!linear_map_requires_bbml2 || is_kfence_address((void *)start))
 		return 0;
 
+	if (!system_supports_bbml2_noabort()) {
+		/*
+		 * !BBML2_NOABORT systems should not be trying to change
+		 * permissions on anything that is not pte-mapped in the first
+		 * place. Just return early and let the permission change code
+		 * raise a warning if not already pte-mapped.
+		 */
+		if (system_capabilities_finalized())
+			return 0;
+
+		/*
+		 * Boot-time: split_kernel_leaf_mapping_locked() allocates from
+		 * page allocator. Can't split until it's available.
+		 */
+		if (WARN_ON(!page_alloc_available))
+			return -EBUSY;
+
+		/*
+		 * Boot-time: Started secondary cpus but don't know if they
+		 * support BBML2_NOABORT yet. Can't allow splitting in this
+		 * window in case they don't.
+		 */
+		if (WARN_ON(num_online_cpus() > 1))
+			return -EBUSY;
+	}
+
 	/*
 	 * Ensure start and end are at least page-aligned since this is the
 	 * finest granularity we can split to.
@@ -891,8 +912,6 @@ static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp
 	return ret;
 }
 
-static bool linear_map_requires_bbml2 __initdata;
-
 u32 idmap_kpti_bbml2_flag;
 
 static void __init init_idmap_kpti_bbml2_flag(void)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/3] arm64: mm: Handle invalid large leaf mappings correctly
  2026-03-30 16:17 [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Ryan Roberts
  2026-03-30 16:17 ` [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
@ 2026-03-30 16:17 ` Ryan Roberts
  2026-03-30 16:17 ` [PATCH v2 3/3] arm64: mm: Remove pmd_sect() and pud_sect() Ryan Roberts
  2026-04-02 21:11 ` [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Catalin Marinas
  3 siblings, 0 replies; 8+ messages in thread
From: Ryan Roberts @ 2026-03-30 16:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, David Hildenbrand (Arm), Dev Jain,
	Yang Shi, Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel, stable

It has been possible for a long time to mark ptes in the linear map as
invalid. This is done for secretmem, kfence, realm dma memory un/share,
and others, by simply clearing the PTE_VALID bit. But until commit
a166563e7ec37 ("arm64: mm: support large block mapping when
rodata=full") large leaf mappings were never made invalid in this way.

It turns out various parts of the code base are not equipped to handle
invalid large leaf mappings (in the way they are currently encoded) and
I've observed a kernel panic while booting a realm guest on a
BBML2_NOABORT system as a result:

[   15.432706] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
[   15.476896] Unable to handle kernel paging request at virtual address ffff000019600000
[   15.513762] Mem abort info:
[   15.527245]   ESR = 0x0000000096000046
[   15.548553]   EC = 0x25: DABT (current EL), IL = 32 bits
[   15.572146]   SET = 0, FnV = 0
[   15.592141]   EA = 0, S1PTW = 0
[   15.612694]   FSC = 0x06: level 2 translation fault
[   15.640644] Data abort info:
[   15.661983]   ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000
[   15.694875]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[   15.723740]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   15.755776] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000081f3f000
[   15.800410] [ffff000019600000] pgd=0000000000000000, p4d=180000009ffff403, pud=180000009fffe403, pmd=00e8000199600704
[   15.855046] Internal error: Oops: 0000000096000046 [#1]  SMP
[   15.886394] Modules linked in:
[   15.900029] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-dirty #4 PREEMPT
[   15.935258] Hardware name: linux,dummy-virt (DT)
[   15.955612] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   15.986009] pc : __pi_memcpy_generic+0x128/0x22c
[   16.006163] lr : swiotlb_bounce+0xf4/0x158
[   16.024145] sp : ffff80008000b8f0
[   16.038896] x29: ffff80008000b8f0 x28: 0000000000000000 x27: 0000000000000000
[   16.069953] x26: ffffb3976d261ba8 x25: 0000000000000000 x24: ffff000019600000
[   16.100876] x23: 0000000000000001 x22: ffff0000043430d0 x21: 0000000000007ff0
[   16.131946] x20: 0000000084570010 x19: 0000000000000000 x18: ffff00001ffe3fcc
[   16.163073] x17: 0000000000000000 x16: 00000000003fffff x15: 646e612065766974
[   16.194131] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[   16.225059] x11: 0000000000000000 x10: 0000000000000010 x9 : 0000000000000018
[   16.256113] x8 : 0000000000000018 x7 : 0000000000000000 x6 : 0000000000000000
[   16.287203] x5 : ffff000019607ff0 x4 : ffff000004578000 x3 : ffff000019600000
[   16.318145] x2 : 0000000000007ff0 x1 : ffff000004570010 x0 : ffff000019600000
[   16.349071] Call trace:
[   16.360143]  __pi_memcpy_generic+0x128/0x22c (P)
[   16.380310]  swiotlb_tbl_map_single+0x154/0x2b4
[   16.400282]  swiotlb_map+0x5c/0x228
[   16.415984]  dma_map_phys+0x244/0x2b8
[   16.432199]  dma_map_page_attrs+0x44/0x58
[   16.449782]  virtqueue_map_page_attrs+0x38/0x44
[   16.469596]  virtqueue_map_single_attrs+0xc0/0x130
[   16.490509]  virtnet_rq_alloc.isra.0+0xa4/0x1fc
[   16.510355]  try_fill_recv+0x2a4/0x584
[   16.526989]  virtnet_open+0xd4/0x238
[   16.542775]  __dev_open+0x110/0x24c
[   16.558280]  __dev_change_flags+0x194/0x20c
[   16.576879]  netif_change_flags+0x24/0x6c
[   16.594489]  dev_change_flags+0x48/0x7c
[   16.611462]  ip_auto_config+0x258/0x1114
[   16.628727]  do_one_initcall+0x80/0x1c8
[   16.645590]  kernel_init_freeable+0x208/0x2f0
[   16.664917]  kernel_init+0x24/0x1e0
[   16.680295]  ret_from_fork+0x10/0x20
[   16.696369] Code: 927cec03 cb0e0021 8b0e0042 a9411c26 (a900340c)
[   16.723106] ---[ end trace 0000000000000000 ]---
[   16.752866] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   16.792556] Kernel Offset: 0x3396ea200000 from 0xffff800080000000
[   16.818966] PHYS_OFFSET: 0xfff1000080000000
[   16.837237] CPU features: 0x0000000,00060005,13e38581,957e772f
[   16.862904] Memory Limit: none
[   16.876526] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

This panic occurs because the swiotlb memory was previously shared to
the host (__set_memory_enc_dec()), which involves transitioning the
(large) leaf mappings to invalid, sharing to the host, then marking the
mappings valid again. But pageattr_p[mu]d_entry() would only update the
entry if it is a section mapping, since otherwise it concluded it must
be a table entry so shouldn't be modified. But p[mu]d_sect() only
returns true if the entry is valid. So the result was that the large
leaf entry was made invalid in the first pass then ignored in the second
pass. It remains invalid until the above code tries to access it and
blows up.

The simple fix would be to update pageattr_pmd_entry() to use
!pmd_table() instead of pmd_sect(). That would solve this problem.

But the ptdump code also suffers from a similar issue. It checks
pmd_leaf() and doesn't call into the arch-specific note_page() machinery
if it returns false. As a result of this, ptdump wasn't even able to
show the invalid large leaf mappings; it looked like they were valid
which made this super fun to debug. the ptdump code is core-mm and
pmd_table() is arm64-specific so we can't use the same trick to solve
that.

But we already support the concept of "present-invalid" for user space
entries. And even better, pmd_leaf() will return true for a leaf mapping
that is marked present-invalid. So let's just use that encoding for
present-invalid kernel mappings too. Then we can use pmd_leaf() where we
previously used pmd_sect() and everything is magically fixed.

Additionally, from inspection kernel_page_present() was broken in a
similar way, so I'm also updating that to use pmd_leaf().

The transitional page tables component was also similarly broken; it
creates a copy of the kernel page tables, making RO leaf mappings RW in
the process. It also makes invalid (but-not-none) pte mappings valid.
But it was not doing this for large leaf mappings. This could have
resulted in crashes at kexec- or hibernate-time. This code is fixed to
flip "present-invalid" mappings back to "present-valid" at all levels.

Finally, I have hardened split_pmd()/split_pud() so that if it is passed
a "present-invalid" leaf, it will maintain that property in the split
leaves, since I wasn't able to convince myself that it would only ever
be called for "present-valid" leaves.

Fixes: a166563e7ec37 ("arm64: mm: support large block mapping when rodata=full")
Cc: stable@vger.kernel.org
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable-prot.h |  2 ++
 arch/arm64/include/asm/pgtable.h      |  9 +++--
 arch/arm64/mm/mmu.c                   |  4 +++
 arch/arm64/mm/pageattr.c              | 50 +++++++++++++++------------
 arch/arm64/mm/trans_pgd.c             | 42 ++++------------------
 5 files changed, 48 insertions(+), 59 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index f560e64202674..212ce1b02e15e 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -25,6 +25,8 @@
  */
 #define PTE_PRESENT_INVALID	(PTE_NG)		 /* only when !PTE_VALID */
 
+#define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG)
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 #define PTE_UFFD_WP		(_AT(pteval_t, 1) << 58) /* uffd-wp tracking */
 #define PTE_SWP_UFFD_WP		(_AT(pteval_t, 1) << 3)	 /* only for swp ptes */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b3e58735c49bd..dd062179b9b66 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -322,9 +322,11 @@ static inline pte_t pte_mknoncont(pte_t pte)
 	return clear_pte_bit(pte, __pgprot(PTE_CONT));
 }
 
-static inline pte_t pte_mkvalid(pte_t pte)
+static inline pte_t pte_mkvalid_k(pte_t pte)
 {
-	return set_pte_bit(pte, __pgprot(PTE_VALID));
+	pte = clear_pte_bit(pte, __pgprot(PTE_PRESENT_INVALID));
+	pte = set_pte_bit(pte, __pgprot(PTE_PRESENT_VALID_KERNEL));
+	return pte;
 }
 
 static inline pte_t pte_mkinvalid(pte_t pte)
@@ -594,6 +596,7 @@ static inline int pmd_protnone(pmd_t pmd)
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
+#define pmd_mkvalid_k(pmd)	pte_pmd(pte_mkvalid_k(pmd_pte(pmd)))
 #define pmd_mkinvalid(pmd)	pte_pmd(pte_mkinvalid(pmd_pte(pmd)))
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 #define pmd_uffd_wp(pmd)	pte_uffd_wp(pmd_pte(pmd))
@@ -635,6 +638,8 @@ static inline pmd_t pmd_mkspecial(pmd_t pmd)
 
 #define pud_young(pud)		pte_young(pud_pte(pud))
 #define pud_mkyoung(pud)	pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite_novma(pud)	pte_pud(pte_mkwrite_novma(pud_pte(pud)))
+#define pud_mkvalid_k(pud)	pte_pud(pte_mkvalid_k(pud_pte(pud)))
 #define pud_write(pud)		pte_write(pud_pte(pud))
 
 static inline pud_t pud_mkhuge(pud_t pud)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 223947487a223..1575680675d8d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -602,6 +602,8 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
 		tableprot |= PMD_TABLE_PXN;
 
 	prot = __pgprot((pgprot_val(prot) & ~PTE_TYPE_MASK) | PTE_TYPE_PAGE);
+	if (!pmd_valid(pmd))
+		prot = pte_pgprot(pte_mkinvalid(pfn_pte(0, prot)));
 	prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
 	if (to_cont)
 		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
@@ -647,6 +649,8 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
 		tableprot |= PUD_TABLE_PXN;
 
 	prot = __pgprot((pgprot_val(prot) & ~PMD_TYPE_MASK) | PMD_TYPE_SECT);
+	if (!pud_valid(pud))
+		prot = pmd_pgprot(pmd_mkinvalid(pfn_pmd(0, prot)));
 	prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
 	if (to_cont)
 		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 358d1dc9a576f..ce035e1b4eaf6 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -25,6 +25,11 @@ static ptdesc_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
 {
 	struct page_change_data *masks = walk->private;
 
+	/*
+	 * Some users clear and set bits which alias each other (e.g. PTE_NG and
+	 * PTE_PRESENT_INVALID). It is therefore important that we always clear
+	 * first then set.
+	 */
 	val &= ~(pgprot_val(masks->clear_mask));
 	val |= (pgprot_val(masks->set_mask));
 
@@ -36,7 +41,7 @@ static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
 {
 	pud_t val = pudp_get(pud);
 
-	if (pud_sect(val)) {
+	if (pud_leaf(val)) {
 		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
 			return -EINVAL;
 		val = __pud(set_pageattr_masks(pud_val(val), walk));
@@ -52,7 +57,7 @@ static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
 {
 	pmd_t val = pmdp_get(pmd);
 
-	if (pmd_sect(val)) {
+	if (pmd_leaf(val)) {
 		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
 			return -EINVAL;
 		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
@@ -132,11 +137,12 @@ static int __change_memory_common(unsigned long start, unsigned long size,
 	ret = update_range_prot(start, size, set_mask, clear_mask);
 
 	/*
-	 * If the memory is being made valid without changing any other bits
-	 * then a TLBI isn't required as a non-valid entry cannot be cached in
-	 * the TLB.
+	 * If the memory is being switched from present-invalid to valid without
+	 * changing any other bits then a TLBI isn't required as a non-valid
+	 * entry cannot be cached in the TLB.
 	 */
-	if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
+	if (pgprot_val(set_mask) != PTE_PRESENT_VALID_KERNEL ||
+	    pgprot_val(clear_mask) != PTE_PRESENT_INVALID)
 		flush_tlb_kernel_range(start, start + size);
 	return ret;
 }
@@ -237,18 +243,18 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 {
 	if (enable)
 		return __change_memory_common(addr, PAGE_SIZE * numpages,
-					__pgprot(PTE_VALID),
-					__pgprot(0));
+					__pgprot(PTE_PRESENT_VALID_KERNEL),
+					__pgprot(PTE_PRESENT_INVALID));
 	else
 		return __change_memory_common(addr, PAGE_SIZE * numpages,
-					__pgprot(0),
-					__pgprot(PTE_VALID));
+					__pgprot(PTE_PRESENT_INVALID),
+					__pgprot(PTE_PRESENT_VALID_KERNEL));
 }
 
 int set_direct_map_invalid_noflush(struct page *page)
 {
-	pgprot_t clear_mask = __pgprot(PTE_VALID);
-	pgprot_t set_mask = __pgprot(0);
+	pgprot_t clear_mask = __pgprot(PTE_PRESENT_VALID_KERNEL);
+	pgprot_t set_mask = __pgprot(PTE_PRESENT_INVALID);
 
 	if (!can_set_direct_map())
 		return 0;
@@ -259,8 +265,8 @@ int set_direct_map_invalid_noflush(struct page *page)
 
 int set_direct_map_default_noflush(struct page *page)
 {
-	pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
-	pgprot_t clear_mask = __pgprot(PTE_RDONLY);
+	pgprot_t set_mask = __pgprot(PTE_PRESENT_VALID_KERNEL | PTE_WRITE);
+	pgprot_t clear_mask = __pgprot(PTE_PRESENT_INVALID | PTE_RDONLY);
 
 	if (!can_set_direct_map())
 		return 0;
@@ -296,8 +302,8 @@ static int __set_memory_enc_dec(unsigned long addr,
 	 * entries or Synchronous External Aborts caused by RIPAS_EMPTY
 	 */
 	ret = __change_memory_common(addr, PAGE_SIZE * numpages,
-				     __pgprot(set_prot),
-				     __pgprot(clear_prot | PTE_VALID));
+				     __pgprot(set_prot | PTE_PRESENT_INVALID),
+				     __pgprot(clear_prot | PTE_PRESENT_VALID_KERNEL));
 
 	if (ret)
 		return ret;
@@ -311,8 +317,8 @@ static int __set_memory_enc_dec(unsigned long addr,
 		return ret;
 
 	return __change_memory_common(addr, PAGE_SIZE * numpages,
-				      __pgprot(PTE_VALID),
-				      __pgprot(0));
+				      __pgprot(PTE_PRESENT_VALID_KERNEL),
+				      __pgprot(PTE_PRESENT_INVALID));
 }
 
 static int realm_set_memory_encrypted(unsigned long addr, int numpages)
@@ -404,15 +410,15 @@ bool kernel_page_present(struct page *page)
 	pud = READ_ONCE(*pudp);
 	if (pud_none(pud))
 		return false;
-	if (pud_sect(pud))
-		return true;
+	if (pud_leaf(pud))
+		return pud_valid(pud);
 
 	pmdp = pmd_offset(pudp, addr);
 	pmd = READ_ONCE(*pmdp);
 	if (pmd_none(pmd))
 		return false;
-	if (pmd_sect(pmd))
-		return true;
+	if (pmd_leaf(pmd))
+		return pmd_valid(pmd);
 
 	ptep = pte_offset_kernel(pmdp, addr);
 	return pte_valid(__ptep_get(ptep));
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 18543b603c77b..cca9706a875c3 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -31,36 +31,6 @@ static void *trans_alloc(struct trans_pgd_info *info)
 	return info->trans_alloc_page(info->trans_alloc_arg);
 }
 
-static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
-{
-	pte_t pte = __ptep_get(src_ptep);
-
-	if (pte_valid(pte)) {
-		/*
-		 * Resume will overwrite areas that may be marked
-		 * read only (code, rodata). Clear the RDONLY bit from
-		 * the temporary mappings we use during restore.
-		 */
-		__set_pte(dst_ptep, pte_mkwrite_novma(pte));
-	} else if (!pte_none(pte)) {
-		/*
-		 * debug_pagealloc will removed the PTE_VALID bit if
-		 * the page isn't in use by the resume kernel. It may have
-		 * been in use by the original kernel, in which case we need
-		 * to put it back in our copy to do the restore.
-		 *
-		 * Other cases include kfence / vmalloc / memfd_secret which
-		 * may call `set_direct_map_invalid_noflush()`.
-		 *
-		 * Before marking this entry valid, check the pfn should
-		 * be mapped.
-		 */
-		BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-		__set_pte(dst_ptep, pte_mkvalid(pte_mkwrite_novma(pte)));
-	}
-}
-
 static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
 		    pmd_t *src_pmdp, unsigned long start, unsigned long end)
 {
@@ -76,7 +46,11 @@ static int copy_pte(struct trans_pgd_info *info, pmd_t *dst_pmdp,
 
 	src_ptep = pte_offset_kernel(src_pmdp, start);
 	do {
-		_copy_pte(dst_ptep, src_ptep, addr);
+		pte_t pte = __ptep_get(src_ptep);
+
+		if (pte_none(pte))
+			continue;
+		__set_pte(dst_ptep, pte_mkvalid_k(pte_mkwrite_novma(pte)));
 	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
 
 	return 0;
@@ -109,8 +83,7 @@ static int copy_pmd(struct trans_pgd_info *info, pud_t *dst_pudp,
 			if (copy_pte(info, dst_pmdp, src_pmdp, addr, next))
 				return -ENOMEM;
 		} else {
-			set_pmd(dst_pmdp,
-				__pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY));
+			set_pmd(dst_pmdp, pmd_mkvalid_k(pmd_mkwrite_novma(pmd)));
 		}
 	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
 
@@ -145,8 +118,7 @@ static int copy_pud(struct trans_pgd_info *info, p4d_t *dst_p4dp,
 			if (copy_pmd(info, dst_pudp, src_pudp, addr, next))
 				return -ENOMEM;
 		} else {
-			set_pud(dst_pudp,
-				__pud(pud_val(pud) & ~PUD_SECT_RDONLY));
+			set_pud(dst_pudp, pud_mkvalid_k(pud_mkwrite_novma(pud)));
 		}
 	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/3] arm64: mm: Remove pmd_sect() and pud_sect()
  2026-03-30 16:17 [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Ryan Roberts
  2026-03-30 16:17 ` [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
  2026-03-30 16:17 ` [PATCH v2 2/3] arm64: mm: Handle invalid large leaf mappings correctly Ryan Roberts
@ 2026-03-30 16:17 ` Ryan Roberts
  2026-04-02 21:11 ` [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Catalin Marinas
  3 siblings, 0 replies; 8+ messages in thread
From: Ryan Roberts @ 2026-03-30 16:17 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, David Hildenbrand (Arm), Dev Jain,
	Yang Shi, Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

The semantics of pXd_leaf() are very similar to pXd_sect(). The only
difference is that pXd_sect() only considers it a section if PTE_VALID
is set, whereas pXd_leaf() permits both "valid" and "present-invalid"
types.

Using pXd_sect() has caused issues now that large leaf entries can be
present-invalid since commit a166563e7ec37 ("arm64: mm: support large
block mapping when rodata=full"), so let's just remove the API and
standardize on pXd_leaf().

There are a few callsites of the form pXd_leaf(READ_ONCE(*pXdp)). This
was previously fine for the pXd_sect() macro because it only evaluated
its argument once. But pXd_leaf() evaluates its argument multiple times.
So let's avoid unintended side effects by reimplementing pXd_leaf() as
an inline function.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 19 ++++++++++++-------
 arch/arm64/mm/mmu.c              | 18 +++++++++---------
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index dd062179b9b66..5bc42b85acfc0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -784,9 +784,13 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 
 #define pmd_table(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
 				 PMD_TYPE_TABLE)
-#define pmd_sect(pmd)		((pmd_val(pmd) & PMD_TYPE_MASK) == \
-				 PMD_TYPE_SECT)
-#define pmd_leaf(pmd)		(pmd_present(pmd) && !pmd_table(pmd))
+
+#define pmd_leaf pmd_leaf
+static inline bool pmd_leaf(pmd_t pmd)
+{
+	return pmd_present(pmd) && !pmd_table(pmd);
+}
+
 #define pmd_bad(pmd)		(!pmd_table(pmd))
 
 #define pmd_leaf_size(pmd)	(pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
@@ -804,11 +808,8 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
-static inline bool pud_sect(pud_t pud) { return false; }
 static inline bool pud_table(pud_t pud) { return true; }
 #else
-#define pud_sect(pud)		((pud_val(pud) & PUD_TYPE_MASK) == \
-				 PUD_TYPE_SECT)
 #define pud_table(pud)		((pud_val(pud) & PUD_TYPE_MASK) == \
 				 PUD_TYPE_TABLE)
 #endif
@@ -878,7 +879,11 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 				 PUD_TYPE_TABLE)
 #define pud_present(pud)	pte_present(pud_pte(pud))
 #ifndef __PAGETABLE_PMD_FOLDED
-#define pud_leaf(pud)		(pud_present(pud) && !pud_table(pud))
+#define pud_leaf pud_leaf
+static inline bool pud_leaf(pud_t pud)
+{
+	return pud_present(pud) && !pud_table(pud);
+}
 #else
 #define pud_leaf(pud)		false
 #endif
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1575680675d8d..dcee56bb622ad 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -204,7 +204,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	pmd_t pmd = READ_ONCE(*pmdp);
 	pte_t *ptep;
 
-	BUG_ON(pmd_sect(pmd));
+	BUG_ON(pmd_leaf(pmd));
 	if (pmd_none(pmd)) {
 		pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
 		phys_addr_t pte_phys;
@@ -303,7 +303,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 	/*
 	 * Check for initial section mappings in the pgd/pud.
 	 */
-	BUG_ON(pud_sect(pud));
+	BUG_ON(pud_leaf(pud));
 	if (pud_none(pud)) {
 		pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
 		phys_addr_t pmd_phys;
@@ -1503,7 +1503,7 @@ static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pmd_present(pmd));
-		if (pmd_sect(pmd)) {
+		if (pmd_leaf(pmd)) {
 			pmd_clear(pmdp);
 
 			/*
@@ -1536,7 +1536,7 @@ static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
 			continue;
 
 		WARN_ON(!pud_present(pud));
-		if (pud_sect(pud)) {
+		if (pud_leaf(pud)) {
 			pud_clear(pudp);
 
 			/*
@@ -1650,7 +1650,7 @@ static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
 		if (pmd_none(pmd))
 			continue;
 
-		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+		WARN_ON(!pmd_present(pmd) || !pmd_table(pmd));
 		free_empty_pte_table(pmdp, addr, next, floor, ceiling);
 	} while (addr = next, addr < end);
 
@@ -1690,7 +1690,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
 		if (pud_none(pud))
 			continue;
 
-		WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+		WARN_ON(!pud_present(pud) || !pud_table(pud));
 		free_empty_pmd_table(pudp, addr, next, floor, ceiling);
 	} while (addr = next, addr < end);
 
@@ -1786,7 +1786,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 {
 	vmemmap_verify((pte_t *)pmdp, node, addr, next);
 
-	return pmd_sect(READ_ONCE(*pmdp));
+	return pmd_leaf(READ_ONCE(*pmdp));
 }
 
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -1850,7 +1850,7 @@ void p4d_clear_huge(p4d_t *p4dp)
 
 int pud_clear_huge(pud_t *pudp)
 {
-	if (!pud_sect(READ_ONCE(*pudp)))
+	if (!pud_leaf(READ_ONCE(*pudp)))
 		return 0;
 	pud_clear(pudp);
 	return 1;
@@ -1858,7 +1858,7 @@ int pud_clear_huge(pud_t *pudp)
 
 int pmd_clear_huge(pmd_t *pmdp)
 {
-	if (!pmd_sect(READ_ONCE(*pmdp)))
+	if (!pmd_leaf(READ_ONCE(*pmdp)))
 		return 0;
 	pmd_clear(pmdp);
 	return 1;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests
  2026-03-30 16:17 ` [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
@ 2026-03-31 14:35   ` Suzuki K Poulose
  2026-04-02 20:43   ` Catalin Marinas
  1 sibling, 0 replies; 8+ messages in thread
From: Suzuki K Poulose @ 2026-03-31 14:35 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon,
	David Hildenbrand (Arm), Dev Jain, Yang Shi, Jinjiang Tu,
	Kevin Brodsky
  Cc: linux-arm-kernel, linux-kernel, stable

On 30/03/2026 17:17, Ryan Roberts wrote:
> Commit a166563e7ec37 ("arm64: mm: support large block mapping when
> rodata=full") enabled the linear map to be mapped by block/cont while
> still allowing granular permission changes on BBML2_NOABORT systems by
> lazily splitting the live mappings. This mechanism was intended to be
> usable by realm guests since they need to dynamically share dma buffers
> with the host by "decrypting" them - which for Arm CCA, means marking
> them as shared in the page tables.
> 
> However, it turns out that the mechanism was failing for realm guests
> because realms need to share their dma buffers (via
> __set_memory_enc_dec()) much earlier during boot than
> split_kernel_leaf_mapping() was able to handle. The report linked below
> showed that GIC's ITS was one such user. But during the investigation I
> found other callsites that could not meet the
> split_kernel_leaf_mapping() constraints.
> 
> The problem is that we block map the linear map based on the boot CPU
> supporting BBML2_NOABORT, then check that all the other CPUs support it
> too when finalizing the caps. If they don't, then we stop_machine() and
> split to ptes. For safety, split_kernel_leaf_mapping() previously
> wouldn't permit splitting until after the caps were finalized. That
> ensured that if any secondary cpus were running that didn't support
> BBML2_NOABORT, we wouldn't risk breaking them.
> 
> I've fix this problem by reducing the black-out window where we refuse
> to split; there are now 2 windows. The first is from T0 until the page
> allocator is inititialized. Splitting allocates memory for the page
> allocator so it must be in use. The second covers the period between
> starting to online the secondary cpus until the system caps are
> finalized (this is a very small window).
> 
> All of the problematic callers are calling __set_memory_enc_dec() before
> the secondary cpus come online, so this solves the problem. However, one
> of these callers, swiotlb_update_mem_attributes(), was trying to split
> before the page allocator was initialized. So I have moved this call
> from arch_mm_preinit() to mem_init(), which solves the ordering issue.
> 
> I've added warnings and return an error if any attempt is made to split
> in the black-out windows.
> 
> Note there are other issues which prevent booting all the way to user
> space, which will be fixed in subsequent patches.
> 
> Reported-by: Jinjiang Tu <tujinjiang@huawei.com>
> Closes: https://lore.kernel.org/all/0b2a4ae5-fc51-4d77-b177-b2e9db74f11d@huawei.com/
> Fixes: a166563e7ec37 ("arm64: mm: support large block mapping when rodata=full")
> Cc: stable@vger.kernel.org
> Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>

I have tested with a hacked cpufeature code to enable BBML2_NOABORT
for FVP MIDRs.

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Tested-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Suzuki

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/include/asm/mmu.h |  2 ++
>   arch/arm64/mm/init.c         |  9 +++++++-
>   arch/arm64/mm/mmu.c          | 45 +++++++++++++++++++++++++-----------
>   3 files changed, 42 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 137a173df1ff8..472610433aaea 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -112,5 +112,7 @@ void kpti_install_ng_mappings(void);
>   static inline void kpti_install_ng_mappings(void) {}
>   #endif
>   
> +extern bool page_alloc_available;
> +
>   #endif	/* !__ASSEMBLER__ */
>   #endif
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 96711b8578fd0..b9b248d24fd10 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -350,7 +350,6 @@ void __init arch_mm_preinit(void)
>   	}
>   
>   	swiotlb_init(swiotlb, flags);
> -	swiotlb_update_mem_attributes();
>   
>   	/*
>   	 * Check boundaries twice: Some fundamental inconsistencies can be
> @@ -377,6 +376,14 @@ void __init arch_mm_preinit(void)
>   	}
>   }
>   
> +bool page_alloc_available __ro_after_init;
> +
> +void __init mem_init(void)
> +{
> +	page_alloc_available = true;
> +	swiotlb_update_mem_attributes();
> +}
> +
>   void free_initmem(void)
>   {
>   	void *lm_init_begin = lm_alias(__init_begin);
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index a6a00accf4f93..223947487a223 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -768,30 +768,51 @@ static inline bool force_pte_mapping(void)
>   }
>   
>   static DEFINE_MUTEX(pgtable_split_lock);
> +static bool linear_map_requires_bbml2;
>   
>   int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>   {
>   	int ret;
>   
> -	/*
> -	 * !BBML2_NOABORT systems should not be trying to change permissions on
> -	 * anything that is not pte-mapped in the first place. Just return early
> -	 * and let the permission change code raise a warning if not already
> -	 * pte-mapped.
> -	 */
> -	if (!system_supports_bbml2_noabort())
> -		return 0;
> -
>   	/*
>   	 * If the region is within a pte-mapped area, there is no need to try to
>   	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>   	 * change permissions from atomic context so for those cases (which are
>   	 * always pte-mapped), we must not go any further because taking the
> -	 * mutex below may sleep.
> +	 * mutex below may sleep. Do not call force_pte_mapping() here because
> +	 * it could return a confusing result if called from a secondary cpu
> +	 * prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us
> +	 * what we need.
>   	 */
> -	if (force_pte_mapping() || is_kfence_address((void *)start))
> +	if (!linear_map_requires_bbml2 || is_kfence_address((void *)start))
>   		return 0;
>   
> +	if (!system_supports_bbml2_noabort()) {
> +		/*
> +		 * !BBML2_NOABORT systems should not be trying to change
> +		 * permissions on anything that is not pte-mapped in the first
> +		 * place. Just return early and let the permission change code
> +		 * raise a warning if not already pte-mapped.
> +		 */
> +		if (system_capabilities_finalized())
> +			return 0;
> +
> +		/*
> +		 * Boot-time: split_kernel_leaf_mapping_locked() allocates from
> +		 * page allocator. Can't split until it's available.
> +		 */
> +		if (WARN_ON(!page_alloc_available))
> +			return -EBUSY;
> +
> +		/*
> +		 * Boot-time: Started secondary cpus but don't know if they
> +		 * support BBML2_NOABORT yet. Can't allow splitting in this
> +		 * window in case they don't.
> +		 */
> +		if (WARN_ON(num_online_cpus() > 1))
> +			return -EBUSY;
> +	}
> +
>   	/*
>   	 * Ensure start and end are at least page-aligned since this is the
>   	 * finest granularity we can split to.
> @@ -891,8 +912,6 @@ static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp
>   	return ret;
>   }
>   
> -static bool linear_map_requires_bbml2 __initdata;
> -
>   u32 idmap_kpti_bbml2_flag;
>   
>   static void __init init_idmap_kpti_bbml2_flag(void)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests
  2026-03-30 16:17 ` [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
  2026-03-31 14:35   ` Suzuki K Poulose
@ 2026-04-02 20:43   ` Catalin Marinas
  2026-04-03 10:31     ` Catalin Marinas
  1 sibling, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2026-04-02 20:43 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, David Hildenbrand (Arm), Dev Jain, Yang Shi,
	Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky, linux-arm-kernel,
	linux-kernel, stable

On Mon, Mar 30, 2026 at 05:17:02PM +0100, Ryan Roberts wrote:
>  int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>  {
>  	int ret;
>  
> -	/*
> -	 * !BBML2_NOABORT systems should not be trying to change permissions on
> -	 * anything that is not pte-mapped in the first place. Just return early
> -	 * and let the permission change code raise a warning if not already
> -	 * pte-mapped.
> -	 */
> -	if (!system_supports_bbml2_noabort())
> -		return 0;
> -
>  	/*
>  	 * If the region is within a pte-mapped area, there is no need to try to
>  	 * split. Additionally, CONFIG_DEBUG_PAGEALLOC and CONFIG_KFENCE may
>  	 * change permissions from atomic context so for those cases (which are
>  	 * always pte-mapped), we must not go any further because taking the
> -	 * mutex below may sleep.
> +	 * mutex below may sleep. Do not call force_pte_mapping() here because
> +	 * it could return a confusing result if called from a secondary cpu
> +	 * prior to finalizing caps. Instead, linear_map_requires_bbml2 gives us
> +	 * what we need.
>  	 */
> -	if (force_pte_mapping() || is_kfence_address((void *)start))
> +	if (!linear_map_requires_bbml2 || is_kfence_address((void *)start))
>  		return 0;
>  
> +	if (!system_supports_bbml2_noabort()) {
> +		/*
> +		 * !BBML2_NOABORT systems should not be trying to change
> +		 * permissions on anything that is not pte-mapped in the first
> +		 * place. Just return early and let the permission change code
> +		 * raise a warning if not already pte-mapped.
> +		 */
> +		if (system_capabilities_finalized())
> +			return 0;
> +
> +		/*
> +		 * Boot-time: split_kernel_leaf_mapping_locked() allocates from
> +		 * page allocator. Can't split until it's available.
> +		 */
> +		if (WARN_ON(!page_alloc_available))
> +			return -EBUSY;
> +
> +		/*
> +		 * Boot-time: Started secondary cpus but don't know if they
> +		 * support BBML2_NOABORT yet. Can't allow splitting in this
> +		 * window in case they don't.
> +		 */
> +		if (WARN_ON(num_online_cpus() > 1))
> +			return -EBUSY;
> +	}

I think sashiko is over cautions here
(https://sashiko.dev/#/patchset/20260330161705.3349825-1-ryan.roberts@arm.com)
but it has a somewhat valid point from the perspective of
num_online_cpus() semantics. We have have num_online_cpus() == 1 while
having a secondary CPU just booted and with its MMU enabled. I don't
think we can have any asynchronous tasks running at that point to
trigger a spit though. Even async_init() is called after smp_init().

An option may be to attempt cpus_read_trylock() as this lock is taken by
_cpu_up(). If it fails, return -EBUSY, otherwise check num_online_cpus()
and unlock (and return -EBUSY if secondaries already started).

Another thing I couldn't get my head around - IIUC is_realm_world()
won't return true for map_mem() yet (if in a realm). Can we have realms
on hardware that does not support BBML2_NOABORT? We may not have
configuration with rodata_full set (it should be complementary to realm
support).

I'll add the patches to for-next/core to give them a bit of time in
-next but let's see next week if we ignore this (with an updated
comment) or we try to avoid the issue altogether.

-- 
Catalin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT
  2026-03-30 16:17 [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Ryan Roberts
                   ` (2 preceding siblings ...)
  2026-03-30 16:17 ` [PATCH v2 3/3] arm64: mm: Remove pmd_sect() and pud_sect() Ryan Roberts
@ 2026-04-02 21:11 ` Catalin Marinas
  3 siblings, 0 replies; 8+ messages in thread
From: Catalin Marinas @ 2026-04-02 21:11 UTC (permalink / raw)
  To: Will Deacon, David Hildenbrand (Arm), Dev Jain, Yang Shi,
	Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky, Ryan Roberts
  Cc: linux-arm-kernel, linux-kernel

On Mon, 30 Mar 2026 17:17:01 +0100, Ryan Roberts wrote:
> This fixes a couple of bugs in the "large block mappings for linear map when we
> have BBML2_NOABORT" feature when used in conjunction with a CCA realm guest.
> While investigating I found and fixed some more general issues too. See commit
> logs for full explanations.
> 
> Applies on top of v7.0-rc4.
> 
> [...]

Applied to arm64 (for-next/bbml2-fixes), thanks! I had some comments on
the first patch, so I may rebase it or add something on top.

[1/3] arm64: mm: Fix rodata=full block mapping support for realm guests
      https://git.kernel.org/arm64/c/f12b435de2f2
[2/3] arm64: mm: Handle invalid large leaf mappings correctly
      https://git.kernel.org/arm64/c/15bfba1ad77f
[3/3] arm64: mm: Remove pmd_sect() and pud_sect()
      https://git.kernel.org/arm64/c/1d37713fa837


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests
  2026-04-02 20:43   ` Catalin Marinas
@ 2026-04-03 10:31     ` Catalin Marinas
  0 siblings, 0 replies; 8+ messages in thread
From: Catalin Marinas @ 2026-04-03 10:31 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, David Hildenbrand (Arm), Dev Jain, Yang Shi,
	Suzuki K Poulose, Jinjiang Tu, Kevin Brodsky, linux-arm-kernel,
	linux-kernel, stable

On Thu, Apr 02, 2026 at 09:43:59PM +0100, Catalin Marinas wrote:
> Another thing I couldn't get my head around - IIUC is_realm_world()
> won't return true for map_mem() yet (if in a realm). Can we have realms
> on hardware that does not support BBML2_NOABORT? We may not have
> configuration with rodata_full set (it should be complementary to realm
> support).

With rodata_full==false, can_set_direct_map() returns false initially
but after arm64_rsi_init() it starts returning true if is_realm_world().
The side-effect is that map_mem() goes for block mappings and
linear_map_requires_bbml2 set to false. Later on,
linear_map_maybe_split_to_ptes() will skip the splitting.

Unless I'm missing something, is_realm_world() calls in
force_pte_mapping() and can_set_direct_map() are useless. I'd remove
them and either require BBML2_NOABORT with CCA or get the user to force
rodata_full when running in realms. Or move arm64_rsi_init() even
earlier?

-- 
Catalin


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-03 10:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 16:17 [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Ryan Roberts
2026-03-30 16:17 ` [PATCH v2 1/3] arm64: mm: Fix rodata=full block mapping support for realm guests Ryan Roberts
2026-03-31 14:35   ` Suzuki K Poulose
2026-04-02 20:43   ` Catalin Marinas
2026-04-03 10:31     ` Catalin Marinas
2026-03-30 16:17 ` [PATCH v2 2/3] arm64: mm: Handle invalid large leaf mappings correctly Ryan Roberts
2026-03-30 16:17 ` [PATCH v2 3/3] arm64: mm: Remove pmd_sect() and pud_sect() Ryan Roberts
2026-04-02 21:11 ` [PATCH v2 0/3] Fix bugs for realm guest plus BBML2_NOABORT Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox