public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition
@ 2025-10-22 22:06 Usama Arif
  2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Usama Arif @ 2025-10-22 22:06 UTC (permalink / raw)
  To: dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Usama Arif, Michael van der Westhuizen, Tobias Fleig

This series addresses critical bugs in the kexec path when transitioning
from a kernel using 5-level page tables to one using 4-level page tables.

The root cause is improper handling of PGD entry value during the page level
transition. Specifically p4d value is masked with PAGE_MASK instead of
PTE_PFN_MASK, failing to account for high-order software bits like
_PAGE_BIT_NOPTISHADOW (bit 58).

When bit 58 (_PAGE_BIT_NOPTISHADOW) is set in the source kernel, the target
4-level kernel doesn't recognize it and fails to mask it properly, leading
to kexec failure.

This series fixes the issue in three parts:

Patch 1: Fixes the x86 boot compressed code path by replacing direct CR3
dereferencing with read_cr3_pa() and using PTE_PFN_MASK instead
of PAGE_MASK.

Patch 2: Applies the same fix to the EFI stub code path. (Done in a
separate patch as Fixes tag is different).

Patch 3: Moves _PAGE_BIT_NOPTISHADOW from bit 58 (_PAGE_BIT_SOFTW5) to
bit 9 (_PAGE_BIT_SOFTW1), which is already properly masked by
older kernels. This provides backward compatibility without
requiring patches 1 and 2 to be applied to all existing kernel versions,
which is not feasible for production systems or live patching.

Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>

The patches are based on aaa9c3550b60d6259d6ea8b1175ade8d1242444e (next-20251022)
 
Usama Arif (3):
  x86/boot: Fix page table access in 5-level to 4-level paging
    transition
  efi/libstub: Fix page table access in 5-level to 4-level paging
    transition
  x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9

 arch/x86/boot/compressed/pgtable_64.c   | 8 +++++---
 arch/x86/include/asm/pgtable_types.h    | 4 ++--
 drivers/firmware/efi/libstub/x86-5lvl.c | 5 ++++-
 3 files changed, 11 insertions(+), 6 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 22:06 [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
@ 2025-10-22 22:06 ` Usama Arif
  2025-10-22 23:16   ` Dave Hansen
                     ` (2 more replies)
  2025-10-22 22:06 ` [PATCH 2/3] efi/libstub: " Usama Arif
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 19+ messages in thread
From: Usama Arif @ 2025-10-22 22:06 UTC (permalink / raw)
  To: dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Usama Arif, Michael van der Westhuizen, Tobias Fleig

When transitioning from 5-level to 4-level paging, the existing code
incorrectly accesses page table entries by directly dereferencing CR3
and applying PAGE_MASK. This approach has several issues:

- __native_read_cr3() returns the raw CR3 register value, which on
  x86_64 includes not just the physical address but also flags. Bits
  above the physical address width of the system i.e. above
  __PHYSICAL_MASK_SHIFT) are also not masked.
- The PGD entry is masked by PAGE_SIZE which doesn't take into account
  the higher bits such as _PAGE_BIT_NOPTISHADOW.

Replace this with proper accessor functions:
- read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption bit
  and extracting only the physical address portion.
- mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
  flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).

Fixes: e9d0e6330eb8 ("x86/boot/compressed/64: Prepare new top-level page table for trampoline")
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
---
 arch/x86/boot/compressed/pgtable_64.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index bdd26050dff77..a56449938b7ec 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -170,7 +170,8 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 		 */
 		*trampoline_32bit = __native_read_cr3() | _PAGE_TABLE_NOENC;
 	} else {
-		unsigned long src;
+		u64 *new_cr3;
+		pgd_t *pgdp;
 
 		/*
 		 * For 5- to 4-level paging transition, copy page table pointed
@@ -180,8 +181,9 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 		 * We cannot just point to the page table from trampoline as it
 		 * may be above 4G.
 		 */
-		src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
-		memcpy(trampoline_32bit, (void *)src, PAGE_SIZE);
+		pgdp = (pgd_t *)read_cr3_pa();
+		new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
+		memcpy(trampoline_32bit, new_cr3, PAGE_SIZE);
 	}
 
 	toggle_la57(trampoline_32bit);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] efi/libstub: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 22:06 [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
  2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
@ 2025-10-22 22:06 ` Usama Arif
  2025-10-23 14:13   ` Ard Biesheuvel
  2025-10-22 22:06 ` [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9 Usama Arif
  2025-10-22 22:25 ` [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
  3 siblings, 1 reply; 19+ messages in thread
From: Usama Arif @ 2025-10-22 22:06 UTC (permalink / raw)
  To: dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Usama Arif, Michael van der Westhuizen, Tobias Fleig

When transitioning from 5-level to 4-level paging, the existing code
incorrectly accesses page table entries by directly dereferencing CR3
and applying PAGE_MASK. This approach has several issues:

- __native_read_cr3() returns the raw CR3 register value, which on
  x86_64 includes not just the physical address but also flags Bits
  above the physical address width of the system (i.e. above
  __PHYSICAL_MASK_SHIFT) are also not masked.
- The pgd value is masked by PAGE_SIZE which doesn't take into account
  the higher bits such as _PAGE_BIT_NOPTISHADOW.

Replace this with proper accessor functions:
- read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption bit
  and extracting only the physical address portion.
- mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
  flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).

Fixes: cb1c9e02b0c1 ("x86/efistub: Perform 4/5 level paging switch from the stub")
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
---
 drivers/firmware/efi/libstub/x86-5lvl.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/x86-5lvl.c b/drivers/firmware/efi/libstub/x86-5lvl.c
index f1c5fb45d5f7c..34b72da457487 100644
--- a/drivers/firmware/efi/libstub/x86-5lvl.c
+++ b/drivers/firmware/efi/libstub/x86-5lvl.c
@@ -81,8 +81,11 @@ void efi_5level_switch(void)
 		new_cr3 = memset(pgt, 0, PAGE_SIZE);
 		new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
 	} else {
+		pgd_t *pgdp;
+
+		pgdp = (pgd_t *)read_cr3_pa();
 		/* take the new root table pointer from the current entry #0 */
-		new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
+		new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
 
 		/* copy the new root table if it is not 32-bit addressable */
 		if ((u64)new_cr3 > U32_MAX)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-22 22:06 [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
  2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
  2025-10-22 22:06 ` [PATCH 2/3] efi/libstub: " Usama Arif
@ 2025-10-22 22:06 ` Usama Arif
  2025-10-22 23:35   ` Dave Hansen
  2025-10-22 22:25 ` [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
  3 siblings, 1 reply; 19+ messages in thread
From: Usama Arif @ 2025-10-22 22:06 UTC (permalink / raw)
  To: dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Usama Arif, Michael van der Westhuizen, Tobias Fleig

Kexec from a kernel with 5-level page tables to one with 4-level page
tables is broken because bits above the physical address width are not
properly masked by the target kernel. This issue was particularly triggered
by _PAGE_BIT_NOPTISHADOW, which uses _PAGE_BIT_SOFTW5 (bit 58).

The ideal fix would be to mask the upper bits properly in all kernels.
However, this is not feasible due to:
- The logistical challenge of patching all older kernels in production
- The patch not being applicable for live patching

Instead, move _PAGE_BIT_NOPTISHADOW to use _PAGE_BIT_SOFTW1 (bit 9),
which is already masked by older kernels using PAGE_MASK. This is safe
as the other users of _PAGE_BIT_SOFTW1 (_PAGE_BIT_SPECIAL and
_PAGE_BIT_CPA_TEST) are only used for leaf entries, while
_PAGE_BIT_NOPTISHADOW is used for PGD and P4D entries only.

Fixes: d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables")
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
---
 arch/x86/include/asm/pgtable_types.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 2ec250ba467e2..616e928d87973 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -29,6 +29,8 @@
 #define _PAGE_BIT_PKEY_BIT3	62	/* Protection Keys, bit 4/4 */
 #define _PAGE_BIT_NX		63	/* No execute: only valid after cpuid check */
 
+/* _PAGE_BIT_SPECIAL and _PAGE_BIT_CPA_TEST only used for leaf entries */
+#define _PAGE_BIT_NOPTISHADOW	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_UFFD_WP	_PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
@@ -37,11 +39,9 @@
 
 #ifdef CONFIG_X86_64
 #define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
-#define _PAGE_BIT_NOPTISHADOW	_PAGE_BIT_SOFTW5 /* No PTI shadow (root PGD) */
 #else
 /* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */
 #define _PAGE_BIT_SAVED_DIRTY	_PAGE_BIT_SOFTW2 /* Saved Dirty bit (leaf) */
-#define _PAGE_BIT_NOPTISHADOW	_PAGE_BIT_SOFTW2 /* No PTI shadow (root PGD) */
 #endif
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition
  2025-10-22 22:06 [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
                   ` (2 preceding siblings ...)
  2025-10-22 22:06 ` [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9 Usama Arif
@ 2025-10-22 22:25 ` Usama Arif
  3 siblings, 0 replies; 19+ messages in thread
From: Usama Arif @ 2025-10-22 22:25 UTC (permalink / raw)
  To: dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig



On 22/10/2025 23:06, Usama Arif wrote:
> This series addresses critical bugs in the kexec path when transitioning
> from a kernel using 5-level page tables to one using 4-level page tables.
> 
> The root cause is improper handling of PGD entry value during the page level
> transition. Specifically p4d value is masked with PAGE_MASK instead of
> PTE_PFN_MASK, failing to account for high-order software bits like
> _PAGE_BIT_NOPTISHADOW (bit 58).

s/p4d value/PGD entry value/ for consistency.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
@ 2025-10-22 23:16   ` Dave Hansen
  2025-10-22 23:49     ` Usama Arif
  2025-10-25 21:50     ` H. Peter Anvin
  2025-10-23 17:43   ` kernel test robot
  2025-10-24  8:07   ` kernel test robot
  2 siblings, 2 replies; 19+ messages in thread
From: Dave Hansen @ 2025-10-22 23:16 UTC (permalink / raw)
  To: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig

On 10/22/25 15:06, Usama Arif wrote:
> +		pgdp = (pgd_t *)read_cr3_pa();
> +		new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
> +		memcpy(trampoline_32bit, new_cr3, PAGE_SIZE);

Heh, somebody like casting, I see!

But seriously, read_cr3_pa() should be returning a physical address. No?
Today it does:

static inline unsigned long read_cr3_pa(void)
{
        return __read_cr3() & CR3_ADDR_MASK;
}

So shouldn't CR3_ADDR_MASK be masking out any naughty non-address bits?
Shouldn't we fix read_cr3_pa() and not do this in its caller?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-22 22:06 ` [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9 Usama Arif
@ 2025-10-22 23:35   ` Dave Hansen
  2025-10-22 23:58     ` Usama Arif
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2025-10-22 23:35 UTC (permalink / raw)
  To: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig

On 10/22/25 15:06, Usama Arif wrote:
> Instead, move _PAGE_BIT_NOPTISHADOW to use _PAGE_BIT_SOFTW1 (bit 9),

Wait a sec, though...

This isn't necessary once the previous 2 patches are applied, right?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 23:16   ` Dave Hansen
@ 2025-10-22 23:49     ` Usama Arif
  2025-10-25 21:50     ` H. Peter Anvin
  1 sibling, 0 replies; 19+ messages in thread
From: Usama Arif @ 2025-10-22 23:49 UTC (permalink / raw)
  To: Dave Hansen, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig



On 23/10/2025 00:16, Dave Hansen wrote:
> On 10/22/25 15:06, Usama Arif wrote:
>> +		pgdp = (pgd_t *)read_cr3_pa();
>> +		new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
>> +		memcpy(trampoline_32bit, new_cr3, PAGE_SIZE);
> 
> Heh, somebody like casting, I see!

haha yeah its a lot here.
> 
> But seriously, read_cr3_pa() should be returning a physical address. No?
> Today it does:
> 
> static inline unsigned long read_cr3_pa(void)
> {
>         return __read_cr3() & CR3_ADDR_MASK;
> }
> 
> So shouldn't CR3_ADDR_MASK be masking out any naughty non-address bits?
> Shouldn't we fix read_cr3_pa() and not do this in its caller?

So we need to mask 2 things here:
- cr3, which is done by read_cr3_pa using CR3_ADDR_MASK/(__sme_clr(PHYSICAL_PAGE_MASK))
  as you pointed out.
- pgdp[0] (the deferenced value), i.e. the p4d table pointer (This was previously
  *(unsigned long *)__native_read_cr3()). This needs to be masked by PTE_PFN_MASK and
  and not PAGE_MASK which was done previously in order to take care of _PAGE_BIT_NOPTISHADOW.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-22 23:35   ` Dave Hansen
@ 2025-10-22 23:58     ` Usama Arif
  2025-10-23 14:05       ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Usama Arif @ 2025-10-22 23:58 UTC (permalink / raw)
  To: Dave Hansen, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig



On 23/10/2025 00:35, Dave Hansen wrote:
> On 10/22/25 15:06, Usama Arif wrote:
>> Instead, move _PAGE_BIT_NOPTISHADOW to use _PAGE_BIT_SOFTW1 (bit 9),
> 
> Wait a sec, though...
> 
> This isn't necessary once the previous 2 patches are applied, right?

In kexec if the target kernels have patch 1 and 2, then this patch
is not needed. Unfortunately, patches 1 and 2 are not livepatchable.
Also backporting patches 1 and 2 to all previous kernels running in
production in a large fleet is not very scalable.

So if we want to run a kernel with 5 level pagetable in production
(with the ability to kexec into a 4 level kernel that doesn't have the first
2 patches), then this patch would solve the problem. i.e. patches 1 and 2
solve the problem from the target kernels perspective, patch 3 solves
it from the source kernel (if the target kernel doesnt have patches 1
and 2 applied).
I mentioned this in the commit message as:

"
- The logistical challenge of patching all older kernels in production
- The patch not being applicable for live patching
"

I can try and make the commit message clearer in the next revision.
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-22 23:58     ` Usama Arif
@ 2025-10-23 14:05       ` Dave Hansen
  2025-10-23 14:24         ` Kiryl Shutsemau
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2025-10-23 14:05 UTC (permalink / raw)
  To: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig

On 10/22/25 16:58, Usama Arif wrote:
>> This isn't necessary once the previous 2 patches are applied, right?
> In kexec if the target kernels have patch 1 and 2, then this patch
> is not needed. Unfortunately, patches 1 and 2 are not livepatchable.
> Also backporting patches 1 and 2 to all previous kernels running in
> production in a large fleet is not very scalable.

I don't think I've ever been asked to apply a patch to make livepatching
easier. I'm not sure that's something we want to pollute mainline with.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] efi/libstub: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 22:06 ` [PATCH 2/3] efi/libstub: " Usama Arif
@ 2025-10-23 14:13   ` Ard Biesheuvel
  2025-10-23 14:28     ` Kiryl Shutsemau
  0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2025-10-23 14:13 UTC (permalink / raw)
  To: Usama Arif
  Cc: dwmw, tglx, mingo, bp, dave.hansen, hpa, x86, apopple, thuth,
	nik.borisov, kas, linux-kernel, linux-efi, kernel-team,
	Michael van der Westhuizen, Tobias Fleig

On Thu, 23 Oct 2025 at 00:08, Usama Arif <usamaarif642@gmail.com> wrote:
>
> When transitioning from 5-level to 4-level paging, the existing code
> incorrectly accesses page table entries by directly dereferencing CR3
> and applying PAGE_MASK. This approach has several issues:
>
> - __native_read_cr3() returns the raw CR3 register value, which on
>   x86_64 includes not just the physical address but also flags Bits
>   above the physical address width of the system (i.e. above
>   __PHYSICAL_MASK_SHIFT) are also not masked.
> - The pgd value is masked by PAGE_SIZE which doesn't take into account
>   the higher bits such as _PAGE_BIT_NOPTISHADOW.
>
> Replace this with proper accessor functions:
> - read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption bit
>   and extracting only the physical address portion.
> - mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
>   flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).
>
> Fixes: cb1c9e02b0c1 ("x86/efistub: Perform 4/5 level paging switch from the stub")
> Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> Reported-by: Michael van der Westhuizen <rmikey@meta.com>
> Reported-by: Tobias Fleig <tfleig@meta.com>
> ---
>  drivers/firmware/efi/libstub/x86-5lvl.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/firmware/efi/libstub/x86-5lvl.c b/drivers/firmware/efi/libstub/x86-5lvl.c
> index f1c5fb45d5f7c..34b72da457487 100644
> --- a/drivers/firmware/efi/libstub/x86-5lvl.c
> +++ b/drivers/firmware/efi/libstub/x86-5lvl.c
> @@ -81,8 +81,11 @@ void efi_5level_switch(void)
>                 new_cr3 = memset(pgt, 0, PAGE_SIZE);
>                 new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
>         } else {
> +               pgd_t *pgdp;
> +
> +               pgdp = (pgd_t *)read_cr3_pa();

Shouldn't this be using native_read_cr3_pa()? And is there any reason
to re-read CR3 here, rather than update the code that populates the
cr3 variable? The preceding other branch of the if() should probably
use the same sanitised value of CR3, no?


>                 /* take the new root table pointer from the current entry #0 */
> -               new_cr3 = (u64 *)(cr3[0] & PAGE_MASK);
> +               new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
>
>                 /* copy the new root table if it is not 32-bit addressable */
>                 if ((u64)new_cr3 > U32_MAX)
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-23 14:05       ` Dave Hansen
@ 2025-10-23 14:24         ` Kiryl Shutsemau
  2025-10-23 15:12           ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23 14:24 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa, x86,
	apopple, thuth, nik.borisov, linux-kernel, linux-efi, kernel-team,
	Michael van der Westhuizen, Tobias Fleig

On Thu, Oct 23, 2025 at 07:05:24AM -0700, Dave Hansen wrote:
> On 10/22/25 16:58, Usama Arif wrote:
> >> This isn't necessary once the previous 2 patches are applied, right?
> > In kexec if the target kernels have patch 1 and 2, then this patch
> > is not needed. Unfortunately, patches 1 and 2 are not livepatchable.
> > Also backporting patches 1 and 2 to all previous kernels running in
> > production in a large fleet is not very scalable.
> 
> I don't think I've ever been asked to apply a patch to make livepatching
> easier. I'm not sure that's something we want to pollute mainline with.

It is not about assisting livepatching.

Machines in our fleet may switch between kernel versions using kexec.

We recently introduced a kernel in the fleet that enables 5-level
paging.

Kexecing into an older kernel that requires switching from 5- to 4-level
paging which is broken because the target kernel doesn't expect
_PAGE_NOPTISHADOW.

The first two patches fix the problem for the target kernel. If we only
apply them upstream, we would need to backport them to all kernels we
use to address the problem.

The last patch allows us to only update the kernel that has 5-level
paging enabled, making it much easier logistically.

The fix seems trivial, and I don't see any downsides.

Ultimately, it helps with interoperability between different kernel
versions and/or configurations.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] efi/libstub: Fix page table access in 5-level to 4-level paging transition
  2025-10-23 14:13   ` Ard Biesheuvel
@ 2025-10-23 14:28     ` Kiryl Shutsemau
  0 siblings, 0 replies; 19+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23 14:28 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, hpa, x86, apopple,
	thuth, nik.borisov, linux-kernel, linux-efi, kernel-team,
	Michael van der Westhuizen, Tobias Fleig

On Thu, Oct 23, 2025 at 04:13:26PM +0200, Ard Biesheuvel wrote:
> On Thu, 23 Oct 2025 at 00:08, Usama Arif <usamaarif642@gmail.com> wrote:
> >
> > When transitioning from 5-level to 4-level paging, the existing code
> > incorrectly accesses page table entries by directly dereferencing CR3
> > and applying PAGE_MASK. This approach has several issues:
> >
> > - __native_read_cr3() returns the raw CR3 register value, which on
> >   x86_64 includes not just the physical address but also flags Bits
> >   above the physical address width of the system (i.e. above
> >   __PHYSICAL_MASK_SHIFT) are also not masked.
> > - The pgd value is masked by PAGE_SIZE which doesn't take into account
> >   the higher bits such as _PAGE_BIT_NOPTISHADOW.
> >
> > Replace this with proper accessor functions:
> > - read_cr3_pa(): Uses CR3_ADDR_MASK properly clearing SME encryption bit
> >   and extracting only the physical address portion.
> > - mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for
> >   flags above physical address (_PAGE_BIT_NOPTISHADOW in particular).
> >
> > Fixes: cb1c9e02b0c1 ("x86/efistub: Perform 4/5 level paging switch from the stub")
> > Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > Reported-by: Michael van der Westhuizen <rmikey@meta.com>
> > Reported-by: Tobias Fleig <tfleig@meta.com>
> > ---
> >  drivers/firmware/efi/libstub/x86-5lvl.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/firmware/efi/libstub/x86-5lvl.c b/drivers/firmware/efi/libstub/x86-5lvl.c
> > index f1c5fb45d5f7c..34b72da457487 100644
> > --- a/drivers/firmware/efi/libstub/x86-5lvl.c
> > +++ b/drivers/firmware/efi/libstub/x86-5lvl.c
> > @@ -81,8 +81,11 @@ void efi_5level_switch(void)
> >                 new_cr3 = memset(pgt, 0, PAGE_SIZE);
> >                 new_cr3[0] = (u64)cr3 | _PAGE_TABLE_NOENC;
> >         } else {
> > +               pgd_t *pgdp;
> > +
> > +               pgdp = (pgd_t *)read_cr3_pa();
> 
> Shouldn't this be using native_read_cr3_pa()?

Perhaps. But I don't think it makes a difference.

We don't have paravirt in stub/decompressor, do we?

> And is there any reason
> to re-read CR3 here, rather than update the code that populates the
> cr3 variable? The preceding other branch of the if() should probably
> use the same sanitised value of CR3, no?

Good point.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-23 14:24         ` Kiryl Shutsemau
@ 2025-10-23 15:12           ` Dave Hansen
  2025-10-23 15:25             ` Kiryl Shutsemau
  2025-10-23 22:15             ` Usama Arif
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Hansen @ 2025-10-23 15:12 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa, x86,
	apopple, thuth, nik.borisov, linux-kernel, linux-efi, kernel-team,
	Michael van der Westhuizen, Tobias Fleig

On 10/23/25 07:24, Kiryl Shutsemau wrote:
> The last patch allows us to only update the kernel that has 5-level
> paging enabled, making it much easier logistically.
> 
> The fix seems trivial, and I don't see any downsides.

What I'm hearing is: Please change mainline so $COMPANY can do fewer
backports.

Yeah, it's pretty trivial. But I'm worried about the precedent, and I'm
worried that the change doesn't do a thing for mainline. It's pure
churn. Churn has inherent downsides.

I'd urge you to kick this out of the series and focus on the bug fixes
that are unambiguously good for everyone. Let's have a nice big flamewar
in another thread.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-23 15:12           ` Dave Hansen
@ 2025-10-23 15:25             ` Kiryl Shutsemau
  2025-10-23 22:15             ` Usama Arif
  1 sibling, 0 replies; 19+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23 15:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa, x86,
	apopple, thuth, nik.borisov, linux-kernel, linux-efi, kernel-team,
	Michael van der Westhuizen, Tobias Fleig

On Thu, Oct 23, 2025 at 08:12:32AM -0700, Dave Hansen wrote:
> On 10/23/25 07:24, Kiryl Shutsemau wrote:
> > The last patch allows us to only update the kernel that has 5-level
> > paging enabled, making it much easier logistically.
> > 
> > The fix seems trivial, and I don't see any downsides.
> 
> What I'm hearing is: Please change mainline so $COMPANY can do fewer
> backports.

Or you can read it as: without the fix 5-level paging deployment is
harder.

One other point is that crashkernels tend to be older and update less
frequently than the main kernel. And one would only discover that
crashdump doesn't work when the crash happens.

> Yeah, it's pretty trivial. But I'm worried about the precedent, and I'm
> worried that the change doesn't do a thing for mainline. It's pure
> churn. Churn has inherent downsides.

You don't consider kexec to older kernels useful for mainline?

> I'd urge you to kick this out of the series and focus on the bug fixes
> that are unambiguously good for everyone. Let's have a nice big flamewar
> in another thread.

Oh, well... Okay.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
  2025-10-22 23:16   ` Dave Hansen
@ 2025-10-23 17:43   ` kernel test robot
  2025-10-24  8:07   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-10-23 17:43 UTC (permalink / raw)
  To: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: llvm, oe-kbuild-all, x86, apopple, thuth, nik.borisov, kas,
	linux-kernel, linux-efi, kernel-team, Usama Arif,
	Michael van der Westhuizen, Tobias Fleig

Hi Usama,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/master efi/next linus/master v6.18-rc2 next-20251023]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/x86-boot-Fix-page-table-access-in-5-level-to-4-level-paging-transition/20251023-061048
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20251022220755.1026144-2-usamaarif642%40gmail.com
patch subject: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20251024/202510240106.1aff6SIM-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251024/202510240106.1aff6SIM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510240106.1aff6SIM-lkp@intel.com/

All errors (new ones prefixed by >>):

>> arch/x86/boot/compressed/pgtable_64.c:185:21: error: call to undeclared function 'pgd_val'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     185 |                 new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
         |                                   ^
   1 error generated.


vim +/pgd_val +185 arch/x86/boot/compressed/pgtable_64.c

   101	
   102	asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
   103	{
   104		void (*toggle_la57)(void *cr3);
   105		bool l5_required = false;
   106	
   107		/* Initialize boot_params. Required for cmdline_find_option_bool(). */
   108		sanitize_boot_params(bp);
   109		boot_params_ptr = bp;
   110	
   111		/*
   112		 * Check if LA57 is desired and supported.
   113		 *
   114		 * There are several parts to the check:
   115		 *   - if user asked to disable 5-level paging: no5lvl in cmdline
   116		 *   - if the machine supports 5-level paging:
   117		 *     + CPUID leaf 7 is supported
   118		 *     + the leaf has the feature bit set
   119		 */
   120		if (!cmdline_find_option_bool("no5lvl") &&
   121		    native_cpuid_eax(0) >= 7 && (native_cpuid_ecx(7) & BIT(16))) {
   122			l5_required = true;
   123	
   124			/* Initialize variables for 5-level paging */
   125			__pgtable_l5_enabled = 1;
   126			pgdir_shift = 48;
   127			ptrs_per_p4d = 512;
   128		}
   129	
   130		/*
   131		 * The trampoline will not be used if the paging mode is already set to
   132		 * the desired one.
   133		 */
   134		if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
   135			return;
   136	
   137		trampoline_32bit = (unsigned long *)find_trampoline_placement();
   138	
   139		/* Preserve trampoline memory */
   140		memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
   141	
   142		/* Clear trampoline memory first */
   143		memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
   144	
   145		/* Copy trampoline code in place */
   146		toggle_la57 = memcpy(trampoline_32bit +
   147				TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
   148				&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
   149	
   150		/*
   151		 * Avoid the need for a stack in the 32-bit trampoline code, by using
   152		 * LJMP rather than LRET to return back to long mode. LJMP takes an
   153		 * immediate absolute address, which needs to be adjusted based on the
   154		 * placement of the trampoline.
   155		 */
   156		*(u32 *)((u8 *)toggle_la57 + trampoline_ljmp_imm_offset) +=
   157							(unsigned long)toggle_la57;
   158	
   159		/*
   160		 * The code below prepares page table in trampoline memory.
   161		 *
   162		 * The new page table will be used by trampoline code for switching
   163		 * from 4- to 5-level paging or vice versa.
   164		 */
   165	
   166		if (l5_required) {
   167			/*
   168			 * For 4- to 5-level paging transition, set up current CR3 as
   169			 * the first and the only entry in a new top-level page table.
   170			 */
   171			*trampoline_32bit = __native_read_cr3() | _PAGE_TABLE_NOENC;
   172		} else {
   173			u64 *new_cr3;
   174			pgd_t *pgdp;
   175	
   176			/*
   177			 * For 5- to 4-level paging transition, copy page table pointed
   178			 * by first entry in the current top-level page table as our
   179			 * new top-level page table.
   180			 *
   181			 * We cannot just point to the page table from trampoline as it
   182			 * may be above 4G.
   183			 */
   184			pgdp = (pgd_t *)read_cr3_pa();
 > 185			new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9
  2025-10-23 15:12           ` Dave Hansen
  2025-10-23 15:25             ` Kiryl Shutsemau
@ 2025-10-23 22:15             ` Usama Arif
  1 sibling, 0 replies; 19+ messages in thread
From: Usama Arif @ 2025-10-23 22:15 UTC (permalink / raw)
  To: Dave Hansen, Kiryl Shutsemau
  Cc: dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa, x86, apopple,
	thuth, nik.borisov, linux-kernel, linux-efi, kernel-team,
	Michael van der Westhuizen, Tobias Fleig, Breno Leitao



On 23/10/2025 16:12, Dave Hansen wrote:
> On 10/23/25 07:24, Kiryl Shutsemau wrote:
>> The last patch allows us to only update the kernel that has 5-level
>> paging enabled, making it much easier logistically.
>>
>> The fix seems trivial, and I don't see any downsides.
> 
> What I'm hearing is: Please change mainline so $COMPANY can do fewer
> backports.
> 

Not at all! Very happy to do the backports (will probably end up doing anyways).
They apply very cleanly annd are easy to do.

The issue is trying to deploy a kernel with 5-level table. This problem would be encountered
by anyone that has a medium to large number of machines to manage. 
Kiryl made a good point about crash kernels, but also medium to large fleets are very
dynamic. Old kernels remain for some time for a variety of reasons. And once you have
to kexec into an older kernel that doesnt have patches 1 and 2, it just doesn't work.

The only reason I mentioned live-patch is because that is the only way I know that can
be used to fix a problem like this and not have patch 3. But even if they were live patchable
not every uses it.

It would be nice to have patch 3 in upstream, as I would imagine it would make
life easier for a lot of people when they upgrade their kernel past 6.15 (when the defconfig
option to switch to 4 level was removed). We know of the problem, so we can mitigate it,
but I would imagine a lot of people won't. The bug was found when we tried upgrading
to 6.16, and kexec was breaking when downgrading. It took quite a while to find the bug
as prints don't work in this part of the code, so I think this patch might just save others
the trouble of going through the whole debugging process. 

If there is a strong preference to drop patch 3, I will remove it in the next revision.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
  2025-10-22 23:16   ` Dave Hansen
  2025-10-23 17:43   ` kernel test robot
@ 2025-10-24  8:07   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-10-24  8:07 UTC (permalink / raw)
  To: Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb, hpa
  Cc: oe-kbuild-all, x86, apopple, thuth, nik.borisov, kas,
	linux-kernel, linux-efi, kernel-team, Usama Arif,
	Michael van der Westhuizen, Tobias Fleig

Hi Usama,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/master efi/next linus/master v6.18-rc2 next-20251024]
[cannot apply to tip/auto-latest]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/x86-boot-Fix-page-table-access-in-5-level-to-4-level-paging-transition/20251023-061048
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20251022220755.1026144-2-usamaarif642%40gmail.com
patch subject: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
config: x86_64-buildonly-randconfig-004-20251024 (https://download.01.org/0day-ci/archive/20251024/202510241522.uU9W0Xbv-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251024/202510241522.uU9W0Xbv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510241522.uU9W0Xbv-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/x86/boot/compressed/pgtable_64.c: In function 'configure_5level_paging':
>> arch/x86/boot/compressed/pgtable_64.c:185:35: error: implicit declaration of function 'pgd_val' [-Wimplicit-function-declaration]
     185 |                 new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
         |                                   ^~~~~~~


vim +/pgd_val +185 arch/x86/boot/compressed/pgtable_64.c

   101	
   102	asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
   103	{
   104		void (*toggle_la57)(void *cr3);
   105		bool l5_required = false;
   106	
   107		/* Initialize boot_params. Required for cmdline_find_option_bool(). */
   108		sanitize_boot_params(bp);
   109		boot_params_ptr = bp;
   110	
   111		/*
   112		 * Check if LA57 is desired and supported.
   113		 *
   114		 * There are several parts to the check:
   115		 *   - if user asked to disable 5-level paging: no5lvl in cmdline
   116		 *   - if the machine supports 5-level paging:
   117		 *     + CPUID leaf 7 is supported
   118		 *     + the leaf has the feature bit set
   119		 */
   120		if (!cmdline_find_option_bool("no5lvl") &&
   121		    native_cpuid_eax(0) >= 7 && (native_cpuid_ecx(7) & BIT(16))) {
   122			l5_required = true;
   123	
   124			/* Initialize variables for 5-level paging */
   125			__pgtable_l5_enabled = 1;
   126			pgdir_shift = 48;
   127			ptrs_per_p4d = 512;
   128		}
   129	
   130		/*
   131		 * The trampoline will not be used if the paging mode is already set to
   132		 * the desired one.
   133		 */
   134		if (l5_required == !!(native_read_cr4() & X86_CR4_LA57))
   135			return;
   136	
   137		trampoline_32bit = (unsigned long *)find_trampoline_placement();
   138	
   139		/* Preserve trampoline memory */
   140		memcpy(trampoline_save, trampoline_32bit, TRAMPOLINE_32BIT_SIZE);
   141	
   142		/* Clear trampoline memory first */
   143		memset(trampoline_32bit, 0, TRAMPOLINE_32BIT_SIZE);
   144	
   145		/* Copy trampoline code in place */
   146		toggle_la57 = memcpy(trampoline_32bit +
   147				TRAMPOLINE_32BIT_CODE_OFFSET / sizeof(unsigned long),
   148				&trampoline_32bit_src, TRAMPOLINE_32BIT_CODE_SIZE);
   149	
   150		/*
   151		 * Avoid the need for a stack in the 32-bit trampoline code, by using
   152		 * LJMP rather than LRET to return back to long mode. LJMP takes an
   153		 * immediate absolute address, which needs to be adjusted based on the
   154		 * placement of the trampoline.
   155		 */
   156		*(u32 *)((u8 *)toggle_la57 + trampoline_ljmp_imm_offset) +=
   157							(unsigned long)toggle_la57;
   158	
   159		/*
   160		 * The code below prepares page table in trampoline memory.
   161		 *
   162		 * The new page table will be used by trampoline code for switching
   163		 * from 4- to 5-level paging or vice versa.
   164		 */
   165	
   166		if (l5_required) {
   167			/*
   168			 * For 4- to 5-level paging transition, set up current CR3 as
   169			 * the first and the only entry in a new top-level page table.
   170			 */
   171			*trampoline_32bit = __native_read_cr3() | _PAGE_TABLE_NOENC;
   172		} else {
   173			u64 *new_cr3;
   174			pgd_t *pgdp;
   175	
   176			/*
   177			 * For 5- to 4-level paging transition, copy page table pointed
   178			 * by first entry in the current top-level page table as our
   179			 * new top-level page table.
   180			 *
   181			 * We cannot just point to the page table from trampoline as it
   182			 * may be above 4G.
   183			 */
   184			pgdp = (pgd_t *)read_cr3_pa();
 > 185			new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] x86/boot: Fix page table access in 5-level to 4-level paging transition
  2025-10-22 23:16   ` Dave Hansen
  2025-10-22 23:49     ` Usama Arif
@ 2025-10-25 21:50     ` H. Peter Anvin
  1 sibling, 0 replies; 19+ messages in thread
From: H. Peter Anvin @ 2025-10-25 21:50 UTC (permalink / raw)
  To: Dave Hansen, Usama Arif, dwmw, tglx, mingo, bp, dave.hansen, ardb
  Cc: x86, apopple, thuth, nik.borisov, kas, linux-kernel, linux-efi,
	kernel-team, Michael van der Westhuizen, Tobias Fleig

On October 22, 2025 4:16:34 PM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>On 10/22/25 15:06, Usama Arif wrote:
>> +		pgdp = (pgd_t *)read_cr3_pa();
>> +		new_cr3 = (u64 *)(pgd_val(pgdp[0]) & PTE_PFN_MASK);
>> +		memcpy(trampoline_32bit, new_cr3, PAGE_SIZE);
>
>Heh, somebody like casting, I see!
>
>But seriously, read_cr3_pa() should be returning a physical address. No?
>Today it does:
>
>static inline unsigned long read_cr3_pa(void)
>{
>        return __read_cr3() & CR3_ADDR_MASK;
>}
>
>So shouldn't CR3_ADDR_MASK be masking out any naughty non-address bits?
>Shouldn't we fix read_cr3_pa() and not do this in its caller?

Ah, the times when one can wish for C++.

Too bad they still haven't figured out tagged initializers.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-10-25 21:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-22 22:06 [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif
2025-10-22 22:06 ` [PATCH 1/3] x86/boot: Fix page table access in " Usama Arif
2025-10-22 23:16   ` Dave Hansen
2025-10-22 23:49     ` Usama Arif
2025-10-25 21:50     ` H. Peter Anvin
2025-10-23 17:43   ` kernel test robot
2025-10-24  8:07   ` kernel test robot
2025-10-22 22:06 ` [PATCH 2/3] efi/libstub: " Usama Arif
2025-10-23 14:13   ` Ard Biesheuvel
2025-10-23 14:28     ` Kiryl Shutsemau
2025-10-22 22:06 ` [PATCH 3/3] x86/mm: Move _PAGE_BIT_NOPTISHADOW from bit 58 to bit 9 Usama Arif
2025-10-22 23:35   ` Dave Hansen
2025-10-22 23:58     ` Usama Arif
2025-10-23 14:05       ` Dave Hansen
2025-10-23 14:24         ` Kiryl Shutsemau
2025-10-23 15:12           ` Dave Hansen
2025-10-23 15:25             ` Kiryl Shutsemau
2025-10-23 22:15             ` Usama Arif
2025-10-22 22:25 ` [PATCH 0/3] x86: Fix kexec 5-level to 4-level paging transition Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox