public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled()
@ 2025-05-20 10:41 Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

This is a follow-up to the discussion at [0], broken out of that series
so we can progress while the SEV changes are being reviewed and tested.

The current implementation of pgtable_l5_enabled() is problematic
because it has two implementations, and source files need to opt into
the correct one if they contain code that might be called very early.
Other related global pseudo-constants exist that assume different values
based on the number of paging levels, and it is hard to reason about
whether or not all memory mapping and page table code is guaranteed to
observe consistent values of all of these at all times during the boot.
Case in point: currently, KASAN needs to be disabled during alternatives
patching because otherwise, it will reliably produce false positive
reports due to such inconsistencies.

This revision of the series still provides a single implementation of
pgtable_l5_enabled(), but no longer based on cpu_feature_enabled(), for
a number of reasons:
- fiddling with the early CPU feature detection code is not risk-free,
  and may cause regressions that are difficult to debug;
- Boris objected to the use of a separate capability flag, and using the
  existing one is trickier, as it gets set and cleared during the boot
  by the feature detection code a couple of times, even if 5-level
  paging is not in use
- by their very nature, manipulations of level 4 and level 5 page
  tables occur rarely compared to lower levels, so it is not obvious
  that the code patching in cpu_feature_enabled() is needed.

So instead, collapse the various 5-level paging related global variables
into a single byte wide pgdir_shift variable, and move it into the cache
hot per-CPU section where it can be accessed cheaply. Set it from asm
code so C will always see the same value, and derive
pgtable_l5_enabled() and PTRS_PER_P4D from it directly, ensuring that
all these quantities are always mutually consistent.

If pgtable_l5_enabled() requires more optimization, we can consider
alternatives, runtime constants, etc. but whether this is actually
necessary is TBD. Suggestions welcome for (micro-)benchmarks that
illustrate the perf delta.

Build and boot tested using QEMU with LA57 emulation.

Changes since v4:
- Add patch to fix MAX_PHYSMEM_BITS (and drop an occurrence of
  pgtable_l5_enabled())
- Re-order the changes and split across more patches so any potential
  performance hit is bisectable.

Changes since v3:
- Drop asm-offsets patch which has been merged already
- Rebase onto tip/x86/core which now carries some related changes by
  Kirill
- Avoid adding new instances of '#ifdef CONFIG_X86_5LEVEL' where
  possible, as it is going to be removed soon
- Move cap override arrays straight to __ro_after_init
- Drop KVM changes entirely - they were wrong and unnecessary
- Drop the new "la57_hw" capability flag for now - we can always add it
  later if there is a need.

Changes since v2:
- Drop first patch which has been merged
- Rename existing "la57" CPU flag to "la57_hw" and use "la57" to
  indicate that 5 level paging is being used
- Move memset() out of identify_cpu()
- Make set/clear cap override arrays ro_after_init
- Split off asm-offsets update

[0] https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Borislav Petkov <bp@alien8.de>

Ard Biesheuvel (7):
  x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
  x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
  x86/mm: Derive pgtable_l5_enabled() from pgdir_shift()
  x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions
  x86/boot: Drop 5-level paging related global variable
  x86/boot: Remove KASAN workaround for 4/5 level paging switch

 arch/x86/boot/compressed/misc.h         |  8 +++---
 arch/x86/boot/compressed/pgtable_64.c   | 10 --------
 arch/x86/boot/startup/map_kernel.c      | 18 +------------
 arch/x86/boot/startup/sme.c             |  9 -------
 arch/x86/include/asm/page_64_types.h    |  2 +-
 arch/x86/include/asm/pgtable_64_types.h | 27 ++++++++------------
 arch/x86/include/asm/sparsemem.h        |  2 +-
 arch/x86/kernel/alternative.c           | 12 ---------
 arch/x86/kernel/cpu/common.c            |  3 ---
 arch/x86/kernel/head64.c                |  9 -------
 arch/x86/kernel/head_64.S               |  5 ++++
 arch/x86/mm/kasan_init_64.c             |  3 ---
 arch/x86/mm/pgtable.c                   |  4 +++
 13 files changed, 26 insertions(+), 86 deletions(-)


base-commit: 54c2c688cd9305bdbab4883b9da6ff63f4deca5d
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  2025-05-20 10:59   ` Kirill A. Shutemov
  2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

As the Intel SDM states, MAXPHYADDR is up to 52 bits when running in
long mode, and this is independent from the number of levels of paging.
I.e., it is permitted for a 4-level hierarchy to use 52-bit output
addresses in the descriptors, both for next-level tables and for the
mappings themselves.

So set MAX_PHYSMEM_BITS to 52 in all cases for x86_64, and drop the
MAX_POSSIBLE_PHYSMEM_BITS definition, which becomes redundant as a
result.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h | 2 --
 arch/x86/include/asm/sparsemem.h        | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 4604f924d8b8..1481b234465a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -56,8 +56,6 @@ extern unsigned int ptrs_per_p4d;
 #define P4D_SIZE		(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK		(~(P4D_SIZE - 1))
 
-#define MAX_POSSIBLE_PHYSMEM_BITS	52
-
 /*
  * 3rd level page
  */
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 3918c7a434f5..550b6d73ae22 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -26,7 +26,7 @@
 # endif
 #else /* CONFIG_X86_32 */
 # define SECTION_SIZE_BITS	27 /* matt - 128 is convenient right now */
-# define MAX_PHYSMEM_BITS	(pgtable_l5_enabled() ? 52 : 46)
+# define MAX_PHYSMEM_BITS	52
 #endif
 
 #endif /* CONFIG_SPARSEMEM */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  2025-05-20 11:03   ` Kirill A. Shutemov
  2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

Shrink the global variable 'pgdir_shift' to a single byte, and move it
to the cache hot per-CPU variable section so that it can be accessed
virtually for free.

Set the variable from asm code before entering C so that C code is
guaranteed to always consistently observe the correct value.

For the decompressor, provide an alternative that tests the CR4.LA57
control bit directly - this is more costly in principle but this should
not matter for the early boot code, and the LA57 control bit is
guaranteed to be in sync with the number of paging levels in use.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/misc.h         |  6 +++++-
 arch/x86/boot/compressed/pgtable_64.c   |  2 --
 arch/x86/boot/startup/map_kernel.c      |  1 -
 arch/x86/include/asm/page_64_types.h    |  2 +-
 arch/x86/include/asm/pgtable_64_types.h | 13 +++++++++++--
 arch/x86/kernel/head64.c                |  2 --
 arch/x86/kernel/head_64.S               |  5 +++++
 arch/x86/mm/pgtable.c                   |  4 ++++
 8 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index db1048621ea2..97b80d08f03c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,10 @@
 #define __pa(x)  ((unsigned long)(x))
 #define __va(x)  ((void *)((unsigned long)(x)))
 
+#ifdef CONFIG_X86_64
+#define pgdir_shift()	(native_read_cr4() & X86_CR4_LA57 ? 48 : 39)
+#endif
+
 #include <linux/linkage.h>
 #include <linux/screen_info.h>
 #include <linux/elf.h>
@@ -189,7 +193,7 @@ static inline int count_immovable_mem_regions(void) { return 0; }
 #endif
 
 /* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
+extern unsigned int __pgtable_l5_enabled, ptrs_per_p4d;
 extern void kernel_add_identity_map(unsigned long start, unsigned long end);
 
 /* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index bdd26050dff7..898a4e66e401 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -12,7 +12,6 @@
 
 /* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
 unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
 unsigned int __section(".data") ptrs_per_p4d = 1;
 
 /* Buffer to preserve trampoline memory */
@@ -123,7 +122,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 
 		/* Initialize variables for 5-level paging */
 		__pgtable_l5_enabled = 1;
-		pgdir_shift = 48;
 		ptrs_per_p4d = 512;
 	}
 
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 332dbe6688c4..06306c5d016f 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -24,7 +24,6 @@ static inline bool check_la57_support(void)
 		return false;
 
 	__pgtable_l5_enabled	= 1;
-	pgdir_shift		= 48;
 	ptrs_per_p4d		= 512;
 
 	return true;
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 7400dab373fe..e1fccd9116c3 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -48,7 +48,7 @@
 /* See Documentation/arch/x86/x86_64/mm.rst for a description of the memory map. */
 
 #define __PHYSICAL_MASK_SHIFT	52
-#define __VIRTUAL_MASK_SHIFT	(pgtable_l5_enabled() ? 56 : 47)
+#define __VIRTUAL_MASK_SHIFT	(pgdir_shift() + 8)
 
 #define TASK_SIZE_MAX		task_size_max()
 #define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 1481b234465a..3ee747f596e3 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -7,6 +7,7 @@
 #ifndef __ASSEMBLER__
 #include <linux/types.h>
 #include <asm/kaslr.h>
+#include <asm/percpu.h>
 
 /*
  * These are used to make use of C type-checking..
@@ -23,6 +24,15 @@ typedef struct { pmdval_t pmd; } pmd_t;
 
 extern unsigned int __pgtable_l5_enabled;
 
+#ifndef pgdir_shift
+DECLARE_PER_CPU_CACHE_HOT(u8, __pgdir_shift);
+
+static __always_inline __attribute_const__ u8 pgdir_shift(void)
+{
+	return this_cpu_read_stable(__pgdir_shift);
+}
+#endif /* pgdir_shift */
+
 #ifdef USE_EARLY_PGTABLE_L5
 /*
  * cpu_feature_enabled() is not available in early boot code.
@@ -36,7 +46,6 @@ static inline bool pgtable_l5_enabled(void)
 #define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
 #endif /* USE_EARLY_PGTABLE_L5 */
 
-extern unsigned int pgdir_shift;
 extern unsigned int ptrs_per_p4d;
 
 #endif	/* !__ASSEMBLER__ */
@@ -44,7 +53,7 @@ extern unsigned int ptrs_per_p4d;
 /*
  * PGDIR_SHIFT determines what a top-level page table entry can map
  */
-#define PGDIR_SHIFT	pgdir_shift
+#define PGDIR_SHIFT	pgdir_shift()
 #define PTRS_PER_PGD	512
 
 /*
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 533fcf5636fc..e2d9e709ec01 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -52,8 +52,6 @@ SYM_PIC_ALIAS(next_early_pgt);
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
 unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
 unsigned int ptrs_per_p4d __ro_after_init = 1;
 EXPORT_SYMBOL(ptrs_per_p4d);
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 3e9b3a3bd039..d35b75a5f892 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -71,6 +71,11 @@ SYM_CODE_START_NOALIGN(startup_64)
 	xorl	%edx, %edx
 	wrmsr
 
+	movq	%cr4, %rax
+	btl	$X86_CR4_LA57_BIT, %eax
+	jnc	0f
+	movb	$48, PER_CPU_VAR(__pgdir_shift)
+0:
 	call	startup_64_setup_gdt_idt
 
 	/* Now switch to __KERNEL_CS so IRET works reliably */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 62777ba4de1a..a7eaeaa595f8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -46,6 +46,10 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
+DEFINE_PER_CPU_CACHE_HOT(u8, __pgdir_shift) = 39;
+EXPORT_SYMBOL_GPL(__pgdir_shift);
+SYM_PIC_ALIAS(__pgdir_shift);
+
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
 	paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  2025-05-20 11:08   ` Kirill A. Shutemov
  2025-05-20 10:41 ` [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift() Ard Biesheuvel
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

Define the value of PTRS_PER_P4D in terms of pgdir_shift(), which can be
accessed cheaply, removing the need for a global variable that needs to
be kept in sync manually.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/misc.h         | 2 +-
 arch/x86/boot/compressed/pgtable_64.c   | 2 --
 arch/x86/boot/startup/map_kernel.c      | 1 -
 arch/x86/include/asm/pgtable_64_types.h | 4 +---
 arch/x86/kernel/head64.c                | 2 --
 5 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 97b80d08f03c..3157f2fbc593 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -193,7 +193,7 @@ static inline int count_immovable_mem_regions(void) { return 0; }
 #endif
 
 /* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, ptrs_per_p4d;
+extern unsigned int __pgtable_l5_enabled;
 extern void kernel_add_identity_map(unsigned long start, unsigned long end);
 
 /* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 898a4e66e401..965fca150e68 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -12,7 +12,6 @@
 
 /* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
 unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") ptrs_per_p4d = 1;
 
 /* Buffer to preserve trampoline memory */
 static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
@@ -122,7 +121,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 
 		/* Initialize variables for 5-level paging */
 		__pgtable_l5_enabled = 1;
-		ptrs_per_p4d = 512;
 	}
 
 	/*
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 06306c5d016f..5d3c6108f1c3 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -24,7 +24,6 @@ static inline bool check_la57_support(void)
 		return false;
 
 	__pgtable_l5_enabled	= 1;
-	ptrs_per_p4d		= 512;
 
 	return true;
 }
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 3ee747f596e3..d8e39c479387 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -46,8 +46,6 @@ static inline bool pgtable_l5_enabled(void)
 #define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
 #endif /* USE_EARLY_PGTABLE_L5 */
 
-extern unsigned int ptrs_per_p4d;
-
 #endif	/* !__ASSEMBLER__ */
 
 /*
@@ -61,7 +59,7 @@ extern unsigned int ptrs_per_p4d;
  */
 #define P4D_SHIFT		39
 #define MAX_PTRS_PER_P4D	512
-#define PTRS_PER_P4D		ptrs_per_p4d
+#define PTRS_PER_P4D		(pgdir_shift() & 1 ?: MAX_PTRS_PER_P4D)
 #define P4D_SIZE		(_AC(1, UL) << P4D_SHIFT)
 #define P4D_MASK		(~(P4D_SIZE - 1))
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index e2d9e709ec01..fe0770f468c3 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -52,8 +52,6 @@ SYM_PIC_ALIAS(next_early_pgt);
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
 unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
 
 unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
 EXPORT_SYMBOL(page_offset_base);
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift()
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions Ard Biesheuvel
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

Instead of using two versions of pgtable_l5_enabled() which go out of
sync a few times during boot, use a single implementation, and base it
on pgdir_shift(), which can be accessed cheaply when running in the core
kernel.

This is the most efficient way to obtain this information without
relying on code patching, on which the 'late' implementation depends
today (via the ternary alternative in cpu_feature_enabled()).

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/include/asm/pgtable_64_types.h | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index d8e39c479387..ed847a90cf4f 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -33,18 +33,8 @@ static __always_inline __attribute_const__ u8 pgdir_shift(void)
 }
 #endif /* pgdir_shift */
 
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
-{
-	return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+/* PGDIR shift == 39 -> L5 disabled, 48 -> L5 enabled */
+#define pgtable_l5_enabled() !(pgdir_shift() & 1)
 
 #endif	/* !__ASSEMBLER__ */
 
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
                   ` (3 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift() Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch Ard Biesheuvel
  6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

Drop USE_EARLY_PGTABLE_L5 definitions, which no longer have any effect.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/misc.h | 3 ---
 arch/x86/boot/startup/sme.c     | 9 ---------
 arch/x86/kernel/cpu/common.c    | 3 ---
 arch/x86/kernel/head64.c        | 3 ---
 arch/x86/mm/kasan_init_64.c     | 3 ---
 5 files changed, 21 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 3157f2fbc593..7f79c426f542 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@
 
 #define __NO_FORTIFY
 
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 /*
  * Boot stub deals with identity mappings, physical and virtual addresses are
  * the same, so override these defines.
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 70ea1748c0a7..a6c25d005991 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -25,15 +25,6 @@
 #undef CONFIG_PARAVIRT_XXL
 #undef CONFIG_PARAVIRT_SPINLOCKS
 
-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/mem_encrypt.h>
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8feb8fd2957a..256613b24bd4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,7 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/memblock.h>
 #include <linux/linkage.h>
 #include <linux/bitops.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index fe0770f468c3..5e5da6574a83 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -5,9 +5,6 @@
  *  Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
  */
 
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/init.h>
 #include <linux/linkage.h>
 #include <linux/types.h>
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0539efd0d216..7c4fafbd52cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -1,9 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #define pr_fmt(fmt) "kasan: " fmt
 
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
 #include <linux/memblock.h>
 #include <linux/kasan.h>
 #include <linux/kdebug.h>
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
                   ` (4 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  2025-05-20 10:41 ` [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch Ard Biesheuvel
  6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

The variable __pgtable_l5_enabled is no longer used so it can be
dropped.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/boot/compressed/misc.h         |  1 -
 arch/x86/boot/compressed/pgtable_64.c   |  6 ------
 arch/x86/boot/startup/map_kernel.c      | 16 +---------------
 arch/x86/include/asm/pgtable_64_types.h |  2 --
 arch/x86/kernel/head64.c                |  2 --
 5 files changed, 1 insertion(+), 26 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 7f79c426f542..d6860b50afaf 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -190,7 +190,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
 #endif
 
 /* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled;
 extern void kernel_add_identity_map(unsigned long start, unsigned long end);
 
 /* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 965fca150e68..4772919411a9 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -10,9 +10,6 @@
 #define BIOS_START_MIN		0x20000U	/* 128K, less than this is insane */
 #define BIOS_START_MAX		0x9f000U	/* 640K, absolute maximum */
 
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-
 /* Buffer to preserve trampoline memory */
 static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
 
@@ -118,9 +115,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
 	if (!cmdline_find_option_bool("no5lvl") &&
 	    native_cpuid_eax(0) >= 7 && (native_cpuid_ecx(7) & BIT(16))) {
 		l5_required = true;
-
-		/* Initialize variables for 5-level paging */
-		__pgtable_l5_enabled = 1;
 	}
 
 	/*
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 5d3c6108f1c3..328a343bc355 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -14,20 +14,6 @@
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 extern unsigned int next_early_pgt;
 
-static inline bool check_la57_support(void)
-{
-	/*
-	 * 5-level paging is detected and enabled at kernel decompression
-	 * stage. Only check if it has been enabled there.
-	 */
-	if (!(native_read_cr4() & X86_CR4_LA57))
-		return false;
-
-	__pgtable_l5_enabled	= 1;
-
-	return true;
-}
-
 static unsigned long __head sme_postprocess_startup(struct boot_params *bp,
 						    pmdval_t *pmd,
 						    unsigned long p2v_offset)
@@ -97,7 +83,7 @@ unsigned long __head __startup_64(unsigned long p2v_offset,
 	bool la57;
 	int i;
 
-	la57 = check_la57_support();
+	la57 = pgtable_l5_enabled();
 
 	/* Is the address too large? */
 	if (physaddr >> MAX_PHYSMEM_BITS)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index ed847a90cf4f..3c56b98b87b0 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -22,8 +22,6 @@ typedef unsigned long	pgprotval_t;
 typedef struct { pteval_t pte; } pte_t;
 typedef struct { pmdval_t pmd; } pmd_t;
 
-extern unsigned int __pgtable_l5_enabled;
-
 #ifndef pgdir_shift
 DECLARE_PER_CPU_CACHE_HOT(u8, __pgdir_shift);
 
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e5da6574a83..137c93498601 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -48,8 +48,6 @@ unsigned int __initdata next_early_pgt;
 SYM_PIC_ALIAS(next_early_pgt);
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
 
-unsigned int __pgtable_l5_enabled __ro_after_init;
-
 unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
 EXPORT_SYMBOL(page_offset_base);
 unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch
  2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
                   ` (5 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
  6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
	Kirill A. Shutemov, Borislav Petkov

From: Ard Biesheuvel <ardb@kernel.org>

pgtable_l5_enabled() is no longer based on a CPU feature bit that might
change values and confuse KASAN, so we no longer need to disable it
temporarily.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/kernel/alternative.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f3c68b586a95..e39823d8d1ae 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -589,16 +589,6 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 
 	DPRINTK(ALT, "alt table %px, -> %px", start, end);
 
-	/*
-	 * KASAN_SHADOW_START is defined using
-	 * cpu_feature_enabled(X86_FEATURE_LA57) and is therefore patched here.
-	 * During the process, KASAN becomes confused seeing partial LA57
-	 * conversion and triggers a false-positive out-of-bound report.
-	 *
-	 * Disable KASAN until the patching is complete.
-	 */
-	kasan_disable_current();
-
 	/*
 	 * The scan order should be from start to end. A later scanned
 	 * alternative code can overwrite previously scanned alternative code.
@@ -666,8 +656,6 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
 
 		text_poke_early(instr, insn_buff, insn_buff_sz);
 	}
-
-	kasan_enable_current();
 }
 
 static inline bool is_jcc32(struct insn *insn)
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
  2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
@ 2025-05-20 10:59   ` Kirill A. Shutemov
  2025-05-20 11:27     ` Ard Biesheuvel
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-05-20 10:59 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
	Brian Gerst, Borislav Petkov

On Tue, May 20, 2025 at 12:41:40PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> As the Intel SDM states, MAXPHYADDR is up to 52 bits when running in
> long mode, and this is independent from the number of levels of paging.
> I.e., it is permitted for a 4-level hierarchy to use 52-bit output
> addresses in the descriptors, both for next-level tables and for the
> mappings themselves.
> 
> So set MAX_PHYSMEM_BITS to 52 in all cases for x86_64, and drop the
> MAX_POSSIBLE_PHYSMEM_BITS definition, which becomes redundant as a
> result.

I think it will backfire.

We only have a 46-bit window in memory layout if 4-level paging is
enabled. Currently, we truncate PA to whatever fits into 46 bits.

I expect to see weird failures if you try to boot with this patch in
4-level paging mode on machine with > 64 TiB of memory.

If we want to go this path, it might be useful to refuse to boot
altogether in 4-level paging mode if there's anything in memory map above
46-bit.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
@ 2025-05-20 11:03   ` Kirill A. Shutemov
  2025-05-20 11:28     ` Ard Biesheuvel
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-05-20 11:03 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
	Brian Gerst, Borislav Petkov

On Tue, May 20, 2025 at 12:41:41PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Shrink the global variable 'pgdir_shift' to a single byte, and move it
> to the cache hot per-CPU variable section so that it can be accessed
> virtually for free.

Hm. I am not convinced about per-CPU being useful here. Read-only cache
lines can be shared between cores anyway.

Putting it to __ro_after_init should do the trick, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
  2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
@ 2025-05-20 11:08   ` Kirill A. Shutemov
  2025-05-20 11:29     ` Ard Biesheuvel
  0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-05-20 11:08 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
	Brian Gerst, Borislav Petkov

On Tue, May 20, 2025 at 12:41:42PM +0200, Ard Biesheuvel wrote:
> @@ -61,7 +59,7 @@ extern unsigned int ptrs_per_p4d;
>   */
>  #define P4D_SHIFT		39
>  #define MAX_PTRS_PER_P4D	512
> -#define PTRS_PER_P4D		ptrs_per_p4d
> +#define PTRS_PER_P4D		(pgdir_shift() & 1 ?: MAX_PTRS_PER_P4D)

"& 1" is a hack and at least deserves a comment.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
  2025-05-20 10:59   ` Kirill A. Shutemov
@ 2025-05-20 11:27     ` Ard Biesheuvel
  0 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 11:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
	Brian Gerst, Borislav Petkov

On Tue, 20 May 2025 at 12:59, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, May 20, 2025 at 12:41:40PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > As the Intel SDM states, MAXPHYADDR is up to 52 bits when running in
> > long mode, and this is independent from the number of levels of paging.
> > I.e., it is permitted for a 4-level hierarchy to use 52-bit output
> > addresses in the descriptors, both for next-level tables and for the
> > mappings themselves.
> >
> > So set MAX_PHYSMEM_BITS to 52 in all cases for x86_64, and drop the
> > MAX_POSSIBLE_PHYSMEM_BITS definition, which becomes redundant as a
> > result.
>
> I think it will backfire.
>
> We only have a 46-bit window in memory layout if 4-level paging is
> enabled. Currently, we truncate PA to whatever fits into 46 bits.
>

This is the linear map, right?

I assumed that this affected MMIO mappings too, but it seems x86 does
not rely on MAX_PHYSMEM_BITS for that.

> I expect to see weird failures if you try to boot with this patch in
> 4-level paging mode on machine with > 64 TiB of memory.
>
> If we want to go this path, it might be useful to refuse to boot
> altogether in 4-level paging mode if there's anything in memory map above
> 46-bit.
>

Agreed - if RAM does not fit, it makes no sense to limp on. I assumed
this limit applied to any physical address.

I'll withdraw the patch - it was just an unrelated thing I spotted, so
that shouldn't affect the rest of the series.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 11:03   ` Kirill A. Shutemov
@ 2025-05-20 11:28     ` Ard Biesheuvel
  2025-05-20 14:35       ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 11:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
	Brian Gerst, Borislav Petkov

On Tue, 20 May 2025 at 13:03, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, May 20, 2025 at 12:41:41PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Shrink the global variable 'pgdir_shift' to a single byte, and move it
> > to the cache hot per-CPU variable section so that it can be accessed
> > virtually for free.
>
> Hm. I am not convinced about per-CPU being useful here. Read-only cache
> lines can be shared between cores anyway.
>
> Putting it to __ro_after_init should do the trick, no?
>

Are you saying we shouldn't optimize prematurely? You must be new here :-)

But seriously, whatever works is fine with me, as long as the
observable return value of pgtable_l5_enabled() never changes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
  2025-05-20 11:08   ` Kirill A. Shutemov
@ 2025-05-20 11:29     ` Ard Biesheuvel
  0 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 11:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
	Brian Gerst, Borislav Petkov

On Tue, 20 May 2025 at 13:08, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, May 20, 2025 at 12:41:42PM +0200, Ard Biesheuvel wrote:
> > @@ -61,7 +59,7 @@ extern unsigned int ptrs_per_p4d;
> >   */
> >  #define P4D_SHIFT            39
> >  #define MAX_PTRS_PER_P4D     512
> > -#define PTRS_PER_P4D         ptrs_per_p4d
> > +#define PTRS_PER_P4D         (pgdir_shift() & 1 ?: MAX_PTRS_PER_P4D)
>
> "& 1" is a hack and at least deserves a comment.
>

Fair enough.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 11:28     ` Ard Biesheuvel
@ 2025-05-20 14:35       ` Borislav Petkov
  2025-05-20 17:03         ` Ard Biesheuvel
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 14:35 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
	Ingo Molnar, Linus Torvalds, Brian Gerst

On Tue, May 20, 2025 at 01:28:52PM +0200, Ard Biesheuvel wrote:
> Are you saying we shouldn't optimize prematurely? You must be new here :-)

Right, but do you see __pgdir_shift on a seriously hot path to warrant this?

I'd expect this to happen the other way around, really: "Hey, pgdir_shift is
getting accessed a lot, let's stick it somewhere where we can access it
quickly..."

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 14:35       ` Borislav Petkov
@ 2025-05-20 17:03         ` Ard Biesheuvel
  2025-05-20 17:38           ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 17:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
	Ingo Molnar, Linus Torvalds, Brian Gerst

On Tue, 20 May 2025 at 16:35, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, May 20, 2025 at 01:28:52PM +0200, Ard Biesheuvel wrote:
> > Are you saying we shouldn't optimize prematurely? You must be new here :-)
>
> Right, but do you see __pgdir_shift on a seriously hot path to warrant this?
>

No. But if you had read the next couple of patches, you would have
noticed that PGDIR_SHIFT, PTRS_PER_P4D and pgtable_l5_enabled() will
all be derived from this variable, and the latter currently uses code
patching (in cpu_feature_enabled())

This is also explained in the cover letter btw

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 17:03         ` Ard Biesheuvel
@ 2025-05-20 17:38           ` Borislav Petkov
  2025-05-20 17:46             ` Ard Biesheuvel
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 17:38 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
	Ingo Molnar, Linus Torvalds, Brian Gerst

On Tue, May 20, 2025 at 07:03:37PM +0200, Ard Biesheuvel wrote:
> No. But if you had read the next couple of patches, you would have
> noticed that PGDIR_SHIFT, PTRS_PER_P4D and pgtable_l5_enabled() will
> all be derived from this variable, and the latter currently uses code
> patching (in cpu_feature_enabled())
> 
> This is also explained in the cover letter btw

Yes, I saw that.

The question remains: are the *users* - PGDIR_SHIFT, etc - on some hot path
which I'm not seeing?

For example pgd_index() is called in a bunch of places and I guess that adds
up. But without measuring that, we won't know for sure.

Looking at an example:

# ./arch/x86/include/asm/pgtable_64_types.h:32:         return this_cpu_read_stable(__pgdir_shift);
#APP
# 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
        movb %gs:__pgdir_shift(%rip), %cl       #, pfo_val__
# 0 "" 2

...

        movq    40(%rdi), %rax  # ppd_20(D)->vaddr, tmp128
        shrq    %cl, %rax       # pfo_val__, tmp128
# arch/x86/boot/startup/sme.c:105:      pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
        movq    8(%rdi), %rcx   # ppd_20(D)->pgd, ppd_20(D)->pgd
# arch/x86/boot/startup/sme.c:105:      pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
        andl    $511, %eax      #, tmp130
# arch/x86/boot/startup/sme.c:105:      pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
        leaq    (%rcx,%rax,8), %rsi     #, pgd_p

that looks like two insns to me: the RIP-relative mov to %cl and then the
shift.

If you use a "normal" variable, that would be also two insns, no?

Or am I way off?

Because if not, the percpu thing doesn't buy you anything...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 17:38           ` Borislav Petkov
@ 2025-05-20 17:46             ` Ard Biesheuvel
  2025-05-20 18:01               ` Borislav Petkov
  0 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 17:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
	Ingo Molnar, Linus Torvalds, Brian Gerst

On Tue, 20 May 2025 at 19:38, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, May 20, 2025 at 07:03:37PM +0200, Ard Biesheuvel wrote:
> > No. But if you had read the next couple of patches, you would have
> > noticed that PGDIR_SHIFT, PTRS_PER_P4D and pgtable_l5_enabled() will
> > all be derived from this variable, and the latter currently uses code
> > patching (in cpu_feature_enabled())
> >
> > This is also explained in the cover letter btw
>
> Yes, I saw that.
>
> The question remains: are the *users* - PGDIR_SHIFT, etc - on some hot path
> which I'm not seeing?
>
> For example pgd_index() is called in a bunch of places and I guess that adds
> up. But without measuring that, we won't know for sure.
>

Look at pgtable_l5_enabled() please, that is the important one.

> Looking at an example:
>
> # ./arch/x86/include/asm/pgtable_64_types.h:32:         return this_cpu_read_stable(__pgdir_shift);
> #APP
> # 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
>         movb %gs:__pgdir_shift(%rip), %cl       #, pfo_val__
> # 0 "" 2
>
> ...
>
>         movq    40(%rdi), %rax  # ppd_20(D)->vaddr, tmp128
>         shrq    %cl, %rax       # pfo_val__, tmp128
> # arch/x86/boot/startup/sme.c:105:      pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
>         movq    8(%rdi), %rcx   # ppd_20(D)->pgd, ppd_20(D)->pgd
> # arch/x86/boot/startup/sme.c:105:      pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
>         andl    $511, %eax      #, tmp130
> # arch/x86/boot/startup/sme.c:105:      pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
>         leaq    (%rcx,%rax,8), %rsi     #, pgd_p
>
> that looks like two insns to me: the RIP-relative mov to %cl and then the
> shift.
>
> If you use a "normal" variable, that would be also two insns, no?
>
> Or am I way off?
>
> Because if not, the percpu thing doesn't buy you anything...
>

The variable access is identical in terms of instructions, the only
difference is the %gs offset being applied, and the fact that using
cache hot data is guaranteed not to increase the number of cachelines
covering the working set of any existing workload (the region is
bounded to a fixed number of cachelines)

Happy to keep this as a simple __ro_after_init variable if there is
consensus between the tip maintainers that we don't need this perf
advantage, or we can use some kind of code patching that is not CPU
feature based or ternary alternative based to short circuit these
things (and pgtable_l5_enabled() in particular) after boot.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 17:46             ` Ard Biesheuvel
@ 2025-05-20 18:01               ` Borislav Petkov
  2025-05-20 18:28                 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 18:01 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
	Ingo Molnar, Linus Torvalds, Brian Gerst

On Tue, May 20, 2025 at 07:46:33PM +0200, Ard Biesheuvel wrote:
> Look at pgtable_l5_enabled() please, that is the important one.

OMG. :-)

# 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
        movb %gs:__pgdir_shift(%rip), %al       #, pfo_val__
# 0 "" 2
# ./arch/x86/include/asm/pgtable.h:1178:        if (!pgtable_l5_enabled())
#NO_APP
        testb   $1, %al #, pfo_val__

I hadn't seen his fun yet:

!(pgdir_shift() & 1)

> The variable access is identical in terms of instructions, the only
> difference is the %gs offset being applied, and the fact that using
> cache hot data is guaranteed not to increase the number of cachelines
> covering the working set of any existing workload (the region is
> bounded to a fixed number of cachelines)

Yes, but look at all the callers. I hardly see any serious hot paths to care
about the %gs offset being applied.

And I'll be really surprised if that *shows* in any sensible benchmark...

Also, it doesn't make any sense for a *global* system property - what paging
level the machine uses - to be in a *per-CPU* variable. That's a global value
which is the same everywhere. So why waste space?

> Happy to keep this as a simple __ro_after_init variable if there is
> consensus between the tip maintainers that we don't need this perf

Yeah, IMO, no need. But let's see what the others say.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 18:01               ` Borislav Petkov
@ 2025-05-20 18:28                 ` Linus Torvalds
  2025-05-20 18:35                   ` Borislav Petkov
  2025-05-20 19:49                   ` Ard Biesheuvel
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2025-05-20 18:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ard Biesheuvel, Kirill A. Shutemov, Ard Biesheuvel, linux-kernel,
	x86, Ingo Molnar, Brian Gerst

On Tue, 20 May 2025 at 11:01, Borislav Petkov <bp@alien8.de> wrote:
>
> OMG. :-)
>
> # 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
>         movb %gs:__pgdir_shift(%rip), %al       #, pfo_val__
> # 0 "" 2
> # ./arch/x86/include/asm/pgtable.h:1178:        if (!pgtable_l5_enabled())
> #NO_APP
>         testb   $1, %al #, pfo_val__

That's garbage.

Gcc should be able to turn it into just a

        testb $1,%gs:__pgdir_shift(%rip)

What happens if pgtable_l5_enabled() is made to use  __raw_cpu_read()?
With a compiler that is new enough to support USE_X86_SEG_SUPPORT?

Oh, and it looks like we messed up __raw_cpu_read_stable(), because
that *always* uses the inline asm, so it doesn't allow the compiler to
just DTRT and do things like the above.

                Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 18:28                 ` Linus Torvalds
@ 2025-05-20 18:35                   ` Borislav Petkov
  2025-05-20 19:49                   ` Ard Biesheuvel
  1 sibling, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ard Biesheuvel, Kirill A. Shutemov, Ard Biesheuvel, linux-kernel,
	x86, Ingo Molnar, Brian Gerst

On Tue, May 20, 2025 at 11:28:28AM -0700, Linus Torvalds wrote:
> What happens if pgtable_l5_enabled() is made to use  __raw_cpu_read()?
> With a compiler that is new enough to support USE_X86_SEG_SUPPORT?

Before we go play with this more, I see the same thing with gcc-13 and gcc-14
on debian.

Those should be fairly new, I'd think...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
  2025-05-20 18:28                 ` Linus Torvalds
  2025-05-20 18:35                   ` Borislav Petkov
@ 2025-05-20 19:49                   ` Ard Biesheuvel
  1 sibling, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 19:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Kirill A. Shutemov, Ard Biesheuvel, linux-kernel,
	x86, Ingo Molnar, Brian Gerst

On Tue, 20 May 2025 at 20:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 20 May 2025 at 11:01, Borislav Petkov <bp@alien8.de> wrote:
> >
> > OMG. :-)
> >
> > # 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
> >         movb %gs:__pgdir_shift(%rip), %al       #, pfo_val__
> > # 0 "" 2
> > # ./arch/x86/include/asm/pgtable.h:1178:        if (!pgtable_l5_enabled())
> > #NO_APP
> >         testb   $1, %al #, pfo_val__
>
> That's garbage.
>
> Gcc should be able to turn it into just a
>
>         testb $1,%gs:__pgdir_shift(%rip)
>
> What happens if pgtable_l5_enabled() is made to use  __raw_cpu_read()?
> With a compiler that is new enough to support USE_X86_SEG_SUPPORT?
>
> Oh, and it looks like we messed up __raw_cpu_read_stable(), because
> that *always* uses the inline asm, so it doesn't allow the compiler to
> just DTRT and do things like the above.
>

I suppose that is what the this_cpu_read_const() is for, but that
requires a const alias declaration and an entry in the linker script.

If we decide to go with a per-cpu var, I can add this
USE_X86_SEG_SUPPORT codepath like we have in other places - that
should give the compiler sufficient insight into the fact that this
value never changes, and generate optimal code as a result.

For an ordinary variable, I suppose we can still declare it as const,
__attribute__((const)) etc if it is defined in and set from the asm
startup code, resulting in the same potential for optimization
(without alternatives or runtime const hacks).

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-05-20 19:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
2025-05-20 10:59   ` Kirill A. Shutemov
2025-05-20 11:27     ` Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
2025-05-20 11:03   ` Kirill A. Shutemov
2025-05-20 11:28     ` Ard Biesheuvel
2025-05-20 14:35       ` Borislav Petkov
2025-05-20 17:03         ` Ard Biesheuvel
2025-05-20 17:38           ` Borislav Petkov
2025-05-20 17:46             ` Ard Biesheuvel
2025-05-20 18:01               ` Borislav Petkov
2025-05-20 18:28                 ` Linus Torvalds
2025-05-20 18:35                   ` Borislav Petkov
2025-05-20 19:49                   ` Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
2025-05-20 11:08   ` Kirill A. Shutemov
2025-05-20 11:29     ` Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift() Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch Ard Biesheuvel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox