* [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
2025-05-20 10:59 ` Kirill A. Shutemov
2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
` (5 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
As the Intel SDM states, MAXPHYADDR is up to 52 bits when running in
long mode, and this is independent from the number of levels of paging.
I.e., it is permitted for a 4-level hierarchy to use 52-bit output
addresses in the descriptors, both for next-level tables and for the
mappings themselves.
So set MAX_PHYSMEM_BITS to 52 in all cases for x86_64, and drop the
MAX_POSSIBLE_PHYSMEM_BITS definition, which becomes redundant as a
result.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/include/asm/pgtable_64_types.h | 2 --
arch/x86/include/asm/sparsemem.h | 2 +-
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 4604f924d8b8..1481b234465a 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -56,8 +56,6 @@ extern unsigned int ptrs_per_p4d;
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))
-#define MAX_POSSIBLE_PHYSMEM_BITS 52
-
/*
* 3rd level page
*/
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 3918c7a434f5..550b6d73ae22 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -26,7 +26,7 @@
# endif
#else /* CONFIG_X86_32 */
# define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */
-# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
+# define MAX_PHYSMEM_BITS 52
#endif
#endif /* CONFIG_SPARSEMEM */
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
@ 2025-05-20 10:59 ` Kirill A. Shutemov
2025-05-20 11:27 ` Ard Biesheuvel
0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-05-20 10:59 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst, Borislav Petkov
On Tue, May 20, 2025 at 12:41:40PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> As the Intel SDM states, MAXPHYADDR is up to 52 bits when running in
> long mode, and this is independent from the number of levels of paging.
> I.e., it is permitted for a 4-level hierarchy to use 52-bit output
> addresses in the descriptors, both for next-level tables and for the
> mappings themselves.
>
> So set MAX_PHYSMEM_BITS to 52 in all cases for x86_64, and drop the
> MAX_POSSIBLE_PHYSMEM_BITS definition, which becomes redundant as a
> result.
I think it will backfire.
We only have a 46-bit window in memory layout if 4-level paging is
enabled. Currently, we truncate PA to whatever fits into 46 bits.
I expect to see weird failures if you try to boot with this patch in
4-level paging mode on machine with > 64 TiB of memory.
If we want to go this path, it might be useful to refuse to boot
altogether in 4-level paging mode if there's anything in memory map above
46-bit.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state
2025-05-20 10:59 ` Kirill A. Shutemov
@ 2025-05-20 11:27 ` Ard Biesheuvel
0 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 11:27 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst, Borislav Petkov
On Tue, 20 May 2025 at 12:59, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, May 20, 2025 at 12:41:40PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > As the Intel SDM states, MAXPHYADDR is up to 52 bits when running in
> > long mode, and this is independent from the number of levels of paging.
> > I.e., it is permitted for a 4-level hierarchy to use 52-bit output
> > addresses in the descriptors, both for next-level tables and for the
> > mappings themselves.
> >
> > So set MAX_PHYSMEM_BITS to 52 in all cases for x86_64, and drop the
> > MAX_POSSIBLE_PHYSMEM_BITS definition, which becomes redundant as a
> > result.
>
> I think it will backfire.
>
> We only have a 46-bit window in memory layout if 4-level paging is
> enabled. Currently, we truncate PA to whatever fits into 46 bits.
>
This is the linear map, right?
I assumed that this affected MMIO mappings too, but it seems x86 does
not rely on MAX_PHYSMEM_BITS for that.
> I expect to see weird failures if you try to boot with this patch in
> 4-level paging mode on machine with > 64 TiB of memory.
>
> If we want to go this path, it might be useful to refuse to boot
> altogether in 4-level paging mode if there's anything in memory map above
> 46-bit.
>
Agreed - if RAM does not fit, it makes no sense to limp on. I assumed
this limit applied to any physical address.
I'll withdraw the patch - it was just an unrelated thing I spotted, so
that shouldn't affect the rest of the series.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
2025-05-20 11:03 ` Kirill A. Shutemov
2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
` (4 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
Shrink the global variable 'pgdir_shift' to a single byte, and move it
to the cache hot per-CPU variable section so that it can be accessed
virtually for free.
Set the variable from asm code before entering C so that C code is
guaranteed to always consistently observe the correct value.
For the decompressor, provide an alternative that tests the CR4.LA57
control bit directly - this is more costly in principle but this should
not matter for the early boot code, and the LA57 control bit is
guaranteed to be in sync with the number of paging levels in use.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 6 +++++-
arch/x86/boot/compressed/pgtable_64.c | 2 --
arch/x86/boot/startup/map_kernel.c | 1 -
arch/x86/include/asm/page_64_types.h | 2 +-
arch/x86/include/asm/pgtable_64_types.h | 13 +++++++++++--
arch/x86/kernel/head64.c | 2 --
arch/x86/kernel/head_64.S | 5 +++++
arch/x86/mm/pgtable.c | 4 ++++
8 files changed, 26 insertions(+), 9 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index db1048621ea2..97b80d08f03c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -28,6 +28,10 @@
#define __pa(x) ((unsigned long)(x))
#define __va(x) ((void *)((unsigned long)(x)))
+#ifdef CONFIG_X86_64
+#define pgdir_shift() (native_read_cr4() & X86_CR4_LA57 ? 48 : 39)
+#endif
+
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
@@ -189,7 +193,7 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif
/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
+extern unsigned int __pgtable_l5_enabled, ptrs_per_p4d;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);
/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index bdd26050dff7..898a4e66e401 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -12,7 +12,6 @@
/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
unsigned int __section(".data") ptrs_per_p4d = 1;
/* Buffer to preserve trampoline memory */
@@ -123,7 +122,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
/* Initialize variables for 5-level paging */
__pgtable_l5_enabled = 1;
- pgdir_shift = 48;
ptrs_per_p4d = 512;
}
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 332dbe6688c4..06306c5d016f 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -24,7 +24,6 @@ static inline bool check_la57_support(void)
return false;
__pgtable_l5_enabled = 1;
- pgdir_shift = 48;
ptrs_per_p4d = 512;
return true;
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 7400dab373fe..e1fccd9116c3 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -48,7 +48,7 @@
/* See Documentation/arch/x86/x86_64/mm.rst for a description of the memory map. */
#define __PHYSICAL_MASK_SHIFT 52
-#define __VIRTUAL_MASK_SHIFT (pgtable_l5_enabled() ? 56 : 47)
+#define __VIRTUAL_MASK_SHIFT (pgdir_shift() + 8)
#define TASK_SIZE_MAX task_size_max()
#define DEFAULT_MAP_WINDOW ((1UL << 47) - PAGE_SIZE)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 1481b234465a..3ee747f596e3 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -7,6 +7,7 @@
#ifndef __ASSEMBLER__
#include <linux/types.h>
#include <asm/kaslr.h>
+#include <asm/percpu.h>
/*
* These are used to make use of C type-checking..
@@ -23,6 +24,15 @@ typedef struct { pmdval_t pmd; } pmd_t;
extern unsigned int __pgtable_l5_enabled;
+#ifndef pgdir_shift
+DECLARE_PER_CPU_CACHE_HOT(u8, __pgdir_shift);
+
+static __always_inline __attribute_const__ u8 pgdir_shift(void)
+{
+ return this_cpu_read_stable(__pgdir_shift);
+}
+#endif /* pgdir_shift */
+
#ifdef USE_EARLY_PGTABLE_L5
/*
* cpu_feature_enabled() is not available in early boot code.
@@ -36,7 +46,6 @@ static inline bool pgtable_l5_enabled(void)
#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
#endif /* USE_EARLY_PGTABLE_L5 */
-extern unsigned int pgdir_shift;
extern unsigned int ptrs_per_p4d;
#endif /* !__ASSEMBLER__ */
@@ -44,7 +53,7 @@ extern unsigned int ptrs_per_p4d;
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
-#define PGDIR_SHIFT pgdir_shift
+#define PGDIR_SHIFT pgdir_shift()
#define PTRS_PER_PGD 512
/*
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 533fcf5636fc..e2d9e709ec01 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -52,8 +52,6 @@ SYM_PIC_ALIAS(next_early_pgt);
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
unsigned int ptrs_per_p4d __ro_after_init = 1;
EXPORT_SYMBOL(ptrs_per_p4d);
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 3e9b3a3bd039..d35b75a5f892 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -71,6 +71,11 @@ SYM_CODE_START_NOALIGN(startup_64)
xorl %edx, %edx
wrmsr
+ movq %cr4, %rax
+ btl $X86_CR4_LA57_BIT, %eax
+ jnc 0f
+ movb $48, PER_CPU_VAR(__pgdir_shift)
+0:
call startup_64_setup_gdt_idt
/* Now switch to __KERNEL_CS so IRET works reliably */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 62777ba4de1a..a7eaeaa595f8 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -46,6 +46,10 @@ void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
}
#if CONFIG_PGTABLE_LEVELS > 4
+DEFINE_PER_CPU_CACHE_HOT(u8, __pgdir_shift) = 39;
+EXPORT_SYMBOL_GPL(__pgdir_shift);
+SYM_PIC_ALIAS(__pgdir_shift);
+
void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
{
paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
@ 2025-05-20 11:03 ` Kirill A. Shutemov
2025-05-20 11:28 ` Ard Biesheuvel
0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-05-20 11:03 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst, Borislav Petkov
On Tue, May 20, 2025 at 12:41:41PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Shrink the global variable 'pgdir_shift' to a single byte, and move it
> to the cache hot per-CPU variable section so that it can be accessed
> virtually for free.
Hm. I am not convinced about per-CPU being useful here. Read-only cache
lines can be shared between cores anyway.
Putting it to __ro_after_init should do the trick, no?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 11:03 ` Kirill A. Shutemov
@ 2025-05-20 11:28 ` Ard Biesheuvel
2025-05-20 14:35 ` Borislav Petkov
0 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 11:28 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst, Borislav Petkov
On Tue, 20 May 2025 at 13:03, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, May 20, 2025 at 12:41:41PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Shrink the global variable 'pgdir_shift' to a single byte, and move it
> > to the cache hot per-CPU variable section so that it can be accessed
> > virtually for free.
>
> Hm. I am not convinced about per-CPU being useful here. Read-only cache
> lines can be shared between cores anyway.
>
> Putting it to __ro_after_init should do the trick, no?
>
Are you saying we shouldn't optimize prematurely? You must be new here :-)
But seriously, whatever works is fine with me, as long as the
observable return value of pgtable_l5_enabled() never changes.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 11:28 ` Ard Biesheuvel
@ 2025-05-20 14:35 ` Borislav Petkov
2025-05-20 17:03 ` Ard Biesheuvel
0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 14:35 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Ingo Molnar, Linus Torvalds, Brian Gerst
On Tue, May 20, 2025 at 01:28:52PM +0200, Ard Biesheuvel wrote:
> Are you saying we shouldn't optimize prematurely? You must be new here :-)
Right, but do you see __pgdir_shift on a seriously hot path to warrant this?
I'd expect this to happen the other way around, really: "Hey, pgdir_shift is
getting accessed a lot, let's stick it somewhere where we can access it
quickly..."
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 14:35 ` Borislav Petkov
@ 2025-05-20 17:03 ` Ard Biesheuvel
2025-05-20 17:38 ` Borislav Petkov
0 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 17:03 UTC (permalink / raw)
To: Borislav Petkov
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Ingo Molnar, Linus Torvalds, Brian Gerst
On Tue, 20 May 2025 at 16:35, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, May 20, 2025 at 01:28:52PM +0200, Ard Biesheuvel wrote:
> > Are you saying we shouldn't optimize prematurely? You must be new here :-)
>
> Right, but do you see __pgdir_shift on a seriously hot path to warrant this?
>
No. But if you had read the next couple of patches, you would have
noticed that PGDIR_SHIFT, PTRS_PER_P4D and pgtable_l5_enabled() will
all be derived from this variable, and the latter currently uses code
patching (in cpu_feature_enabled())
This is also explained in the cover letter btw
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 17:03 ` Ard Biesheuvel
@ 2025-05-20 17:38 ` Borislav Petkov
2025-05-20 17:46 ` Ard Biesheuvel
0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 17:38 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Ingo Molnar, Linus Torvalds, Brian Gerst
On Tue, May 20, 2025 at 07:03:37PM +0200, Ard Biesheuvel wrote:
> No. But if you had read the next couple of patches, you would have
> noticed that PGDIR_SHIFT, PTRS_PER_P4D and pgtable_l5_enabled() will
> all be derived from this variable, and the latter currently uses code
> patching (in cpu_feature_enabled())
>
> This is also explained in the cover letter btw
Yes, I saw that.
The question remains: are the *users* - PGDIR_SHIFT, etc - on some hot path
which I'm not seeing?
For example pgd_index() is called in a bunch of places and I guess that adds
up. But without measuring that, we won't know for sure.
Looking at an example:
# ./arch/x86/include/asm/pgtable_64_types.h:32: return this_cpu_read_stable(__pgdir_shift);
#APP
# 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
movb %gs:__pgdir_shift(%rip), %cl #, pfo_val__
# 0 "" 2
...
movq 40(%rdi), %rax # ppd_20(D)->vaddr, tmp128
shrq %cl, %rax # pfo_val__, tmp128
# arch/x86/boot/startup/sme.c:105: pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
movq 8(%rdi), %rcx # ppd_20(D)->pgd, ppd_20(D)->pgd
# arch/x86/boot/startup/sme.c:105: pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
andl $511, %eax #, tmp130
# arch/x86/boot/startup/sme.c:105: pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
leaq (%rcx,%rax,8), %rsi #, pgd_p
that looks like two insns to me: the RIP-relative mov to %cl and then the
shift.
If you use a "normal" variable, that would be also two insns, no?
Or am I way off?
Because if not, the percpu thing doesn't buy you anything...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 17:38 ` Borislav Petkov
@ 2025-05-20 17:46 ` Ard Biesheuvel
2025-05-20 18:01 ` Borislav Petkov
0 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 17:46 UTC (permalink / raw)
To: Borislav Petkov
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Ingo Molnar, Linus Torvalds, Brian Gerst
On Tue, 20 May 2025 at 19:38, Borislav Petkov <bp@alien8.de> wrote:
>
> On Tue, May 20, 2025 at 07:03:37PM +0200, Ard Biesheuvel wrote:
> > No. But if you had read the next couple of patches, you would have
> > noticed that PGDIR_SHIFT, PTRS_PER_P4D and pgtable_l5_enabled() will
> > all be derived from this variable, and the latter currently uses code
> > patching (in cpu_feature_enabled())
> >
> > This is also explained in the cover letter btw
>
> Yes, I saw that.
>
> The question remains: are the *users* - PGDIR_SHIFT, etc - on some hot path
> which I'm not seeing?
>
> For example pgd_index() is called in a bunch of places and I guess that adds
> up. But without measuring that, we won't know for sure.
>
Look at pgtable_l5_enabled() please, that is the important one.
> Looking at an example:
>
> # ./arch/x86/include/asm/pgtable_64_types.h:32: return this_cpu_read_stable(__pgdir_shift);
> #APP
> # 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
> movb %gs:__pgdir_shift(%rip), %cl #, pfo_val__
> # 0 "" 2
>
> ...
>
> movq 40(%rdi), %rax # ppd_20(D)->vaddr, tmp128
> shrq %cl, %rax # pfo_val__, tmp128
> # arch/x86/boot/startup/sme.c:105: pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
> movq 8(%rdi), %rcx # ppd_20(D)->pgd, ppd_20(D)->pgd
> # arch/x86/boot/startup/sme.c:105: pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
> andl $511, %eax #, tmp130
> # arch/x86/boot/startup/sme.c:105: pgd_p = ppd->pgd + pgd_index(ppd->vaddr);
> leaq (%rcx,%rax,8), %rsi #, pgd_p
>
> that looks like two insns to me: the RIP-relative mov to %cl and then the
> shift.
>
> If you use a "normal" variable, that would be also two insns, no?
>
> Or am I way off?
>
> Because if not, the percpu thing doesn't buy you anything...
>
The variable access is identical in terms of instructions, the only
difference is the %gs offset being applied, and the fact that using
cache hot data is guaranteed not to increase the number of cachelines
covering the working set of any existing workload (the region is
bounded to a fixed number of cachelines)
Happy to keep this as a simple __ro_after_init variable if there is
consensus between the tip maintainers that we don't need this perf
advantage, or we can use some kind of code patching that is not CPU
feature based or ternary alternative based to short circuit these
things (and pgtable_l5_enabled() in particular) after boot.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 17:46 ` Ard Biesheuvel
@ 2025-05-20 18:01 ` Borislav Petkov
2025-05-20 18:28 ` Linus Torvalds
0 siblings, 1 reply; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 18:01 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Ingo Molnar, Linus Torvalds, Brian Gerst
On Tue, May 20, 2025 at 07:46:33PM +0200, Ard Biesheuvel wrote:
> Look at pgtable_l5_enabled() please, that is the important one.
OMG. :-)
# 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
movb %gs:__pgdir_shift(%rip), %al #, pfo_val__
# 0 "" 2
# ./arch/x86/include/asm/pgtable.h:1178: if (!pgtable_l5_enabled())
#NO_APP
testb $1, %al #, pfo_val__
I hadn't seen his fun yet:
!(pgdir_shift() & 1)
> The variable access is identical in terms of instructions, the only
> difference is the %gs offset being applied, and the fact that using
> cache hot data is guaranteed not to increase the number of cachelines
> covering the working set of any existing workload (the region is
> bounded to a fixed number of cachelines)
Yes, but look at all the callers. I hardly see any serious hot paths to care
about the %gs offset being applied.
And I'll be really surprised if that *shows* in any sensible benchmark...
Also, it doesn't make any sense for a *global* system property - what paging
level the machine uses - to be in a *per-CPU* variable. That's a global value
which is the same everywhere. So why waste space?
> Happy to keep this as a simple __ro_after_init variable if there is
> consensus between the tip maintainers that we don't need this perf
Yeah, IMO, no need. But let's see what the others say.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 18:01 ` Borislav Petkov
@ 2025-05-20 18:28 ` Linus Torvalds
2025-05-20 18:35 ` Borislav Petkov
2025-05-20 19:49 ` Ard Biesheuvel
0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2025-05-20 18:28 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ard Biesheuvel, Kirill A. Shutemov, Ard Biesheuvel, linux-kernel,
x86, Ingo Molnar, Brian Gerst
On Tue, 20 May 2025 at 11:01, Borislav Petkov <bp@alien8.de> wrote:
>
> OMG. :-)
>
> # 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
> movb %gs:__pgdir_shift(%rip), %al #, pfo_val__
> # 0 "" 2
> # ./arch/x86/include/asm/pgtable.h:1178: if (!pgtable_l5_enabled())
> #NO_APP
> testb $1, %al #, pfo_val__
That's garbage.
Gcc should be able to turn it into just a
testb $1,%gs:__pgdir_shift(%rip)
What happens if pgtable_l5_enabled() is made to use __raw_cpu_read()?
With a compiler that is new enough to support USE_X86_SEG_SUPPORT?
Oh, and it looks like we messed up __raw_cpu_read_stable(), because
that *always* uses the inline asm, so it doesn't allow the compiler to
just DTRT and do things like the above.
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 18:28 ` Linus Torvalds
@ 2025-05-20 18:35 ` Borislav Petkov
2025-05-20 19:49 ` Ard Biesheuvel
1 sibling, 0 replies; 22+ messages in thread
From: Borislav Petkov @ 2025-05-20 18:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ard Biesheuvel, Kirill A. Shutemov, Ard Biesheuvel, linux-kernel,
x86, Ingo Molnar, Brian Gerst
On Tue, May 20, 2025 at 11:28:28AM -0700, Linus Torvalds wrote:
> What happens if pgtable_l5_enabled() is made to use __raw_cpu_read()?
> With a compiler that is new enough to support USE_X86_SEG_SUPPORT?
Before we go play with this more, I see the same thing with gcc-13 and gcc-14
on debian.
Those should be fairly new, I'd think...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift
2025-05-20 18:28 ` Linus Torvalds
2025-05-20 18:35 ` Borislav Petkov
@ 2025-05-20 19:49 ` Ard Biesheuvel
1 sibling, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 19:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Borislav Petkov, Kirill A. Shutemov, Ard Biesheuvel, linux-kernel,
x86, Ingo Molnar, Brian Gerst
On Tue, 20 May 2025 at 20:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 20 May 2025 at 11:01, Borislav Petkov <bp@alien8.de> wrote:
> >
> > OMG. :-)
> >
> > # 32 "./arch/x86/include/asm/pgtable_64_types.h" 1
> > movb %gs:__pgdir_shift(%rip), %al #, pfo_val__
> > # 0 "" 2
> > # ./arch/x86/include/asm/pgtable.h:1178: if (!pgtable_l5_enabled())
> > #NO_APP
> > testb $1, %al #, pfo_val__
>
> That's garbage.
>
> Gcc should be able to turn it into just a
>
> testb $1,%gs:__pgdir_shift(%rip)
>
> What happens if pgtable_l5_enabled() is made to use __raw_cpu_read()?
> With a compiler that is new enough to support USE_X86_SEG_SUPPORT?
>
> Oh, and it looks like we messed up __raw_cpu_read_stable(), because
> that *always* uses the inline asm, so it doesn't allow the compiler to
> just DTRT and do things like the above.
>
I suppose that is what the this_cpu_read_const() is for, but that
requires a const alias declaration and an entry in the linker script.
If we decide to go with a per-cpu var, I can add this
USE_X86_SEG_SUPPORT codepath like we have in other places - that
should give the compiler sufficient insight into the fact that this
value never changes, and generate optimal code as a result.
For an ordinary variable, I suppose we can still declare it as const,
__attribute__((const)) etc if it is defined in and set from the asm
startup code, resulting in the same potential for optimization
(without alternatives or runtime const hacks).
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 1/7] x86/mm: Decouple MAX_PHYSMEM_BITS from LA57 state Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 2/7] x86/mm: Use a single cache hot per-CPU variable to record pgdir_shift Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
2025-05-20 11:08 ` Kirill A. Shutemov
2025-05-20 10:41 ` [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift() Ard Biesheuvel
` (3 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
Define the value of PTRS_PER_P4D in terms of pgdir_shift(), which can be
accessed cheaply, removing the need for a global variable that needs to
be kept in sync manually.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 2 +-
arch/x86/boot/compressed/pgtable_64.c | 2 --
arch/x86/boot/startup/map_kernel.c | 1 -
arch/x86/include/asm/pgtable_64_types.h | 4 +---
arch/x86/kernel/head64.c | 2 --
5 files changed, 2 insertions(+), 9 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 97b80d08f03c..3157f2fbc593 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -193,7 +193,7 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif
/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, ptrs_per_p4d;
+extern unsigned int __pgtable_l5_enabled;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);
/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 898a4e66e401..965fca150e68 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -12,7 +12,6 @@
/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") ptrs_per_p4d = 1;
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
@@ -122,7 +121,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
/* Initialize variables for 5-level paging */
__pgtable_l5_enabled = 1;
- ptrs_per_p4d = 512;
}
/*
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 06306c5d016f..5d3c6108f1c3 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -24,7 +24,6 @@ static inline bool check_la57_support(void)
return false;
__pgtable_l5_enabled = 1;
- ptrs_per_p4d = 512;
return true;
}
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 3ee747f596e3..d8e39c479387 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -46,8 +46,6 @@ static inline bool pgtable_l5_enabled(void)
#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
#endif /* USE_EARLY_PGTABLE_L5 */
-extern unsigned int ptrs_per_p4d;
-
#endif /* !__ASSEMBLER__ */
/*
@@ -61,7 +59,7 @@ extern unsigned int ptrs_per_p4d;
*/
#define P4D_SHIFT 39
#define MAX_PTRS_PER_P4D 512
-#define PTRS_PER_P4D ptrs_per_p4d
+#define PTRS_PER_P4D (pgdir_shift() & 1 ?: MAX_PTRS_PER_P4D)
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index e2d9e709ec01..fe0770f468c3 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -52,8 +52,6 @@ SYM_PIC_ALIAS(next_early_pgt);
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
@ 2025-05-20 11:08 ` Kirill A. Shutemov
2025-05-20 11:29 ` Ard Biesheuvel
0 siblings, 1 reply; 22+ messages in thread
From: Kirill A. Shutemov @ 2025-05-20 11:08 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst, Borislav Petkov
On Tue, May 20, 2025 at 12:41:42PM +0200, Ard Biesheuvel wrote:
> @@ -61,7 +59,7 @@ extern unsigned int ptrs_per_p4d;
> */
> #define P4D_SHIFT 39
> #define MAX_PTRS_PER_P4D 512
> -#define PTRS_PER_P4D ptrs_per_p4d
> +#define PTRS_PER_P4D (pgdir_shift() & 1 ?: MAX_PTRS_PER_P4D)
"& 1" is a hack and at least deserves a comment.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift()
2025-05-20 11:08 ` Kirill A. Shutemov
@ 2025-05-20 11:29 ` Ard Biesheuvel
0 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 11:29 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst, Borislav Petkov
On Tue, 20 May 2025 at 13:08, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, May 20, 2025 at 12:41:42PM +0200, Ard Biesheuvel wrote:
> > @@ -61,7 +59,7 @@ extern unsigned int ptrs_per_p4d;
> > */
> > #define P4D_SHIFT 39
> > #define MAX_PTRS_PER_P4D 512
> > -#define PTRS_PER_P4D ptrs_per_p4d
> > +#define PTRS_PER_P4D (pgdir_shift() & 1 ?: MAX_PTRS_PER_P4D)
>
> "& 1" is a hack and at least deserves a comment.
>
Fair enough.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift()
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (2 preceding siblings ...)
2025-05-20 10:41 ` [PATCH v5 3/7] x86/mm: Define PTRS_PER_P4D in terms of pgdir_shift() Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions Ard Biesheuvel
` (2 subsequent siblings)
6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
Instead of using two versions of pgtable_l5_enabled() which go out of
sync a few times during boot, use a single implementation, and base it
on pgdir_shift(), which can be accessed cheaply when running in the core
kernel.
This is the most efficient way to obtain this information without
relying on code patching, on which the 'late' implementation depends
today (via the ternary alternative in cpu_feature_enabled()).
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/include/asm/pgtable_64_types.h | 14 ++------------
1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index d8e39c479387..ed847a90cf4f 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -33,18 +33,8 @@ static __always_inline __attribute_const__ u8 pgdir_shift(void)
}
#endif /* pgdir_shift */
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
-{
- return __pgtable_l5_enabled;
-}
-#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
-#endif /* USE_EARLY_PGTABLE_L5 */
+/* PGDIR shift == 39 -> L5 disabled, 48 -> L5 enabled */
+#define pgtable_l5_enabled() !(pgdir_shift() & 1)
#endif /* !__ASSEMBLER__ */
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (3 preceding siblings ...)
2025-05-20 10:41 ` [PATCH v5 4/7] x86/mm: Derive pgtable_l5_enabled() from pgdir_shift() Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch Ard Biesheuvel
6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
Drop USE_EARLY_PGTABLE_L5 definitions, which no longer have any effect.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 3 ---
arch/x86/boot/startup/sme.c | 9 ---------
arch/x86/kernel/cpu/common.c | 3 ---
arch/x86/kernel/head64.c | 3 ---
arch/x86/mm/kasan_init_64.c | 3 ---
5 files changed, 21 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 3157f2fbc593..7f79c426f542 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@
#define __NO_FORTIFY
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
/*
* Boot stub deals with identity mappings, physical and virtual addresses are
* the same, so override these defines.
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 70ea1748c0a7..a6c25d005991 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -25,15 +25,6 @@
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS
-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8feb8fd2957a..256613b24bd4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,7 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/linkage.h>
#include <linux/bitops.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index fe0770f468c3..5e5da6574a83 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -5,9 +5,6 @@
* Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
*/
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/types.h>
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0539efd0d216..7c4fafbd52cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -1,9 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#define pr_fmt(fmt) "kasan: " fmt
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/kasan.h>
#include <linux/kdebug.h>
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (4 preceding siblings ...)
2025-05-20 10:41 ` [PATCH v5 5/7] x86/boot: Drop USE_EARLY_PGTABLE_L5 definitions Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
2025-05-20 10:41 ` [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch Ard Biesheuvel
6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
The variable __pgtable_l5_enabled is no longer used so it can be
dropped.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 1 -
arch/x86/boot/compressed/pgtable_64.c | 6 ------
arch/x86/boot/startup/map_kernel.c | 16 +---------------
arch/x86/include/asm/pgtable_64_types.h | 2 --
arch/x86/kernel/head64.c | 2 --
5 files changed, 1 insertion(+), 26 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 7f79c426f542..d6860b50afaf 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -190,7 +190,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif
/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);
/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 965fca150e68..4772919411a9 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -10,9 +10,6 @@
#define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
#define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
@@ -118,9 +115,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
if (!cmdline_find_option_bool("no5lvl") &&
native_cpuid_eax(0) >= 7 && (native_cpuid_ecx(7) & BIT(16))) {
l5_required = true;
-
- /* Initialize variables for 5-level paging */
- __pgtable_l5_enabled = 1;
}
/*
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 5d3c6108f1c3..328a343bc355 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -14,20 +14,6 @@
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
extern unsigned int next_early_pgt;
-static inline bool check_la57_support(void)
-{
- /*
- * 5-level paging is detected and enabled at kernel decompression
- * stage. Only check if it has been enabled there.
- */
- if (!(native_read_cr4() & X86_CR4_LA57))
- return false;
-
- __pgtable_l5_enabled = 1;
-
- return true;
-}
-
static unsigned long __head sme_postprocess_startup(struct boot_params *bp,
pmdval_t *pmd,
unsigned long p2v_offset)
@@ -97,7 +83,7 @@ unsigned long __head __startup_64(unsigned long p2v_offset,
bool la57;
int i;
- la57 = check_la57_support();
+ la57 = pgtable_l5_enabled();
/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index ed847a90cf4f..3c56b98b87b0 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -22,8 +22,6 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;
-extern unsigned int __pgtable_l5_enabled;
-
#ifndef pgdir_shift
DECLARE_PER_CPU_CACHE_HOT(u8, __pgdir_shift);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 5e5da6574a83..137c93498601 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -48,8 +48,6 @@ unsigned int __initdata next_early_pgt;
SYM_PIC_ALIAS(next_early_pgt);
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
-unsigned int __pgtable_l5_enabled __ro_after_init;
-
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH v5 7/7] x86/boot: Remove KASAN workaround for 4/5 level paging switch
2025-05-20 10:41 [PATCH v5 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (5 preceding siblings ...)
2025-05-20 10:41 ` [PATCH v5 6/7] x86/boot: Drop 5-level paging related global variable Ard Biesheuvel
@ 2025-05-20 10:41 ` Ard Biesheuvel
6 siblings, 0 replies; 22+ messages in thread
From: Ard Biesheuvel @ 2025-05-20 10:41 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst,
Kirill A. Shutemov, Borislav Petkov
From: Ard Biesheuvel <ardb@kernel.org>
pgtable_l5_enabled() is no longer based on a CPU feature bit that might
change values and confuse KASAN, so we no longer need to disable it
temporarily.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/kernel/alternative.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f3c68b586a95..e39823d8d1ae 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -589,16 +589,6 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
DPRINTK(ALT, "alt table %px, -> %px", start, end);
- /*
- * KASAN_SHADOW_START is defined using
- * cpu_feature_enabled(X86_FEATURE_LA57) and is therefore patched here.
- * During the process, KASAN becomes confused seeing partial LA57
- * conversion and triggers a false-positive out-of-bound report.
- *
- * Disable KASAN until the patching is complete.
- */
- kasan_disable_current();
-
/*
* The scan order should be from start to end. A later scanned
* alternative code can overwrite previously scanned alternative code.
@@ -666,8 +656,6 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
text_poke_early(instr, insn_buff, insn_buff_sz);
}
-
- kasan_enable_current();
}
static inline bool is_jcc32(struct insn *insn)
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 22+ messages in thread