* [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled()
@ 2025-05-14 10:42 Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
` (6 more replies)
0 siblings, 7 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
This is a follow-up to the discussion at [0], broken out of that series
so we can progress while the SEV changes are being reviewed and tested.
The current implementation of pgtable_l5_enabled() is problematic
because it has two implementations, and source files need to opt into
the correct one if they contain code that might be called very early.
Other related global pseudo-constants exist that assume different values
based on the number of paging levels, and it is hard to reason about
whether or not all memory mapping and page table code is guaranteed to
observe consistent values of all of these at all times during the boot.
Case in point: currently, KASAN needs to be disabled during alternatives
patching because otherwise, it will reliably produce false positive
reports due to such inconsistencies.
This v2 drops the early variant entirely, and makes the existing late
variant, which is based on cpu_feature_enabled(), work as expected in
all cases by tweaking the CPU capability code so that it permits setting
the 5-level paging capability from assembler before calling the C
entrypoint of the core kernel.
Runtime constants were considered for PGDIR_SHIFT and PTRS_PER_P4D but
were found unsuitable as they do not support loadable modules, and so
they are replaced with expressions based on pgtable_l5_enabled(). Earlier
patching of alternatives based on CPU capabilities may be feasible, but
whether or not this improves performance is TBD. In any case, doing so
from the startup code is unlikely to be worth the added complexity.
Build and boot tested using QEMU with LA57 emulation.
Changes since v2:
- Drop first patch which has been merged
- Rename existing "la57" CPU flag to "la57_hw" and use "la57" to
indicate that 5 level paging is being used
- Move memset() out of identify_cpu()
- Make set/clear cap override arrays ro_after_init
- Split off asm-offsets update
[0] https://lore.kernel.org/all/20250504095230.2932860-28-ardb+git@google.com/
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brian Gerst <brgerst@gmail.com>
Ard Biesheuvel (7):
x86/cpu: Use a new feature flag for 5 level paging
x86/cpu: Allow caps to be set arbitrarily early
x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use
x86/boot: Set 5-level paging CPU cap before entering C code
x86/boot: Drop the early variant of pgtable_l5_enabled()
x86/boot: Drop 5-level paging related variables and early updates
x86/cpu: Make CPU capability overrides __ro_after_init
arch/x86/boot/compressed/misc.h | 8 +++---
arch/x86/boot/compressed/pgtable_64.c | 12 ---------
arch/x86/boot/startup/map_kernel.c | 21 +---------------
arch/x86/boot/startup/sme.c | 9 -------
arch/x86/include/asm/cpufeature.h | 12 ++++++---
arch/x86/include/asm/cpufeatures.h | 3 ++-
arch/x86/include/asm/page_64.h | 2 +-
arch/x86/include/asm/pgtable_64_types.h | 25 ++++---------------
arch/x86/kernel/alternative.c | 12 ---------
arch/x86/kernel/asm-offsets.c | 8 ++++++
arch/x86/kernel/asm-offsets_32.c | 9 -------
arch/x86/kernel/cpu/common.c | 26 +++-----------------
arch/x86/kernel/head64.c | 11 ---------
arch/x86/kernel/head_64.S | 15 +++++++++++
arch/x86/kvm/x86.h | 4 +--
arch/x86/mm/kasan_init_64.c | 3 ---
drivers/iommu/amd/init.c | 4 +--
drivers/iommu/intel/svm.c | 4 +--
tools/testing/selftests/kvm/x86/set_sregs_test.c | 2 +-
19 files changed, 55 insertions(+), 135 deletions(-)
base-commit: 64797551baec252f953fa8234051f88b0c368ed5
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
2025-05-15 7:06 ` Ingo Molnar
` (3 more replies)
2025-05-14 10:42 ` [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early Ard Biesheuvel
` (5 subsequent siblings)
6 siblings, 4 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
Currently, the LA57 CPU feature flag is taken to mean two different
things at once:
- whether the CPU implements the LA57 extension, and is therefore
capable of supporting 5 level paging;
- whether 5 level paging is currently in use.
This means the LA57 capability of the hardware is hidden when a LA57
capable CPU is forced to run with 4 levels of paging. It also means the
the ordinary CPU capability detection code will happily set the LA57
capability and it needs to be cleared explicitly afterwards to avoid
inconsistencies.
Separate the two so that the CPU hardware capability can be identified
unambigously in all cases.
To avoid breaking existing users that might assume that 5 level paging
is being used when the "la57" string is visible in /proc/cpuinfo,
repurpose that string to mean that 5-level paging is in use, and add a
new string la57_capable to indicate that the CPU feature is implemented
by the hardware.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/include/asm/cpufeatures.h | 3 ++-
arch/x86/include/asm/page_64.h | 2 +-
arch/x86/include/asm/pgtable_64_types.h | 2 +-
arch/x86/kernel/cpu/common.c | 16 ++--------------
arch/x86/kvm/x86.h | 4 ++--
drivers/iommu/amd/init.c | 4 ++--
drivers/iommu/intel/svm.c | 4 ++--
tools/testing/selftests/kvm/x86/set_sregs_test.c | 2 +-
8 files changed, 13 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index f67a93fc9391..d59bee5907e7 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -395,7 +395,7 @@
#define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
#define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
#define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
-#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
+#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
#define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
#define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
#define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
@@ -483,6 +483,7 @@
#define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
#define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
#define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
+#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
/*
* BUG word(s)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index d3aab6f4e59a..acfa61ad0725 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -86,7 +86,7 @@ static __always_inline unsigned long task_size_max(void)
unsigned long ret;
alternative_io("movq %[small],%0","movq %[large],%0",
- X86_FEATURE_LA57,
+ X86_FEATURE_5LEVEL_PAGING,
"=r" (ret),
[small] "i" ((1ul << 47)-PAGE_SIZE),
[large] "i" ((1ul << 56)-PAGE_SIZE));
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index e83721db18c9..d31ae12663bb 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -34,7 +34,7 @@ static inline bool pgtable_l5_enabled(void)
return __pgtable_l5_enabled;
}
#else
-#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
+#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING)
#endif /* USE_EARLY_PGTABLE_L5 */
#else
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 114aaaf6ae8a..6f7827015834 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1755,20 +1755,8 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
setup_clear_cpu_cap(X86_FEATURE_PCID);
#endif
- /*
- * Later in the boot process pgtable_l5_enabled() relies on
- * cpu_feature_enabled(X86_FEATURE_LA57). If 5-level paging is not
- * enabled by this point we need to clear the feature bit to avoid
- * false-positives at the later stage.
- *
- * pgtable_l5_enabled() can be false here for several reasons:
- * - 5-level paging is disabled compile-time;
- * - it's 32-bit kernel;
- * - machine doesn't support 5-level paging;
- * - user specified 'no5lvl' in kernel command line.
- */
- if (!pgtable_l5_enabled())
- setup_clear_cpu_cap(X86_FEATURE_LA57);
+ if (IS_ENABLED(CONFIG_X86_5LEVEL) && (native_read_cr4() & X86_CR4_LA57))
+ setup_force_cpu_cap(X86_FEATURE_5LEVEL_PAGING);
detect_nopl();
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9dc32a409076..d2c093f17ae5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -243,7 +243,7 @@ static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
static inline u8 max_host_virt_addr_bits(void)
{
- return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
+ return kvm_cpu_cap_has(X86_FEATURE_5LEVEL_PAGING) ? 57 : 48;
}
/*
@@ -603,7 +603,7 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
__reserved_bits |= X86_CR4_FSGSBASE; \
if (!__cpu_has(__c, X86_FEATURE_PKU)) \
__reserved_bits |= X86_CR4_PKE; \
- if (!__cpu_has(__c, X86_FEATURE_LA57)) \
+ if (!__cpu_has(__c, X86_FEATURE_5LEVEL_PAGING)) \
__reserved_bits |= X86_CR4_LA57; \
if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
__reserved_bits |= X86_CR4_UMIP; \
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 14aa0d77df26..083fca8f8b97 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -3084,7 +3084,7 @@ static int __init early_amd_iommu_init(void)
goto out;
/* 5 level guest page table */
- if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+ if (cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING) &&
FIELD_GET(FEATURE_GATS, amd_iommu_efr) == GUEST_PGTABLE_5_LEVEL)
amd_iommu_gpt_level = PAGE_MODE_5_LEVEL;
@@ -3691,7 +3691,7 @@ __setup("ivrs_acpihid", parse_ivrs_acpihid);
bool amd_iommu_pasid_supported(void)
{
/* CPU page table size should match IOMMU guest page table size */
- if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+ if (cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING) &&
amd_iommu_gpt_level != PAGE_MODE_5_LEVEL)
return false;
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index ba93123cb4eb..1f615e6d06ec 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -37,7 +37,7 @@ void intel_svm_check(struct intel_iommu *iommu)
return;
}
- if (cpu_feature_enabled(X86_FEATURE_LA57) &&
+ if (cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING) &&
!cap_fl5lp_support(iommu->cap)) {
pr_err("%s SVM disabled, incompatible paging mode\n",
iommu->name);
@@ -165,7 +165,7 @@ static int intel_svm_set_dev_pasid(struct iommu_domain *domain,
return PTR_ERR(dev_pasid);
/* Setup the pasid table: */
- sflags = cpu_feature_enabled(X86_FEATURE_LA57) ? PASID_FLAG_FL5LP : 0;
+ sflags = cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING) ? PASID_FLAG_FL5LP : 0;
ret = __domain_setup_first_level(iommu, dev, pasid,
FLPT_DEFAULT_DID, mm->pgd,
sflags, old);
diff --git a/tools/testing/selftests/kvm/x86/set_sregs_test.c b/tools/testing/selftests/kvm/x86/set_sregs_test.c
index f4095a3d1278..de78665fa675 100644
--- a/tools/testing/selftests/kvm/x86/set_sregs_test.c
+++ b/tools/testing/selftests/kvm/x86/set_sregs_test.c
@@ -52,7 +52,7 @@ static uint64_t calc_supported_cr4_feature_bits(void)
if (kvm_cpu_has(X86_FEATURE_UMIP))
cr4 |= X86_CR4_UMIP;
- if (kvm_cpu_has(X86_FEATURE_LA57))
+ if (kvm_cpu_has(X86_FEATURE_5LEVEL_PAGING))
cr4 |= X86_CR4_LA57;
if (kvm_cpu_has(X86_FEATURE_VMX))
cr4 |= X86_CR4_VMXE;
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
2025-05-15 6:56 ` Ingo Molnar
2025-05-14 10:42 ` [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use Ard Biesheuvel
` (4 subsequent siblings)
6 siblings, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
cpu_feature_enabled() uses a ternary alternative, where the late variant
is based on code patching and the early variant accesses the capability
field in boot_cpu_data directly.
This allows cpu_feature_enabled() to be called quite early, but it still
requires that the CPU feature detection code runs before being able to
rely on the return value of cpu_feature_enabled().
This is a problem for the implementation of pgtable_l5_enabled(), which
is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
called extremely early. Currently, there is a hacky workaround where
some source files that may execute before (but also after) CPU feature
detection have a different version of pgtable_l5_enabled(), based on the
USE_EARLY_PGTABLE_L5 preprocessor macro.
Instead, let's make it possible to set CPU feature arbitrarily early, so
that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
entering C code.
This involves relying on static initialization of boot_cpu_data and the
cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
.data. This ensures that they won't be cleared along with the rest of
BSS.
Note that forcing a capability involves setting it in both
boot_cpu_data.x86_capability[] and cpu_caps_set[].
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/kernel/cpu/common.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6f7827015834..f6f206743d6a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
}
/* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
-__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
-__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
+__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
+__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
#ifdef CONFIG_X86_32
/* The 32-bit entry code needs to find cpu_entry_area. */
@@ -1708,9 +1708,6 @@ static void __init cpu_parse_early_param(void)
*/
static void __init early_identify_cpu(struct cpuinfo_x86 *c)
{
- memset(&c->x86_capability, 0, sizeof(c->x86_capability));
- c->extended_cpuid_level = 0;
-
if (!have_cpuid_p())
identify_cpu_without_cpuid(c);
@@ -1922,7 +1919,6 @@ static void identify_cpu(struct cpuinfo_x86 *c)
c->x86_virt_bits = 32;
#endif
c->x86_cache_alignment = c->x86_clflush_size;
- memset(&c->x86_capability, 0, sizeof(c->x86_capability));
#ifdef CONFIG_X86_VMX_FEATURE_NAMES
memset(&c->vmx_capability, 0, sizeof(c->vmx_capability));
#endif
@@ -2084,6 +2080,7 @@ void identify_secondary_cpu(unsigned int cpu)
*c = boot_cpu_data;
c->cpu_index = cpu;
+ memset(&c->x86_capability, 0, sizeof(c->x86_capability));
identify_cpu(c);
#ifdef CONFIG_X86_32
enable_sep_cpu();
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
2025-05-15 7:10 ` Ingo Molnar
2025-05-15 7:58 ` [tip: x86/core] x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too tip-bot2 for Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code Ard Biesheuvel
` (3 subsequent siblings)
6 siblings, 2 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
Expose struct cpuinfo_x86 via asm-offsets for x86_64 too so that it will
be possible to set CPU capabilities from asm code.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/kernel/asm-offsets.c | 8 ++++++++
arch/x86/kernel/asm-offsets_32.c | 9 ---------
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index ad4ea6fb3b6c..6259b474073b 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -33,6 +33,14 @@
static void __used common(void)
{
+ OFFSET(CPUINFO_x86, cpuinfo_x86, x86);
+ OFFSET(CPUINFO_x86_vendor, cpuinfo_x86, x86_vendor);
+ OFFSET(CPUINFO_x86_model, cpuinfo_x86, x86_model);
+ OFFSET(CPUINFO_x86_stepping, cpuinfo_x86, x86_stepping);
+ OFFSET(CPUINFO_cpuid_level, cpuinfo_x86, cpuid_level);
+ OFFSET(CPUINFO_x86_capability, cpuinfo_x86, x86_capability);
+ OFFSET(CPUINFO_x86_vendor_id, cpuinfo_x86, x86_vendor_id);
+
BLANK();
OFFSET(TASK_threadsp, task_struct, thread.sp);
#ifdef CONFIG_STACKPROTECTOR
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 2b411cd00a4e..e0a292db97b2 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -12,15 +12,6 @@ void foo(void);
void foo(void)
{
- OFFSET(CPUINFO_x86, cpuinfo_x86, x86);
- OFFSET(CPUINFO_x86_vendor, cpuinfo_x86, x86_vendor);
- OFFSET(CPUINFO_x86_model, cpuinfo_x86, x86_model);
- OFFSET(CPUINFO_x86_stepping, cpuinfo_x86, x86_stepping);
- OFFSET(CPUINFO_cpuid_level, cpuinfo_x86, cpuid_level);
- OFFSET(CPUINFO_x86_capability, cpuinfo_x86, x86_capability);
- OFFSET(CPUINFO_x86_vendor_id, cpuinfo_x86, x86_vendor_id);
- BLANK();
-
OFFSET(PT_EBX, pt_regs, bx);
OFFSET(PT_ECX, pt_regs, cx);
OFFSET(PT_EDX, pt_regs, dx);
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (2 preceding siblings ...)
2025-05-14 10:42 ` [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
2025-05-15 8:00 ` Kirill A. Shutemov
2025-05-14 10:42 ` [PATCH v3 5/7] x86/boot: Drop the early variant of pgtable_l5_enabled() Ard Biesheuvel
` (2 subsequent siblings)
6 siblings, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
In order for pgtable_l5_enabled() to be reliable wherever it is used and
however early, set the associated CPU capability from asm code before
entering the startup C code.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/include/asm/cpufeature.h | 12 +++++++++---
arch/x86/kernel/cpu/common.c | 3 ---
arch/x86/kernel/head_64.S | 15 +++++++++++++++
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 893cbca37fe9..1b5de40e7bf7 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -2,10 +2,10 @@
#ifndef _ASM_X86_CPUFEATURE_H
#define _ASM_X86_CPUFEATURE_H
+#ifdef __KERNEL__
+#ifndef __ASSEMBLER__
#include <asm/processor.h>
-#if defined(__KERNEL__) && !defined(__ASSEMBLER__)
-
#include <asm/asm.h>
#include <linux/bitops.h>
#include <asm/alternative.h>
@@ -137,5 +137,11 @@ static __always_inline bool _static_cpu_has(u16 bit)
#define CPU_FEATURE_TYPEVAL boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
boot_cpu_data.x86_model
-#endif /* defined(__KERNEL__) && !defined(__ASSEMBLER__) */
+#else /* !defined(__ASSEMBLER__) */
+ .macro setup_force_cpu_cap, cap:req
+ btsl $\cap % 32, boot_cpu_data+CPUINFO_x86_capability+4*(\cap / 32)(%rip)
+ btsl $\cap % 32, cpu_caps_set+4*(\cap / 32)(%rip)
+ .endm
+#endif /* !defined(__ASSEMBLER__) */
+#endif /* defined(__KERNEL__) */
#endif /* _ASM_X86_CPUFEATURE_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f6f206743d6a..c8954dc2fb26 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1752,9 +1752,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
setup_clear_cpu_cap(X86_FEATURE_PCID);
#endif
- if (IS_ENABLED(CONFIG_X86_5LEVEL) && (native_read_cr4() & X86_CR4_LA57))
- setup_force_cpu_cap(X86_FEATURE_5LEVEL_PAGING);
-
detect_nopl();
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 069420853304..191d5947a762 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -27,6 +27,7 @@
#include <asm/fixmap.h>
#include <asm/smp.h>
#include <asm/thread_info.h>
+#include <asm/cpufeature.h>
/*
* We are not able to switch in one step to the final KERNEL ADDRESS SPACE
@@ -58,6 +59,20 @@ SYM_CODE_START_NOALIGN(startup_64)
*/
mov %rsi, %r15
+#ifdef CONFIG_X86_5LEVEL
+ /*
+ * Set the X86_FEATURE_5LEVEL_PAGING capability before calling into the
+ * C code, to give it a consistent view of any global pseudo-constants
+ * that are derived from pgtable_l5_enabled().
+ */
+ mov %cr4, %rax
+ btl $X86_CR4_LA57_BIT, %eax
+ jnc 0f
+
+ setup_force_cpu_cap X86_FEATURE_5LEVEL_PAGING
+0:
+#endif
+
/* Set up the stack for verify_cpu() */
leaq __top_init_kernel_stack(%rip), %rsp
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v3 5/7] x86/boot: Drop the early variant of pgtable_l5_enabled()
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (3 preceding siblings ...)
2025-05-14 10:42 ` [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 6/7] x86/boot: Drop 5-level paging related variables and early updates Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 7/7] x86/cpu: Make CPU capability overrides __ro_after_init Ard Biesheuvel
6 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
Now that cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING) is guaranteed to
produce the correct value even during early boot, there is no longer a
need for an early variant and so it can be dropped.
For the decompressor, fall back to testing the CR4.LA57 control register
bit directly.
Note that this removes the need to disable KASAN temporarily while
applying alternatives, given that any constant or VA space dimension
derived from pgtable_l5_enabled() will now always produce a consistent
value.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 7 ++++---
arch/x86/boot/startup/sme.c | 9 ---------
arch/x86/include/asm/pgtable_64_types.h | 14 ++------------
arch/x86/kernel/alternative.c | 12 ------------
arch/x86/kernel/cpu/common.c | 2 --
arch/x86/kernel/head64.c | 3 ---
arch/x86/mm/kasan_init_64.c | 3 ---
7 files changed, 6 insertions(+), 44 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index db1048621ea2..72b830b8a69c 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -16,9 +16,6 @@
#define __NO_FORTIFY
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
/*
* Boot stub deals with identity mappings, physical and virtual addresses are
* the same, so override these defines.
@@ -28,6 +25,10 @@
#define __pa(x) ((unsigned long)(x))
#define __va(x) ((void *)((unsigned long)(x)))
+#ifdef CONFIG_X86_5LEVEL
+#define pgtable_l5_enabled() (native_read_cr4() & X86_CR4_LA57)
+#endif
+
#include <linux/linkage.h>
#include <linux/screen_info.h>
#include <linux/elf.h>
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index 70ea1748c0a7..a6c25d005991 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -25,15 +25,6 @@
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS
-/*
- * This code runs before CPU feature bits are set. By default, the
- * pgtable_l5_enabled() function uses bit X86_FEATURE_LA57 to determine if
- * 5-level paging is active, so that won't work here. USE_EARLY_PGTABLE_L5
- * is provided to handle this situation and, instead, use a variable that
- * has been set by the early boot code.
- */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/mem_encrypt.h>
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index d31ae12663bb..3e94da790cb7 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -24,19 +24,9 @@ typedef struct { pmdval_t pmd; } pmd_t;
extern unsigned int __pgtable_l5_enabled;
#ifdef CONFIG_X86_5LEVEL
-#ifdef USE_EARLY_PGTABLE_L5
-/*
- * cpu_feature_enabled() is not available in early boot code.
- * Use variable instead.
- */
-static inline bool pgtable_l5_enabled(void)
-{
- return __pgtable_l5_enabled;
-}
-#else
+#ifndef pgtable_l5_enabled
#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING)
-#endif /* USE_EARLY_PGTABLE_L5 */
-
+#endif
#else
#define pgtable_l5_enabled() 0
#endif /* CONFIG_X86_5LEVEL */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 2385528792b2..e39823d8d1ae 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -589,16 +589,6 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
DPRINTK(ALT, "alt table %px, -> %px", start, end);
- /*
- * In the case CONFIG_X86_5LEVEL=y, KASAN_SHADOW_START is defined using
- * cpu_feature_enabled(X86_FEATURE_LA57) and is therefore patched here.
- * During the process, KASAN becomes confused seeing partial LA57
- * conversion and triggers a false-positive out-of-bound report.
- *
- * Disable KASAN until the patching is complete.
- */
- kasan_disable_current();
-
/*
* The scan order should be from start to end. A later scanned
* alternative code can overwrite previously scanned alternative code.
@@ -666,8 +656,6 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
text_poke_early(instr, insn_buff, insn_buff_sz);
}
-
- kasan_enable_current();
}
static inline bool is_jcc32(struct insn *insn)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8954dc2fb26..08a586606e24 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1,6 +1,4 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
#include <linux/memblock.h>
#include <linux/linkage.h>
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 14f7dda20954..84b5df539a94 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -5,9 +5,6 @@
* Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
*/
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/init.h>
#include <linux/linkage.h>
#include <linux/types.h>
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 0539efd0d216..7c4fafbd52cc 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -1,9 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#define pr_fmt(fmt) "kasan: " fmt
-/* cpu_feature_enabled() cannot be used this early */
-#define USE_EARLY_PGTABLE_L5
-
#include <linux/memblock.h>
#include <linux/kasan.h>
#include <linux/kdebug.h>
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v3 6/7] x86/boot: Drop 5-level paging related variables and early updates
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (4 preceding siblings ...)
2025-05-14 10:42 ` [PATCH v3 5/7] x86/boot: Drop the early variant of pgtable_l5_enabled() Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 7/7] x86/cpu: Make CPU capability overrides __ro_after_init Ard Biesheuvel
6 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
The variable __pgtable_l5_enabled is no longer used so it can be
dropped.
Along with it, drop ptrs_per_p4d and pgdir_shift, and replace any
references to those with expressions based on pgtable_l5_enabled(). This
ensures that all observers see values that are mutually consistent.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/boot/compressed/misc.h | 1 -
arch/x86/boot/compressed/pgtable_64.c | 12 -----------
arch/x86/boot/startup/map_kernel.c | 21 +-------------------
arch/x86/include/asm/pgtable_64_types.h | 9 ++-------
arch/x86/kernel/head64.c | 8 --------
5 files changed, 3 insertions(+), 48 deletions(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index 72b830b8a69c..3d5c6322def8 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -190,7 +190,6 @@ static inline int count_immovable_mem_regions(void) { return 0; }
#endif
/* ident_map_64.c */
-extern unsigned int __pgtable_l5_enabled, pgdir_shift, ptrs_per_p4d;
extern void kernel_add_identity_map(unsigned long start, unsigned long end);
/* Used by PAGE_KERN* macros: */
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index 5a6c7a190e5b..591d28f2feb6 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -10,13 +10,6 @@
#define BIOS_START_MIN 0x20000U /* 128K, less than this is insane */
#define BIOS_START_MAX 0x9f000U /* 640K, absolute maximum */
-#ifdef CONFIG_X86_5LEVEL
-/* __pgtable_l5_enabled needs to be in .data to avoid being cleared along with .bss */
-unsigned int __section(".data") __pgtable_l5_enabled;
-unsigned int __section(".data") pgdir_shift = 39;
-unsigned int __section(".data") ptrs_per_p4d = 1;
-#endif
-
/* Buffer to preserve trampoline memory */
static char trampoline_save[TRAMPOLINE_32BIT_SIZE];
@@ -127,11 +120,6 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
native_cpuid_eax(0) >= 7 &&
(native_cpuid_ecx(7) & (1 << (X86_FEATURE_LA57 & 31)))) {
l5_required = true;
-
- /* Initialize variables for 5-level paging */
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
}
/*
diff --git a/arch/x86/boot/startup/map_kernel.c b/arch/x86/boot/startup/map_kernel.c
index 905e8734b5a3..056de4766006 100644
--- a/arch/x86/boot/startup/map_kernel.c
+++ b/arch/x86/boot/startup/map_kernel.c
@@ -14,25 +14,6 @@
extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
extern unsigned int next_early_pgt;
-static inline bool check_la57_support(void)
-{
- if (!IS_ENABLED(CONFIG_X86_5LEVEL))
- return false;
-
- /*
- * 5-level paging is detected and enabled at kernel decompression
- * stage. Only check if it has been enabled there.
- */
- if (!(native_read_cr4() & X86_CR4_LA57))
- return false;
-
- __pgtable_l5_enabled = 1;
- pgdir_shift = 48;
- ptrs_per_p4d = 512;
-
- return true;
-}
-
static unsigned long __head sme_postprocess_startup(struct boot_params *bp,
pmdval_t *pmd,
unsigned long p2v_offset)
@@ -102,7 +83,7 @@ unsigned long __head __startup_64(unsigned long p2v_offset,
bool la57;
int i;
- la57 = check_la57_support();
+ la57 = pgtable_l5_enabled();
/* Is the address too large? */
if (physaddr >> MAX_PHYSMEM_BITS)
diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h
index 3e94da790cb7..dc3a08b539d8 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -21,8 +21,6 @@ typedef unsigned long pgprotval_t;
typedef struct { pteval_t pte; } pte_t;
typedef struct { pmdval_t pmd; } pmd_t;
-extern unsigned int __pgtable_l5_enabled;
-
#ifdef CONFIG_X86_5LEVEL
#ifndef pgtable_l5_enabled
#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING)
@@ -31,9 +29,6 @@ extern unsigned int __pgtable_l5_enabled;
#define pgtable_l5_enabled() 0
#endif /* CONFIG_X86_5LEVEL */
-extern unsigned int pgdir_shift;
-extern unsigned int ptrs_per_p4d;
-
#endif /* !__ASSEMBLER__ */
#ifdef CONFIG_X86_5LEVEL
@@ -41,7 +36,7 @@ extern unsigned int ptrs_per_p4d;
/*
* PGDIR_SHIFT determines what a top-level page table entry can map
*/
-#define PGDIR_SHIFT pgdir_shift
+#define PGDIR_SHIFT (pgtable_l5_enabled() ? 48 : 39)
#define PTRS_PER_PGD 512
/*
@@ -49,7 +44,7 @@ extern unsigned int ptrs_per_p4d;
*/
#define P4D_SHIFT 39
#define MAX_PTRS_PER_P4D 512
-#define PTRS_PER_P4D ptrs_per_p4d
+#define PTRS_PER_P4D (pgtable_l5_enabled() ? MAX_PTRS_PER_P4D : 1)
#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
#define P4D_MASK (~(P4D_SIZE - 1))
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 84b5df539a94..68f6a31b4d8e 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -48,14 +48,6 @@ unsigned int __initdata next_early_pgt;
SYM_PIC_ALIAS(next_early_pgt);
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
-#ifdef CONFIG_X86_5LEVEL
-unsigned int __pgtable_l5_enabled __ro_after_init;
-unsigned int pgdir_shift __ro_after_init = 39;
-EXPORT_SYMBOL(pgdir_shift);
-unsigned int ptrs_per_p4d __ro_after_init = 1;
-EXPORT_SYMBOL(ptrs_per_p4d);
-#endif
-
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v3 7/7] x86/cpu: Make CPU capability overrides __ro_after_init
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
` (5 preceding siblings ...)
2025-05-14 10:42 ` [PATCH v3 6/7] x86/boot: Drop 5-level paging related variables and early updates Ard Biesheuvel
@ 2025-05-14 10:42 ` Ard Biesheuvel
6 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-14 10:42 UTC (permalink / raw)
To: linux-kernel
Cc: x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds, Brian Gerst
From: Ard Biesheuvel <ardb@kernel.org>
CPU capabilities are set at init time so they can be made R/O once the
boot completes.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
arch/x86/kernel/cpu/common.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 08a586606e24..19b310bd55ae 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -702,8 +702,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
}
/* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
-__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
-__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
+__u32 __ro_after_init cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
+__u32 __ro_after_init cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
#ifdef CONFIG_X86_32
/* The 32-bit entry code needs to find cpu_entry_area. */
--
2.49.0.1101.gccaa498523-goog
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-14 10:42 ` [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early Ard Biesheuvel
@ 2025-05-15 6:56 ` Ingo Molnar
2025-05-15 7:50 ` Ingo Molnar
2025-05-15 7:55 ` Kirill A. Shutemov
0 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 6:56 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Linus Torvalds, Brian Gerst
* Ard Biesheuvel <ardb+git@google.com> wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> cpu_feature_enabled() uses a ternary alternative, where the late variant
> is based on code patching and the early variant accesses the capability
> field in boot_cpu_data directly.
>
> This allows cpu_feature_enabled() to be called quite early, but it still
> requires that the CPU feature detection code runs before being able to
> rely on the return value of cpu_feature_enabled().
>
> This is a problem for the implementation of pgtable_l5_enabled(), which
> is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
> called extremely early. Currently, there is a hacky workaround where
> some source files that may execute before (but also after) CPU feature
> detection have a different version of pgtable_l5_enabled(), based on the
> USE_EARLY_PGTABLE_L5 preprocessor macro.
>
> Instead, let's make it possible to set CPU feature arbitrarily early, so
> that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
> entering C code.
>
> This involves relying on static initialization of boot_cpu_data and the
> cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
> .data. This ensures that they won't be cleared along with the rest of
> BSS.
>
> Note that forcing a capability involves setting it in both
> boot_cpu_data.x86_capability[] and cpu_caps_set[].
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/kernel/cpu/common.c | 9 +++------
> 1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 6f7827015834..f6f206743d6a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
> }
>
> /* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
> -__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> -__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> +__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> +__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
This change is not mentioned in the changelog AFAICS, but it should be
in a separate patch anyway.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
@ 2025-05-15 7:06 ` Ingo Molnar
2025-05-15 7:45 ` Ingo Molnar
` (2 subsequent siblings)
3 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 7:06 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Linus Torvalds, Brian Gerst
* Ard Biesheuvel <ardb+git@google.com> wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Currently, the LA57 CPU feature flag is taken to mean two different
> things at once:
> - whether the CPU implements the LA57 extension, and is therefore
> capable of supporting 5 level paging;
> - whether 5 level paging is currently in use.
>
> This means the LA57 capability of the hardware is hidden when a LA57
> capable CPU is forced to run with 4 levels of paging. It also means the
> the ordinary CPU capability detection code will happily set the LA57
> capability and it needs to be cleared explicitly afterwards to avoid
> inconsistencies.
>
> Separate the two so that the CPU hardware capability can be identified
> unambigously in all cases.
>
> To avoid breaking existing users that might assume that 5 level paging
> is being used when the "la57" string is visible in /proc/cpuinfo,
> repurpose that string to mean that 5-level paging is in use, and add a
> new string la57_capable to indicate that the CPU feature is implemented
> by the hardware.
So the new string ended up being "la57_hw", not "la57_capable". :-)
> -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
I fixed the changelog and kept la57_hw.
BTW., I too was considering these variants for the new flag:
la57_support
la57_cap
la57_capable
- These are easy to confuse with 5-level paging software
support in the kernel, ie. the name doesn't sufficiently
disambiguate that this flag is about hardware support.
la57_cpu
- While this makes it clear that it's about the CPU, the _cpu
postfix often denotes something related to a specific CPU,
so it's a tiny bit confusing in this context.
... and each had disadvantages, as listed, and "la57_hw" seemed the
least ambiguous in this context.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use
2025-05-14 10:42 ` [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use Ard Biesheuvel
@ 2025-05-15 7:10 ` Ingo Molnar
2025-05-15 7:58 ` [tip: x86/core] x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too tip-bot2 for Ard Biesheuvel
1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 7:10 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Linus Torvalds, Brian Gerst
* Ard Biesheuvel <ardb+git@google.com> wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Expose struct cpuinfo_x86 via asm-offsets for x86_64 too so that it will
> be possible to set CPU capabilities from asm code.
Note that this title and changelog is slightly inaccurate: we don't
really expose 'struct cpuinfo_x86', only certain fields of it. Most of
the fields are not exported, only those that are used (or will be used)
by asm code.
I fixed up the changelog.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
2025-05-15 7:06 ` Ingo Molnar
@ 2025-05-15 7:45 ` Ingo Molnar
2025-05-15 8:07 ` Kirill A. Shutemov
2025-05-15 10:12 ` Ard Biesheuvel
2025-05-15 9:51 ` Borislav Petkov
2025-05-15 13:11 ` Borislav Petkov
3 siblings, 2 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 7:45 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Linus Torvalds, Brian Gerst
* Ard Biesheuvel <ardb+git@google.com> wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Currently, the LA57 CPU feature flag is taken to mean two different
> things at once:
> - whether the CPU implements the LA57 extension, and is therefore
> capable of supporting 5 level paging;
> - whether 5 level paging is currently in use.
>
> This means the LA57 capability of the hardware is hidden when a LA57
> capable CPU is forced to run with 4 levels of paging. It also means the
> the ordinary CPU capability detection code will happily set the LA57
> capability and it needs to be cleared explicitly afterwards to avoid
> inconsistencies.
>
> Separate the two so that the CPU hardware capability can be identified
> unambigously in all cases.
>
> To avoid breaking existing users that might assume that 5 level paging
> is being used when the "la57" string is visible in /proc/cpuinfo,
> repurpose that string to mean that 5-level paging is in use, and add a
> new string la57_capable to indicate that the CPU feature is implemented
> by the hardware.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/include/asm/cpufeatures.h | 3 ++-
> arch/x86/include/asm/page_64.h | 2 +-
> arch/x86/include/asm/pgtable_64_types.h | 2 +-
> arch/x86/kernel/cpu/common.c | 16 ++--------------
> arch/x86/kvm/x86.h | 4 ++--
> drivers/iommu/amd/init.c | 4 ++--
> drivers/iommu/intel/svm.c | 4 ++--
> tools/testing/selftests/kvm/x86/set_sregs_test.c | 2 +-
> 8 files changed, 13 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index f67a93fc9391..d59bee5907e7 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -395,7 +395,7 @@
> #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
> #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> @@ -483,6 +483,7 @@
> #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
So there's a new complication here, KVM doesn't like the use of
synthethic CPU flags, for understandable reasons:
inlined from ‘intel_pmu_set_msr’ at arch/x86/kvm/vmx/pmu_intel.c:369:7:
...
./arch/x86/kvm/reverse_cpuid.h:102:9: note: in expansion of macro ‘BUILD_BUG_ON’
102 | BUILD_BUG_ON(x86_leaf == CPUID_LNX_5);
| ^~~~~~~~~~~~
(See x86-64 allmodconfig)
Even though previously X86_FEATURE_LA57 was effectively a synthethic
CPU flag (it got artificially turned off by the Linux kernel if 5-level
paging was disabled) ...
So I think the most straightforward solution would be to do the change
below, and pass through LA57 flag if 5-level paging is enabled in the
host kernel. This is similar to as if the firmware turned off LA57, and
it doesn't bring in all the early-boot complications bare metal has. It
should also match the previous behavior I think.
Thoughts?
Thanks,
Ingo
=================>
arch/x86/kvm/cpuid.c | 6 ++++++
arch/x86/kvm/x86.h | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 571c906ffcbf..d951d71aea3b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1221,6 +1221,12 @@ void kvm_set_cpu_caps(void)
kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
kvm_cpu_cap_clear(X86_FEATURE_RDPID);
}
+ /*
+ * Clear the LA57 flag in the guest if the host kernel
+ * does not have 5-level paging support:
+ */
+ if (kvm_cpu_cap_has(X86_FEATURE_LA57) && !pgtable_l5_enabled())
+ kvm_cpu_cap_clear(X86_FEATURE_LA57);
}
EXPORT_SYMBOL_GPL(kvm_set_cpu_caps);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d2c093f17ae5..9dc32a409076 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -243,7 +243,7 @@ static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
static inline u8 max_host_virt_addr_bits(void)
{
- return kvm_cpu_cap_has(X86_FEATURE_5LEVEL_PAGING) ? 57 : 48;
+ return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
}
/*
@@ -603,7 +603,7 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
__reserved_bits |= X86_CR4_FSGSBASE; \
if (!__cpu_has(__c, X86_FEATURE_PKU)) \
__reserved_bits |= X86_CR4_PKE; \
- if (!__cpu_has(__c, X86_FEATURE_5LEVEL_PAGING)) \
+ if (!__cpu_has(__c, X86_FEATURE_LA57)) \
__reserved_bits |= X86_CR4_LA57; \
if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
__reserved_bits |= X86_CR4_UMIP; \
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-15 6:56 ` Ingo Molnar
@ 2025-05-15 7:50 ` Ingo Molnar
2025-05-15 7:55 ` Kirill A. Shutemov
1 sibling, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 7:50 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Linus Torvalds, Brian Gerst
* Ingo Molnar <mingo@kernel.org> wrote:
>
> * Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > cpu_feature_enabled() uses a ternary alternative, where the late variant
> > is based on code patching and the early variant accesses the capability
> > field in boot_cpu_data directly.
> >
> > This allows cpu_feature_enabled() to be called quite early, but it still
> > requires that the CPU feature detection code runs before being able to
> > rely on the return value of cpu_feature_enabled().
> >
> > This is a problem for the implementation of pgtable_l5_enabled(), which
> > is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
> > called extremely early. Currently, there is a hacky workaround where
> > some source files that may execute before (but also after) CPU feature
> > detection have a different version of pgtable_l5_enabled(), based on the
> > USE_EARLY_PGTABLE_L5 preprocessor macro.
> >
> > Instead, let's make it possible to set CPU feature arbitrarily early, so
> > that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
> > entering C code.
> >
> > This involves relying on static initialization of boot_cpu_data and the
> > cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
> > .data. This ensures that they won't be cleared along with the rest of
> > BSS.
> >
> > Note that forcing a capability involves setting it in both
> > boot_cpu_data.x86_capability[] and cpu_caps_set[].
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/kernel/cpu/common.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 6f7827015834..f6f206743d6a 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
> > }
> >
> > /* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
> > -__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > -__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > +__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > +__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
>
> This change is not mentioned in the changelog AFAICS, but it should be
> in a separate patch anyway.
So patch #7 makes this __ro_after_init.
I think we should not introduce the __read_mostly attribute in patch
#2, and introduce __ro_after_init in patch #7 - with the same end
result.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-15 6:56 ` Ingo Molnar
2025-05-15 7:50 ` Ingo Molnar
@ 2025-05-15 7:55 ` Kirill A. Shutemov
2025-05-15 8:18 ` Ingo Molnar
1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 7:55 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, x86, Ard Biesheuvel, Linus Torvalds,
Brian Gerst
On Thu, May 15, 2025 at 08:56:59AM +0200, Ingo Molnar wrote:
>
> * Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > cpu_feature_enabled() uses a ternary alternative, where the late variant
> > is based on code patching and the early variant accesses the capability
> > field in boot_cpu_data directly.
> >
> > This allows cpu_feature_enabled() to be called quite early, but it still
> > requires that the CPU feature detection code runs before being able to
> > rely on the return value of cpu_feature_enabled().
> >
> > This is a problem for the implementation of pgtable_l5_enabled(), which
> > is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
> > called extremely early. Currently, there is a hacky workaround where
> > some source files that may execute before (but also after) CPU feature
> > detection have a different version of pgtable_l5_enabled(), based on the
> > USE_EARLY_PGTABLE_L5 preprocessor macro.
> >
> > Instead, let's make it possible to set CPU feature arbitrarily early, so
> > that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
> > entering C code.
> >
> > This involves relying on static initialization of boot_cpu_data and the
> > cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
> > .data. This ensures that they won't be cleared along with the rest of
> > BSS.
> >
> > Note that forcing a capability involves setting it in both
> > boot_cpu_data.x86_capability[] and cpu_caps_set[].
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/kernel/cpu/common.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 6f7827015834..f6f206743d6a 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
> > }
> >
> > /* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
> > -__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > -__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > +__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > +__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
>
> This change is not mentioned in the changelog AFAICS, but it should be
> in a separate patch anyway.
And why not __ro_after_init?
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 36+ messages in thread
* [tip: x86/core] x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too
2025-05-14 10:42 ` [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use Ard Biesheuvel
2025-05-15 7:10 ` Ingo Molnar
@ 2025-05-15 7:58 ` tip-bot2 for Ard Biesheuvel
1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Ard Biesheuvel @ 2025-05-15 7:58 UTC (permalink / raw)
To: linux-tip-commits
Cc: Ard Biesheuvel, Ingo Molnar, Brian Gerst, Linus Torvalds, x86,
linux-kernel
The following commit has been merged into the x86/core branch of tip:
Commit-ID: 25219c2578b32c96087569d074e32c8ae634d602
Gitweb: https://git.kernel.org/tip/25219c2578b32c96087569d074e32c8ae634d602
Author: Ard Biesheuvel <ardb@kernel.org>
AuthorDate: Wed, 14 May 2025 12:42:46 +02:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Thu, 15 May 2025 09:12:07 +02:00
x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too
Expose certain 'struct cpuinfo_x86' fields via asm-offsets for x86_64
too, so that it will be possible to set CPU capabilities from 64-bit
asm code.
32-bit already used these fields, so simply move those offset exports into
the unified asm-offsets.c file.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250514104242.1275040-12-ardb+git@google.com
---
arch/x86/kernel/asm-offsets.c | 8 ++++++++
arch/x86/kernel/asm-offsets_32.c | 9 ---------
2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index ad4ea6f..6259b47 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -33,6 +33,14 @@
static void __used common(void)
{
+ OFFSET(CPUINFO_x86, cpuinfo_x86, x86);
+ OFFSET(CPUINFO_x86_vendor, cpuinfo_x86, x86_vendor);
+ OFFSET(CPUINFO_x86_model, cpuinfo_x86, x86_model);
+ OFFSET(CPUINFO_x86_stepping, cpuinfo_x86, x86_stepping);
+ OFFSET(CPUINFO_cpuid_level, cpuinfo_x86, cpuid_level);
+ OFFSET(CPUINFO_x86_capability, cpuinfo_x86, x86_capability);
+ OFFSET(CPUINFO_x86_vendor_id, cpuinfo_x86, x86_vendor_id);
+
BLANK();
OFFSET(TASK_threadsp, task_struct, thread.sp);
#ifdef CONFIG_STACKPROTECTOR
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index 2b411cd..e0a292d 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -12,15 +12,6 @@ void foo(void);
void foo(void)
{
- OFFSET(CPUINFO_x86, cpuinfo_x86, x86);
- OFFSET(CPUINFO_x86_vendor, cpuinfo_x86, x86_vendor);
- OFFSET(CPUINFO_x86_model, cpuinfo_x86, x86_model);
- OFFSET(CPUINFO_x86_stepping, cpuinfo_x86, x86_stepping);
- OFFSET(CPUINFO_cpuid_level, cpuinfo_x86, cpuid_level);
- OFFSET(CPUINFO_x86_capability, cpuinfo_x86, x86_capability);
- OFFSET(CPUINFO_x86_vendor_id, cpuinfo_x86, x86_vendor_id);
- BLANK();
-
OFFSET(PT_EBX, pt_regs, bx);
OFFSET(PT_ECX, pt_regs, cx);
OFFSET(PT_EDX, pt_regs, dx);
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code
2025-05-14 10:42 ` [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code Ard Biesheuvel
@ 2025-05-15 8:00 ` Kirill A. Shutemov
2025-05-15 9:43 ` Ard Biesheuvel
0 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 8:00 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Wed, May 14, 2025 at 12:42:47PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> In order for pgtable_l5_enabled() to be reliable wherever it is used and
> however early, set the associated CPU capability from asm code before
> entering the startup C code.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
> arch/x86/include/asm/cpufeature.h | 12 +++++++++---
> arch/x86/kernel/cpu/common.c | 3 ---
> arch/x86/kernel/head_64.S | 15 +++++++++++++++
> 3 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 893cbca37fe9..1b5de40e7bf7 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -2,10 +2,10 @@
> #ifndef _ASM_X86_CPUFEATURE_H
> #define _ASM_X86_CPUFEATURE_H
>
> +#ifdef __KERNEL__
> +#ifndef __ASSEMBLER__
> #include <asm/processor.h>
>
> -#if defined(__KERNEL__) && !defined(__ASSEMBLER__)
> -
> #include <asm/asm.h>
> #include <linux/bitops.h>
> #include <asm/alternative.h>
> @@ -137,5 +137,11 @@ static __always_inline bool _static_cpu_has(u16 bit)
> #define CPU_FEATURE_TYPEVAL boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
> boot_cpu_data.x86_model
>
> -#endif /* defined(__KERNEL__) && !defined(__ASSEMBLER__) */
> +#else /* !defined(__ASSEMBLER__) */
> + .macro setup_force_cpu_cap, cap:req
> + btsl $\cap % 32, boot_cpu_data+CPUINFO_x86_capability+4*(\cap / 32)(%rip)
> + btsl $\cap % 32, cpu_caps_set+4*(\cap / 32)(%rip)
> + .endm
> +#endif /* !defined(__ASSEMBLER__) */
> +#endif /* defined(__KERNEL__) */
> #endif /* _ASM_X86_CPUFEATURE_H */
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index f6f206743d6a..c8954dc2fb26 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1752,9 +1752,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
> setup_clear_cpu_cap(X86_FEATURE_PCID);
> #endif
>
> - if (IS_ENABLED(CONFIG_X86_5LEVEL) && (native_read_cr4() & X86_CR4_LA57))
> - setup_force_cpu_cap(X86_FEATURE_5LEVEL_PAGING);
> -
> detect_nopl();
> }
>
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 069420853304..191d5947a762 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -27,6 +27,7 @@
> #include <asm/fixmap.h>
> #include <asm/smp.h>
> #include <asm/thread_info.h>
> +#include <asm/cpufeature.h>
>
> /*
> * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
> @@ -58,6 +59,20 @@ SYM_CODE_START_NOALIGN(startup_64)
> */
> mov %rsi, %r15
>
> +#ifdef CONFIG_X86_5LEVEL
Is #ifdef really needed?
> + /*
> + * Set the X86_FEATURE_5LEVEL_PAGING capability before calling into the
> + * C code, to give it a consistent view of any global pseudo-constants
> + * that are derived from pgtable_l5_enabled().
> + */
> + mov %cr4, %rax
> + btl $X86_CR4_LA57_BIT, %eax
> + jnc 0f
> +
> + setup_force_cpu_cap X86_FEATURE_5LEVEL_PAGING
> +0:
> +#endif
> +
> /* Set up the stack for verify_cpu() */
> leaq __top_init_kernel_stack(%rip), %rsp
>
> --
> 2.49.0.1101.gccaa498523-goog
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 7:45 ` Ingo Molnar
@ 2025-05-15 8:07 ` Kirill A. Shutemov
2025-05-15 8:22 ` Ingo Molnar
2025-05-15 10:12 ` Ard Biesheuvel
1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 8:07 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, x86, Ard Biesheuvel, Linus Torvalds,
Brian Gerst
On Thu, May 15, 2025 at 09:45:52AM +0200, Ingo Molnar wrote:
>
> * Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Currently, the LA57 CPU feature flag is taken to mean two different
> > things at once:
> > - whether the CPU implements the LA57 extension, and is therefore
> > capable of supporting 5 level paging;
> > - whether 5 level paging is currently in use.
> >
> > This means the LA57 capability of the hardware is hidden when a LA57
> > capable CPU is forced to run with 4 levels of paging. It also means the
> > the ordinary CPU capability detection code will happily set the LA57
> > capability and it needs to be cleared explicitly afterwards to avoid
> > inconsistencies.
> >
> > Separate the two so that the CPU hardware capability can be identified
> > unambigously in all cases.
> >
> > To avoid breaking existing users that might assume that 5 level paging
> > is being used when the "la57" string is visible in /proc/cpuinfo,
> > repurpose that string to mean that 5-level paging is in use, and add a
> > new string la57_capable to indicate that the CPU feature is implemented
> > by the hardware.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/include/asm/cpufeatures.h | 3 ++-
> > arch/x86/include/asm/page_64.h | 2 +-
> > arch/x86/include/asm/pgtable_64_types.h | 2 +-
> > arch/x86/kernel/cpu/common.c | 16 ++--------------
> > arch/x86/kvm/x86.h | 4 ++--
> > drivers/iommu/amd/init.c | 4 ++--
> > drivers/iommu/intel/svm.c | 4 ++--
> > tools/testing/selftests/kvm/x86/set_sregs_test.c | 2 +-
> > 8 files changed, 13 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index f67a93fc9391..d59bee5907e7 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -395,7 +395,7 @@
> > #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> > #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> > #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> > -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> > +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
> > #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> > #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> > #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> > @@ -483,6 +483,7 @@
> > #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> > #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> > #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> > +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
>
> So there's a new complication here, KVM doesn't like the use of
> synthethic CPU flags, for understandable reasons:
>
> inlined from ‘intel_pmu_set_msr’ at arch/x86/kvm/vmx/pmu_intel.c:369:7:
> ...
> ./arch/x86/kvm/reverse_cpuid.h:102:9: note: in expansion of macro ‘BUILD_BUG_ON’
> 102 | BUILD_BUG_ON(x86_leaf == CPUID_LNX_5);
> | ^~~~~~~~~~~~
>
> (See x86-64 allmodconfig)
>
> Even though previously X86_FEATURE_LA57 was effectively a synthethic
> CPU flag (it got artificially turned off by the Linux kernel if 5-level
> paging was disabled) ...
>
> So I think the most straightforward solution would be to do the change
> below, and pass through LA57 flag if 5-level paging is enabled in the
> host kernel. This is similar to as if the firmware turned off LA57, and
> it doesn't bring in all the early-boot complications bare metal has. It
> should also match the previous behavior I think.
>
> Thoughts?
>
> Thanks,
>
> Ingo
>
> =================>
>
> arch/x86/kvm/cpuid.c | 6 ++++++
> arch/x86/kvm/x86.h | 4 ++--
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 571c906ffcbf..d951d71aea3b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1221,6 +1221,12 @@ void kvm_set_cpu_caps(void)
> kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
> kvm_cpu_cap_clear(X86_FEATURE_RDPID);
> }
> + /*
> + * Clear the LA57 flag in the guest if the host kernel
> + * does not have 5-level paging support:
> + */
> + if (kvm_cpu_cap_has(X86_FEATURE_LA57) && !pgtable_l5_enabled())
X86_FEATURE_LA57 check seems redundant.
> + kvm_cpu_cap_clear(X86_FEATURE_LA57);
> }
> EXPORT_SYMBOL_GPL(kvm_set_cpu_caps);
>
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index d2c093f17ae5..9dc32a409076 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -243,7 +243,7 @@ static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
>
> static inline u8 max_host_virt_addr_bits(void)
> {
> - return kvm_cpu_cap_has(X86_FEATURE_5LEVEL_PAGING) ? 57 : 48;
> + return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
> }
>
> /*
> @@ -603,7 +603,7 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> __reserved_bits |= X86_CR4_FSGSBASE; \
> if (!__cpu_has(__c, X86_FEATURE_PKU)) \
> __reserved_bits |= X86_CR4_PKE; \
> - if (!__cpu_has(__c, X86_FEATURE_5LEVEL_PAGING)) \
> + if (!__cpu_has(__c, X86_FEATURE_LA57)) \
> __reserved_bits |= X86_CR4_LA57; \
> if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
> __reserved_bits |= X86_CR4_UMIP; \
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-15 7:55 ` Kirill A. Shutemov
@ 2025-05-15 8:18 ` Ingo Molnar
2025-05-15 9:45 ` Ard Biesheuvel
0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 8:18 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ard Biesheuvel, Linus Torvalds,
Brian Gerst
* Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Thu, May 15, 2025 at 08:56:59AM +0200, Ingo Molnar wrote:
> >
> > * Ard Biesheuvel <ardb+git@google.com> wrote:
> >
> > > From: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > cpu_feature_enabled() uses a ternary alternative, where the late variant
> > > is based on code patching and the early variant accesses the capability
> > > field in boot_cpu_data directly.
> > >
> > > This allows cpu_feature_enabled() to be called quite early, but it still
> > > requires that the CPU feature detection code runs before being able to
> > > rely on the return value of cpu_feature_enabled().
> > >
> > > This is a problem for the implementation of pgtable_l5_enabled(), which
> > > is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
> > > called extremely early. Currently, there is a hacky workaround where
> > > some source files that may execute before (but also after) CPU feature
> > > detection have a different version of pgtable_l5_enabled(), based on the
> > > USE_EARLY_PGTABLE_L5 preprocessor macro.
> > >
> > > Instead, let's make it possible to set CPU feature arbitrarily early, so
> > > that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
> > > entering C code.
> > >
> > > This involves relying on static initialization of boot_cpu_data and the
> > > cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
> > > .data. This ensures that they won't be cleared along with the rest of
> > > BSS.
> > >
> > > Note that forcing a capability involves setting it in both
> > > boot_cpu_data.x86_capability[] and cpu_caps_set[].
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > > arch/x86/kernel/cpu/common.c | 9 +++------
> > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index 6f7827015834..f6f206743d6a 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
> > > }
> > >
> > > /* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
> > > -__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > -__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > +__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > +__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> >
> > This change is not mentioned in the changelog AFAICS, but it should be
> > in a separate patch anyway.
>
> And why not __ro_after_init?
That's patch #7 :-)
I got confused about that too.
Patch #2 should not touch this line, and patch #7 should simply
introduce __ro_after_init, and we are good I think.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 8:07 ` Kirill A. Shutemov
@ 2025-05-15 8:22 ` Ingo Molnar
0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 8:22 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ard Biesheuvel, Linus Torvalds,
Brian Gerst
* Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Thu, May 15, 2025 at 09:45:52AM +0200, Ingo Molnar wrote:
> >
> > * Ard Biesheuvel <ardb+git@google.com> wrote:
> >
> > > From: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Currently, the LA57 CPU feature flag is taken to mean two different
> > > things at once:
> > > - whether the CPU implements the LA57 extension, and is therefore
> > > capable of supporting 5 level paging;
> > > - whether 5 level paging is currently in use.
> > >
> > > This means the LA57 capability of the hardware is hidden when a LA57
> > > capable CPU is forced to run with 4 levels of paging. It also means the
> > > the ordinary CPU capability detection code will happily set the LA57
> > > capability and it needs to be cleared explicitly afterwards to avoid
> > > inconsistencies.
> > >
> > > Separate the two so that the CPU hardware capability can be identified
> > > unambigously in all cases.
> > >
> > > To avoid breaking existing users that might assume that 5 level paging
> > > is being used when the "la57" string is visible in /proc/cpuinfo,
> > > repurpose that string to mean that 5-level paging is in use, and add a
> > > new string la57_capable to indicate that the CPU feature is implemented
> > > by the hardware.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > > arch/x86/include/asm/cpufeatures.h | 3 ++-
> > > arch/x86/include/asm/page_64.h | 2 +-
> > > arch/x86/include/asm/pgtable_64_types.h | 2 +-
> > > arch/x86/kernel/cpu/common.c | 16 ++--------------
> > > arch/x86/kvm/x86.h | 4 ++--
> > > drivers/iommu/amd/init.c | 4 ++--
> > > drivers/iommu/intel/svm.c | 4 ++--
> > > tools/testing/selftests/kvm/x86/set_sregs_test.c | 2 +-
> > > 8 files changed, 13 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > > index f67a93fc9391..d59bee5907e7 100644
> > > --- a/arch/x86/include/asm/cpufeatures.h
> > > +++ b/arch/x86/include/asm/cpufeatures.h
> > > @@ -395,7 +395,7 @@
> > > #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> > > #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> > > #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> > > -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> > > +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
> > > #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> > > #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> > > #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> > > @@ -483,6 +483,7 @@
> > > #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> > > #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> > > #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> > > +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
> >
> > So there's a new complication here, KVM doesn't like the use of
> > synthethic CPU flags, for understandable reasons:
> >
> > inlined from ‘intel_pmu_set_msr’ at arch/x86/kvm/vmx/pmu_intel.c:369:7:
> > ...
> > ./arch/x86/kvm/reverse_cpuid.h:102:9: note: in expansion of macro ‘BUILD_BUG_ON’
> > 102 | BUILD_BUG_ON(x86_leaf == CPUID_LNX_5);
> > | ^~~~~~~~~~~~
> >
> > (See x86-64 allmodconfig)
> >
> > Even though previously X86_FEATURE_LA57 was effectively a synthethic
> > CPU flag (it got artificially turned off by the Linux kernel if 5-level
> > paging was disabled) ...
> >
> > So I think the most straightforward solution would be to do the change
> > below, and pass through LA57 flag if 5-level paging is enabled in the
> > host kernel. This is similar to as if the firmware turned off LA57, and
> > it doesn't bring in all the early-boot complications bare metal has. It
> > should also match the previous behavior I think.
> >
> > Thoughts?
> >
> > Thanks,
> >
> > Ingo
> >
> > =================>
> >
> > arch/x86/kvm/cpuid.c | 6 ++++++
> > arch/x86/kvm/x86.h | 4 ++--
> > 2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 571c906ffcbf..d951d71aea3b 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1221,6 +1221,12 @@ void kvm_set_cpu_caps(void)
> > kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
> > kvm_cpu_cap_clear(X86_FEATURE_RDPID);
> > }
> > + /*
> > + * Clear the LA57 flag in the guest if the host kernel
> > + * does not have 5-level paging support:
> > + */
> > + if (kvm_cpu_cap_has(X86_FEATURE_LA57) && !pgtable_l5_enabled())
>
> X86_FEATURE_LA57 check seems redundant.
Possibly. I wanted to limit the clearing of X86_FEATURE_LA57 only to
precisely the case where it was set to begin with. Ie. don't clear it
when it's not present to begin with.
It shouldn't matter, as this is KVM-module-init-only code.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code
2025-05-15 8:00 ` Kirill A. Shutemov
@ 2025-05-15 9:43 ` Ard Biesheuvel
2025-05-15 11:05 ` Kirill A. Shutemov
0 siblings, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-15 9:43 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Thu, 15 May 2025 at 09:00, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, May 14, 2025 at 12:42:47PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > In order for pgtable_l5_enabled() to be reliable wherever it is used and
> > however early, set the associated CPU capability from asm code before
> > entering the startup C code.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/include/asm/cpufeature.h | 12 +++++++++---
> > arch/x86/kernel/cpu/common.c | 3 ---
> > arch/x86/kernel/head_64.S | 15 +++++++++++++++
> > 3 files changed, 24 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> > index 893cbca37fe9..1b5de40e7bf7 100644
> > --- a/arch/x86/include/asm/cpufeature.h
> > +++ b/arch/x86/include/asm/cpufeature.h
> > @@ -2,10 +2,10 @@
> > #ifndef _ASM_X86_CPUFEATURE_H
> > #define _ASM_X86_CPUFEATURE_H
> >
> > +#ifdef __KERNEL__
> > +#ifndef __ASSEMBLER__
> > #include <asm/processor.h>
> >
> > -#if defined(__KERNEL__) && !defined(__ASSEMBLER__)
> > -
> > #include <asm/asm.h>
> > #include <linux/bitops.h>
> > #include <asm/alternative.h>
> > @@ -137,5 +137,11 @@ static __always_inline bool _static_cpu_has(u16 bit)
> > #define CPU_FEATURE_TYPEVAL boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
> > boot_cpu_data.x86_model
> >
> > -#endif /* defined(__KERNEL__) && !defined(__ASSEMBLER__) */
> > +#else /* !defined(__ASSEMBLER__) */
> > + .macro setup_force_cpu_cap, cap:req
> > + btsl $\cap % 32, boot_cpu_data+CPUINFO_x86_capability+4*(\cap / 32)(%rip)
> > + btsl $\cap % 32, cpu_caps_set+4*(\cap / 32)(%rip)
> > + .endm
> > +#endif /* !defined(__ASSEMBLER__) */
> > +#endif /* defined(__KERNEL__) */
> > #endif /* _ASM_X86_CPUFEATURE_H */
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index f6f206743d6a..c8954dc2fb26 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -1752,9 +1752,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
> > setup_clear_cpu_cap(X86_FEATURE_PCID);
> > #endif
> >
> > - if (IS_ENABLED(CONFIG_X86_5LEVEL) && (native_read_cr4() & X86_CR4_LA57))
> > - setup_force_cpu_cap(X86_FEATURE_5LEVEL_PAGING);
> > -
> > detect_nopl();
> > }
> >
> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> > index 069420853304..191d5947a762 100644
> > --- a/arch/x86/kernel/head_64.S
> > +++ b/arch/x86/kernel/head_64.S
> > @@ -27,6 +27,7 @@
> > #include <asm/fixmap.h>
> > #include <asm/smp.h>
> > #include <asm/thread_info.h>
> > +#include <asm/cpufeature.h>
> >
> > /*
> > * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
> > @@ -58,6 +59,20 @@ SYM_CODE_START_NOALIGN(startup_64)
> > */
> > mov %rsi, %r15
> >
> > +#ifdef CONFIG_X86_5LEVEL
>
> Is #ifdef really needed?
>
'Really needed' no but setting this capability is semantically murky
if the kernel is not configured to support it. But you wouldn't get
far in that case, so it doesn't matter much in practice.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-15 8:18 ` Ingo Molnar
@ 2025-05-15 9:45 ` Ard Biesheuvel
2025-05-15 12:08 ` Ingo Molnar
0 siblings, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-15 9:45 UTC (permalink / raw)
To: Ingo Molnar
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Linus Torvalds, Brian Gerst
On Thu, 15 May 2025 at 09:18, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> > On Thu, May 15, 2025 at 08:56:59AM +0200, Ingo Molnar wrote:
> > >
> > > * Ard Biesheuvel <ardb+git@google.com> wrote:
> > >
> > > > From: Ard Biesheuvel <ardb@kernel.org>
> > > >
> > > > cpu_feature_enabled() uses a ternary alternative, where the late variant
> > > > is based on code patching and the early variant accesses the capability
> > > > field in boot_cpu_data directly.
> > > >
> > > > This allows cpu_feature_enabled() to be called quite early, but it still
> > > > requires that the CPU feature detection code runs before being able to
> > > > rely on the return value of cpu_feature_enabled().
> > > >
> > > > This is a problem for the implementation of pgtable_l5_enabled(), which
> > > > is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
> > > > called extremely early. Currently, there is a hacky workaround where
> > > > some source files that may execute before (but also after) CPU feature
> > > > detection have a different version of pgtable_l5_enabled(), based on the
> > > > USE_EARLY_PGTABLE_L5 preprocessor macro.
> > > >
> > > > Instead, let's make it possible to set CPU feature arbitrarily early, so
> > > > that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
> > > > entering C code.
> > > >
> > > > This involves relying on static initialization of boot_cpu_data and the
> > > > cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
> > > > .data. This ensures that they won't be cleared along with the rest of
> > > > BSS.
> > > >
> > > > Note that forcing a capability involves setting it in both
> > > > boot_cpu_data.x86_capability[] and cpu_caps_set[].
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > > ---
> > > > arch/x86/kernel/cpu/common.c | 9 +++------
> > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > > index 6f7827015834..f6f206743d6a 100644
> > > > --- a/arch/x86/kernel/cpu/common.c
> > > > +++ b/arch/x86/kernel/cpu/common.c
> > > > @@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
> > > > }
> > > >
> > > > /* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
> > > > -__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > > -__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > > +__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > > +__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > >
> > > This change is not mentioned in the changelog AFAICS, but it should be
> > > in a separate patch anyway.
> >
> > And why not __ro_after_init?
>
> That's patch #7 :-)
>
> I got confused about that too.
>
> Patch #2 should not touch this line, and patch #7 should simply
> introduce __ro_after_init, and we are good I think.
>
This change is needed because it prevents these arrays from being
cleared along with the rest of BSS, which occurs after the startup
code executes.
So conceptually, moving these out of BSS is similar to dropping the
memset()s, and therefore this belongs in the same patch.
However, you are correct that moving these into __ro_after_init
achieves the same thing, so I will just reorder that patch with this
one, and clarify in the commit log that we are relying on the fact
that __ro_after_init is not cleared at boot time.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
2025-05-15 7:06 ` Ingo Molnar
2025-05-15 7:45 ` Ingo Molnar
@ 2025-05-15 9:51 ` Borislav Petkov
2025-05-15 10:17 ` Ard Biesheuvel
2025-05-15 13:11 ` Borislav Petkov
3 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2025-05-15 9:51 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Wed, May 14, 2025 at 12:42:44PM +0200, Ard Biesheuvel wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index f67a93fc9391..d59bee5907e7 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -395,7 +395,7 @@
> #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
Is there any real reason to expose that flag in /proc/cpuinfo?
> #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> @@ -483,6 +483,7 @@
> #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
Or can we stick to that one?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 7:45 ` Ingo Molnar
2025-05-15 8:07 ` Kirill A. Shutemov
@ 2025-05-15 10:12 ` Ard Biesheuvel
2025-05-15 23:24 ` Sean Christopherson
1 sibling, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-15 10:12 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ard Biesheuvel, linux-kernel, x86, Linus Torvalds, Brian Gerst
On Thu, 15 May 2025 at 08:45, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Ard Biesheuvel <ardb+git@google.com> wrote:
>
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Currently, the LA57 CPU feature flag is taken to mean two different
> > things at once:
> > - whether the CPU implements the LA57 extension, and is therefore
> > capable of supporting 5 level paging;
> > - whether 5 level paging is currently in use.
> >
> > This means the LA57 capability of the hardware is hidden when a LA57
> > capable CPU is forced to run with 4 levels of paging. It also means the
> > the ordinary CPU capability detection code will happily set the LA57
> > capability and it needs to be cleared explicitly afterwards to avoid
> > inconsistencies.
> >
> > Separate the two so that the CPU hardware capability can be identified
> > unambigously in all cases.
> >
> > To avoid breaking existing users that might assume that 5 level paging
> > is being used when the "la57" string is visible in /proc/cpuinfo,
> > repurpose that string to mean that 5-level paging is in use, and add a
> > new string la57_capable to indicate that the CPU feature is implemented
> > by the hardware.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> > arch/x86/include/asm/cpufeatures.h | 3 ++-
> > arch/x86/include/asm/page_64.h | 2 +-
> > arch/x86/include/asm/pgtable_64_types.h | 2 +-
> > arch/x86/kernel/cpu/common.c | 16 ++--------------
> > arch/x86/kvm/x86.h | 4 ++--
> > drivers/iommu/amd/init.c | 4 ++--
> > drivers/iommu/intel/svm.c | 4 ++--
> > tools/testing/selftests/kvm/x86/set_sregs_test.c | 2 +-
> > 8 files changed, 13 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index f67a93fc9391..d59bee5907e7 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -395,7 +395,7 @@
> > #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> > #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> > #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> > -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> > +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
> > #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> > #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> > #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> > @@ -483,6 +483,7 @@
> > #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> > #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> > #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> > +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
>
> So there's a new complication here, KVM doesn't like the use of
> synthethic CPU flags, for understandable reasons:
>
> inlined from ‘intel_pmu_set_msr’ at arch/x86/kvm/vmx/pmu_intel.c:369:7:
> ...
> ./arch/x86/kvm/reverse_cpuid.h:102:9: note: in expansion of macro ‘BUILD_BUG_ON’
> 102 | BUILD_BUG_ON(x86_leaf == CPUID_LNX_5);
> | ^~~~~~~~~~~~
>
> (See x86-64 allmodconfig)
>
> Even though previously X86_FEATURE_LA57 was effectively a synthethic
> CPU flag (it got artificially turned off by the Linux kernel if 5-level
> paging was disabled) ...
>
> So I think the most straightforward solution would be to do the change
> below, and pass through LA57 flag if 5-level paging is enabled in the
> host kernel. This is similar to as if the firmware turned off LA57, and
> it doesn't bring in all the early-boot complications bare metal has. It
> should also match the previous behavior I think.
>
> Thoughts?
>
> Thanks,
>
> Ingo
>
> =================>
>
> arch/x86/kvm/cpuid.c | 6 ++++++
> arch/x86/kvm/x86.h | 4 ++--
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 571c906ffcbf..d951d71aea3b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1221,6 +1221,12 @@ void kvm_set_cpu_caps(void)
> kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
> kvm_cpu_cap_clear(X86_FEATURE_RDPID);
> }
> + /*
> + * Clear the LA57 flag in the guest if the host kernel
> + * does not have 5-level paging support:
> + */
> + if (kvm_cpu_cap_has(X86_FEATURE_LA57) && !pgtable_l5_enabled())
> + kvm_cpu_cap_clear(X86_FEATURE_LA57);
> }
> EXPORT_SYMBOL_GPL(kvm_set_cpu_caps);
>
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index d2c093f17ae5..9dc32a409076 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -243,7 +243,7 @@ static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
>
> static inline u8 max_host_virt_addr_bits(void)
> {
> - return kvm_cpu_cap_has(X86_FEATURE_5LEVEL_PAGING) ? 57 : 48;
> + return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
Based on the comment that documents is_noncanonical_address(), it
seems to me that this should be using cpu_has(X86_FEATURE_LA57) rather
than kvm_cpu_cap_has(X86_FEATURE_LA57). Whether the host actually runs
with 5-level paging should be irrelevant, it only matters whether it
is capable of doing so.
> }
>
> /*
> @@ -603,7 +603,7 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> __reserved_bits |= X86_CR4_FSGSBASE; \
> if (!__cpu_has(__c, X86_FEATURE_PKU)) \
> __reserved_bits |= X86_CR4_PKE; \
> - if (!__cpu_has(__c, X86_FEATURE_5LEVEL_PAGING)) \
> + if (!__cpu_has(__c, X86_FEATURE_LA57)) \
> __reserved_bits |= X86_CR4_LA57; \
This could be
if (!__cpu_has(__c, X86_FEATURE_LA57) || !pgtable_l5_enabled())
__reserved_bits |= X86_CR4_LA57;
> if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
> __reserved_bits |= X86_CR4_UMIP; \
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 9:51 ` Borislav Petkov
@ 2025-05-15 10:17 ` Ard Biesheuvel
2025-05-15 10:39 ` Borislav Petkov
0 siblings, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-15 10:17 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Thu, 15 May 2025 at 10:52, Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, May 14, 2025 at 12:42:44PM +0200, Ard Biesheuvel wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > index f67a93fc9391..d59bee5907e7 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -395,7 +395,7 @@
> > #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> > #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> > #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> > -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> > +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
>
> Is there any real reason to expose that flag in /proc/cpuinfo?
>
I'd lean towards not lying about whether the CPU is la57 capable in
/proc/cpuinfo if we don't have to - this flag directly reflects the
CPUID leaf.
> > #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> > #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> > #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> > @@ -483,6 +483,7 @@
> > #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> > #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> > #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> > +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
>
> Or can we stick to that one?
>
This is arguably the important one: as long as "la57" does not change
meaning, we should be fine from compatibility pov.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 10:17 ` Ard Biesheuvel
@ 2025-05-15 10:39 ` Borislav Petkov
2025-05-15 10:57 ` Ard Biesheuvel
0 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2025-05-15 10:39 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Thu, May 15, 2025 at 11:17:47AM +0100, Ard Biesheuvel wrote:
> I'd lean towards not lying about whether the CPU is la57 capable in
> /proc/cpuinfo if we don't have to - this flag directly reflects the
> CPUID leaf.
We have a whole documentation about that whole "not lying" thing: short
version - use kcpuid or some other cpuid tool:
Documentation/arch/x86/cpuinfo.rst
> This is arguably the important one: as long as "la57" does not change
> meaning, we should be fine from compatibility pov.
Yes, those flags in /proc/cpuinfo should show that something is enabled and in
use. And it makes sense here. And you can't change that string anyway anymore.
Whether the hw is capable is not important for userspace and there are a lot
better means for it.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 10:39 ` Borislav Petkov
@ 2025-05-15 10:57 ` Ard Biesheuvel
0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-15 10:57 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Thu, 15 May 2025 at 11:39, Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, May 15, 2025 at 11:17:47AM +0100, Ard Biesheuvel wrote:
> > I'd lean towards not lying about whether the CPU is la57 capable in
> > /proc/cpuinfo if we don't have to - this flag directly reflects the
> > CPUID leaf.
>
> We have a whole documentation about that whole "not lying" thing: short
> version - use kcpuid or some other cpuid tool:
>
> Documentation/arch/x86/cpuinfo.rst
>
> > This is arguably the important one: as long as "la57" does not change
> > meaning, we should be fine from compatibility pov.
>
> Yes, those flags in /proc/cpuinfo should show that something is enabled and in
> use. And it makes sense here. And you can't change that string anyway anymore.
>
> Whether the hw is capable is not important for userspace and there are a lot
> better means for it.
>
Fair enough.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code
2025-05-15 9:43 ` Ard Biesheuvel
@ 2025-05-15 11:05 ` Kirill A. Shutemov
0 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2025-05-15 11:05 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Ard Biesheuvel, linux-kernel, x86, Ingo Molnar, Linus Torvalds,
Brian Gerst
On Thu, May 15, 2025 at 10:43:27AM +0100, Ard Biesheuvel wrote:
> On Thu, 15 May 2025 at 09:00, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Wed, May 14, 2025 at 12:42:47PM +0200, Ard Biesheuvel wrote:
> > > From: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > In order for pgtable_l5_enabled() to be reliable wherever it is used and
> > > however early, set the associated CPU capability from asm code before
> > > entering the startup C code.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > > arch/x86/include/asm/cpufeature.h | 12 +++++++++---
> > > arch/x86/kernel/cpu/common.c | 3 ---
> > > arch/x86/kernel/head_64.S | 15 +++++++++++++++
> > > 3 files changed, 24 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> > > index 893cbca37fe9..1b5de40e7bf7 100644
> > > --- a/arch/x86/include/asm/cpufeature.h
> > > +++ b/arch/x86/include/asm/cpufeature.h
> > > @@ -2,10 +2,10 @@
> > > #ifndef _ASM_X86_CPUFEATURE_H
> > > #define _ASM_X86_CPUFEATURE_H
> > >
> > > +#ifdef __KERNEL__
> > > +#ifndef __ASSEMBLER__
> > > #include <asm/processor.h>
> > >
> > > -#if defined(__KERNEL__) && !defined(__ASSEMBLER__)
> > > -
> > > #include <asm/asm.h>
> > > #include <linux/bitops.h>
> > > #include <asm/alternative.h>
> > > @@ -137,5 +137,11 @@ static __always_inline bool _static_cpu_has(u16 bit)
> > > #define CPU_FEATURE_TYPEVAL boot_cpu_data.x86_vendor, boot_cpu_data.x86, \
> > > boot_cpu_data.x86_model
> > >
> > > -#endif /* defined(__KERNEL__) && !defined(__ASSEMBLER__) */
> > > +#else /* !defined(__ASSEMBLER__) */
> > > + .macro setup_force_cpu_cap, cap:req
> > > + btsl $\cap % 32, boot_cpu_data+CPUINFO_x86_capability+4*(\cap / 32)(%rip)
> > > + btsl $\cap % 32, cpu_caps_set+4*(\cap / 32)(%rip)
> > > + .endm
> > > +#endif /* !defined(__ASSEMBLER__) */
> > > +#endif /* defined(__KERNEL__) */
> > > #endif /* _ASM_X86_CPUFEATURE_H */
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index f6f206743d6a..c8954dc2fb26 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -1752,9 +1752,6 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
> > > setup_clear_cpu_cap(X86_FEATURE_PCID);
> > > #endif
> > >
> > > - if (IS_ENABLED(CONFIG_X86_5LEVEL) && (native_read_cr4() & X86_CR4_LA57))
> > > - setup_force_cpu_cap(X86_FEATURE_5LEVEL_PAGING);
> > > -
> > > detect_nopl();
> > > }
> > >
> > > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> > > index 069420853304..191d5947a762 100644
> > > --- a/arch/x86/kernel/head_64.S
> > > +++ b/arch/x86/kernel/head_64.S
> > > @@ -27,6 +27,7 @@
> > > #include <asm/fixmap.h>
> > > #include <asm/smp.h>
> > > #include <asm/thread_info.h>
> > > +#include <asm/cpufeature.h>
> > >
> > > /*
> > > * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
> > > @@ -58,6 +59,20 @@ SYM_CODE_START_NOALIGN(startup_64)
> > > */
> > > mov %rsi, %r15
> > >
> > > +#ifdef CONFIG_X86_5LEVEL
> >
> > Is #ifdef really needed?
> >
>
> 'Really needed' no but setting this capability is semantically murky
> if the kernel is not configured to support it. But you wouldn't get
> far in that case, so it doesn't matter much in practice.
If we get here with CR4.LA57=1 and CONFIG_X86_5LEVEL=n, we are screwed.
Something is very broken in stub or decompressor.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early
2025-05-15 9:45 ` Ard Biesheuvel
@ 2025-05-15 12:08 ` Ingo Molnar
0 siblings, 0 replies; 36+ messages in thread
From: Ingo Molnar @ 2025-05-15 12:08 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Kirill A. Shutemov, Ard Biesheuvel, linux-kernel, x86,
Linus Torvalds, Brian Gerst
* Ard Biesheuvel <ardb@kernel.org> wrote:
> On Thu, 15 May 2025 at 09:18, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > > On Thu, May 15, 2025 at 08:56:59AM +0200, Ingo Molnar wrote:
> > > >
> > > > * Ard Biesheuvel <ardb+git@google.com> wrote:
> > > >
> > > > > From: Ard Biesheuvel <ardb@kernel.org>
> > > > >
> > > > > cpu_feature_enabled() uses a ternary alternative, where the late variant
> > > > > is based on code patching and the early variant accesses the capability
> > > > > field in boot_cpu_data directly.
> > > > >
> > > > > This allows cpu_feature_enabled() to be called quite early, but it still
> > > > > requires that the CPU feature detection code runs before being able to
> > > > > rely on the return value of cpu_feature_enabled().
> > > > >
> > > > > This is a problem for the implementation of pgtable_l5_enabled(), which
> > > > > is based on cpu_feature_enabled(X86_FEATURE_5LEVEL_PAGING), and may be
> > > > > called extremely early. Currently, there is a hacky workaround where
> > > > > some source files that may execute before (but also after) CPU feature
> > > > > detection have a different version of pgtable_l5_enabled(), based on the
> > > > > USE_EARLY_PGTABLE_L5 preprocessor macro.
> > > > >
> > > > > Instead, let's make it possible to set CPU feature arbitrarily early, so
> > > > > that the X86_FEATURE_5LEVEL_PAGING capability can be set before even
> > > > > entering C code.
> > > > >
> > > > > This involves relying on static initialization of boot_cpu_data and the
> > > > > cpu_caps_set/cpu_caps_cleared arrays, so they all need to reside in
> > > > > .data. This ensures that they won't be cleared along with the rest of
> > > > > BSS.
> > > > >
> > > > > Note that forcing a capability involves setting it in both
> > > > > boot_cpu_data.x86_capability[] and cpu_caps_set[].
> > > > >
> > > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > > > ---
> > > > > arch/x86/kernel/cpu/common.c | 9 +++------
> > > > > 1 file changed, 3 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > > > index 6f7827015834..f6f206743d6a 100644
> > > > > --- a/arch/x86/kernel/cpu/common.c
> > > > > +++ b/arch/x86/kernel/cpu/common.c
> > > > > @@ -704,8 +704,8 @@ static const char *table_lookup_model(struct cpuinfo_x86 *c)
> > > > > }
> > > > >
> > > > > /* Aligned to unsigned long to avoid split lock in atomic bitmap ops */
> > > > > -__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > > > -__u32 cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > > > +__u32 __read_mostly cpu_caps_cleared[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > > > +__u32 __read_mostly cpu_caps_set[NCAPINTS + NBUGINTS] __aligned(sizeof(unsigned long));
> > > >
> > > > This change is not mentioned in the changelog AFAICS, but it should be
> > > > in a separate patch anyway.
> > >
> > > And why not __ro_after_init?
> >
> > That's patch #7 :-)
> >
> > I got confused about that too.
> >
> > Patch #2 should not touch this line, and patch #7 should simply
> > introduce __ro_after_init, and we are good I think.
> >
>
> This change is needed because it prevents these arrays from being
> cleared along with the rest of BSS, which occurs after the startup
> code executes.
I see, it's a correctness & bisectability aspect, even if it's
obsoleted later on. Good point, objection withdrawn.
> So conceptually, moving these out of BSS is similar to dropping the
> memset()s, and therefore this belongs in the same patch.
>
> However, you are correct that moving these into __ro_after_init
> achieves the same thing, so I will just reorder that patch with this
> one, and clarify in the commit log that we are relying on the fact
> that __ro_after_init is not cleared at boot time.
That works fine for me too - whichever order you prefer.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
` (2 preceding siblings ...)
2025-05-15 9:51 ` Borislav Petkov
@ 2025-05-15 13:11 ` Borislav Petkov
2025-05-15 13:33 ` Ard Biesheuvel
2025-05-15 18:20 ` Shivank Garg
3 siblings, 2 replies; 36+ messages in thread
From: Borislav Petkov @ 2025-05-15 13:11 UTC (permalink / raw)
To: Ard Biesheuvel, Shivank Garg
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst, Rao, Bharata Bhasker
On Wed, May 14, 2025 at 12:42:44PM +0200, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
>
> Currently, the LA57 CPU feature flag is taken to mean two different
> things at once:
> - whether the CPU implements the LA57 extension, and is therefore
> capable of supporting 5 level paging;
> - whether 5 level paging is currently in use.
Btw, that gunk:
We had started simplifying the whole 5-level crap:
https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com/
Shivank, I hear the performance issues got resolved in the meantime?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 13:11 ` Borislav Petkov
@ 2025-05-15 13:33 ` Ard Biesheuvel
2025-05-17 16:59 ` David Laight
2025-05-15 18:20 ` Shivank Garg
1 sibling, 1 reply; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-15 13:33 UTC (permalink / raw)
To: Borislav Petkov
Cc: Ard Biesheuvel, Shivank Garg, linux-kernel, x86, Ingo Molnar,
Linus Torvalds, Brian Gerst, Rao, Bharata Bhasker
On Thu, 15 May 2025 at 14:11, Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, May 14, 2025 at 12:42:44PM +0200, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Currently, the LA57 CPU feature flag is taken to mean two different
> > things at once:
> > - whether the CPU implements the LA57 extension, and is therefore
> > capable of supporting 5 level paging;
> > - whether 5 level paging is currently in use.
>
> Btw, that gunk:
>
> We had started simplifying the whole 5-level crap:
>
> https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com/
>
> Shivank, I hear the performance issues got resolved in the meantime?
>
It would be interesting to know whether CONFIG_X86_5LEVEL=n deviates
from CONFIG_X86_5LEVEL=y with 'no5lvl' on the command line. If passing
'no5lvl' makes up for the performance hit, then I don't think the
performance issues should stop us from removing this Kconfig symbol.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 13:11 ` Borislav Petkov
2025-05-15 13:33 ` Ard Biesheuvel
@ 2025-05-15 18:20 ` Shivank Garg
2025-05-15 19:11 ` Borislav Petkov
1 sibling, 1 reply; 36+ messages in thread
From: Shivank Garg @ 2025-05-15 18:20 UTC (permalink / raw)
To: Borislav Petkov, Ard Biesheuvel
Cc: linux-kernel, x86, Ard Biesheuvel, Ingo Molnar, Linus Torvalds,
Brian Gerst, Rao, Bharata Bhasker, Kirill A. Shutemov,
Dave Hansen, Peter Zijlstra, Thomas Gleixner
On 5/15/2025 6:41 PM, Borislav Petkov wrote:
> On Wed, May 14, 2025 at 12:42:44PM +0200, Ard Biesheuvel wrote:
>> From: Ard Biesheuvel <ardb@kernel.org>
>>
>> Currently, the LA57 CPU feature flag is taken to mean two different
>> things at once:
>> - whether the CPU implements the LA57 extension, and is therefore
>> capable of supporting 5 level paging;
>> - whether 5 level paging is currently in use.
>
> Btw, that gunk:
>
> We had started simplifying the whole 5-level crap:
>
> https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com/
>
> Shivank, I hear the performance issues got resolved in the meantime?
>
> Thx.
>
Hi Boris,
I've re-tested the performance concerns we discussed earlier regarding 5-level paging.
Recent tests on a current kernel don't show any performance issues:
AMD EPYC Zen 5 (SMT enabled).
Linux HEAD 6.15.0-rc6+ 088d13246a46
lmbench/lat_pagefault:
numactl --membind=1 --cpunodebind=1 bin/x86_64-linux-gnu/lat_pagefault -N 100 1GB_randomfile
Output values (50 runs, Mean, 2.5 percentile and 97.5 percentile, in microseconds):
4-level (no5lvl option)
Mean: 0.138876
2.5% 97.5%
0.1384988 0.1392532
4-level (CONFIG_X86_5LEVEL=n)
Mean: 0.137958
2.5% 97.5%
0.1376473 0.1382687
5-level
Mean: 0.138904
2.5% 97.5%
0.1384789 0.1393291
After repeating the experiments a few times, the observed difference(~1%) in mean values
is under noise levels.
I think these results address the performance concerns previously raised[1]. I don't
foresee any issues in proceeding with the 5-level paging implementation
simplification efforts[2].
[1] https://lore.kernel.org/all/80734605-1926-4ac7-9c63-006fe3ea6b6a@amd.com
[2] https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com
Thanks,
Shivank
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 18:20 ` Shivank Garg
@ 2025-05-15 19:11 ` Borislav Petkov
2025-05-16 9:17 ` Kirill A. Shutemov
0 siblings, 1 reply; 36+ messages in thread
From: Borislav Petkov @ 2025-05-15 19:11 UTC (permalink / raw)
To: Shivank Garg, Kirill A. Shutemov
Cc: Ard Biesheuvel, linux-kernel, x86, Ard Biesheuvel, Ingo Molnar,
Linus Torvalds, Brian Gerst, Rao, Bharata Bhasker, Dave Hansen,
Peter Zijlstra, Thomas Gleixner
On Thu, May 15, 2025 at 11:50:17PM +0530, Shivank Garg wrote:
> I've re-tested the performance concerns we discussed earlier regarding 5-level paging.
> Recent tests on a current kernel don't show any performance issues:
>
> AMD EPYC Zen 5 (SMT enabled).
> Linux HEAD 6.15.0-rc6+ 088d13246a46
>
> lmbench/lat_pagefault:
> numactl --membind=1 --cpunodebind=1 bin/x86_64-linux-gnu/lat_pagefault -N 100 1GB_randomfile
>
> Output values (50 runs, Mean, 2.5 percentile and 97.5 percentile, in microseconds):
>
> 4-level (no5lvl option)
> Mean: 0.138876
> 2.5% 97.5%
> 0.1384988 0.1392532
>
> 4-level (CONFIG_X86_5LEVEL=n)
> Mean: 0.137958
> 2.5% 97.5%
> 0.1376473 0.1382687
>
> 5-level
> Mean: 0.138904
> 2.5% 97.5%
> 0.1384789 0.1393291
>
> After repeating the experiments a few times, the observed difference(~1%) in mean values
> is under noise levels.
> I think these results address the performance concerns previously raised[1]. I don't
> foresee any issues in proceeding with the 5-level paging implementation
> simplification efforts[2].
>
> [1] https://lore.kernel.org/all/80734605-1926-4ac7-9c63-006fe3ea6b6a@amd.com
> [2] https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com
I guess Kirill could dust off his patchset from [2] and that would get rid of
CONFIG_X86_5LEVEL and likely simplify that aspect considerably...
I'd say.
And then Ard's patches would get even simpler...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 10:12 ` Ard Biesheuvel
@ 2025-05-15 23:24 ` Sean Christopherson
2025-05-16 8:31 ` Ard Biesheuvel
0 siblings, 1 reply; 36+ messages in thread
From: Sean Christopherson @ 2025-05-15 23:24 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Ingo Molnar, Ard Biesheuvel, linux-kernel, x86, Linus Torvalds,
Brian Gerst
On Thu, May 15, 2025, Ard Biesheuvel wrote:
> On Thu, 15 May 2025 at 08:45, Ingo Molnar <mingo@kernel.org> wrote:
> > > diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> > > index f67a93fc9391..d59bee5907e7 100644
> > > --- a/arch/x86/include/asm/cpufeatures.h
> > > +++ b/arch/x86/include/asm/cpufeatures.h
> > > @@ -395,7 +395,7 @@
> > > #define X86_FEATURE_AVX512_BITALG (16*32+12) /* "avx512_bitalg" Support for VPOPCNT[B,W] and VPSHUF-BITQMB instructions */
> > > #define X86_FEATURE_TME (16*32+13) /* "tme" Intel Total Memory Encryption */
> > > #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* "avx512_vpopcntdq" POPCNT for vectors of DW/QW */
> > > -#define X86_FEATURE_LA57 (16*32+16) /* "la57" 5-level page tables */
> > > +#define X86_FEATURE_LA57 (16*32+16) /* "la57_hw" 5-level page tables */
> > > #define X86_FEATURE_RDPID (16*32+22) /* "rdpid" RDPID instruction */
> > > #define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* "bus_lock_detect" Bus Lock detect */
> > > #define X86_FEATURE_CLDEMOTE (16*32+25) /* "cldemote" CLDEMOTE instruction */
> > > @@ -483,6 +483,7 @@
> > > #define X86_FEATURE_PREFER_YMM (21*32+ 8) /* Avoid ZMM registers due to downclocking */
> > > #define X86_FEATURE_APX (21*32+ 9) /* Advanced Performance Extensions */
> > > #define X86_FEATURE_INDIRECT_THUNK_ITS (21*32+10) /* Use thunk for indirect branches in lower half of cacheline */
> > > +#define X86_FEATURE_5LEVEL_PAGING (21*32+11) /* "la57" Whether 5 levels of page tables are in use */
> >
> > So there's a new complication here, KVM doesn't like the use of
> > synthethic CPU flags, for understandable reasons:
> >
> > inlined from ‘intel_pmu_set_msr’ at arch/x86/kvm/vmx/pmu_intel.c:369:7:
> > ...
> > ./arch/x86/kvm/reverse_cpuid.h:102:9: note: in expansion of macro ‘BUILD_BUG_ON’
> > 102 | BUILD_BUG_ON(x86_leaf == CPUID_LNX_5);
> > | ^~~~~~~~~~~~
> >
> > (See x86-64 allmodconfig)
> >
> > Even though previously X86_FEATURE_LA57 was effectively a synthethic
> > CPU flag (it got artificially turned off by the Linux kernel if 5-level
> > paging was disabled) ...
KVM doesn't care if the kernel clears CPUID feature flags. In fact, KVM depends
on it in many cases. What KVM cares about is that the bit number matches CPUID
(hardware defined bit) for features that are exposed to guests.
> > So I think the most straightforward solution would be to do the change
> > below, and pass through LA57 flag if 5-level paging is enabled in the
> > host kernel. This is similar to as if the firmware turned off LA57, and
> > it doesn't bring in all the early-boot complications bare metal has. It
> > should also match the previous behavior I think.
> >
> > Thoughts?
> >
> > Thanks,
> >
> > Ingo
> >
> > =================>
> >
> > arch/x86/kvm/cpuid.c | 6 ++++++
> > arch/x86/kvm/x86.h | 4 ++--
> > 2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 571c906ffcbf..d951d71aea3b 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1221,6 +1221,12 @@ void kvm_set_cpu_caps(void)
> > kvm_cpu_cap_clear(X86_FEATURE_RDTSCP);
> > kvm_cpu_cap_clear(X86_FEATURE_RDPID);
> > }
> > + /*
> > + * Clear the LA57 flag in the guest if the host kernel
> > + * does not have 5-level paging support:
No, KVM very intentionally goes out of it's way to support LA57 in the guest
irrespective of host kernel support. I.e. KVM already looks at CPUID directly
and supports virtualizing LA57 if it's supported in hardware, the kernel's
feature flags are ignored (for LA57).
> > + */
> > + if (kvm_cpu_cap_has(X86_FEATURE_LA57) && !pgtable_l5_enabled())
> > + kvm_cpu_cap_clear(X86_FEATURE_LA57);
> > }
> > EXPORT_SYMBOL_GPL(kvm_set_cpu_caps);
> >
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index d2c093f17ae5..9dc32a409076 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -243,7 +243,7 @@ static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
> >
> > static inline u8 max_host_virt_addr_bits(void)
> > {
> > - return kvm_cpu_cap_has(X86_FEATURE_5LEVEL_PAGING) ? 57 : 48;
> > + return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
>
> Based on the comment that documents is_noncanonical_address(), it
> seems to me that this should be using cpu_has(X86_FEATURE_LA57) rather
> than kvm_cpu_cap_has(X86_FEATURE_LA57). Whether the host actually runs
> with 5-level paging should be irrelevant, it only matters whether it
> is capable of doing so.
Ard's patches are completely wrong for KVM, and amusingly, completely unecessary.
Simply drop all of the KVM changes, including the selftest change. And if you
want to save yourselves some time in the future, use scripts/get_maintainer.pl.
> > }
> >
> > /*
> > @@ -603,7 +603,7 @@ static inline bool __kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > __reserved_bits |= X86_CR4_FSGSBASE; \
> > if (!__cpu_has(__c, X86_FEATURE_PKU)) \
> > __reserved_bits |= X86_CR4_PKE; \
> > - if (!__cpu_has(__c, X86_FEATURE_5LEVEL_PAGING)) \
> > + if (!__cpu_has(__c, X86_FEATURE_LA57)) \
> > __reserved_bits |= X86_CR4_LA57; \
>
> This could be
>
> if (!__cpu_has(__c, X86_FEATURE_LA57) || !pgtable_l5_enabled())
> __reserved_bits |= X86_CR4_LA57;
No, all of this is wrong. __kvm_is_valid_cr4() is used check guest CR4 against
KVM's supported set of features CPUID, and also to check guest CR4 against the
vCPU's supported set of features.
>
> > if (!__cpu_has(__c, X86_FEATURE_UMIP)) \
> > __reserved_bits |= X86_CR4_UMIP; \
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 23:24 ` Sean Christopherson
@ 2025-05-16 8:31 ` Ard Biesheuvel
0 siblings, 0 replies; 36+ messages in thread
From: Ard Biesheuvel @ 2025-05-16 8:31 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ingo Molnar, Ard Biesheuvel, linux-kernel, x86, Linus Torvalds,
Brian Gerst
On Fri, 16 May 2025 at 00:24, Sean Christopherson <seanjc@google.com> wrote:
>
...
> Ard's patches are completely wrong for KVM, and amusingly, completely unecessary.
>
> Simply drop all of the KVM changes, including the selftest change. And if you
> want to save yourselves some time in the future, use scripts/get_maintainer.pl.
>
Understood. Thanks for taking a look.
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 19:11 ` Borislav Petkov
@ 2025-05-16 9:17 ` Kirill A. Shutemov
0 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2025-05-16 9:17 UTC (permalink / raw)
To: Borislav Petkov
Cc: Shivank Garg, Ard Biesheuvel, linux-kernel, x86, Ard Biesheuvel,
Ingo Molnar, Linus Torvalds, Brian Gerst, Rao, Bharata Bhasker,
Dave Hansen, Peter Zijlstra, Thomas Gleixner
On Thu, May 15, 2025 at 09:11:31PM +0200, Borislav Petkov wrote:
> On Thu, May 15, 2025 at 11:50:17PM +0530, Shivank Garg wrote:
> > I've re-tested the performance concerns we discussed earlier regarding 5-level paging.
> > Recent tests on a current kernel don't show any performance issues:
> >
> > AMD EPYC Zen 5 (SMT enabled).
> > Linux HEAD 6.15.0-rc6+ 088d13246a46
> >
> > lmbench/lat_pagefault:
> > numactl --membind=1 --cpunodebind=1 bin/x86_64-linux-gnu/lat_pagefault -N 100 1GB_randomfile
> >
> > Output values (50 runs, Mean, 2.5 percentile and 97.5 percentile, in microseconds):
> >
> > 4-level (no5lvl option)
> > Mean: 0.138876
> > 2.5% 97.5%
> > 0.1384988 0.1392532
> >
> > 4-level (CONFIG_X86_5LEVEL=n)
> > Mean: 0.137958
> > 2.5% 97.5%
> > 0.1376473 0.1382687
> >
> > 5-level
> > Mean: 0.138904
> > 2.5% 97.5%
> > 0.1384789 0.1393291
> >
> > After repeating the experiments a few times, the observed difference(~1%) in mean values
> > is under noise levels.
> > I think these results address the performance concerns previously raised[1]. I don't
> > foresee any issues in proceeding with the 5-level paging implementation
> > simplification efforts[2].
> >
> > [1] https://lore.kernel.org/all/80734605-1926-4ac7-9c63-006fe3ea6b6a@amd.com
> > [2] https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com
>
> I guess Kirill could dust off his patchset from [2] and that would get rid of
> CONFIG_X86_5LEVEL and likely simplify that aspect considerably...
https://lore.kernel.org/all/20250516091534.3414310-1-kirill.shutemov@linux.intel.com/
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging
2025-05-15 13:33 ` Ard Biesheuvel
@ 2025-05-17 16:59 ` David Laight
0 siblings, 0 replies; 36+ messages in thread
From: David Laight @ 2025-05-17 16:59 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Borislav Petkov, Ard Biesheuvel, Shivank Garg, linux-kernel, x86,
Ingo Molnar, Linus Torvalds, Brian Gerst, Rao, Bharata Bhasker
On Thu, 15 May 2025 14:33:50 +0100
Ard Biesheuvel <ardb@kernel.org> wrote:
> On Thu, 15 May 2025 at 14:11, Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Wed, May 14, 2025 at 12:42:44PM +0200, Ard Biesheuvel wrote:
> > > From: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > Currently, the LA57 CPU feature flag is taken to mean two different
> > > things at once:
> > > - whether the CPU implements the LA57 extension, and is therefore
> > > capable of supporting 5 level paging;
> > > - whether 5 level paging is currently in use.
> >
> > Btw, that gunk:
> >
> > We had started simplifying the whole 5-level crap:
> >
> > https://lore.kernel.org/all/20240621164406.256314-1-kirill.shutemov@linux.intel.com/
> >
> > Shivank, I hear the performance issues got resolved in the meantime?
> >
>
> It would be interesting to know whether CONFIG_X86_5LEVEL=n deviates
> from CONFIG_X86_5LEVEL=y with 'no5lvl' on the command line. If passing
> 'no5lvl' makes up for the performance hit, then I don't think the
> performance issues should stop us from removing this Kconfig symbol.
You might then want a Kconfig option to invert the default for the
command line option (and an inverted command line option).
David
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-05-17 16:59 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 10:42 [PATCH v3 0/7] x86: Robustify pgtable_l5_enabled() Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 1/7] x86/cpu: Use a new feature flag for 5 level paging Ard Biesheuvel
2025-05-15 7:06 ` Ingo Molnar
2025-05-15 7:45 ` Ingo Molnar
2025-05-15 8:07 ` Kirill A. Shutemov
2025-05-15 8:22 ` Ingo Molnar
2025-05-15 10:12 ` Ard Biesheuvel
2025-05-15 23:24 ` Sean Christopherson
2025-05-16 8:31 ` Ard Biesheuvel
2025-05-15 9:51 ` Borislav Petkov
2025-05-15 10:17 ` Ard Biesheuvel
2025-05-15 10:39 ` Borislav Petkov
2025-05-15 10:57 ` Ard Biesheuvel
2025-05-15 13:11 ` Borislav Petkov
2025-05-15 13:33 ` Ard Biesheuvel
2025-05-17 16:59 ` David Laight
2025-05-15 18:20 ` Shivank Garg
2025-05-15 19:11 ` Borislav Petkov
2025-05-16 9:17 ` Kirill A. Shutemov
2025-05-14 10:42 ` [PATCH v3 2/7] x86/cpu: Allow caps to be set arbitrarily early Ard Biesheuvel
2025-05-15 6:56 ` Ingo Molnar
2025-05-15 7:50 ` Ingo Molnar
2025-05-15 7:55 ` Kirill A. Shutemov
2025-05-15 8:18 ` Ingo Molnar
2025-05-15 9:45 ` Ard Biesheuvel
2025-05-15 12:08 ` Ingo Molnar
2025-05-14 10:42 ` [PATCH v3 3/7] x86/asm-offsets: Export struct cpuinfo_x86 layout for asm use Ard Biesheuvel
2025-05-15 7:10 ` Ingo Molnar
2025-05-15 7:58 ` [tip: x86/core] x86/asm-offsets: Export certain 'struct cpuinfo_x86' fields for 64-bit asm use too tip-bot2 for Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 4/7] x86/boot: Set 5-level paging CPU cap before entering C code Ard Biesheuvel
2025-05-15 8:00 ` Kirill A. Shutemov
2025-05-15 9:43 ` Ard Biesheuvel
2025-05-15 11:05 ` Kirill A. Shutemov
2025-05-14 10:42 ` [PATCH v3 5/7] x86/boot: Drop the early variant of pgtable_l5_enabled() Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 6/7] x86/boot: Drop 5-level paging related variables and early updates Ard Biesheuvel
2025-05-14 10:42 ` [PATCH v3 7/7] x86/cpu: Make CPU capability overrides __ro_after_init Ard Biesheuvel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).