* [PATCH v8 0/6] Introduce CET supervisor state support
@ 2025-05-22 15:10 Chao Gao
2025-05-22 15:10 ` [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs Chao Gao
` (7 more replies)
0 siblings, 8 replies; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Dave Hansen, Eric Biggers,
H. Peter Anvin, Ingo Molnar, Kees Cook, Maxim Levitsky,
Mitchell Levy, Nikolay Borisov, Oleg Nesterov, Sohil Mehta,
Stanislav Spassov, Vignesh Balasubramanian
Dear maintainers and reviewers,
I kindly request your consideration for merging this series. Most of
patches have received Reviewed-by/Acked-by tags.
Thanks Chang, Rick, Xin, Sean and Dave for their help with this series.
== Changelog ==
v7->v8:
- refine the comment in __fpstate_reset() (Sean)
- provide helpers to provide default feature masks for host and
guest FPUs (Sean)
- v7: https://lore.kernel.org/kvm/20250512085735.564475-1-chao.gao@intel.com/
== Background ==
CET defines two register states: CET user, which includes user-mode control
registers, and CET supervisor, which consists of shadow-stack pointers for
privilege levels 0-2.
Current kernel disables shadow stacks in kernel mode, making the CET
supervisor state unused and eliminating the need for context switching.
== Problem ==
To virtualize CET for guests, KVM must accurately emulate hardware
behavior. A key challenge arises because there is no CPUID flag to indicate
that shadow stack is supported only in user mode. Therefore, KVM cannot
assume guests will not enable shadow stacks in kernel mode and must
preserve the CET supervisor state of vCPUs.
== Solution ==
An initial proposal to manually save and restore CET supervisor states
using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and
its impact on KVM's ABI. Instead, leveraging the kernel's FPU
infrastructure for context switching was favored [1].
The main question then became whether to enable the CET supervisor state
globally for all processes or restrict it to vCPU processes. This decision
involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU
processes and approximately 100 lines of code complexity in the kernel [2].
The agreed approach is to first try this optimal solution [3], i.e.,
restricting the CET supervisor state to guest FPUs only and eliminating
unnecessary space waste.
Key changes in this series are:
1) Fix existing issue regarding enabling guest supervisor states support.
2) Add default features and size for guest FPUs.
3) Add infrastructure to support guest-only features.
4) Add CET supervisor state as the first guest-only feature.
With the series in place, guest FPUs have xstate_bv[12] == xcomp_bv[12] == 1
and CET supervisor state is saved/reloaded when xsaves/xrstors executes on
guest FPUs. non-guest FPUs have xstate_bv[12] == xcomp_bv[12] == 0, then
CET supervisor state is not saved/restored.
== Performance ==
We measured context-switching performance with the benchmark [4] in following
three cases.
case 1: the baseline. i.e., this series isn't applied
case 2: baseline + this series. CET-S space is allocated for guest fpu only.
case 3: baseline + allocate CET-S space for all tasks. Hardware init
optimization avoids writing out CET-S space on each XSAVES.
The performance differences in the three cases are very small and fall within the
run-to-run variation.
[1]: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/
[2]: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/
[3]: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/
[4]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/context_switch1.c
Chao Gao (4):
x86/fpu/xstate: Differentiate default features for host and guest FPUs
x86/fpu: Initialize guest FPU permissions from guest defaults
x86/fpu: Initialize guest fpstate and FPU pseudo container from guest
defaults
x86/fpu: Remove xfd argument from __fpstate_reset()
Yang Weijiang (2):
x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only
feature
arch/x86/include/asm/fpu/types.h | 49 ++++++++++++++++++++++++----
arch/x86/include/asm/fpu/xstate.h | 9 ++++--
arch/x86/kernel/fpu/core.c | 53 +++++++++++++++++++++++--------
arch/x86/kernel/fpu/init.c | 1 +
arch/x86/kernel/fpu/xstate.c | 40 +++++++++++++++++++----
5 files changed, 122 insertions(+), 30 deletions(-)
base-commit: 5d7e238ec229cadaeda63b5f96b4ee90bac489e4
--
2.47.1
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
@ 2025-05-22 15:10 ` Chao Gao
2025-05-29 20:59 ` John Allen
2025-05-22 15:10 ` [PATCH v8 2/6] x86/fpu: Initialize guest FPU permissions from guest defaults Chao Gao
` (6 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Ingo Molnar, Dave Hansen,
H. Peter Anvin, Maxim Levitsky, Mitchell Levy, Kees Cook,
Stanislav Spassov, Eric Biggers, Oleg Nesterov, Nikolay Borisov,
Vignesh Balasubramanian
Currently, guest and host FPUs share the same default features. However,
the CET supervisor xstate is the first feature that needs to be enabled
exclusively for guest FPUs. Enabling it for host FPUs leads to a waste of
24 bytes in the XSAVE buffer.
To support "guest-only" features, add a new structure to hold the
default features and sizes for guest FPUs to clearly differentiate them
from those for host FPUs.
Add two helpers to provide the default feature masks for guest and host
FPUs. Default features are derived by applying the masks to the maximum
supported features.
Note that,
1) for now, guest_default_mask() and host_default_mask() are identical.
This will change in a follow-up patch once guest permissions, default
xfeatures, and fpstate size are all converted to use the guest defaults.
2) only supervisor features will diverge between guest FPUs and host
FPUs, while user features will remain the same [1][2]. So, the new
vcpu_fpu_config struct does not include default user features and size
for the UABI buffer.
An alternative approach is adding a guest_only_xfeatures member to
fpu_kernel_cfg and adding two helper functions to calculate the guest
default xfeatures and size. However, calculating these defaults at runtime
would introduce unnecessary overhead.
Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/kvm/aAwdQ759Y6V7SGhv@google.com/ [1]
Link: https://lore.kernel.org/kvm/9ca17e1169805f35168eb722734fbf3579187886.camel@intel.com/ [2]
---
v8:
provide helpers to provide the default masks (Sean)
v6:
Drop vcpu_fpu_config.user_* (Rick)
Reset guest default size when XSAVE is unavaiable or disabled (Chang)
v5:
Add a new vcpu_fpu_config instead of adding new members to
fpu_state_config (Chang)
Extract a helper to set default values (Chang)
---
arch/x86/include/asm/fpu/types.h | 26 ++++++++++++++++++++++++++
arch/x86/kernel/fpu/core.c | 1 +
arch/x86/kernel/fpu/init.c | 1 +
arch/x86/kernel/fpu/xstate.c | 32 ++++++++++++++++++++++++++------
4 files changed, 54 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 1c94121acd3d..abd193a1a52e 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -551,6 +551,31 @@ struct fpu_guest {
struct fpstate *fpstate;
};
+/*
+ * FPU state configuration data for fpu_guest.
+ * Initialized at boot time. Read only after init.
+ */
+struct vcpu_fpu_config {
+ /*
+ * @size:
+ *
+ * The default size of the register state buffer in guest FPUs.
+ * Includes all supported features except independent managed
+ * features and features which have to be requested by user space
+ * before usage.
+ */
+ unsigned int size;
+
+ /*
+ * @features:
+ *
+ * The default supported features bitmap in guest FPUs. Does not
+ * include independent managed features and features which have to
+ * be requested by user space before usage.
+ */
+ u64 features;
+};
+
/*
* FPU state configuration data. Initialized at boot time. Read only after init.
*/
@@ -606,5 +631,6 @@ struct fpu_state_config {
/* FPU state configuration information */
extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
+extern struct vcpu_fpu_config guest_default_cfg;
#endif /* _ASM_X86_FPU_TYPES_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1cda5b78540b..2cd5e1910ff8 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -36,6 +36,7 @@ DEFINE_PER_CPU(u64, xfd_state);
/* The FPU state configuration data for kernel and user space */
struct fpu_state_config fpu_kernel_cfg __ro_after_init;
struct fpu_state_config fpu_user_cfg __ro_after_init;
+struct vcpu_fpu_config guest_default_cfg __ro_after_init;
/*
* Represents the initial FPU state. It's mostly (but not completely) zeroes,
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 6bb3e35c40e2..e19660cdc70c 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -202,6 +202,7 @@ static void __init fpu__init_system_xstate_size_legacy(void)
fpu_kernel_cfg.default_size = size;
fpu_user_cfg.max_size = size;
fpu_user_cfg.default_size = size;
+ guest_default_cfg.size = size;
}
/*
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 1c8410b68108..f15be5c3f0cc 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -742,6 +742,9 @@ static int __init init_xstate_size(void)
fpu_user_cfg.default_size =
xstate_calculate_size(fpu_user_cfg.default_features, false);
+ guest_default_cfg.size =
+ xstate_calculate_size(guest_default_cfg.features, compacted);
+
return 0;
}
@@ -762,6 +765,7 @@ static void __init fpu__init_disable_system_xstate(unsigned int legacy_size)
fpu_kernel_cfg.default_size = legacy_size;
fpu_user_cfg.max_size = legacy_size;
fpu_user_cfg.default_size = legacy_size;
+ guest_default_cfg.size = legacy_size;
/*
* Prevent enabling the static branch which enables writes to the
@@ -772,6 +776,21 @@ static void __init fpu__init_disable_system_xstate(unsigned int legacy_size)
fpstate_reset(x86_task_fpu(current));
}
+static u64 __init host_default_mask(void)
+{
+ /* Exclude dynamic features, which require userspace opt-in. */
+ return ~(u64)XFEATURE_MASK_USER_DYNAMIC;
+}
+
+static u64 __init guest_default_mask(void)
+{
+ /*
+ * Exclude dynamic features, which require userspace opt-in even
+ * for KVM guests.
+ */
+ return ~(u64)XFEATURE_MASK_USER_DYNAMIC;
+}
+
/*
* Enable and initialize the xsave feature.
* Called once per system bootup.
@@ -854,12 +873,13 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
fpu_user_cfg.max_features = fpu_kernel_cfg.max_features;
fpu_user_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED;
- /* Clean out dynamic features from default */
- fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
- fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
-
- fpu_user_cfg.default_features = fpu_user_cfg.max_features;
- fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+ /*
+ * Now, given maximum feature set, determine default values by
+ * applying default masks.
+ */
+ fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features & host_default_mask();
+ fpu_user_cfg.default_features = fpu_user_cfg.max_features & host_default_mask();
+ guest_default_cfg.features = fpu_kernel_cfg.max_features & guest_default_mask();
/* Store it for paranoia check at the end */
xfeatures = fpu_kernel_cfg.max_features;
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v8 2/6] x86/fpu: Initialize guest FPU permissions from guest defaults
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
2025-05-22 15:10 ` [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs Chao Gao
@ 2025-05-22 15:10 ` Chao Gao
2025-05-29 21:14 ` John Allen
2025-05-22 15:10 ` [PATCH v8 3/6] x86/fpu: Initialize guest fpstate and FPU pseudo container " Chao Gao
` (5 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Ingo Molnar, Dave Hansen,
H. Peter Anvin, Stanislav Spassov, Eric Biggers, Oleg Nesterov,
Kees Cook
Currently, fpu->guest_perm is copied from fpu->perm, which is derived from
fpu_kernel_cfg.default_features.
Guest defaults were introduced to differentiate the features and sizes of
host and guest FPUs. Copying guest FPU permissions from the host will lead
to inconsistencies between the guest default features and permissions.
Initialize guest FPU permissions from guest defaults instead of host
defaults. This ensures that any changes to guest default features are
automatically reflected in guest permissions, which in turn guarantees
that fpstate_realloc() allocates a correctly sized XSAVE buffer for guest
FPUs.
Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v8: Refine the comment above fpu->guest_perm.__user_state_size
v6: Drop vcpu_fpu_config.user_* and collect reviews (Rick)
---
arch/x86/kernel/fpu/core.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2cd5e1910ff8..c69432d0ee41 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -553,8 +553,14 @@ void fpstate_reset(struct fpu *fpu)
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
fpu->perm.__state_size = fpu_kernel_cfg.default_size;
fpu->perm.__user_state_size = fpu_user_cfg.default_size;
- /* Same defaults for guests */
- fpu->guest_perm = fpu->perm;
+
+ fpu->guest_perm.__state_perm = guest_default_cfg.features;
+ fpu->guest_perm.__state_size = guest_default_cfg.size;
+ /*
+ * User features and sizes are always identical between host and
+ * guest FPUs, which allows for common guest and userspace ABI.
+ */
+ fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
}
static inline void fpu_inherit_perms(struct fpu *dst_fpu)
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v8 3/6] x86/fpu: Initialize guest fpstate and FPU pseudo container from guest defaults
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
2025-05-22 15:10 ` [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs Chao Gao
2025-05-22 15:10 ` [PATCH v8 2/6] x86/fpu: Initialize guest FPU permissions from guest defaults Chao Gao
@ 2025-05-22 15:10 ` Chao Gao
2025-05-29 21:25 ` John Allen
2025-05-22 15:10 ` [PATCH v8 4/6] x86/fpu: Remove xfd argument from __fpstate_reset() Chao Gao
` (4 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Ingo Molnar, Dave Hansen,
H. Peter Anvin, Kees Cook, Stanislav Spassov, Oleg Nesterov,
Eric Biggers
fpu_alloc_guest_fpstate() currently uses host defaults to initialize guest
fpstate and pseudo containers. Guest defaults were introduced to
differentiate the features and sizes of host and guest FPUs. Switch to
using guest defaults instead.
Adjust __fpstate_reset() to handle different defaults for host and guest
FPUs. And to distinguish between the types of FPUs, move the initialization
of indicators (is_guest and is_valloc) before the reset.
Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v8: tweak comment in __fpstate_reset() (Sean)
v7: tweak __fpstate_reset() instead of adding a guest-specific reset
function (Sean/Dave)
v6: Drop vcpu_fpu_config.user_* (Rick)
v5: init is_valloc/is_guest in the guest-specific reset function
(Chang)
---
arch/x86/kernel/fpu/core.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c69432d0ee41..a5a9c55fcf83 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -236,19 +236,22 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
struct fpstate *fpstate;
unsigned int size;
- size = fpu_kernel_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
+ size = guest_default_cfg.size + ALIGN(offsetof(struct fpstate, regs), 64);
+
fpstate = vzalloc(size);
if (!fpstate)
return false;
+ /* Initialize indicators to reflect properties of the fpstate */
+ fpstate->is_valloc = true;
+ fpstate->is_guest = true;
+
/* Leave xfd to 0 (the reset value defined by spec) */
__fpstate_reset(fpstate, 0);
fpstate_init_user(fpstate);
- fpstate->is_valloc = true;
- fpstate->is_guest = true;
gfpu->fpstate = fpstate;
- gfpu->xfeatures = fpu_kernel_cfg.default_features;
+ gfpu->xfeatures = guest_default_cfg.features;
/*
* KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
@@ -535,10 +538,22 @@ void fpstate_init_user(struct fpstate *fpstate)
static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
{
- /* Initialize sizes and feature masks */
- fpstate->size = fpu_kernel_cfg.default_size;
+ /*
+ * Supervisor features (and thus sizes) may diverge between guest
+ * FPUs and host FPUs, as some supervisor features are supported
+ * for guests despite not being utilized by the host. User
+ * features and sizes are always identical, which allows for
+ * common guest and userspace ABI.
+ */
+ if (fpstate->is_guest) {
+ fpstate->size = guest_default_cfg.size;
+ fpstate->xfeatures = guest_default_cfg.features;
+ } else {
+ fpstate->size = fpu_kernel_cfg.default_size;
+ fpstate->xfeatures = fpu_kernel_cfg.default_features;
+ }
+
fpstate->user_size = fpu_user_cfg.default_size;
- fpstate->xfeatures = fpu_kernel_cfg.default_features;
fpstate->user_xfeatures = fpu_user_cfg.default_features;
fpstate->xfd = xfd;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v8 4/6] x86/fpu: Remove xfd argument from __fpstate_reset()
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
` (2 preceding siblings ...)
2025-05-22 15:10 ` [PATCH v8 3/6] x86/fpu: Initialize guest fpstate and FPU pseudo container " Chao Gao
@ 2025-05-22 15:10 ` Chao Gao
2025-05-29 21:26 ` John Allen
2025-05-22 15:10 ` [PATCH v8 5/6] x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set Chao Gao
` (3 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Ingo Molnar, Dave Hansen,
H. Peter Anvin, Kees Cook, Eric Biggers, Stanislav Spassov,
Oleg Nesterov
The initial values for fpstate::xfd differ between guest and host fpstates.
Currently, the initial values are passed as an argument to
__fpstate_reset(). But, __fpstate_reset() already assigns different default
features and sizes based on the type of fpstates (i.e., guest or host). So,
handle fpstate::xfd in a similar way to highlight the differences in the
initial xfd value between guest and host fpstates
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/all/aBuf7wiiDT0Wflhk@google.com/
---
v8: tweak comment in __fpstate_reset() (Sean)
v7: new
---
arch/x86/kernel/fpu/core.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a5a9c55fcf83..4fafb27e9416 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -211,7 +211,7 @@ void fpu_reset_from_exception_fixup(void)
}
#if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
+static void __fpstate_reset(struct fpstate *fpstate);
static void fpu_lock_guest_permissions(void)
{
@@ -246,8 +246,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
fpstate->is_valloc = true;
fpstate->is_guest = true;
- /* Leave xfd to 0 (the reset value defined by spec) */
- __fpstate_reset(fpstate, 0);
+ __fpstate_reset(fpstate);
fpstate_init_user(fpstate);
gfpu->fpstate = fpstate;
@@ -536,7 +535,7 @@ void fpstate_init_user(struct fpstate *fpstate)
fpstate_init_fstate(fpstate);
}
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
+static void __fpstate_reset(struct fpstate *fpstate)
{
/*
* Supervisor features (and thus sizes) may diverge between guest
@@ -544,25 +543,29 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
* for guests despite not being utilized by the host. User
* features and sizes are always identical, which allows for
* common guest and userspace ABI.
+ *
+ * For the host, set XFD to the kernel's desired initialization
+ * value. For guests, set XFD to its architectural RESET value.
*/
if (fpstate->is_guest) {
fpstate->size = guest_default_cfg.size;
fpstate->xfeatures = guest_default_cfg.features;
+ fpstate->xfd = 0;
} else {
fpstate->size = fpu_kernel_cfg.default_size;
fpstate->xfeatures = fpu_kernel_cfg.default_features;
+ fpstate->xfd = init_fpstate.xfd;
}
fpstate->user_size = fpu_user_cfg.default_size;
fpstate->user_xfeatures = fpu_user_cfg.default_features;
- fpstate->xfd = xfd;
}
void fpstate_reset(struct fpu *fpu)
{
/* Set the fpstate pointer to the default fpstate */
fpu->fpstate = &fpu->__fpstate;
- __fpstate_reset(fpu->fpstate, init_fpstate.xfd);
+ __fpstate_reset(fpu->fpstate);
/* Initialize the permission related info in fpu */
fpu->perm.__state_perm = fpu_kernel_cfg.default_features;
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v8 5/6] x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
` (3 preceding siblings ...)
2025-05-22 15:10 ` [PATCH v8 4/6] x86/fpu: Remove xfd argument from __fpstate_reset() Chao Gao
@ 2025-05-22 15:10 ` Chao Gao
2025-05-30 16:04 ` John Allen
2025-05-22 15:10 ` [PATCH v8 6/6] x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature Chao Gao
` (2 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Ingo Molnar, Dave Hansen,
H. Peter Anvin, Maxim Levitsky, Mitchell Levy,
Vignesh Balasubramanian
From: Yang Weijiang <weijiang.yang@intel.com>
In preparation for upcoming CET virtualization support, the CET supervisor
state will be added as a "guest-only" feature, since it is required only by
KVM (i.e., guest FPUs). Establish the infrastructure for "guest-only"
features.
Define a new XFEATURE_MASK_GUEST_SUPERVISOR mask to specify features that
are enabled by default in guest FPUs but not in host FPUs. Specifically,
for any bit in this set, permission is granted and XSAVE space is allocated
during vCPU creation. Non-guest FPUs cannot enable guest-only features,
even dynamically, and no XSAVE space will be allocated for them.
The mask is currently empty, but this will be changed by a subsequent
patch.
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
v8: rebased
v6: Collect reviews
v5: Explain in detail the reasoning behind the mask name choice below the
"---" separator line.
In previous versions, the mask was named "XFEATURE_MASK_SUPERVISOR_DYNAMIC"
Dave suggested this name [1], but he also noted, "I don't feel strongly about
it and I've said my piece. I won't NAK it one way or the other."
The term "dynamic" was initially preferred because it reflects the impact
on XSAVE buffers—some buffers accommodate dynamic features while others do
not. This naming allows for the introduction of dynamic features that are
not strictly "guest-only", offering flexibility beyond KVM.
However, using "dynamic" has led to confusion [2]. Chang pointed out that
permission granting and buffer allocation are actually static at VCPU
allocation, diverging from the model for user dynamic features. He also
questioned the rationale for introducing a kernel dynamic feature mask
while using it as a guest-only feature mask [3]. Moreover, Thomas remarked
that "the dynamic naming is really bad" [4]. Although his specific concerns
are unclear, we should be cautious about reinstating the "kernel dynamic
feature" naming.
Therefore, in v4, I renamed the mask to "XFEATURE_MASK_SUPERVISOR_GUEST"
and further refined it to "XFEATURE_MASK_GUEST_SUPERVISOR" in this v5.
[1]: https://lore.kernel.org/all/893ac578-baaf-4f4f-96ee-e012dfc073a8@intel.com/#t
[2]: https://lore.kernel.org/kvm/e15d1074-d5ec-431d-86e5-a58bc6297df8@intel.com/
[3]: https://lore.kernel.org/kvm/7bee70fd-b2b9-4466-a694-4bf3486b19c7@intel.com/
[4]: https://lore.kernel.org/all/87sg1owmth.ffs@nanos.tec.linutronix.de/
---
arch/x86/include/asm/fpu/types.h | 9 +++++----
arch/x86/include/asm/fpu/xstate.h | 6 +++++-
arch/x86/kernel/fpu/xstate.c | 7 +++++--
3 files changed, 15 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index abd193a1a52e..54ba567258d6 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -592,8 +592,9 @@ struct fpu_state_config {
* @default_size:
*
* The default size of the register state buffer. Includes all
- * supported features except independent managed features and
- * features which have to be requested by user space before usage.
+ * supported features except independent managed features,
+ * guest-only features and features which have to be requested by
+ * user space before usage.
*/
unsigned int default_size;
@@ -609,8 +610,8 @@ struct fpu_state_config {
* @default_features:
*
* The default supported features bitmap. Does not include
- * independent managed features and features which have to
- * be requested by user space before usage.
+ * independent managed features, guest-only features and features
+ * which have to be requested by user space before usage.
*/
u64 default_features;
/*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index b308a76afbb7..a3cd25453f94 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -46,9 +46,13 @@
/* Features which are dynamically enabled for a process on request */
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
+/* Supervisor features which are enabled only in guest FPUs */
+#define XFEATURE_MASK_GUEST_SUPERVISOR 0
+
/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
- XFEATURE_MASK_CET_USER)
+ XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_GUEST_SUPERVISOR)
/*
* A supervisor state component may not always contain valuable information,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index f15be5c3f0cc..f5eb3e84c3dc 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -778,8 +778,11 @@ static void __init fpu__init_disable_system_xstate(unsigned int legacy_size)
static u64 __init host_default_mask(void)
{
- /* Exclude dynamic features, which require userspace opt-in. */
- return ~(u64)XFEATURE_MASK_USER_DYNAMIC;
+ /*
+ * Exclude dynamic features (require userspace opt-in) and features
+ * that are supported only for KVM guests.
+ */
+ return ~((u64)XFEATURE_MASK_USER_DYNAMIC | XFEATURE_MASK_GUEST_SUPERVISOR);
}
static u64 __init guest_default_mask(void)
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v8 6/6] x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
` (4 preceding siblings ...)
2025-05-22 15:10 ` [PATCH v8 5/6] x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set Chao Gao
@ 2025-05-22 15:10 ` Chao Gao
2025-05-30 16:05 ` John Allen
2025-05-23 16:57 ` [PATCH v8 0/6] Introduce CET supervisor state support Sean Christopherson
2025-06-04 0:56 ` Chao Gao
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-22 15:10 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Chao Gao, Maxim Levitsky, Ingo Molnar,
Dave Hansen, H. Peter Anvin, Mitchell Levy,
Vignesh Balasubramanian
From: Yang Weijiang <weijiang.yang@intel.com>
== Background ==
CET defines two register states: CET user, which includes user-mode control
registers, and CET supervisor, which consists of shadow-stack pointers for
privilege levels 0-2.
Current kernels disable shadow stacks in kernel mode, making the CET
supervisor state unused and eliminating the need for context switching.
== Problem ==
To virtualize CET for guests, KVM must accurately emulate hardware
behavior. A key challenge arises because there is no CPUID flag to indicate
that shadow stack is supported only in user mode. Therefore, KVM cannot
assume guests will not enable shadow stacks in kernel mode and must
preserve the CET supervisor state of vCPUs.
== Solution ==
An initial proposal to manually save and restore CET supervisor states
using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and
its impact on KVM's ABI. Instead, leveraging the kernel's FPU
infrastructure for context switching was favored [1].
The main question then became whether to enable the CET supervisor state
globally for all processes or restrict it to vCPU processes. This decision
involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU
processes and approximately 100 lines of code complexity in the kernel [2].
The agreed approach is to first try this optimal solution [3], i.e.,
restricting the CET supervisor state to guest FPUs only and eliminating
unnecessary space waste.
The guest-only xfeature infrastructure has already been added. Now,
introduce CET supervisor xstate support as the first guest-only feature
to prepare for the upcoming CET virtualization in KVM.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/ [1]
Link: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/ [2]
Link: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/ [3]
---
v5:
Introduce CET supervisor xfeature directly as a guest-only feature, rather
than first introducing it in one patch and then converting it to guest-only
in a subsequent patch. (Chang)
Add new features after cleanups/bug fixes (Chang, Dave, Ingo)
Improve the commit message to follow the suggested
background-problem-solution pattern.
---
arch/x86/include/asm/fpu/types.h | 14 ++++++++++++--
arch/x86/include/asm/fpu/xstate.h | 5 ++---
arch/x86/kernel/fpu/xstate.c | 5 ++++-
3 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 54ba567258d6..93e99d2583d6 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -118,7 +118,7 @@ enum xfeature {
XFEATURE_PKRU,
XFEATURE_PASID,
XFEATURE_CET_USER,
- XFEATURE_CET_KERNEL_UNUSED,
+ XFEATURE_CET_KERNEL,
XFEATURE_RSRVD_COMP_13,
XFEATURE_RSRVD_COMP_14,
XFEATURE_LBR,
@@ -142,7 +142,7 @@ enum xfeature {
#define XFEATURE_MASK_PKRU (1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID (1 << XFEATURE_PASID)
#define XFEATURE_MASK_CET_USER (1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL (1 << XFEATURE_CET_KERNEL)
#define XFEATURE_MASK_LBR (1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG (1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA (1 << XFEATURE_XTILE_DATA)
@@ -268,6 +268,16 @@ struct cet_user_state {
u64 user_ssp;
};
+/*
+ * State component 12 is Control-flow Enforcement supervisor states.
+ * This state includes SSP pointers for privilege levels 0 through 2.
+ */
+struct cet_supervisor_state {
+ u64 pl0_ssp;
+ u64 pl1_ssp;
+ u64 pl2_ssp;
+} __packed;
+
/*
* State component 15: Architectural LBR configuration state.
* The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index a3cd25453f94..7a7dc9d56027 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -47,7 +47,7 @@
#define XFEATURE_MASK_USER_DYNAMIC XFEATURE_MASK_XTILE_DATA
/* Supervisor features which are enabled only in guest FPUs */
-#define XFEATURE_MASK_GUEST_SUPERVISOR 0
+#define XFEATURE_MASK_GUEST_SUPERVISOR XFEATURE_MASK_CET_KERNEL
/* All currently supported supervisor features */
#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
@@ -79,8 +79,7 @@
* Unsupported supervisor features. When a supervisor feature in this mask is
* supported in the future, move it to the supported supervisor feature mask.
*/
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
- XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
/* All supervisor states including supported and unsupported states. */
#define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index f5eb3e84c3dc..bcccb0977809 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -56,7 +56,7 @@ static const char *xfeature_names[] =
"Protection Keys User registers",
"PASID state",
"Control-flow User registers",
- "Control-flow Kernel registers (unused)",
+ "Control-flow Kernel registers (KVM only)",
"unknown xstate feature",
"unknown xstate feature",
"unknown xstate feature",
@@ -80,6 +80,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
[XFEATURE_PKRU] = X86_FEATURE_OSPKE,
[XFEATURE_PASID] = X86_FEATURE_ENQCMD,
[XFEATURE_CET_USER] = X86_FEATURE_SHSTK,
+ [XFEATURE_CET_KERNEL] = X86_FEATURE_SHSTK,
[XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE,
[XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE,
[XFEATURE_APX] = X86_FEATURE_APX,
@@ -371,6 +372,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID | \
XFEATURE_MASK_CET_USER | \
+ XFEATURE_MASK_CET_KERNEL | \
XFEATURE_MASK_XTILE | \
XFEATURE_MASK_APX)
@@ -572,6 +574,7 @@ static bool __init check_xstate_against_struct(int nr)
case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg);
case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state);
+ case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
case XFEATURE_APX: return XCHECK_SZ(sz, nr, struct apx_state);
case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
default:
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
` (5 preceding siblings ...)
2025-05-22 15:10 ` [PATCH v8 6/6] x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature Chao Gao
@ 2025-05-23 16:57 ` Sean Christopherson
2025-05-23 17:12 ` Dave Hansen
2025-06-04 0:56 ` Chao Gao
7 siblings, 1 reply; 23+ messages in thread
From: Sean Christopherson @ 2025-05-23 16:57 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, pbonzini, peterz,
rick.p.edgecombe, weijiang.yang, john.allen, bp, chang.seok.bae,
xin3.li, Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar,
Kees Cook, Maxim Levitsky, Mitchell Levy, Nikolay Borisov,
Oleg Nesterov, Sohil Mehta, Stanislav Spassov,
Vignesh Balasubramanian
On Thu, May 22, 2025, Chao Gao wrote:
> Chao Gao (4):
> x86/fpu/xstate: Differentiate default features for host and guest FPUs
> x86/fpu: Initialize guest FPU permissions from guest defaults
> x86/fpu: Initialize guest fpstate and FPU pseudo container from guest
> defaults
> x86/fpu: Remove xfd argument from __fpstate_reset()
>
> Yang Weijiang (2):
> x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
> x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only
> feature
Acked-by: Sean Christopherson <seanjc@google.com>
Side topic, and *probably* unrelated to this series, I tripped the following
WARN when running it through the KVM tests (though I don't think it has anything
to do with KVM?). The WARN is the version of xfd_validate_state() that's guarded
by CONFIG_X86_DEBUG_FPU=y.
WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
Modules linked in: kvm_intel kvm irqbypass vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd [last unloaded: kvm_intel]
CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan Tainted: G S 6.15.0-smp--3542d5d75b5c-cet #678 NONE
Tainted: [S]=CPU_OUT_OF_SPEC
Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240807.2-0 10/09/2024
RIP: 0010:xfd_validate_state+0x65/0x70
Code: 10 4c 3b 60 18 74 23 49 81 fe 80 c4 45 ab 74 15 4d 0b 7e 08 49 f7 d7 49 85 df 75 0e 5b 41 5c 41 5e 41 5f 5d c3 40 84 ed 75 f2 <0f> 0b eb ee 0f 1f 80 00 00 00 00 66 0f 1f 00 0f 1f 44 00 00 48 89
RSP: 0018:ff7ada85584a3e08 EFLAGS: 00010246
RAX: ff2c5d2908a53940 RBX: 00000000000e00ff RCX: ff2c5d2908a53940
RDX: 0000000000000001 RSI: 00000000000e00ff RDI: ff2c5d2908a521c0
RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffffaa56fa4d R11: 0000000000000000 R12: 0000000000040000
R13: ff2c5d2908a521c0 R14: ffffffffab45c480 R15: 0000000000000000
FS: 00007f21084d6700(0000) GS:ff2c5da752b41000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001ca832006 CR4: 0000000000f73ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 00000000
Call Trace:
<TASK>
fpu__clear_user_states+0x9c/0x100
arch_do_signal_or_restart+0x142/0x210
exit_to_user_mode_loop+0x55/0x100
do_syscall_64+0x205/0x2c0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x55ad185f2ee0
Code: 8c fc 48 8d 0d 6e d5 8e fc 4c 8d 05 64 cb 78 fc bf 03 00 00 00 ba 25 03 00 00 49 89 c1 31 c0 e8 e6 2e 08 00 cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 49 89 d7 49 89
RSP: 002b:00007f21084d3e38 EFLAGS: 00000246 ORIG_RAX: 00000000000001b9
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000055ad18136f73
RDX: 00007f21084d3e40 RSI: 00007f21084d3f70 RDI: 000000000000001b
RBP: 00007f21084d4f90 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 000011f03fbc2f00 R14: ffffffffffffffff R15: 0000000000000000
</TASK>
irq event stamp: 368
hardirqs last enabled at (367): [<ffffffffaaf5f1b8>] _raw_write_unlock_irq+0x28/0x40
hardirqs last disabled at (368): [<ffffffffaaf5487d>] __schedule+0x1bd/0xea0
softirqs last enabled at (0): [<ffffffffaa2aa1ca>] copy_process+0x38a/0x1350
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace 0000000000000000 ]---
But I've hit the WARN once before, so whatever is going on is pre-existing. I
haven't done any experiments to see if the WARN fires more frequently with this
series. I mentioned it here purely out of convenience.
------------[ cut here ]------------
WARNING: CPU: 77 PID: 14821 at arch/x86/kernel/fpu/xstate.c:1466 xfd_validate_state+0x4a/0x50
Modules linked in: kvm_intel kvm irqbypass vfat fat dummy bridge stp llc intel_vsec cdc_acm cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd sr_mod cdrom loop [last unloaded: kvm]
CPU: 77 UID: 0 PID: 14821 Comm: futex-default-S Tainted: G S W 6.15.0-smp--a2104d5ba341-sink #605 NONE
Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
Hardware name: Google Izumi-EMR/izumi, BIOS 0.20240807.2-0 10/09/2024
RIP: 0010:xfd_validate_state+0x4a/0x50
Code: 50 0a a9 4d 8b 80 90 17 00 00 49 3b 48 18 74 1a 48 81 ff 80 a4 65 a7 74 0d 48 0b 47 08 48 f7 d0 48 85 f0 75 05 c3 84 d2 75 fb <0f> 0b c3 0f 1f 00 66 0f 1f 00 0f 1f 44 00 00 48 89 f8 48 8b 7f 10
RSP: 0018:ff1ba89ef124fe58 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffa644871e RCX: 0000000000040000
RDX: 0000000000000001 RSI: 00000000000600ff RDI: ffffffffa765a480
RBP: 0000000000000000 R08: ff137abd4db65bc0 R09: 0000000000000000
R10: ffffffffa6775f8d R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ff137abd4db65b80 R15: 0000000000000000
FS: 00007fea8bce7700(0000) GS:ff137afc151b3000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001a87c0004 CR4: 0000000000f73ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 00000000
Call Trace:
<TASK>
fpu__clear_user_states+0x92/0xf0
arch_do_signal_or_restart+0x134/0x200
syscall_exit_to_user_mode+0x8a/0x110
do_syscall_64+0x8b/0x160
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x563afb5588c0
Code: f0 e9 df fc ff ff 48 8b 5d 88 4d 89 f0 e9 b5 fe ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc <55> 48 89 e5 41 57 41 56 41 55 41 54 53 50 49 89 d7 e8 0a 6f f2 01
RSP: 002b:00007fea8bce4e78 EFLAGS: 00000202 ORIG_RAX: 00000000000000e8
RAX: 0000000000000000 RBX: 000011fcd50a6dd0 RCX: 0000563af84d6b30
RDX: 00007fea8bce4e80 RSI: 00007fea8bce4fb0 RDI: 000000000000001e
RBP: 00007fea8bce5c30 R08: 0000000000000000 R09: 00007fea8bce6ca0
R10: 00000000000007d0 R11: 0000000000000202 R12: 0000000063239328
R13: 00000000680b9c83 R14: 00000000000007d0 R15: 000011fcd5c46150
</TASK>
irq event stamp: 496018
hardirqs last enabled at (496017): [<ffffffffa7158965>] _raw_spin_unlock_irqrestore+0x35/0x50
hardirqs last disabled at (496018): [<ffffffffa714e6fd>] __schedule+0x1bd/0xe90
softirqs last enabled at (495074): [<ffffffffa64c02ec>] __irq_exit_rcu+0x6c/0x130
softirqs last disabled at (495065): [<ffffffffa64c02ec>] __irq_exit_rcu+0x6c/0x130
---[ end trace 0000000000000000 ]---
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-05-23 16:57 ` [PATCH v8 0/6] Introduce CET supervisor state support Sean Christopherson
@ 2025-05-23 17:12 ` Dave Hansen
2025-05-23 17:48 ` Sean Christopherson
2025-05-27 11:01 ` Chao Gao
0 siblings, 2 replies; 23+ messages in thread
From: Dave Hansen @ 2025-05-23 17:12 UTC (permalink / raw)
To: Sean Christopherson, Chao Gao
Cc: x86, linux-kernel, kvm, tglx, pbonzini, peterz, rick.p.edgecombe,
weijiang.yang, john.allen, bp, chang.seok.bae, xin3.li,
Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar, Kees Cook,
Maxim Levitsky, Mitchell Levy, Nikolay Borisov, Oleg Nesterov,
Sohil Mehta, Stanislav Spassov, Vignesh Balasubramanian
On 5/23/25 09:57, Sean Christopherson wrote:
> Side topic, and *probably* unrelated to this series, I tripped the following
> WARN when running it through the KVM tests (though I don't think it has anything
> to do with KVM?). The WARN is the version of xfd_validate_state() that's guarded
> by CONFIG_X86_DEBUG_FPU=y.
>
> WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
Huh, and the two processes getting hit by it:
CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan ...
CPU: 77 UID: 0 PID: 14821 Comm: futex-default-S ...
don't _look_ like KVM test processes. My guess would be it's some
mixture of KVM and a signal handler fighting with XFD state.
I take it this is a Sapphire Rapids system? Is there anything
interesting about the config other than CONFIG_X86_DEBUG_FPU?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-05-23 17:12 ` Dave Hansen
@ 2025-05-23 17:48 ` Sean Christopherson
2025-05-27 11:01 ` Chao Gao
1 sibling, 0 replies; 23+ messages in thread
From: Sean Christopherson @ 2025-05-23 17:48 UTC (permalink / raw)
To: Dave Hansen
Cc: Chao Gao, x86, linux-kernel, kvm, tglx, pbonzini, peterz,
rick.p.edgecombe, weijiang.yang, john.allen, bp, chang.seok.bae,
xin3.li, Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar,
Kees Cook, Maxim Levitsky, Mitchell Levy, Nikolay Borisov,
Oleg Nesterov, Sohil Mehta, Stanislav Spassov,
Vignesh Balasubramanian
On Fri, May 23, 2025, Dave Hansen wrote:
> On 5/23/25 09:57, Sean Christopherson wrote:
> > Side topic, and *probably* unrelated to this series, I tripped the following
> > WARN when running it through the KVM tests (though I don't think it has anything
> > to do with KVM?). The WARN is the version of xfd_validate_state() that's guarded
> > by CONFIG_X86_DEBUG_FPU=y.
> >
> > WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
>
> Huh, and the two processes getting hit by it:
>
> CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan ...
> CPU: 77 UID: 0 PID: 14821 Comm: futex-default-S ...
>
> don't _look_ like KVM test processes.
Yeah, that's why I haven't dug into it, I don't really know where to start, and
I don't even really know what triggered it.
> My guess would be it's some mixture of KVM and a signal handler fighting with
> XFD state.
>
> I take it this is a Sapphire Rapids system?
Emerald Rapids
> Is there anything interesting about the config other than CONFIG_X86_DEBUG_FPU?
The only thing I can think of that's remotely interesting is CONFIG_PROVE_LOCKING=y.
Other than that, it's a pretty vanilla config.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-05-23 17:12 ` Dave Hansen
2025-05-23 17:48 ` Sean Christopherson
@ 2025-05-27 11:01 ` Chao Gao
2025-06-02 19:12 ` Chang S. Bae
1 sibling, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-05-27 11:01 UTC (permalink / raw)
To: Dave Hansen
Cc: Sean Christopherson, x86, linux-kernel, kvm, tglx, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Dave Hansen, Eric Biggers,
H. Peter Anvin, Ingo Molnar, Kees Cook, Maxim Levitsky,
Mitchell Levy, Nikolay Borisov, Oleg Nesterov, Sohil Mehta,
Stanislav Spassov, Vignesh Balasubramanian
On Fri, May 23, 2025 at 10:12:22AM -0700, Dave Hansen wrote:
>On 5/23/25 09:57, Sean Christopherson wrote:
>> Side topic, and *probably* unrelated to this series, I tripped the following
>> WARN when running it through the KVM tests (though I don't think it has anything
>> to do with KVM?). The WARN is the version of xfd_validate_state() that's guarded
>> by CONFIG_X86_DEBUG_FPU=y.
>>
>> WARNING: CPU: 232 PID: 15391 at arch/x86/kernel/fpu/xstate.c:1543 xfd_validate_state+0x65/0x70
>
>Huh, and the two processes getting hit by it:
>
> CPU: 232 UID: 0 PID: 15391 Comm: DefaultEventMan ...
> CPU: 77 UID: 0 PID: 14821 Comm: futex-default-S ...
>
>don't _look_ like KVM test processes. My guess would be it's some
>mixture of KVM and a signal handler fighting with XFD state.
We are hitting the third case in the table below [*]:
MSR | fpstate | cur->fpstate | valid
-------------------------------------
0 | 0 | x | 1 // MSR matches @fpstate
0 | 1 | 0 | 1 // MSR matches cur->fpstate
0 | 1 | 1 | 0 <- *** MSR matches nothing!
1 | 0 | 0 | 0 <- *** MSR matches nothing!
1 | 0 | 1 | 1 // MSR matches cur->fpstate
1 | 1 | x | 1 // MSR matches @fpstate
*: https://lore.kernel.org/all/88cb75d3-01b9-38ea-e29f-b8fefb548573@intel.com/
The issue arises because the XFD MSR retains the value (i.e., 0, indicating
AMX enabled) from the previous process, while both the passed-in fpstate
(init_fpstate) and the current fpstate have AMX disabled.
To reproduce this issue, compile the kernel with CONFIG_PREEMPT=y, apply the
attached diff to the amx selftest and run:
# numactl -C 1 ./tools/testing/selftests/x86/amx_64
diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
index 40769c16de1b..4d533d1a530d 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -430,6 +430,10 @@ static inline void validate_tiledata_regs_changed(struct xsave_buffer *xbuf)
fatal_error("TILEDATA registers did not change");
}
+static void dummy_handler(int sig)
+{
+}
+
/* tiledata inheritance test */
static void test_fork(void)
@@ -444,6 +448,10 @@ static void test_fork(void)
/* fork() succeeded. Now in the parent. */
int status;
+ req_xtiledata_perm();
+ load_rand_tiledata(stashed_xsave);
+ while(1);
+
wait(&status);
if (!WIFEXITED(status) || WEXITSTATUS(status))
fatal_error("fork test child");
@@ -452,7 +460,9 @@ static void test_fork(void)
/* fork() succeeded. Now in the child. */
printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tiledata\n");
- load_rand_tiledata(stashed_xsave);
+ signal(SIGSEGV, dummy_handler);
+ while(1)
+ raise(SIGSEGV);
grandchild = fork();
if (grandchild < 0) {
@@ -500,9 +510,6 @@ int main(void)
test_dynamic_state();
- /* Request permission for the following tests */
- req_xtiledata_perm();
-
test_fork();
/*
>
>I take it this is a Sapphire Rapids system? Is there anything
>interesting about the config other than CONFIG_X86_DEBUG_FPU?
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs
2025-05-22 15:10 ` [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs Chao Gao
@ 2025-05-29 20:59 ` John Allen
0 siblings, 0 replies; 23+ messages in thread
From: John Allen @ 2025-05-29 20:59 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, bp, chang.seok.bae,
xin3.li, Ingo Molnar, Dave Hansen, H. Peter Anvin, Maxim Levitsky,
Mitchell Levy, Kees Cook, Stanislav Spassov, Eric Biggers,
Oleg Nesterov, Nikolay Borisov, Vignesh Balasubramanian
On Thu, May 22, 2025 at 08:10:04AM -0700, Chao Gao wrote:
> Currently, guest and host FPUs share the same default features. However,
> the CET supervisor xstate is the first feature that needs to be enabled
> exclusively for guest FPUs. Enabling it for host FPUs leads to a waste of
> 24 bytes in the XSAVE buffer.
>
> To support "guest-only" features, add a new structure to hold the
> default features and sizes for guest FPUs to clearly differentiate them
> from those for host FPUs.
>
> Add two helpers to provide the default feature masks for guest and host
> FPUs. Default features are derived by applying the masks to the maximum
> supported features.
>
> Note that,
> 1) for now, guest_default_mask() and host_default_mask() are identical.
> This will change in a follow-up patch once guest permissions, default
> xfeatures, and fpstate size are all converted to use the guest defaults.
>
> 2) only supervisor features will diverge between guest FPUs and host
> FPUs, while user features will remain the same [1][2]. So, the new
> vcpu_fpu_config struct does not include default user features and size
> for the UABI buffer.
>
> An alternative approach is adding a guest_only_xfeatures member to
> fpu_kernel_cfg and adding two helper functions to calculate the guest
> default xfeatures and size. However, calculating these defaults at runtime
> would introduce unnecessary overhead.
>
> Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 2/6] x86/fpu: Initialize guest FPU permissions from guest defaults
2025-05-22 15:10 ` [PATCH v8 2/6] x86/fpu: Initialize guest FPU permissions from guest defaults Chao Gao
@ 2025-05-29 21:14 ` John Allen
0 siblings, 0 replies; 23+ messages in thread
From: John Allen @ 2025-05-29 21:14 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, bp, chang.seok.bae,
xin3.li, Ingo Molnar, Dave Hansen, H. Peter Anvin,
Stanislav Spassov, Eric Biggers, Oleg Nesterov, Kees Cook
On Thu, May 22, 2025 at 08:10:05AM -0700, Chao Gao wrote:
> Currently, fpu->guest_perm is copied from fpu->perm, which is derived from
> fpu_kernel_cfg.default_features.
>
> Guest defaults were introduced to differentiate the features and sizes of
> host and guest FPUs. Copying guest FPU permissions from the host will lead
> to inconsistencies between the guest default features and permissions.
>
> Initialize guest FPU permissions from guest defaults instead of host
> defaults. This ensures that any changes to guest default features are
> automatically reflected in guest permissions, which in turn guarantees
> that fpstate_realloc() allocates a correctly sized XSAVE buffer for guest
> FPUs.
>
> Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 3/6] x86/fpu: Initialize guest fpstate and FPU pseudo container from guest defaults
2025-05-22 15:10 ` [PATCH v8 3/6] x86/fpu: Initialize guest fpstate and FPU pseudo container " Chao Gao
@ 2025-05-29 21:25 ` John Allen
0 siblings, 0 replies; 23+ messages in thread
From: John Allen @ 2025-05-29 21:25 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, bp, chang.seok.bae,
xin3.li, Ingo Molnar, Dave Hansen, H. Peter Anvin, Kees Cook,
Stanislav Spassov, Oleg Nesterov, Eric Biggers
On Thu, May 22, 2025 at 08:10:06AM -0700, Chao Gao wrote:
> fpu_alloc_guest_fpstate() currently uses host defaults to initialize guest
> fpstate and pseudo containers. Guest defaults were introduced to
> differentiate the features and sizes of host and guest FPUs. Switch to
> using guest defaults instead.
>
> Adjust __fpstate_reset() to handle different defaults for host and guest
> FPUs. And to distinguish between the types of FPUs, move the initialization
> of indicators (is_guest and is_valloc) before the reset.
>
> Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 4/6] x86/fpu: Remove xfd argument from __fpstate_reset()
2025-05-22 15:10 ` [PATCH v8 4/6] x86/fpu: Remove xfd argument from __fpstate_reset() Chao Gao
@ 2025-05-29 21:26 ` John Allen
0 siblings, 0 replies; 23+ messages in thread
From: John Allen @ 2025-05-29 21:26 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, bp, chang.seok.bae,
xin3.li, Ingo Molnar, Dave Hansen, H. Peter Anvin, Kees Cook,
Eric Biggers, Stanislav Spassov, Oleg Nesterov
On Thu, May 22, 2025 at 08:10:07AM -0700, Chao Gao wrote:
> The initial values for fpstate::xfd differ between guest and host fpstates.
> Currently, the initial values are passed as an argument to
> __fpstate_reset(). But, __fpstate_reset() already assigns different default
> features and sizes based on the type of fpstates (i.e., guest or host). So,
> handle fpstate::xfd in a similar way to highlight the differences in the
> initial xfd value between guest and host fpstates
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 5/6] x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
2025-05-22 15:10 ` [PATCH v8 5/6] x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set Chao Gao
@ 2025-05-30 16:04 ` John Allen
0 siblings, 0 replies; 23+ messages in thread
From: John Allen @ 2025-05-30 16:04 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, bp, chang.seok.bae,
xin3.li, Ingo Molnar, Dave Hansen, H. Peter Anvin, Maxim Levitsky,
Mitchell Levy, Vignesh Balasubramanian
On Thu, May 22, 2025 at 08:10:08AM -0700, Chao Gao wrote:
> From: Yang Weijiang <weijiang.yang@intel.com>
>
> In preparation for upcoming CET virtualization support, the CET supervisor
> state will be added as a "guest-only" feature, since it is required only by
> KVM (i.e., guest FPUs). Establish the infrastructure for "guest-only"
> features.
>
> Define a new XFEATURE_MASK_GUEST_SUPERVISOR mask to specify features that
> are enabled by default in guest FPUs but not in host FPUs. Specifically,
> for any bit in this set, permission is granted and XSAVE space is allocated
> during vCPU creation. Non-guest FPUs cannot enable guest-only features,
> even dynamically, and no XSAVE space will be allocated for them.
>
> The mask is currently empty, but this will be changed by a subsequent
> patch.
>
> Co-developed-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 6/6] x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature
2025-05-22 15:10 ` [PATCH v8 6/6] x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature Chao Gao
@ 2025-05-30 16:05 ` John Allen
0 siblings, 0 replies; 23+ messages in thread
From: John Allen @ 2025-05-30 16:05 UTC (permalink / raw)
To: Chao Gao
Cc: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, bp, chang.seok.bae,
xin3.li, Maxim Levitsky, Ingo Molnar, Dave Hansen, H. Peter Anvin,
Mitchell Levy, Vignesh Balasubramanian
On Thu, May 22, 2025 at 08:10:09AM -0700, Chao Gao wrote:
> From: Yang Weijiang <weijiang.yang@intel.com>
>
> == Background ==
>
> CET defines two register states: CET user, which includes user-mode control
> registers, and CET supervisor, which consists of shadow-stack pointers for
> privilege levels 0-2.
>
> Current kernels disable shadow stacks in kernel mode, making the CET
> supervisor state unused and eliminating the need for context switching.
>
> == Problem ==
>
> To virtualize CET for guests, KVM must accurately emulate hardware
> behavior. A key challenge arises because there is no CPUID flag to indicate
> that shadow stack is supported only in user mode. Therefore, KVM cannot
> assume guests will not enable shadow stacks in kernel mode and must
> preserve the CET supervisor state of vCPUs.
>
> == Solution ==
>
> An initial proposal to manually save and restore CET supervisor states
> using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and
> its impact on KVM's ABI. Instead, leveraging the kernel's FPU
> infrastructure for context switching was favored [1].
>
> The main question then became whether to enable the CET supervisor state
> globally for all processes or restrict it to vCPU processes. This decision
> involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU
> processes and approximately 100 lines of code complexity in the kernel [2].
> The agreed approach is to first try this optimal solution [3], i.e.,
> restricting the CET supervisor state to guest FPUs only and eliminating
> unnecessary space waste.
>
> The guest-only xfeature infrastructure has already been added. Now,
> introduce CET supervisor xstate support as the first guest-only feature
> to prepare for the upcoming CET virtualization in KVM.
>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: John Allen <john.allen@amd.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-05-27 11:01 ` Chao Gao
@ 2025-06-02 19:12 ` Chang S. Bae
2025-06-03 6:22 ` Chao Gao
0 siblings, 1 reply; 23+ messages in thread
From: Chang S. Bae @ 2025-06-02 19:12 UTC (permalink / raw)
To: Chao Gao, Dave Hansen
Cc: Sean Christopherson, x86, linux-kernel, kvm, tglx, pbonzini,
peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp, xin3.li,
Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar, Kees Cook,
Maxim Levitsky, Mitchell Levy, Nikolay Borisov, Oleg Nesterov,
Sohil Mehta, Stanislav Spassov, Vignesh Balasubramanian
On 5/27/2025 4:01 AM, Chao Gao wrote:
>
> *: https://lore.kernel.org/all/88cb75d3-01b9-38ea-e29f-b8fefb548573@intel.com/
>
> The issue arises because the XFD MSR retains the value (i.e., 0, indicating
> AMX enabled) from the previous process, while both the passed-in fpstate
> (init_fpstate) and the current fpstate have AMX disabled.
>
> To reproduce this issue, compile the kernel with CONFIG_PREEMPT=y, apply the
> attached diff to the amx selftest and run:
>
> # numactl -C 1 ./tools/testing/selftests/x86/amx_64
>
> diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
> index 40769c16de1b..4d533d1a530d 100644
> --- a/tools/testing/selftests/x86/amx.c
> +++ b/tools/testing/selftests/x86/amx.c
> @@ -430,6 +430,10 @@ static inline void validate_tiledata_regs_changed(struct xsave_buffer *xbuf)
> fatal_error("TILEDATA registers did not change");
> }
>
> +static void dummy_handler(int sig)
> +{
> +}
> +
> /* tiledata inheritance test */
>
> static void test_fork(void)
> @@ -444,6 +448,10 @@ static void test_fork(void)
> /* fork() succeeded. Now in the parent. */
> int status;
>
> + req_xtiledata_perm();
> + load_rand_tiledata(stashed_xsave);
> + while(1);
> +
> wait(&status);
> if (!WIFEXITED(status) || WEXITSTATUS(status))
> fatal_error("fork test child");
> @@ -452,7 +460,9 @@ static void test_fork(void)
> /* fork() succeeded. Now in the child. */
> printf("[RUN]\tCheck tile data inheritance.\n\tBefore fork(), load tiledata\n");
>
> - load_rand_tiledata(stashed_xsave);
> + signal(SIGSEGV, dummy_handler);
> + while(1)
> + raise(SIGSEGV);
>
> grandchild = fork();
> if (grandchild < 0) {
> @@ -500,9 +510,6 @@ int main(void)
>
> test_dynamic_state();
>
> - /* Request permission for the following tests */
> - req_xtiledata_perm();
> -
> test_fork();
>
> /*
The test case creates two processes -- the first uses AMX (task #1), and
the other continuously sends signals without using AMX (task #2).
This leads to task #2 being preempted by task #1.
This behavior aligns with Sean’s report. From my investigation, the
issue appears to have existed for quite some time and is not related to
the changes in this series.
Here’s a summary of my findings:
== Preempt Case ==
To illustrate how the XFD MSR state becomes incorrect in this scenario:
task #1 (fpstate->xfd=0) task #2 (fpstate->xfd=0x80000)
======================== ==============================
handle_signal()
-> setup_rt_frame()
-> get_siframe()
-> copy_fpstate_to_sigframe()
-> fpregs_unlock()
...
...
switch_fpu_return()
-> fpregs_restore_userregs()
-> restore_fpregs_from_fpstate()
-> xfd_write_state()
^ IA32_XFD_MSR = 0
...
...
-> fpu__clear_user_states()
-> fpregs_lock()
-> restore_fpregs_from_init_fpstate()
-> os_rstor()
-> xfd_validate_state()
^ IA32_XFD_MSR != fpstate->xfd
-> fpregs_mark_active()
-> fpregs_unlock()
Since fpu__clear_user_states() marks the FPU state as valid in the end,
an XFD MSR sync-up was clearly missing.
== Return-to-Userspace Path ==
Both tasks at that moment are on the return-to-userspace path, but at
different points in IRQ state:
* task #2 is inside handle_signal() and already re-enabled IRQs.
* task #1 is after IRQ is disabled again when calling
switch_fpu_return().
local_irq_disable_exit_to_user()
exit_to_user_mode_prepare()
-> exit_to_user_mode_loop()
-> local_irq_enable_exit_to_user()
-> arch_do_signal_or_restart()
-> handle_signal()
-> local_irq_disable_exit_to_user()
-> arch_exit_user_mode_prepare()
-> arch_exit_work()
-> switch_fpu_return()
This implies that fpregs_lock()/fpregs_unlock() is necessary inside
handle_signal() when XSAVE instructions are invoked.
But, it should be okay for switch_fpu_return() to call
fpregs_restore_userregs() without fpregs_lock().
== XFD Sanity Checker ==
The XFD sanity checker -- xfd_op_valid() -- correctly caught this issue
in the test case. However, it may have a false negative when AMX usage
was flipped between the two tasks.
Despite that, I don't think extending its coverage is worthwhile, as it
would complicate the logic. The current logic and documentation seem
sound.
== Fix Consideration ==
I think the fix is straightforward: resynchronize the IA32_XFD MSR in
fpu__clear_user_states().
The existing xfd_update_state() function is self-contained and already
performs feature checks and conditional MSR updates. Thus, it is not
necessary to check the TIF_NEED_FPU_LOAD flag for this.
On the other hand, the sigreturn path already performs the XFD resync
introduced by this commit:
672365477ae8a ("x86/fpu: Update XFD state where required")
But I think that change was supposed to cover _both_ signal return and
signal delivery paths. Sorry, it seems I had overlooked the latter
before.
Thanks,
Chang
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-06-02 19:12 ` Chang S. Bae
@ 2025-06-03 6:22 ` Chao Gao
2025-06-03 18:32 ` Chang S. Bae
0 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-06-03 6:22 UTC (permalink / raw)
To: Chang S. Bae
Cc: Dave Hansen, Sean Christopherson, x86, linux-kernel, kvm, tglx,
pbonzini, peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
xin3.li, Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar,
Kees Cook, Maxim Levitsky, Mitchell Levy, Nikolay Borisov,
Oleg Nesterov, Sohil Mehta, Stanislav Spassov,
Vignesh Balasubramanian
On Mon, Jun 02, 2025 at 12:12:51PM -0700, Chang S. Bae wrote:
>== Preempt Case ==
>
>To illustrate how the XFD MSR state becomes incorrect in this scenario:
>
> task #1 (fpstate->xfd=0) task #2 (fpstate->xfd=0x80000)
> ======================== ==============================
> handle_signal()
> -> setup_rt_frame()
> -> get_siframe()
> -> copy_fpstate_to_sigframe()
> -> fpregs_unlock()
> ...
> ...
> switch_fpu_return()
> -> fpregs_restore_userregs()
> -> restore_fpregs_from_fpstate()
> -> xfd_write_state()
> ^ IA32_XFD_MSR = 0
> ...
> ...
> -> fpu__clear_user_states()
> -> fpregs_lock()
> -> restore_fpregs_from_init_fpstate()
> -> os_rstor()
> -> xfd_validate_state()
> ^ IA32_XFD_MSR != fpstate->xfd
> -> fpregs_mark_active()
> -> fpregs_unlock()
>
>Since fpu__clear_user_states() marks the FPU state as valid in the end, an
>XFD MSR sync-up was clearly missing.
Thanks for this analysis. It makes sense.
>
>== Return-to-Userspace Path ==
>
>Both tasks at that moment are on the return-to-userspace path, but at
>different points in IRQ state:
>
> * task #2 is inside handle_signal() and already re-enabled IRQs.
> * task #1 is after IRQ is disabled again when calling
> switch_fpu_return().
>
> local_irq_disable_exit_to_user()
> exit_to_user_mode_prepare()
> -> exit_to_user_mode_loop()
> -> local_irq_enable_exit_to_user()
> -> arch_do_signal_or_restart()
> -> handle_signal()
> -> local_irq_disable_exit_to_user()
> -> arch_exit_user_mode_prepare()
> -> arch_exit_work()
> -> switch_fpu_return()
>
>This implies that fpregs_lock()/fpregs_unlock() is necessary inside
>handle_signal() when XSAVE instructions are invoked.
>
>But, it should be okay for switch_fpu_return() to call
>fpregs_restore_userregs() without fpregs_lock().
>
>== XFD Sanity Checker ==
>
>The XFD sanity checker -- xfd_op_valid() -- correctly caught this issue in
>the test case. However, it may have a false negative when AMX usage was
>flipped between the two tasks.
>
>Despite that, I don't think extending its coverage is worthwhile, as it would
>complicate the logic. The current logic and documentation seem sound.
>
>== Fix Consideration ==
>
>I think the fix is straightforward: resynchronize the IA32_XFD MSR in
>fpu__clear_user_states().
This fix sounds good.
Btw, what do you think the impact of this issue might be?
Aside from the splat, task #2 could execute AMX instructions without
requesting permissions, but its AMX state would be discarded during the
next FPU switch, as RFBM[18] is cleared when executing XSAVES. And, in the
"flipped" scenario you mentioned, task #2 might receive an extra #NM, after
which its fpstate would be re-allocated (although the size won't increase
further).
So, for well-behaved tasks that never use AMX, there is no impact; tasks
that use AMX may receive extra #NM. There won't be any unexpected #PF,
memory corruption, or kernel panic.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-06-03 6:22 ` Chao Gao
@ 2025-06-03 18:32 ` Chang S. Bae
0 siblings, 0 replies; 23+ messages in thread
From: Chang S. Bae @ 2025-06-03 18:32 UTC (permalink / raw)
To: Chao Gao
Cc: Dave Hansen, Sean Christopherson, x86, linux-kernel, kvm, tglx,
pbonzini, peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
xin3.li, Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar,
Kees Cook, Maxim Levitsky, Mitchell Levy, Nikolay Borisov,
Oleg Nesterov, Sohil Mehta, Stanislav Spassov,
Vignesh Balasubramanian
On 6/2/2025 11:22 PM, Chao Gao wrote:
>
> Aside from the splat, task #2 could execute AMX instructions without
> requesting permissions, but its AMX state would be discarded during the
> next FPU switch, as RFBM[18] is cleared when executing XSAVES. And, in the
Right, AMX instructions can be executed when XFD is disarmed. But in
this case, it's inside a signal handler. On sigreturn, XTILE_DATA will
be reloaded with the init state, since fpstate::user_xfeatures[18] is zero.
> "flipped" scenario you mentioned, task #2 might receive an extra #NM, after
> which its fpstate would be re-allocated (although the size won't increase
> further).
Yes.
> So, for well-behaved tasks that never use AMX, there is no impact; tasks
> that use AMX may receive extra #NM. There won't be any unexpected #PF,
> memory corruption, or kernel panic.
A signal handler is expected to stay within the bounds of
async-signal-safe functions, so using AMX in that context is highly
unlikely in practice. While the issue has existed, its real-world impact
appears quite minimal in my view.
Thanks,
Chang
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
` (6 preceding siblings ...)
2025-05-23 16:57 ` [PATCH v8 0/6] Introduce CET supervisor state support Sean Christopherson
@ 2025-06-04 0:56 ` Chao Gao
2025-06-04 18:45 ` Dave Hansen
7 siblings, 1 reply; 23+ messages in thread
From: Chao Gao @ 2025-06-04 0:56 UTC (permalink / raw)
To: x86, linux-kernel, kvm, tglx, dave.hansen, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Dave Hansen, Eric Biggers,
H. Peter Anvin, Ingo Molnar, Kees Cook, Maxim Levitsky,
Mitchell Levy, Nikolay Borisov, Oleg Nesterov, Sohil Mehta,
Stanislav Spassov, Vignesh Balasubramanian
On Thu, May 22, 2025 at 08:10:03AM -0700, Chao Gao wrote:
>Dear maintainers and reviewers,
>
>I kindly request your consideration for merging this series. Most of
>patches have received Reviewed-by/Acked-by tags.
Looks like we now have AMD RB and the other issue (reported by Sean) was
pre-existing. x86 maintainers, please consider applying.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-06-04 0:56 ` Chao Gao
@ 2025-06-04 18:45 ` Dave Hansen
2025-06-16 8:08 ` Chao Gao
0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2025-06-04 18:45 UTC (permalink / raw)
To: Chao Gao, x86, linux-kernel, kvm, tglx, seanjc, pbonzini
Cc: peterz, rick.p.edgecombe, weijiang.yang, john.allen, bp,
chang.seok.bae, xin3.li, Dave Hansen, Eric Biggers,
H. Peter Anvin, Ingo Molnar, Kees Cook, Maxim Levitsky,
Mitchell Levy, Nikolay Borisov, Oleg Nesterov, Sohil Mehta,
Stanislav Spassov, Vignesh Balasubramanian
On 6/3/25 17:56, Chao Gao wrote:
> On Thu, May 22, 2025 at 08:10:03AM -0700, Chao Gao wrote:
>> Dear maintainers and reviewers,
>>
>> I kindly request your consideration for merging this series. Most of
>> patches have received Reviewed-by/Acked-by tags.
> Looks like we now have AMD RB and the other issue (reported by Sean) was
> pre-existing. x86 maintainers, please consider applying.
Hi Chao,
You might want to take a look at this:
https://docs.kernel.org/process/maintainer-tip.html#merge-window
Specifically:
> Please do not expect patches to be reviewed or merged by tip
> maintainers around or during the merge window. The trees are closed
> to all but urgent fixes during this time. They reopen once the merge
> window closes and a new -rc1 kernel has been released.
In other words, your mail to ask us to consider applying is going to get
ignored for at least a week. Best case, it gets put on one of our lists
to go look at later.
My suggestion to you and all other submitters of non-critical fixes is
to spend the time between now and -rc1 reviewing *others* code and
making sure yours is 100% ready once -rc1 is released. The week after
the -rc1 release is a great time to send these emails, not now.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v8 0/6] Introduce CET supervisor state support
2025-06-04 18:45 ` Dave Hansen
@ 2025-06-16 8:08 ` Chao Gao
0 siblings, 0 replies; 23+ messages in thread
From: Chao Gao @ 2025-06-16 8:08 UTC (permalink / raw)
To: Dave Hansen
Cc: x86, linux-kernel, kvm, tglx, seanjc, pbonzini, peterz,
rick.p.edgecombe, weijiang.yang, john.allen, bp, chang.seok.bae,
xin3.li, Dave Hansen, Eric Biggers, H. Peter Anvin, Ingo Molnar,
Kees Cook, Maxim Levitsky, Mitchell Levy, Nikolay Borisov,
Oleg Nesterov, Sohil Mehta, Stanislav Spassov,
Vignesh Balasubramanian
On Wed, Jun 04, 2025 at 11:45:53AM -0700, Dave Hansen wrote:
>My suggestion to you and all other submitters of non-critical fixes is
>to spend the time between now and -rc1 reviewing *others* code and
>making sure yours is 100% ready once -rc1 is released. The week after
>the -rc1 release is a great time to send these emails, not now.
Hi Dave,
Thanks for the suggestion. I just tested this series on the latest tip/master.
It applies cleanly and passes all my tests. Please consider merging it if it
looks good to you.
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-06-16 8:09 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 15:10 [PATCH v8 0/6] Introduce CET supervisor state support Chao Gao
2025-05-22 15:10 ` [PATCH v8 1/6] x86/fpu/xstate: Differentiate default features for host and guest FPUs Chao Gao
2025-05-29 20:59 ` John Allen
2025-05-22 15:10 ` [PATCH v8 2/6] x86/fpu: Initialize guest FPU permissions from guest defaults Chao Gao
2025-05-29 21:14 ` John Allen
2025-05-22 15:10 ` [PATCH v8 3/6] x86/fpu: Initialize guest fpstate and FPU pseudo container " Chao Gao
2025-05-29 21:25 ` John Allen
2025-05-22 15:10 ` [PATCH v8 4/6] x86/fpu: Remove xfd argument from __fpstate_reset() Chao Gao
2025-05-29 21:26 ` John Allen
2025-05-22 15:10 ` [PATCH v8 5/6] x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set Chao Gao
2025-05-30 16:04 ` John Allen
2025-05-22 15:10 ` [PATCH v8 6/6] x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature Chao Gao
2025-05-30 16:05 ` John Allen
2025-05-23 16:57 ` [PATCH v8 0/6] Introduce CET supervisor state support Sean Christopherson
2025-05-23 17:12 ` Dave Hansen
2025-05-23 17:48 ` Sean Christopherson
2025-05-27 11:01 ` Chao Gao
2025-06-02 19:12 ` Chang S. Bae
2025-06-03 6:22 ` Chao Gao
2025-06-03 18:32 ` Chang S. Bae
2025-06-04 0:56 ` Chao Gao
2025-06-04 18:45 ` Dave Hansen
2025-06-16 8:08 ` Chao Gao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).