* [PATCH RFC 0/2] KVM: TDX: MWAIT in guest
@ 2025-08-16 14:44 Adrian Hunter
2025-08-16 14:44 ` [PATCH RFC 1/2] KVM: TDX: Disable general support for " Adrian Hunter
2025-08-16 14:44 ` [PATCH RFC 2/2] KVM: TDX: Add flag to support MWAIT instruction only Adrian Hunter
0 siblings, 2 replies; 13+ messages in thread
From: Adrian Hunter @ 2025-08-16 14:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao, ira.weiny
Hi
TDX support for using the MWAIT instruction in a guest has issues.
One option is just to disable it, see patch 1.
Then perhaps provide a distinct way to enable it, documenting the
limitations TDX has compared with VMX in this regard, so that users
will be made aware of the limitations. See patch 2.
Other options:
1. Do nothing but document the limitations.
2. Patch 1 but not patch 2; look for a better solution
Adrian Hunter (2):
KVM: TDX: Disable general support for MWAIT in guest
KVM: TDX: Add flag to support MWAIT instruction only
Documentation/virt/kvm/x86/intel-tdx.rst | 28 ++++++++++-
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/include/uapi/asm/kvm.h | 3 ++
arch/x86/kvm/vmx/tdx.c | 80 +++++++++++++++++++++++++-------
arch/x86/kvm/x86.c | 8 ++--
5 files changed, 98 insertions(+), 23 deletions(-)
Regards
Adrian
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-16 14:44 [PATCH RFC 0/2] KVM: TDX: MWAIT in guest Adrian Hunter
@ 2025-08-16 14:44 ` Adrian Hunter
2025-08-18 14:05 ` Sean Christopherson
2025-08-16 14:44 ` [PATCH RFC 2/2] KVM: TDX: Add flag to support MWAIT instruction only Adrian Hunter
1 sibling, 1 reply; 13+ messages in thread
From: Adrian Hunter @ 2025-08-16 14:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao, ira.weiny
TDX support for using the MWAIT instruction in a guest has issues, so
disable it for now.
Background
Like VMX, TDX can allow the MWAIT instruction to be executed in a guest.
Unlike VMX, TDX cannot necessarily provide for virtualization of MSRs that
a guest might reasonably expect to exist as well.
For example, in the case of a Linux guest, the default idle driver
intel_idle may access MSR_POWER_CTL or MSR_PKG_CST_CONFIG_CONTROL. To
virtualize those, KVM would need the guest not to enable #VE reduction,
which is not something that KVM can control or even be aware of. Note,
however, that the consequent unchecked MSR access errors might be harmless.
Without #VE reduction enabled, the TDX Module will inject #VE for MSRs that
it does not virtualize itself. The guest can then hypercall the host VMM
for a resolution.
With #VE reduction enabled, accessing MSRs such as the 2 above, results in
the TDX Module injecting #GP.
Currently, Linux guest opts for #VE reduction unconditionally if it is
available, refer reduce_unnecessary_ve(). However, the #VE reduction
feature was not added to the TDX Module until versions 1.5.09 and 2.0.04.
Refer https://github.com/intel/tdx-module/releases
There is also a further issue experienced by a Linux guest. Prior to
TDX Module versions 1.5.09 and 2.0.04, the Always-Running-APIC-Timer (ARAT)
feature (CPUID leaf 6: EAX bit 2) is not exposed. That results in cpuidle
disabling the timer interrupt and invoking the Tick Broadcast framework
to provide a wake-up. Currently, that falls back to the PIT timer which
does not work for TDX, resulting in the guest becoming stuck in the idle
loop.
Conclusion
User's may expect TDX support of MWAIT in a guest to be similar to VMX
support, but KVM cannot ensure that. Consequently KVM should not expose
the capability.
Fixes: 0186dd29a2518 ("KVM: TDX: add ioctl to initialize VM with TDX specific parameters")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/tdx.c | 22 +++++++++++++++++++++-
arch/x86/kvm/x86.c | 8 +++++---
3 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f7af967aa16f..9c8617217adb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1398,6 +1398,8 @@ struct kvm_arch {
gpa_t wall_clock;
+ u64 unsupported_disable_exits;
+
bool mwait_in_guest;
bool hlt_in_guest;
bool pause_in_guest;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9ad460ef97b0..cdf0dc6cf068 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -132,6 +132,17 @@ static void clear_waitpkg(struct kvm_cpuid_entry2 *entry)
entry->ecx &= ~__feature_bit(X86_FEATURE_WAITPKG);
}
+static bool has_mwait(const struct kvm_cpuid_entry2 *entry)
+{
+ return entry->function == 1 &&
+ (entry->ecx & __feature_bit(X86_FEATURE_MWAIT));
+}
+
+static void clear_mwait(struct kvm_cpuid_entry2 *entry)
+{
+ entry->ecx &= ~__feature_bit(X86_FEATURE_MWAIT);
+}
+
static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
{
if (has_tsx(entry))
@@ -139,11 +150,15 @@ static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
if (has_waitpkg(entry))
clear_waitpkg(entry);
+
+ /* Also KVM_X86_DISABLE_EXITS_MWAIT is disallowed in tdx_vm_init() */
+ if (has_mwait(entry))
+ clear_mwait(entry);
}
static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
{
- return has_tsx(entry) || has_waitpkg(entry);
+ return has_tsx(entry) || has_waitpkg(entry) || has_mwait(entry);
}
#define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1)
@@ -615,6 +630,11 @@ int tdx_vm_init(struct kvm *kvm)
kvm->arch.has_protected_state = true;
kvm->arch.has_private_mem = true;
kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT;
+ /*
+ * TDX support for using the MWAIT instruction in a guest has issues,
+ * so disable it for now. See also tdx_clear_unsupported_cpuid().
+ */
+ kvm->arch.unsupported_disable_exits |= KVM_X86_DISABLE_EXITS_MWAIT;
/*
* Because guest TD is protected, VMM can't parse the instruction in TD.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 93636f77c42d..bfd4f52286b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4575,7 +4575,7 @@ static inline bool kvm_can_mwait_in_guest(void)
boot_cpu_has(X86_FEATURE_ARAT);
}
-static u64 kvm_get_allowed_disable_exits(void)
+static u64 kvm_get_allowed_disable_exits(struct kvm *kvm)
{
u64 r = KVM_X86_DISABLE_EXITS_PAUSE;
@@ -4586,6 +4586,8 @@ static u64 kvm_get_allowed_disable_exits(void)
if (kvm_can_mwait_in_guest())
r |= KVM_X86_DISABLE_EXITS_MWAIT;
}
+ if (kvm)
+ r &= ~kvm->arch.unsupported_disable_exits;
return r;
}
@@ -4736,7 +4738,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = KVM_CLOCK_VALID_FLAGS;
break;
case KVM_CAP_X86_DISABLE_EXITS:
- r = kvm_get_allowed_disable_exits();
+ r = kvm_get_allowed_disable_exits(kvm);
break;
case KVM_CAP_X86_SMM:
if (!IS_ENABLED(CONFIG_KVM_SMM))
@@ -6613,7 +6615,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
break;
case KVM_CAP_X86_DISABLE_EXITS:
r = -EINVAL;
- if (cap->args[0] & ~kvm_get_allowed_disable_exits())
+ if (cap->args[0] & ~kvm_get_allowed_disable_exits(kvm))
break;
mutex_lock(&kvm->lock);
--
2.48.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC 2/2] KVM: TDX: Add flag to support MWAIT instruction only
2025-08-16 14:44 [PATCH RFC 0/2] KVM: TDX: MWAIT in guest Adrian Hunter
2025-08-16 14:44 ` [PATCH RFC 1/2] KVM: TDX: Disable general support for " Adrian Hunter
@ 2025-08-16 14:44 ` Adrian Hunter
1 sibling, 0 replies; 13+ messages in thread
From: Adrian Hunter @ 2025-08-16 14:44 UTC (permalink / raw)
To: pbonzini, seanjc
Cc: kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao, ira.weiny
Add a TDX-specific flag to allow for using the MWAIT instruction in a
guest. This provides for users that understand the limitations that TDX
has compared with VMX in this regard.
The limitations are:
1. TDX Module versions prior to 1.5.09 and 2.0.04 do not expose the
Always-Running-APIC-Timer (ARAT) feature (CPUID leaf 6: EAX bit 2),
which a TDX guest may need for correct handling of deep C-states.
For example, with a Linux guest, that results in cpuidle disabling the
timer interrupt and invoking the Tick Broadcast framework to provide a
wake-up. Currently, that falls back to the PIT timer which does not
work for TDX, resulting in the guest becoming stuck in the idle loop.
2. TDX Module versions 1.5.09 and 2.0.04 or later support #VE reduction,
which, if the guest opts to enable it, results in the TDX Module
injecting #GP for accesses to MSRs that the guest could reasonably
assume to exist if the MWAIT feature is available.
A Linux guest could possibly be used with TDX support for MWAIT, for
example by:
a) - Using TDX Module versions 1.5.09 and 2.0.04 or later, and
- Using acpi_idle driver with suitable ACPI tables like _CST
b) - Using TDX Module versions 1.5.09 and 2.0.04 or later, and
- Ignoring unchecked MSR access errors from intel_idle
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
---
Documentation/virt/kvm/x86/intel-tdx.rst | 28 ++++++++++-
arch/x86/include/uapi/asm/kvm.h | 3 ++
arch/x86/kvm/vmx/tdx.c | 62 ++++++++++++++++--------
3 files changed, 72 insertions(+), 21 deletions(-)
diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst
index bcfa97e0c9e7..b534a092b4c1 100644
--- a/Documentation/virt/kvm/x86/intel-tdx.rst
+++ b/Documentation/virt/kvm/x86/intel-tdx.rst
@@ -70,8 +70,12 @@ Return the TDX capabilities that current KVM supports with the specific TDX
module loaded in the system. It reports what features/capabilities are allowed
to be configured to the TDX guest.
+KVM_TDX_FLAGS_ALLOW_MWAIT flag allows the capability to use the MWAIT
+instruction in a guest (CPUID leaf 1 ECX bit 3), but beware of the limitations,
+see "MWAIT Limitations" below.
+
- id: KVM_TDX_CAPABILITIES
-- flags: must be 0
+- flags: must be 0, or KVM_TDX_FLAGS_ALLOW_MWAIT (if KVM_TDX_CAP_ALLOW_MWAIT)
- data: pointer to struct kvm_tdx_capabilities
- hw_error: must be 0
@@ -111,8 +115,12 @@ KVM_TDX_INIT_VM
Perform TDX specific VM initialization. This needs to be called after
KVM_CREATE_VM and before creating any VCPUs.
+KVM_TDX_FLAGS_ALLOW_MWAIT flag allows the capability to use the MWAIT
+instruction in a guest (CPUID leaf 1 ECX bit 3), but beware of the limitations,
+see "MWAIT Limitations" below.
+
- id: KVM_TDX_INIT_VM
-- flags: must be 0
+- flags: must be 0, or KVM_TDX_FLAGS_ALLOW_MWAIT (if KVM_TDX_CAP_ALLOW_MWAIT)
- data: pointer to struct kvm_tdx_init_vm
- hw_error: must be 0
@@ -282,6 +290,22 @@ control flow is as follows:
#. Run VCPU
+MWAIT Limitations
+=================
+
+- TDX Module versions 1.5.09 and 2.0.04 or later support #VE reduction,
+ which, if the guest opts to enable it, results in the TDX Module
+ injecting #GP for accesses to MSRs that the guest could reasonably
+ assume to exist if the MWAIT feature is available.
+
+- TDX Module versions prior to 1.5.09 and 2.0.04 do not expose the
+ Always-Running-APIC-Timer (ARAT) feature (CPUID leaf 6: EAX bit 2),
+ which a TDX guest may need for correct handling of deep C-states.
+ For example, with a Linux guest, that results in cpuidle disabling the
+ timer interrupt and invoking the Tick Broadcast framework to provide a
+ wake-up. Currently, that falls back to the PIT timer which does not
+ work for TDX, resulting in the guest becoming stuck in the idle loop.
+
References
==========
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e019111e2150..8175e05c9e50 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -945,6 +945,8 @@ enum kvm_tdx_cmd_id {
KVM_TDX_CMD_NR_MAX,
};
+#define KVM_TDX_FLAGS_ALLOW_MWAIT _BITUL(0)
+
struct kvm_tdx_cmd {
/* enum kvm_tdx_cmd_id */
__u32 id;
@@ -964,6 +966,7 @@ struct kvm_tdx_cmd {
};
#define KVM_TDX_CAP_TERMINATE_VM _BITULL(0)
+#define KVM_TDX_CAP_ALLOW_MWAIT _BITULL(1)
struct kvm_tdx_capabilities {
__u64 supported_attrs;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cdf0dc6cf068..db85624e0e78 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -143,7 +143,7 @@ static void clear_mwait(struct kvm_cpuid_entry2 *entry)
entry->ecx &= ~__feature_bit(X86_FEATURE_MWAIT);
}
-static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
+static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry, bool disallow_mwait)
{
if (has_tsx(entry))
clear_tsx(entry);
@@ -152,18 +152,20 @@ static void tdx_clear_unsupported_cpuid(struct kvm_cpuid_entry2 *entry)
clear_waitpkg(entry);
/* Also KVM_X86_DISABLE_EXITS_MWAIT is disallowed in tdx_vm_init() */
- if (has_mwait(entry))
+ if (disallow_mwait && has_mwait(entry))
clear_mwait(entry);
}
-static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry)
+static bool tdx_unsupported_cpuid(const struct kvm_cpuid_entry2 *entry, bool disallow_mwait)
{
- return has_tsx(entry) || has_waitpkg(entry) || has_mwait(entry);
+ return has_tsx(entry) || has_waitpkg(entry) ||
+ (disallow_mwait && has_mwait(entry));
}
#define KVM_TDX_CPUID_NO_SUBLEAF ((__u32)-1)
-static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx)
+static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char idx,
+ bool disallow_mwait)
{
const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
@@ -185,14 +187,15 @@ static void td_init_cpuid_entry2(struct kvm_cpuid_entry2 *entry, unsigned char i
if (entry->function == 0x80000008)
entry->eax = tdx_set_guest_phys_addr_bits(entry->eax, 0xff);
- tdx_clear_unsupported_cpuid(entry);
+ tdx_clear_unsupported_cpuid(entry, disallow_mwait);
}
#define TDVMCALLINFO_SETUP_EVENT_NOTIFY_INTERRUPT BIT(1)
-static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
+static int init_kvm_tdx_caps(struct kvm *kvm, const struct tdx_sys_info_td_conf *td_conf,
struct kvm_tdx_capabilities *caps)
{
+ bool disallow_mwait = kvm->arch.unsupported_disable_exits & KVM_X86_DISABLE_EXITS_MWAIT;
int i;
caps->supported_attrs = tdx_get_supported_attrs(td_conf);
@@ -203,7 +206,7 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
if (!caps->supported_xfam)
return -EIO;
- caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM;
+ caps->supported_caps = KVM_TDX_CAP_TERMINATE_VM | KVM_TDX_CAP_ALLOW_MWAIT;
caps->cpuid.nent = td_conf->num_cpuid_config;
@@ -211,7 +214,7 @@ static int init_kvm_tdx_caps(const struct tdx_sys_info_td_conf *td_conf,
TDVMCALLINFO_SETUP_EVENT_NOTIFY_INTERRUPT;
for (i = 0; i < td_conf->num_cpuid_config; i++)
- td_init_cpuid_entry2(&caps->cpuid.entries[i], i);
+ td_init_cpuid_entry2(&caps->cpuid.entries[i], i, disallow_mwait);
return 0;
}
@@ -2268,7 +2271,9 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
}
}
-static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
+#define KVM_TDX_CAPABILITIES_FLAGS KVM_TDX_FLAGS_ALLOW_MWAIT
+
+static int tdx_get_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
{
const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
struct kvm_tdx_capabilities __user *user_caps;
@@ -2276,10 +2281,12 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
u32 nr_user_entries;
int ret = 0;
- /* flags is reserved for future use */
- if (cmd->flags)
+ if (cmd->flags & ~KVM_TDX_CAPABILITIES_FLAGS)
return -EINVAL;
+ if (cmd->flags & KVM_TDX_FLAGS_ALLOW_MWAIT)
+ kvm->arch.unsupported_disable_exits &= ~KVM_X86_DISABLE_EXITS_MWAIT;
+
caps = kzalloc(sizeof(*caps) +
sizeof(struct kvm_cpuid_entry2) * td_conf->num_cpuid_config,
GFP_KERNEL);
@@ -2297,7 +2304,7 @@ static int tdx_get_capabilities(struct kvm_tdx_cmd *cmd)
goto out;
}
- ret = init_kvm_tdx_caps(td_conf, caps);
+ ret = init_kvm_tdx_caps(kvm, td_conf, caps);
if (ret)
goto out;
@@ -2356,9 +2363,19 @@ static int setup_tdparams_eptp_controls(struct kvm_cpuid2 *cpuid,
return 0;
}
-static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
+static void tdx_update_mwait_in_guest(struct kvm *kvm, struct kvm_cpuid2 *cpuid)
+{
+ const struct kvm_cpuid_entry2 *entry;
+
+ entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent, 1, 0);
+
+ kvm->arch.mwait_in_guest = entry && has_mwait(entry);
+}
+
+static int setup_tdparams_cpuids(struct kvm *kvm, struct kvm_cpuid2 *cpuid,
struct td_params *td_params)
{
+ bool disallow_mwait = kvm->arch.unsupported_disable_exits & KVM_X86_DISABLE_EXITS_MWAIT;
const struct tdx_sys_info_td_conf *td_conf = &tdx_sysinfo->td_conf;
const struct kvm_cpuid_entry2 *entry;
struct tdx_cpuid_value *value;
@@ -2372,14 +2389,14 @@ static int setup_tdparams_cpuids(struct kvm_cpuid2 *cpuid,
for (i = 0; i < td_conf->num_cpuid_config; i++) {
struct kvm_cpuid_entry2 tmp;
- td_init_cpuid_entry2(&tmp, i);
+ td_init_cpuid_entry2(&tmp, i, disallow_mwait);
entry = kvm_find_cpuid_entry2(cpuid->entries, cpuid->nent,
tmp.function, tmp.index);
if (!entry)
continue;
- if (tdx_unsupported_cpuid(entry))
+ if (tdx_unsupported_cpuid(entry, disallow_mwait))
return -EINVAL;
copy_cnt++;
@@ -2437,10 +2454,12 @@ static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
if (ret)
return ret;
- ret = setup_tdparams_cpuids(cpuid, td_params);
+ ret = setup_tdparams_cpuids(kvm, cpuid, td_params);
if (ret)
return ret;
+ tdx_update_mwait_in_guest(kvm, cpuid);
+
#define MEMCPY_SAME_SIZE(dst, src) \
do { \
BUILD_BUG_ON(sizeof(dst) != sizeof(src)); \
@@ -2745,6 +2764,8 @@ static int tdx_read_cpuid(struct kvm_vcpu *vcpu, u32 leaf, u32 sub_leaf,
return -EIO;
}
+#define KVM_TDX_INIT_VM_FLAGS KVM_TDX_FLAGS_ALLOW_MWAIT
+
static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -2758,9 +2779,12 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
if (kvm_tdx->state != TD_STATE_UNINITIALIZED)
return -EINVAL;
- if (cmd->flags)
+ if (cmd->flags & ~KVM_TDX_INIT_VM_FLAGS)
return -EINVAL;
+ if (cmd->flags & KVM_TDX_FLAGS_ALLOW_MWAIT)
+ kvm->arch.unsupported_disable_exits &= ~KVM_X86_DISABLE_EXITS_MWAIT;
+
init_vm = kmalloc(sizeof(*init_vm) +
sizeof(init_vm->cpuid.entries[0]) * KVM_MAX_CPUID_ENTRIES,
GFP_KERNEL);
@@ -2925,7 +2949,7 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
switch (tdx_cmd.id) {
case KVM_TDX_CAPABILITIES:
- r = tdx_get_capabilities(&tdx_cmd);
+ r = tdx_get_capabilities(kvm, &tdx_cmd);
break;
case KVM_TDX_INIT_VM:
r = tdx_td_init(kvm, &tdx_cmd);
--
2.48.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-16 14:44 ` [PATCH RFC 1/2] KVM: TDX: Disable general support for " Adrian Hunter
@ 2025-08-18 14:05 ` Sean Christopherson
2025-08-18 15:07 ` Adrian Hunter
2025-08-18 18:49 ` Edgecombe, Rick P
0 siblings, 2 replies; 13+ messages in thread
From: Sean Christopherson @ 2025-08-18 14:05 UTC (permalink / raw)
To: Adrian Hunter
Cc: pbonzini, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao, ira.weiny
On Sat, Aug 16, 2025, Adrian Hunter wrote:
> TDX support for using the MWAIT instruction in a guest has issues, so
> disable it for now.
>
> Background
>
> Like VMX, TDX can allow the MWAIT instruction to be executed in a guest.
> Unlike VMX, TDX cannot necessarily provide for virtualization of MSRs that
> a guest might reasonably expect to exist as well.
>
> For example, in the case of a Linux guest, the default idle driver
> intel_idle may access MSR_POWER_CTL or MSR_PKG_CST_CONFIG_CONTROL. To
> virtualize those, KVM would need the guest not to enable #VE reduction,
> which is not something that KVM can control or even be aware of. Note,
> however, that the consequent unchecked MSR access errors might be harmless.
>
> Without #VE reduction enabled, the TDX Module will inject #VE for MSRs that
> it does not virtualize itself. The guest can then hypercall the host VMM
> for a resolution.
>
> With #VE reduction enabled, accessing MSRs such as the 2 above, results in
> the TDX Module injecting #GP.
>
> Currently, Linux guest opts for #VE reduction unconditionally if it is
> available, refer reduce_unnecessary_ve(). However, the #VE reduction
> feature was not added to the TDX Module until versions 1.5.09 and 2.0.04.
> Refer https://github.com/intel/tdx-module/releases
>
> There is also a further issue experienced by a Linux guest. Prior to
> TDX Module versions 1.5.09 and 2.0.04, the Always-Running-APIC-Timer (ARAT)
> feature (CPUID leaf 6: EAX bit 2) is not exposed. That results in cpuidle
> disabling the timer interrupt and invoking the Tick Broadcast framework
> to provide a wake-up. Currently, that falls back to the PIT timer which
> does not work for TDX, resulting in the guest becoming stuck in the idle
> loop.
>
> Conclusion
>
> User's may expect TDX support of MWAIT in a guest to be similar to VMX
> support, but KVM cannot ensure that. Consequently KVM should not expose
> the capability.
>
> Fixes: 0186dd29a2518 ("KVM: TDX: add ioctl to initialize VM with TDX specific parameters")
> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
> ---
NAK.
Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
hack-a-fixes to workaround buggy software/firmware. Been there, done that.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-18 14:05 ` Sean Christopherson
@ 2025-08-18 15:07 ` Adrian Hunter
2025-08-18 18:49 ` Edgecombe, Rick P
1 sibling, 0 replies; 13+ messages in thread
From: Adrian Hunter @ 2025-08-18 15:07 UTC (permalink / raw)
To: Brown, Len
Cc: pbonzini, kvm, rick.p.edgecombe, kirill.shutemov, kai.huang,
reinette.chatre, xiaoyao.li, tony.lindgren, binbin.wu,
isaku.yamahata, linux-kernel, yan.y.zhao, chao.gao, ira.weiny,
Sean Christopherson
On 18/08/2025 17:05, Sean Christopherson wrote:
> On Sat, Aug 16, 2025, Adrian Hunter wrote:
>> TDX support for using the MWAIT instruction in a guest has issues, so
>> disable it for now.
>>
>> Background
>>
>> Like VMX, TDX can allow the MWAIT instruction to be executed in a guest.
>> Unlike VMX, TDX cannot necessarily provide for virtualization of MSRs that
>> a guest might reasonably expect to exist as well.
>>
>> For example, in the case of a Linux guest, the default idle driver
>> intel_idle may access MSR_POWER_CTL or MSR_PKG_CST_CONFIG_CONTROL. To
>> virtualize those, KVM would need the guest not to enable #VE reduction,
>> which is not something that KVM can control or even be aware of. Note,
>> however, that the consequent unchecked MSR access errors might be harmless.
>>
>> Without #VE reduction enabled, the TDX Module will inject #VE for MSRs that
>> it does not virtualize itself. The guest can then hypercall the host VMM
>> for a resolution.
>>
>> With #VE reduction enabled, accessing MSRs such as the 2 above, results in
>> the TDX Module injecting #GP.
>>
>> Currently, Linux guest opts for #VE reduction unconditionally if it is
>> available, refer reduce_unnecessary_ve(). However, the #VE reduction
>> feature was not added to the TDX Module until versions 1.5.09 and 2.0.04.
>> Refer https://github.com/intel/tdx-module/releases
>>
>> There is also a further issue experienced by a Linux guest. Prior to
>> TDX Module versions 1.5.09 and 2.0.04, the Always-Running-APIC-Timer (ARAT)
>> feature (CPUID leaf 6: EAX bit 2) is not exposed. That results in cpuidle
>> disabling the timer interrupt and invoking the Tick Broadcast framework
>> to provide a wake-up. Currently, that falls back to the PIT timer which
>> does not work for TDX, resulting in the guest becoming stuck in the idle
>> loop.
>>
>> Conclusion
>>
>> User's may expect TDX support of MWAIT in a guest to be similar to VMX
>> support, but KVM cannot ensure that. Consequently KVM should not expose
>> the capability.
>>
>> Fixes: 0186dd29a2518 ("KVM: TDX: add ioctl to initialize VM with TDX specific parameters")
>> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
>> ---
>
> NAK.
>
> Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
> hack-a-fixes to workaround buggy software/firmware. Been there, done that.
Thanks for the quick reply. Adding Len Brown for intel_idle.
Len, you may recall that MSR_PKG_CST_CONFIG_CONTROL came up in
the following context:
https://bugzilla.kernel.org/show_bug.cgi?id=218792
https://lore.kernel.org/kvm/bug-218792-28872-5sylPIVpHD@https.bugzilla.kernel.org%2F/
For TDX platforms we would need _safe() MSR access for MSR_POWER_CTL
and MSR_PKG_CST_CONFIG_CONTROL. Would that be OK?
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-18 14:05 ` Sean Christopherson
2025-08-18 15:07 ` Adrian Hunter
@ 2025-08-18 18:49 ` Edgecombe, Rick P
2025-08-19 5:40 ` Binbin Wu
2025-08-19 7:38 ` Adrian Hunter
1 sibling, 2 replies; 13+ messages in thread
From: Edgecombe, Rick P @ 2025-08-18 18:49 UTC (permalink / raw)
To: Hunter, Adrian, seanjc@google.com
Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com,
Chatre, Reinette, Li, Xiaoyao, kirill.shutemov@linux.intel.com,
tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira
Attn: Binbin, Xiaoyao
On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
> NAK.
>
> Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
> hack-a-fixes to workaround buggy software/firmware. Been there, done that.
Yes, I would have thought we should have at least had a TDX module change option
for this.
But side topic. We have an existing arch TODO around creating some guidelines
around how CPUID bit configuration should evolve.
A new directly configurable CPUID bit that affects host state is an obvious no-
no. But how about a directly configurable bit that can't hurt the host, but
requires host changes to virtualize in an x86 arch compliant way? (not quite
like this MWAIT case)
In some ways KVM shouldn't care since it's between userspace and the TDX module.
But userspace may try to set it and then we would have a situation where the bit
would remain malfunctioning until/if KVM decided to add support for the bit. If
KVM never did then it would be silently broken. It's not a kernel regression,
but not great either.
If we required some other opt-in for each such feature, it would further
complicate the CPUID bit configuration interface. I think I'd rather keep
directly configurable CPUID bits as the main way to configure the TD.
Maybe we could have the TDX module enumerate which direct bits require VMM
enabling and KVM could automatically filter them? So then TDX module could add
simple feature bits without fuss, but KVM could manually enable the bits that
require consideration.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-18 18:49 ` Edgecombe, Rick P
@ 2025-08-19 5:40 ` Binbin Wu
2025-08-19 15:59 ` Edgecombe, Rick P
2025-08-19 7:38 ` Adrian Hunter
1 sibling, 1 reply; 13+ messages in thread
From: Binbin Wu @ 2025-08-19 5:40 UTC (permalink / raw)
To: Edgecombe, Rick P, Hunter, Adrian, seanjc@google.com
Cc: Gao, Chao, Huang, Kai, Chatre, Reinette, Li, Xiaoyao,
kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira
On 8/19/2025 2:49 AM, Edgecombe, Rick P wrote:
> Attn: Binbin, Xiaoyao
>
> On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
>> NAK.
>>
>> Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
>> hack-a-fixes to workaround buggy software/firmware. Been there, done that.
> Yes, I would have thought we should have at least had a TDX module change option
> for this.
>
> But side topic. We have an existing arch TODO around creating some guidelines
> around how CPUID bit configuration should evolve.
>
> A new directly configurable CPUID bit that affects host state is an obvious no-
> no. But how about a directly configurable bit that can't hurt the host, but
> requires host changes to virtualize in an x86 arch compliant way? (not quite
> like this MWAIT case)
>
> In some ways KVM shouldn't care since it's between userspace and the TDX module.
> But userspace may try to set it and then we would have a situation where the bit
> would remain malfunctioning until/if KVM decided to add support for the bit. If
> KVM never did then it would be silently broken. It's not a kernel regression,
> but not great either.
>
> If we required some other opt-in for each such feature, it would further
> complicate the CPUID bit configuration interface. I think I'd rather keep
> directly configurable CPUID bits as the main way to configure the TD.
Currently, KVM TDX code filters out TSX (HLE or RTM) and WAITPKG using
tdx_clear_unsupported_cpuid(), which is sort of blacklist.
I am wondering if we could add another array, e.g., tdx_cpu_caps[], which is the
TDX version of kvm_cpu_caps[].
Using tdx_cpu_caps[] is a whitelist way.
For a new feature
- If the developer doesn't know anything about TDX, the bit just be added to
kvm_cpu_caps[].
- If the developer knows that the feature supported by both non-TDX VMs and TDs
(either the feature doesn't require any additional virtualization support or
the virtualization support is added for TDX), extend the macros to set the bit
both in kvm_cpu_caps[] and tdx_cpu_caps[].
- If there is a feature not supported by non-TDX VMs, but supported by TDs,
extend the macros to set the bit only in tdx_cpu_caps[].
So, tdx_cpu_caps[] could be used as the filter of configurable bits reported
to userspace.
Comparing to blacklist (i.e., tdx_clear_unsupported_cpuid()), there is no risk
that a feature not supported by TDX is forgotten to be added to the blacklist.
Also, tdx_cpu_caps[] could support a feature that not supported for non-TDX VMs.
Then we don't need a host opt-in for these directly configurable bits not
clobbering host states.
Of course, to prevent userspace from setting feature bit that would clobber host
state, but not included in tdx_cpu_caps[], I think a new feature that would
clobber host state should requires a host opt-in to TDX module.
>
> Maybe we could have the TDX module enumerate which direct bits require VMM
> enabling and KVM could automatically filter them? So then TDX module could add
> simple feature bits without fuss, but KVM could manually enable the bits that
> require consideration.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-18 18:49 ` Edgecombe, Rick P
2025-08-19 5:40 ` Binbin Wu
@ 2025-08-19 7:38 ` Adrian Hunter
2025-08-19 15:07 ` Edgecombe, Rick P
1 sibling, 1 reply; 13+ messages in thread
From: Adrian Hunter @ 2025-08-19 7:38 UTC (permalink / raw)
To: Edgecombe, Rick P, seanjc@google.com
Cc: Gao, Chao, Huang, Kai, binbin.wu@linux.intel.com,
Chatre, Reinette, Li, Xiaoyao, kirill.shutemov@linux.intel.com,
tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira, Brown, Len
On 18/08/2025 21:49, Edgecombe, Rick P wrote:
> Attn: Binbin, Xiaoyao
>
> On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
>> NAK.
>>
>> Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
>> hack-a-fixes to workaround buggy software/firmware. Been there, done that.
>
> Yes, I would have thought we should have at least had a TDX module change option
> for this.
That would not help with existing TDX Modules, and would possibly require
a guest opt-in, which would not help with existing guests. Hence, to start
with disabling the feature first, and look for another solution second.
In the MWAIT case, Sean has rejected supporting MSR_PKG_CST_CONFIG_CONTROL
even for VMX, because it is an optional MSR, so altering intel_idle is
being proposed.
>
> But side topic. We have an existing arch TODO around creating some guidelines
> around how CPUID bit configuration should evolve.
>
> A new directly configurable CPUID bit that affects host state is an obvious no-
> no. But how about a directly configurable bit that can't hurt the host, but
> requires host changes to virtualize in an x86 arch compliant way? (not quite
> like this MWAIT case)
It is still "new stuff that breaks old stuff" which is generally
"just don't do that".
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-19 7:38 ` Adrian Hunter
@ 2025-08-19 15:07 ` Edgecombe, Rick P
2025-08-19 23:35 ` Sean Christopherson
0 siblings, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2025-08-19 15:07 UTC (permalink / raw)
To: Hunter, Adrian, seanjc@google.com
Cc: Gao, Chao, Brown, Len, Huang, Kai, binbin.wu@linux.intel.com,
Li, Xiaoyao, Chatre, Reinette, kirill.shutemov@linux.intel.com,
tony.lindgren@linux.intel.com, kvm@vger.kernel.org,
pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira
On Tue, 2025-08-19 at 10:38 +0300, Adrian Hunter wrote:
> On 18/08/2025 21:49, Edgecombe, Rick P wrote:
> > Attn: Binbin, Xiaoyao
> >
> > On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
> > > NAK.
> > >
> > > Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
> > > hack-a-fixes to workaround buggy software/firmware. Been there, done that.
> >
> > Yes, I would have thought we should have at least had a TDX module change option
> > for this.
>
> That would not help with existing TDX Modules, and would possibly require
> a guest opt-in, which would not help with existing guests. Hence, to start
> with disabling the feature first, and look for another solution second.
I think you have the priorities wrong. There are only so many kludges we can ask
KVM to take. Across all the changes people want for TDX, do you think not having
to update the TDX module, backport a guest fix or even just adjust qemu args is
more important the other stuff?
TDX support is still very early. We need to think about long term sustainable
solutions. So a fix that doesn't support existing TDX modules or guests (the
intel_idle fix is also in this category anyway) should absolutely be on the
table.
>
> In the MWAIT case, Sean has rejected supporting MSR_PKG_CST_CONFIG_CONTROL
> even for VMX, because it is an optional MSR, so altering intel_idle is
> being proposed.
It seems reasonable for this specific case.
>
> >
> > But side topic. We have an existing arch TODO around creating some guidelines
> > around how CPUID bit configuration should evolve.
> >
> > A new directly configurable CPUID bit that affects host state is an obvious no-
> > no. But how about a directly configurable bit that can't hurt the host, but
> > requires host changes to virtualize in an x86 arch compliant way? (not quite
> > like this MWAIT case)
>
> It is still "new stuff that breaks old stuff" which is generally
> "just don't do that".
>
I don't think so? It doesn't necessarily break old stuff if userspace doesn't
use the bit yet.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-19 5:40 ` Binbin Wu
@ 2025-08-19 15:59 ` Edgecombe, Rick P
2025-08-28 10:11 ` Binbin Wu
0 siblings, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2025-08-19 15:59 UTC (permalink / raw)
To: Hunter, Adrian, seanjc@google.com, binbin.wu@linux.intel.com
Cc: Gao, Chao, Huang, Kai, Li, Xiaoyao, Chatre, Reinette,
kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
pbonzini@redhat.com, kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira
On Tue, 2025-08-19 at 13:40 +0800, Binbin Wu wrote:
> Currently, KVM TDX code filters out TSX (HLE or RTM) and WAITPKG using
> tdx_clear_unsupported_cpuid(), which is sort of blacklist.
>
> I am wondering if we could add another array, e.g., tdx_cpu_caps[], which is the
> TDX version of kvm_cpu_caps[].
>
> Using tdx_cpu_caps[] is a whitelist way.
We had something like this in some of the earlier revisions of the TDX CPUID
configuration.
> For a new feature
> - If the developer doesn't know anything about TDX, the bit just be added to
> kvm_cpu_caps[].
> - If the developer knows that the feature supported by both non-TDX VMs and TDs
> (either the feature doesn't require any additional virtualization support or
> the virtualization support is added for TDX), extend the macros to set the bit
> both in kvm_cpu_caps[] and tdx_cpu_caps[].
> - If there is a feature not supported by non-TDX VMs, but supported by TDs,
> extend the macros to set the bit only in tdx_cpu_caps[].
> So, tdx_cpu_caps[] could be used as the filter of configurable bits reported
> to userspace.
In some ways this is the simplest, but having to maintain a big list in KVM was
not ideal. The original solution started with KVM_GET_SUPPORTED_CPUID and then
massaged the results to fit, so maybe just encoding the whole thing separately
is enough to reconsider it.
But what I was thinking is that we could most of that hardcoded list into the
TDX module, and only keep a list of non-trivial features (i.e. not simple
instruction CPUID bits) in KVM. The list of simple features (definition TBD)
could be provided by the TDX module. So KVM could do the full filtering but only
keep a list that today would just look like TSX and WAITPKG that we already
have. So basically the same as what you are proposing, but just shrinks the size
of list KVM has to keep.
>
> Comparing to blacklist (i.e., tdx_clear_unsupported_cpuid()), there is no risk
> that a feature not supported by TDX is forgotten to be added to the blacklist.
> Also, tdx_cpu_caps[] could support a feature that not supported for non-TDX VMs.
We definitely can't have TDX module adding any host affecting features that we
would automatically allow. And having a separate opt-in interface that doesn't
"speak" cpuid bits is going to just complicate the already complicated logic
that is in QEMU.
>
> Then we don't need a host opt-in for these directly configurable bits not
> clobbering host states.
>
> Of course, to prevent userspace from setting feature bit that would clobber host
> state, but not included in tdx_cpu_caps[], I think a new feature that would
> clobber host state should requires a host opt-in to TDX module.
Yes, but if have some way to get the host clobbering type info programatically
we could keep the host opt-in as part of the main CPUID bit configuration. What
I think will be bad is if we grow a separate protocol of opt-ins. KVM and QEMU
manage everything with CPUID, so it will be easier if we stick to that.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-19 15:07 ` Edgecombe, Rick P
@ 2025-08-19 23:35 ` Sean Christopherson
2025-08-20 1:31 ` Xiaoyao Li
0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2025-08-19 23:35 UTC (permalink / raw)
To: Rick P Edgecombe
Cc: Adrian Hunter, Chao Gao, Len Brown, Kai Huang,
binbin.wu@linux.intel.com, Xiaoyao Li, Reinette Chatre,
kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
kvm@vger.kernel.org, pbonzini@redhat.com, Isaku Yamahata,
linux-kernel@vger.kernel.org, Yan Y Zhao, Ira Weiny
On Tue, Aug 19, 2025, Rick P Edgecombe wrote:
> On Tue, 2025-08-19 at 10:38 +0300, Adrian Hunter wrote:
> > On 18/08/2025 21:49, Edgecombe, Rick P wrote:
> > > Attn: Binbin, Xiaoyao
> > >
> > > On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
> > > > NAK.
> > > >
> > > > Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
> > > > hack-a-fixes to workaround buggy software/firmware. Been there, done that.
> > >
> > > Yes, I would have thought we should have at least had a TDX module change option
> > > for this.
> >
> > That would not help with existing TDX Modules, and would possibly require
> > a guest opt-in, which would not help with existing guests. Hence, to start
> > with disabling the feature first, and look for another solution second.
>
> I think you have the priorities wrong. There are only so many kludges we can ask
> KVM to take. Across all the changes people want for TDX, do you think not having
> to update the TDX module, backport a guest fix or even just adjust qemu args is
> more important the other stuff?
I'm especially sensitive to fudging around _bugs_ in other pieces of the stack.
KVM has been burned badly, multiple times, by hacking around issues elsewhere.
There are inevitably cases where throwing something into KVM is the least awful
choice (usually because it's the only feasible choise), but this ain't one of
those cases.
> TDX support is still very early. We need to think about long term sustainable
> solutions. So a fix that doesn't support existing TDX modules or guests (the
> intel_idle fix is also in this category anyway) should absolutely be on the
> table.
>
> >
> > In the MWAIT case, Sean has rejected supporting MSR_PKG_CST_CONFIG_CONTROL
> > even for VMX, because it is an optional MSR, so altering intel_idle is
> > being proposed.
No, I rejected support MSR_PKG_CST_CONFIG_CONTROL _in KVM_ because I don't see
any reason to shove information into KVM. AFAICT, it's not an "architectural"
MSR, and all of KVM's existing handling of truly uarch/model-specific MSRs is
painful and ugly.
And userspace (e.g. QEMU) could support emulate MSR_PKG_CST_CONFIG_CONTROL (and
any other MSRs of that nature) via MSR filters. I doubt the MSR is accessed
outside of boot paths, so the cost of a userspace exit should be a non-issue.
Of course, QEMU probably can't provide useful/accurate information.
One option if there is a super strong need to do so would be to add a "disable
exits" capability to let the guest read package c-state MSRs, but that has
obvious downsides and would still just be fudging around a flawed driver.
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8dbf19aa66ef..c254aa26ff22 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4120,6 +4120,15 @@ void vmx_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
}
+ if (kvm_package_cstate_in_guest(vcpu->kvm)) {
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_CST_CONFIG_CONTROL, MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C2_RESIDENCY, MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C3_RESIDENCY, MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C6_RESIDENCY, MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C8_RESIDENCY, MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C9_RESIDENCY, MSR_TYPE_R);
+ vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C10_RESIDENCY, MSR_TYPE_R);
+ }
if (kvm_aperfmperf_in_guest(vcpu->kvm)) {
vmx_disable_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R);
vmx_disable_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-19 23:35 ` Sean Christopherson
@ 2025-08-20 1:31 ` Xiaoyao Li
0 siblings, 0 replies; 13+ messages in thread
From: Xiaoyao Li @ 2025-08-20 1:31 UTC (permalink / raw)
To: Sean Christopherson, Rick P Edgecombe
Cc: Adrian Hunter, Chao Gao, Len Brown, Kai Huang,
binbin.wu@linux.intel.com, Reinette Chatre,
kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
kvm@vger.kernel.org, pbonzini@redhat.com, Isaku Yamahata,
linux-kernel@vger.kernel.org, Yan Y Zhao, Ira Weiny
On 8/20/2025 7:35 AM, Sean Christopherson wrote:
> On Tue, Aug 19, 2025, Rick P Edgecombe wrote:
>> On Tue, 2025-08-19 at 10:38 +0300, Adrian Hunter wrote:
>>> On 18/08/2025 21:49, Edgecombe, Rick P wrote:
>>>> Attn: Binbin, Xiaoyao
>>>>
>>>> On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
>>>>> NAK.
>>>>>
>>>>> Fix the guest, or wherever else in the pile there are issues. KVM is NOT carrying
>>>>> hack-a-fixes to workaround buggy software/firmware. Been there, done that.
>>>>
>>>> Yes, I would have thought we should have at least had a TDX module change option
>>>> for this.
>>>
>>> That would not help with existing TDX Modules, and would possibly require
>>> a guest opt-in, which would not help with existing guests. Hence, to start
>>> with disabling the feature first, and look for another solution second.
>>
>> I think you have the priorities wrong. There are only so many kludges we can ask
>> KVM to take. Across all the changes people want for TDX, do you think not having
>> to update the TDX module, backport a guest fix or even just adjust qemu args is
>> more important the other stuff?
>
> I'm especially sensitive to fudging around _bugs_ in other pieces of the stack.
> KVM has been burned badly, multiple times, by hacking around issues elsewhere.
> There are inevitably cases where throwing something into KVM is the least awful
> choice (usually because it's the only feasible choise), but this ain't one of
> those cases.
>
>> TDX support is still very early. We need to think about long term sustainable
>> solutions. So a fix that doesn't support existing TDX modules or guests (the
>> intel_idle fix is also in this category anyway) should absolutely be on the
>> table.
>>
>>>
>>> In the MWAIT case, Sean has rejected supporting MSR_PKG_CST_CONFIG_CONTROL
>>> even for VMX, because it is an optional MSR, so altering intel_idle is
>>> being proposed.
>
> No, I rejected support MSR_PKG_CST_CONFIG_CONTROL _in KVM_ because I don't see
> any reason to shove information into KVM. AFAICT, it's not an "architectural"
> MSR, and all of KVM's existing handling of truly uarch/model-specific MSRs is
> painful and ugly.
E.g., for MSR_IA32_POWER_CTL (I don't know why it has _IA32_ in the
name, it seems to me not an architectural MSR), KVM chose to just
emulate the read/write of it as NOP to workaround the issue that guest
driver will access it after MWAIT is allowed natively for the guest.
TDX allowed the same workaround by returning true for MSR_IA32_POWER_CTL
in tdx_has_emulated_msr(). So that when td guest requests emulation from
KVM on #VE, the workaround can come into play. But with #VE reduction
enabled by the TD guest, no #VE anymore but #GP.
It seems we cannot remove the workaround because that will break the
existing guests who rely on the workaround.
> And userspace (e.g. QEMU) could support emulate MSR_PKG_CST_CONFIG_CONTROL (and
> any other MSRs of that nature) via MSR filters. I doubt the MSR is accessed
> outside of boot paths, so the cost of a userspace exit should be a non-issue.
> Of course, QEMU probably can't provide useful/accurate information.
>
> One option if there is a super strong need to do so would be to add a "disable
> exits" capability to let the guest read package c-state MSRs, but that has
> obvious downsides and would still just be fudging around a flawed driver.
I have thought about this option as well. Unfortunately there are WRITE
case in the driver, and we cannot simply pass through WRITE.
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8dbf19aa66ef..c254aa26ff22 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4120,6 +4120,15 @@ void vmx_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
> vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
> vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
> }
> + if (kvm_package_cstate_in_guest(vcpu->kvm)) {
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_CST_CONFIG_CONTROL, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C2_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C3_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C6_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C8_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C9_RESIDENCY, MSR_TYPE_R);
> + vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C10_RESIDENCY, MSR_TYPE_R);
> + }
> if (kvm_aperfmperf_in_guest(vcpu->kvm)) {
> vmx_disable_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R);
> vmx_disable_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R);
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest
2025-08-19 15:59 ` Edgecombe, Rick P
@ 2025-08-28 10:11 ` Binbin Wu
0 siblings, 0 replies; 13+ messages in thread
From: Binbin Wu @ 2025-08-28 10:11 UTC (permalink / raw)
To: Edgecombe, Rick P, Hunter, Adrian, seanjc@google.com
Cc: Gao, Chao, Huang, Kai, Li, Xiaoyao, Chatre, Reinette,
kirill.shutemov@linux.intel.com, tony.lindgren@linux.intel.com,
pbonzini@redhat.com, kvm@vger.kernel.org, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y, Weiny, Ira
On 8/19/2025 11:59 PM, Edgecombe, Rick P wrote:
> On Tue, 2025-08-19 at 13:40 +0800, Binbin Wu wrote:
>> Currently, KVM TDX code filters out TSX (HLE or RTM) and WAITPKG using
>> tdx_clear_unsupported_cpuid(), which is sort of blacklist.
>>
>> I am wondering if we could add another array, e.g., tdx_cpu_caps[], which is the
>> TDX version of kvm_cpu_caps[].
>>
>> Using tdx_cpu_caps[] is a whitelist way.
> We had something like this in some of the earlier revisions of the TDX CPUID
> configuration.
>
>> For a new feature
>> - If the developer doesn't know anything about TDX, the bit just be added to
>> kvm_cpu_caps[].
>> - If the developer knows that the feature supported by both non-TDX VMs and TDs
>> (either the feature doesn't require any additional virtualization support or
>> the virtualization support is added for TDX), extend the macros to set the bit
>> both in kvm_cpu_caps[] and tdx_cpu_caps[].
>> - If there is a feature not supported by non-TDX VMs, but supported by TDs,
>> extend the macros to set the bit only in tdx_cpu_caps[].
>> So, tdx_cpu_caps[] could be used as the filter of configurable bits reported
>> to userspace.
> In some ways this is the simplest, but having to maintain a big list in KVM was
> not ideal.
Agree.
> The original solution started with KVM_GET_SUPPORTED_CPUID and then
> massaged the results to fit, so maybe just encoding the whole thing separately
> is enough to reconsider it.
>
> But what I was thinking is that we could most of that hardcoded list into the
> TDX module, and only keep a list of non-trivial features (i.e. not simple
> instruction CPUID bits) in KVM. The list of simple features (definition TBD)
> could be provided by the TDX module.
It sounds like a good idea.
Either a list of simple features, or the opposite version is OK.
TDX module already provided the interface to get directly configurable bits.
VMM can get the other part by masking.
But providing a list of non-trivial features may be more direct.
I think non-trivial features should cover both cases:
- a feature clobbers host state
- a feature that requires additional para-virtualization support in VMM. E.g,
the feature related MSR(s) should be virtualized by VMM. Without proper
para-virtualization support in VMM, the guest will experience functionality
issue when using the feature
> So KVM could do the full filtering but only
> keep a list that today would just look like TSX and WAITPKG that we already
> have. So basically the same as what you are proposing, but just shrinks the size
> of list KVM has to keep.
>
>> Comparing to blacklist (i.e., tdx_clear_unsupported_cpuid()), there is no risk
>> that a feature not supported by TDX is forgotten to be added to the blacklist.
>> Also, tdx_cpu_caps[] could support a feature that not supported for non-TDX VMs.
> We definitely can't have TDX module adding any host affecting features that we
> would automatically allow. And having a separate opt-in interface that doesn't
> "speak" cpuid bits is going to just complicate the already complicated logic
> that is in QEMU.
With the list of non-trivial features, VMM can prevent userspace from setting
any bit in the list not supported by VMM.
So can KVM only enforce the consistency for non-trivial feature bits? After all,
these bits are really matters from KVM's view.
If letting userspace, KVM and TDX module have a consistent view of CPUIDs for a
TD is still a target. When a new fixed1 bit is added in a new TDX spec, it still
requires an opt-in interface to allow userspace to get the full picture. Also,
userspace doesn't know which opt-in options are available unless TDX module
provide another interface to report them... yeah, very complicated :(
Ideally, if TDX module never adds new fixed1 bit (including new defined and
converted from other types), or convert a fixed1 bit to fixed0 bit, then
userspace can calculate the right fixed1 bits based on the base spec and the
directly configurable bits without separate opt-in interface.
>
>> Then we don't need a host opt-in for these directly configurable bits not
>> clobbering host states.
>>
>> Of course, to prevent userspace from setting feature bit that would clobber host
>> state, but not included in tdx_cpu_caps[], I think a new feature that would
>> clobber host state should requires a host opt-in to TDX module.
> Yes, but if have some way to get the host clobbering type info programatically
> we could keep the host opt-in as part of the main CPUID bit configuration. What
> I think will be bad is if we grow a separate protocol of opt-ins. KVM and QEMU
> manage everything with CPUID, so it will be easier if we stick to that.
>
Agree.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-08-28 10:11 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-16 14:44 [PATCH RFC 0/2] KVM: TDX: MWAIT in guest Adrian Hunter
2025-08-16 14:44 ` [PATCH RFC 1/2] KVM: TDX: Disable general support for " Adrian Hunter
2025-08-18 14:05 ` Sean Christopherson
2025-08-18 15:07 ` Adrian Hunter
2025-08-18 18:49 ` Edgecombe, Rick P
2025-08-19 5:40 ` Binbin Wu
2025-08-19 15:59 ` Edgecombe, Rick P
2025-08-28 10:11 ` Binbin Wu
2025-08-19 7:38 ` Adrian Hunter
2025-08-19 15:07 ` Edgecombe, Rick P
2025-08-19 23:35 ` Sean Christopherson
2025-08-20 1:31 ` Xiaoyao Li
2025-08-16 14:44 ` [PATCH RFC 2/2] KVM: TDX: Add flag to support MWAIT instruction only Adrian Hunter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).