* [PATCH 00/16] KVM: TDX: TDX interrupts
@ 2024-12-09 1:07 Binbin Wu
2024-12-09 1:07 ` [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC Binbin Wu
` (17 more replies)
0 siblings, 18 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
Hi,
This patch series introduces the support of interrupt handling for TDX
guests, including virtual interrupt injection and VM-Exits caused by
vectored events.
This patch set is one of several patch sets that are all needed to provide
the ability to run a functioning TD VM. It is extracted from the section
"TD vcpu exits/interrupts/hypercalls" of the extensive 130-patch TDX base
enabling series[0]. We would like get maintainers/reviewers feedback of the
implementation of interrupt handling, especially the open about the virtual
NMI injection.
Base of this series
===================
This series is based off of a kvm-coco-queue commit and some pre-req series:
1. commit ee69eb746754 ("KVM: x86/mmu: Prevent aliased memslot GFNs") (in
kvm-coco-queue).
2. v7 of "TDX host: metadata reading tweaks and bug fixes and info dump" [1]
3. v1 of "KVM: VMX: Initialize TDX when loading KVM module" [2]
4. v2 of “TDX vCPU/VM creation” [3]
5. v2 of "TDX KVM MMU part 2" [4]
6 v1 of "TDX vcpu enter/exit" [5] with a few fixups based on review feedbacks.
7. v4 of "KVM: x86: Prep KVM hypercall handling for TDX" [6]
8. v1 of "KVM: TDX: TDX hypercalls may exit to userspace" [7]
Opens to discuss
================
- NMI injection: VMM can't inject another pending NMI to TDX guest when
there is already one pending NMI in the TDX module in a back-to-back
way. The solution adopted in this patch series is if there is a further
pending NMI in KVM, collapse it to the one pending in the TDX module.
Refer to patch "KVM: TDX: Implement methods to inject NMI" for more
details.
- NMI VM-Exit handling: Based on current TDX module, NMI is unblocked after
SEAMRET from SEAM root mode to VMX root mode due to NMI VM-EXit. This is
a TDX module bug, it could cause NMI handling order issue [8][9], which
could lead to unknown NMI warning message or nested NMI. There is a
internal discussion ongoing about the change of TDX module based on Sean's
suggestion [8].
This patch series puts the NMI VM-Exit handling in noinstr section, it
can't completely prevent the order issue from happening, but if the code
can be instrumented, the probability of the order issue could be bigger.
Also, no code change needed if the NMIs are blocked after TD exit to KVM.
Virtual interrupt injection
===========================
Non-NMI Interrupt
-----------------
TDX supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM can update
them directly. To post pending interrupts in the PID, KVM can generate a
self-IPI with notification vector prior to TD entry.
TDX guest status is protected, KVM can't get the interrupt status of TDX
guest. In this series, assumes interrupt is always allowed. A later patch
set will have the support for TDX guest to call TDVMCALL with HLT, which
passes the interrupt block flag, so that whether interrupt is allowed in
HLT will checked against the interrupt block flag.
NMI
---
KVM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1. Following that, KVM can call TDH.VP.ENTER to
run the vCPU and the TDX module will attempt to inject the NMI as soon as
possible.
PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one NMI in
the TDX module. Also, TDX doesn't allow KVM to request NMI-window exit
directly. When there is already one NMI pending in the TDX module, i.e. it
has not been delivered to TDX guest yet, if there is NMI pending in KVM,
collapse the pending NMI in KVM into the one pending in the TDX module.
Such collapse is OK considering on X86 bare metal, multiple NMIs could
collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's
responsibility to poll all NMI sources in the NMI handler to avoid missing
handling of some NMI events. More details can be found in the changelog of
the patch "KVM: TDX: Implement methods to inject NMI".
SMI
---
TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs because TDX module doesn't provide a way for
VMM to inject SMI into guest TD or switch guest vCPU mode into SMM. Handle
SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return -ENOTTY,
and add KVM_BUG_ON() to SMI related OPs for TD.
INIT/SIPI event
----------------
TDX defines its own vCPU creation and initialization sequence including
multiple SEAMCALLs. Also, it's only allowed during TD build time. Always
block INIT and SIPI events for the TDX guest.
VM-Exits caused by vectored event
=================================
NMI
---
Similar to the VMX case, handle VM-Exit caused by NMIs within
tdx_vcpu_enter_exit(), i.e., handled before leaving the safety of noinstr.
External Interrupt
------------------
Similar to the VMX case, external interrupts are handled in
.handle_exit_irqoff() callback.
Exception
---------
Machine check, which is handled in the .handle_exit_irqoff() callback, is
the only exception type KVM handles for TDX guests. For other exceptions,
because TDX guest state is protected, exceptions in TDX guests can't be
intercepted. TDX VMM isn't supposed to handle these exceptions. Exit to
userspace with KVM_EXIT_EXCEPTION If unexpected exception occurs.
SMI
---
In SEAM root mode (TDX module), all interrupts are blocked. If an SMI occurs
in SEAM non-root mode (TD guest), the SMI causes VM exit to TDX module, then
SEAMRET to KVM. Once it exits to KVM, SMI is delivered and handled by kernel
handler right away.
An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI
because I/O instructions inside TDX guest trigger #VE and TDX guest needs to
use TDVMCALL to request VMM to do I/O emulation.
For "other SMI", there are two cases:
- MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in
TDX guest will be delivered as an MSMI. It causes an EXIT_REASON_OTHER_SMI
VM exit with MSMI (bit 0) set in the exit qualification. On VM exit, TDX
module checks whether the "other SMI" is caused by an MSMI or not. If so,
TDX module marks TD as fatal, preventing further TD entries, and then
completes the TD exit flow to KVM with the TDH.VP.ENTER outputs indicating
TDX_NON_RECOVERABLE_TD. After TD exit, the MSMI is delivered and eventually
handled by the kernel machine check handler (7911f14 x86/mce: Implement
recovery for errors in TDX/SEAM non-root mode), i.e., the memory page is
marked as poisoned and it won't be freed to the free list when the TDX
guest is terminated. Since the TDX guest is dead, follow other
non-recoverable cases, exit to userspace.
- For non-MSMI case, KVM doesn't need to do anything, just continue TDX vCPU
execution.
Notable changes since v19
=========================
NMI injection
-------------
If there is a further pending NMI in KVM, collapse it to the one pending in
the TDX module.
Always block INIT/SIPI
----------------------
INIT/SIPI events are always blocked for TDX, so this patch series removes the
unnecessary interfaces and helpers for INIT/SIPI delivery.
Check LVTT vector before waiting for lapic expiration
-----------------------------------------------------
To avoid unnecessary wait calls, only call kvm_wait_lapic_expire() when a
timer IRQ was injected, i.e., POSTED_INTR_ON is set and the vector for
LVTT is set in PIR.
Repos
=====
The full KVM branch is here:
https://github.com/intel/tdx/tree/tdx_kvm_dev-2024-11-20-exits-userspace
A matching QEMU is here:
https://github.com/intel-staging/qemu-tdx/tree/tdx-qemu-upstream-v6.1
Testing
=======
It has been tested as part of the development branch for the TDX base
series. The testing consisted of TDX kvm-unit-tests and booting a Linux
TD, and TDX enhanced KVM selftests.
[0] https://lore.kernel.org/kvm/cover.1708933498.git.isaku.yamahata@intel.com
[1] https://lore.kernel.org/kvm/cover.1731318868.git.kai.huang@intel.com
[2] https://lore.kernel.org/kvm/cover.1730120881.git.kai.huang@intel.com
[3] https://lore.kernel.org/kvm/20241030190039.77971-1-rick.p.edgecombe@intel.com
[4] https://lore.kernel.org/kvm/20241112073327.21979-1-yan.y.zhao@intel.com
[5] https://lore.kernel.org/kvm/20241121201448.36170-1-adrian.hunter@intel.com
[6] https://lore.kernel.org/kvm/20241128004344.4072099-1-seanjc@google.com
[7] https://lore.kernel.org/kvm/20241201035358.2193078-1-binbin.wu@linux.intel.com
[8] https://lore.kernel.org/kvm/Z0T_iPdmtpjrc14q@google.com
[9] https://lore.kernel.org/kvm/57aab3bf-4bae-4956-a3b7-d42e810556e3@linux.intel.com
Isaku Yamahata (13):
KVM: VMX: Remove use of struct vcpu_vmx from posted_intr.c
KVM: TDX: Disable PI wakeup for IPIv
KVM: VMX: Move posted interrupt delivery code to common header
KVM: TDX: Implement non-NMI interrupt injection
KVM: TDX: Wait lapic expire when timer IRQ was injected
KVM: TDX: Implement methods to inject NMI
KVM: TDX: Complete interrupts after TD exit
KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM
KVM: TDX: Always block INIT/SIPI
KVM: TDX: Inhibit APICv for TDX guest
KVM: TDX: Add methods to ignore virtual apic related operation
KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
KVM: TDX: Handle EXIT_REASON_OTHER_SMI
Sean Christopherson (3):
KVM: TDX: Add support for find pending IRQ in a protected local APIC
KVM: x86: Assume timer IRQ was injected if APIC state is protected
KVM: VMX: Move NMI/exception handler to common helper
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 13 +-
arch/x86/include/asm/posted_intr.h | 5 +
arch/x86/include/uapi/asm/vmx.h | 1 +
arch/x86/kvm/irq.c | 3 +
arch/x86/kvm/lapic.c | 16 +-
arch/x86/kvm/lapic.h | 2 +
arch/x86/kvm/smm.h | 3 +
arch/x86/kvm/vmx/common.h | 143 +++++++++++++++
arch/x86/kvm/vmx/main.c | 286 ++++++++++++++++++++++++++---
arch/x86/kvm/vmx/posted_intr.c | 51 +++--
arch/x86/kvm/vmx/posted_intr.h | 13 ++
arch/x86/kvm/vmx/tdx.c | 137 +++++++++++++-
arch/x86/kvm/vmx/tdx.h | 16 ++
arch/x86/kvm/vmx/vmx.c | 144 ++-------------
arch/x86/kvm/vmx/vmx.h | 14 +-
arch/x86/kvm/vmx/x86_ops.h | 13 +-
17 files changed, 677 insertions(+), 184 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 56+ messages in thread
* [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2025-01-09 15:38 ` Nikolay Borisov
2024-12-09 1:07 ` [PATCH 02/16] KVM: VMX: Remove use of struct vcpu_vmx from posted_intr.c Binbin Wu
` (16 subsequent siblings)
17 siblings, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Sean Christopherson <seanjc@google.com>
Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest as a pending IRQ. For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM. As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM. To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector. And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.
Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.
Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.
Note, this doesn't handle interrupts that have been delivered to the vCPU
but not yet recognized by the core, i.e. interrupts that are sitting in
vmcs.GUEST_INTR_STATUS. Querying that state requires a SEAMCALL and will
be supported in a future patch.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Dropped vt_protected_apic_has_interrupt() with KVM_BUG_ON(), wire in
tdx_protected_apic_has_interrupt() directly. (Rick)
- Add {} on else in vt_hardware_setup()
---
arch/x86/include/asm/kvm-x86-ops.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/irq.c | 3 +++
arch/x86/kvm/lapic.c | 3 +++
arch/x86/kvm/lapic.h | 2 ++
arch/x86/kvm/vmx/main.c | 3 +++
arch/x86/kvm/vmx/tdx.c | 6 ++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
8 files changed, 21 insertions(+)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index ec1b1b39c6b3..d5faaaee6ac0 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -114,6 +114,7 @@ KVM_X86_OP_OPTIONAL(pi_start_assignment)
KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
+KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
KVM_X86_OP_OPTIONAL(set_hv_timer)
KVM_X86_OP_OPTIONAL(cancel_hv_timer)
KVM_X86_OP(setup_mce)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37dc7edef1ca..32c7d58a5d68 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1811,6 +1811,7 @@ struct kvm_x86_ops {
void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+ bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);
int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 63f66c51975a..f0644d0bbe11 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -100,6 +100,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
if (kvm_cpu_has_extint(v))
return 1;
+ if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
+ return static_call(kvm_x86_protected_apic_has_interrupt)(v);
+
return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
}
EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 65412640cfc7..684777c2f0a4 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2920,6 +2920,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
if (!kvm_apic_present(vcpu))
return -1;
+ if (apic->guest_apic_protected)
+ return -1;
+
__apic_update_ppr(apic, &ppr);
return apic_has_interrupt_for_ppr(apic, ppr);
}
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 1b8ef9856422..82355faf8c0d 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -65,6 +65,8 @@ struct kvm_lapic {
bool sw_enabled;
bool irr_pending;
bool lvt0_in_nmi_mode;
+ /* Select registers in the vAPIC cannot be read/written. */
+ bool guest_apic_protected;
/* Number of bits set in ISR. */
s16 isr_count;
/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4f6faeb6e8e5..a964093b5c03 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -58,6 +58,8 @@ static __init int vt_hardware_setup(void)
vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+ } else {
+ vt_x86_ops.protected_apic_has_interrupt = NULL;
}
return 0;
@@ -356,6 +358,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.sync_pir_to_irr = vmx_sync_pir_to_irr,
.deliver_interrupt = vmx_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+ .protected_apic_has_interrupt = tdx_protected_apic_has_interrupt,
.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d7cdb44be5cf..877cf9e1fd65 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -722,6 +722,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
return -EINVAL;
fpstate_set_confidential(&vcpu->arch.guest_fpu);
+ vcpu->arch.apic->guest_apic_protected = true;
vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
@@ -764,6 +765,11 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
local_irq_enable();
}
+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+ return pi_has_pending_interrupt(vcpu);
+}
+
/*
* Compared to vmx_prepare_switch_to_guest(), there is not much to do
* as SEAMCALL/SEAMRET calls take care of most of save and restore.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 1c18943e0e1d..a3a5b25976f0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -133,6 +133,7 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
@@ -171,6 +172,7 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
}
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath) { return 0; }
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 02/16] KVM: VMX: Remove use of struct vcpu_vmx from posted_intr.c
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
2024-12-09 1:07 ` [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 03/16] KVM: TDX: Disable PI wakeup for IPIv Binbin Wu
` (15 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Unify the layout for vcpu_vmx and vcpu_tdx at the top of the structure
with members needed in posted-intr.c, and introduce a struct vcpu_pi to
have the common part. Use vcpu_pi instead of vcpu_vmx to enable TDX
compatibility with posted_intr.c.
To minimize the diff size, code conversion such as changing vmx->pi_desc
to vmx->common->pi_desc is avoided. Instead, compile-time checks are
added to ensure the layout is as expected.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Renamed from "KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c"
to "KVM: VMX: Remove use of struct vcpu_vmx from posted_intr.c".
- Use vcpu_pi instead vcpu_vmx in pi_wakeup_handler(). (Binbin)
- Update change log.
- Remove functional change from the code refactor patch, move into its
own separate patch. (Chao)
---
arch/x86/kvm/vmx/posted_intr.c | 43 +++++++++++++++++++++++++---------
arch/x86/kvm/vmx/posted_intr.h | 11 +++++++++
arch/x86/kvm/vmx/tdx.c | 1 +
arch/x86/kvm/vmx/tdx.h | 11 +++++++++
arch/x86/kvm/vmx/vmx.h | 14 ++++++-----
5 files changed, 63 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index ec08fa3caf43..8951c773a475 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -11,6 +11,7 @@
#include "posted_intr.h"
#include "trace.h"
#include "vmx.h"
+#include "tdx.h"
/*
* Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
@@ -31,9 +32,29 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
*/
static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
+/*
+ * The layout of the head of struct vcpu_vmx and struct vcpu_tdx must match with
+ * struct vcpu_pi.
+ */
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+ offsetof(struct vcpu_vmx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+ offsetof(struct vcpu_vmx, pi_wakeup_list));
+#ifdef CONFIG_INTEL_TDX_HOST
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+ offsetof(struct vcpu_tdx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+ offsetof(struct vcpu_tdx, pi_wakeup_list));
+#endif
+
+static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
+{
+ return (struct vcpu_pi *)vcpu;
+}
+
static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
- return &(to_vmx(vcpu)->pi_desc);
+ return &vcpu_to_pi(vcpu)->pi_desc;
}
static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
@@ -52,8 +73,8 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+ struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
struct pi_desc old, new;
unsigned long flags;
unsigned int dest;
@@ -90,7 +111,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
*/
if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- list_del(&vmx->pi_wakeup_list);
+ list_del(&vcpu_pi->pi_wakeup_list);
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
}
@@ -145,15 +166,15 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
*/
static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
{
- struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
- struct vcpu_vmx *vmx = to_vmx(vcpu);
+ struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+ struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
struct pi_desc old, new;
unsigned long flags;
local_irq_save(flags);
raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- list_add_tail(&vmx->pi_wakeup_list,
+ list_add_tail(&vcpu_pi->pi_wakeup_list,
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
@@ -220,13 +241,13 @@ void pi_wakeup_handler(void)
int cpu = smp_processor_id();
struct list_head *wakeup_list = &per_cpu(wakeup_vcpus_on_cpu, cpu);
raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, cpu);
- struct vcpu_vmx *vmx;
+ struct vcpu_pi *pi;
raw_spin_lock(spinlock);
- list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
+ list_for_each_entry(pi, wakeup_list, pi_wakeup_list) {
- if (pi_test_on(&vmx->pi_desc))
- kvm_vcpu_wake_up(&vmx->vcpu);
+ if (pi_test_on(&pi->pi_desc))
+ kvm_vcpu_wake_up(&pi->vcpu);
}
raw_spin_unlock(spinlock);
}
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 1715d2ab07be..e579e8c3ad2e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -5,6 +5,17 @@
#include <linux/find.h>
#include <asm/posted_intr.h>
+struct vcpu_pi {
+ struct kvm_vcpu vcpu;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here common layout between vcpu_vmx and vcpu_tdx. */
+};
+
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 877cf9e1fd65..478da0ebd225 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -723,6 +723,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
fpstate_set_confidential(&vcpu->arch.guest_fpu);
vcpu->arch.apic->guest_apic_protected = true;
+ INIT_LIST_HEAD(&tdx->pi_wakeup_list);
vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index a7dd2b0b5dd7..d6a0fc20ecaa 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,6 +17,10 @@ enum kvm_tdx_state {
TD_STATE_RUNNABLE,
};
+
+#include "irq.h"
+#include "posted_intr.h"
+
struct kvm_tdx {
struct kvm kvm;
@@ -52,6 +56,13 @@ enum tdx_prepare_switch_state {
struct vcpu_tdx {
struct kvm_vcpu vcpu;
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here same layout to struct vcpu_pi. */
+
unsigned long tdvpr_pa;
unsigned long *tdcx_pa;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 37a555c6dfbf..e7f570ca3c19 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -230,6 +230,14 @@ struct nested_vmx {
struct vcpu_vmx {
struct kvm_vcpu vcpu;
+
+ /* Posted interrupt descriptor */
+ struct pi_desc pi_desc;
+
+ /* Used if this vCPU is waiting for PI notification wakeup. */
+ struct list_head pi_wakeup_list;
+ /* Until here same layout to struct vcpu_pi. */
+
u8 fail;
u8 x2apic_msr_bitmap_mode;
@@ -299,12 +307,6 @@ struct vcpu_vmx {
union vmx_exit_reason exit_reason;
- /* Posted interrupt descriptor */
- struct pi_desc pi_desc;
-
- /* Used if this vCPU is waiting for PI notification wakeup. */
- struct list_head pi_wakeup_list;
-
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 03/16] KVM: TDX: Disable PI wakeup for IPIv
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
2024-12-09 1:07 ` [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC Binbin Wu
2024-12-09 1:07 ` [PATCH 02/16] KVM: VMX: Remove use of struct vcpu_vmx from posted_intr.c Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 04/16] KVM: VMX: Move posted interrupt delivery code to common header Binbin Wu
` (14 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Disable PI wakeup for IPI virtualization (IPIv) case for TDX.
When a vCPU is being scheduled out, notification vector is switched and
pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and
posted interrupt is used to wake up the vCPU.
For VMX, a blocked vCPU can be the target of posted interrupts when using
IPIv or VT-d PI. TDX doesn't support IPIv, disable PI wakeup for IPIv.
Also, since the guest status of TD vCPU is protected, assume interrupt is
always enabled for TD. (PV HLT hypercall is not support yet, TDX guest tells
VMM whether HLT is called with interrupt disabled or not.)
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- This is split out as a new patch from patch
"KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c"
---
arch/x86/kvm/vmx/posted_intr.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 8951c773a475..651ef663845c 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -211,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
* notification vector is switched to the one that calls
* back to the pi_wakeup_handler() function.
*/
- return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
+ return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
+ vmx_can_use_vtd_pi(vcpu->kvm);
}
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
@@ -221,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
if (!vmx_needs_pi_wakeup(vcpu))
return;
- if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+ if (kvm_vcpu_is_blocking(vcpu) &&
+ (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
pi_enable_wakeup_handler(vcpu);
/*
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 04/16] KVM: VMX: Move posted interrupt delivery code to common header
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (2 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 03/16] KVM: TDX: Disable PI wakeup for IPIv Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 05/16] KVM: TDX: Implement non-NMI interrupt injection Binbin Wu
` (13 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Move posted interrupt delivery code to common header so that TDX can
leverage it.
No functional change intended.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
[binbin: split into new patch]
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- This is split out from patch "KVM: TDX: Implement interrupt injection"
---
arch/x86/kvm/vmx/common.h | 71 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 59 +-------------------------------
2 files changed, 72 insertions(+), 58 deletions(-)
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 7a592467a044..a46f15ddeda1 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,6 +4,7 @@
#include <linux/kvm_host.h>
+#include "posted_intr.h"
#include "mmu.h"
static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
@@ -40,4 +41,74 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}
+static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+ int pi_vec)
+{
+#ifdef CONFIG_SMP
+ if (vcpu->mode == IN_GUEST_MODE) {
+ /*
+ * The vector of the virtual has already been set in the PIR.
+ * Send a notification event to deliver the virtual interrupt
+ * unless the vCPU is the currently running vCPU, i.e. the
+ * event is being sent from a fastpath VM-Exit handler, in
+ * which case the PIR will be synced to the vIRR before
+ * re-entering the guest.
+ *
+ * When the target is not the running vCPU, the following
+ * possibilities emerge:
+ *
+ * Case 1: vCPU stays in non-root mode. Sending a notification
+ * event posts the interrupt to the vCPU.
+ *
+ * Case 2: vCPU exits to root mode and is still runnable. The
+ * PIR will be synced to the vIRR before re-entering the guest.
+ * Sending a notification event is ok as the host IRQ handler
+ * will ignore the spurious event.
+ *
+ * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
+ * has already synced PIR to vIRR and never blocks the vCPU if
+ * the vIRR is not empty. Therefore, a blocked vCPU here does
+ * not wait for any requested interrupts in PIR, and sending a
+ * notification event also results in a benign, spurious event.
+ */
+
+ if (vcpu != kvm_get_running_vcpu())
+ __apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
+ return;
+ }
+#endif
+ /*
+ * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
+ * otherwise do nothing as KVM will grab the highest priority pending
+ * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
+ */
+ kvm_vcpu_wake_up(vcpu);
+}
+
+/*
+ * Send interrupt to vcpu via posted interrupt way.
+ * 1. If target vcpu is running(non-root mode), send posted interrupt
+ * notification to vcpu and hardware will sync PIR to vIRR atomically.
+ * 2. If target vcpu isn't running(root mode), kick it to pick up the
+ * interrupt from PIR in next vmentry.
+ */
+static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu,
+ struct pi_desc *pi_desc, int vector)
+{
+ if (pi_test_and_set_pir(vector, pi_desc))
+ return;
+
+ /* If a previous notification has sent the IPI, nothing to do. */
+ if (pi_test_and_set_on(pi_desc))
+ return;
+
+ /*
+ * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
+ * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
+ * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
+ * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+ */
+ kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+}
+
#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f7ae2359cea2..176fd5da3a3c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4167,50 +4167,6 @@ void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
pt_update_intercept_for_msr(vcpu);
}
-static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
- int pi_vec)
-{
-#ifdef CONFIG_SMP
- if (vcpu->mode == IN_GUEST_MODE) {
- /*
- * The vector of the virtual has already been set in the PIR.
- * Send a notification event to deliver the virtual interrupt
- * unless the vCPU is the currently running vCPU, i.e. the
- * event is being sent from a fastpath VM-Exit handler, in
- * which case the PIR will be synced to the vIRR before
- * re-entering the guest.
- *
- * When the target is not the running vCPU, the following
- * possibilities emerge:
- *
- * Case 1: vCPU stays in non-root mode. Sending a notification
- * event posts the interrupt to the vCPU.
- *
- * Case 2: vCPU exits to root mode and is still runnable. The
- * PIR will be synced to the vIRR before re-entering the guest.
- * Sending a notification event is ok as the host IRQ handler
- * will ignore the spurious event.
- *
- * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
- * has already synced PIR to vIRR and never blocks the vCPU if
- * the vIRR is not empty. Therefore, a blocked vCPU here does
- * not wait for any requested interrupts in PIR, and sending a
- * notification event also results in a benign, spurious event.
- */
-
- if (vcpu != kvm_get_running_vcpu())
- __apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
- return;
- }
-#endif
- /*
- * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
- * otherwise do nothing as KVM will grab the highest priority pending
- * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
- */
- kvm_vcpu_wake_up(vcpu);
-}
-
static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
int vector)
{
@@ -4270,20 +4226,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
if (!vcpu->arch.apic->apicv_active)
return -1;
- if (pi_test_and_set_pir(vector, &vmx->pi_desc))
- return 0;
-
- /* If a previous notification has sent the IPI, nothing to do. */
- if (pi_test_and_set_on(&vmx->pi_desc))
- return 0;
-
- /*
- * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
- * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
- * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
- * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
- */
- kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+ __vmx_deliver_posted_interrupt(vcpu, &vmx->pi_desc, vector);
return 0;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 05/16] KVM: TDX: Implement non-NMI interrupt injection
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (3 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 04/16] KVM: VMX: Move posted interrupt delivery code to common header Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 06/16] KVM: x86: Assume timer IRQ was injected if APIC state is protected Binbin Wu
` (12 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Implement non-NMI interrupt injection for TDX via posted interrupt.
As CPU state is protected and APICv is enabled for the TDX guest, TDX
supports non-NMI interrupt injection only by posted interrupt. Posted
interrupt descriptors (PIDs) are allocated in shared memory, KVM can
update them directly. If target vCPU is in non-root mode, send posted
interrupt notification to the vCPU and hardware will sync PIR to vIRR
atomically. Otherwise, kick it to pick up the interrupt from PID. To
post pending interrupts in the PID, KVM can generate a self-IPI with
notification vector prior to TD entry.
Since the guest status of TD vCPU is protected, assume interrupt is
always allowed. Ignore the code path for event injection mechanism or
LAPIC emulation for TDX.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX interrupts breakout:
- Renamed from "KVM: TDX: Implement interrupt injection"
to "KVM: TDX: Implement non-NMI interrupt injection"
- Rewrite changelog.
- Add a blank line. (Binbin)
- Split posted interrupt delivery code movement to a separate patch.
- Split kvm_wait_lapic_expire() out to a separate patch. (Chao)
- Use __pi_set_sn() to resolve upstream conflicts.
- Use kvm_x86_call()
---
arch/x86/kvm/vmx/main.c | 94 ++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/posted_intr.c | 2 +-
arch/x86/kvm/vmx/posted_intr.h | 2 +
arch/x86/kvm/vmx/tdx.c | 22 +++++++-
arch/x86/kvm/vmx/vmx.c | 8 ---
arch/x86/kvm/vmx/x86_ops.h | 7 ++-
6 files changed, 115 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a964093b5c03..d59a6a783ead 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -176,6 +176,34 @@ static int vt_handle_exit(struct kvm_vcpu *vcpu,
return vmx_handle_exit(vcpu, fastpath);
}
+static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
+{
+ struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
+
+ pi_clear_on(pi);
+ memset(pi->pir, 0, sizeof(pi->pir));
+}
+
+static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return -1;
+
+ return vmx_sync_pir_to_irr(vcpu);
+}
+
+static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
+{
+ if (is_td_vcpu(apic->vcpu)) {
+ tdx_deliver_interrupt(apic, delivery_mode, trig_mode,
+ vector);
+ return;
+ }
+
+ vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
+}
+
static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu)) {
@@ -223,6 +251,54 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
}
+static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_interrupt_shadow(vcpu, mask);
+}
+
+static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return 0;
+
+ return vmx_get_interrupt_shadow(vcpu);
+}
+
+static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_inject_irq(vcpu, reinjected);
+}
+
+static void vt_cancel_injection(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_cancel_injection(vcpu);
+}
+
+static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_interrupt_allowed(vcpu, for_injection);
+}
+
+static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_enable_irq_window(vcpu);
+}
+
static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
{
@@ -331,19 +407,19 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.handle_exit = vt_handle_exit,
.skip_emulated_instruction = vmx_skip_emulated_instruction,
.update_emulated_instruction = vmx_update_emulated_instruction,
- .set_interrupt_shadow = vmx_set_interrupt_shadow,
- .get_interrupt_shadow = vmx_get_interrupt_shadow,
+ .set_interrupt_shadow = vt_set_interrupt_shadow,
+ .get_interrupt_shadow = vt_get_interrupt_shadow,
.patch_hypercall = vmx_patch_hypercall,
- .inject_irq = vmx_inject_irq,
+ .inject_irq = vt_inject_irq,
.inject_nmi = vmx_inject_nmi,
.inject_exception = vmx_inject_exception,
- .cancel_injection = vmx_cancel_injection,
- .interrupt_allowed = vmx_interrupt_allowed,
+ .cancel_injection = vt_cancel_injection,
+ .interrupt_allowed = vt_interrupt_allowed,
.nmi_allowed = vmx_nmi_allowed,
.get_nmi_mask = vmx_get_nmi_mask,
.set_nmi_mask = vmx_set_nmi_mask,
.enable_nmi_window = vmx_enable_nmi_window,
- .enable_irq_window = vmx_enable_irq_window,
+ .enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
.x2apic_icr_is_split = false,
@@ -351,12 +427,12 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
- .apicv_pre_state_restore = vmx_apicv_pre_state_restore,
+ .apicv_pre_state_restore = vt_apicv_pre_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
.hwapic_irr_update = vmx_hwapic_irr_update,
.hwapic_isr_update = vmx_hwapic_isr_update,
- .sync_pir_to_irr = vmx_sync_pir_to_irr,
- .deliver_interrupt = vmx_deliver_interrupt,
+ .sync_pir_to_irr = vt_sync_pir_to_irr,
+ .deliver_interrupt = vt_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt,
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 651ef663845c..87a6964c662a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -52,7 +52,7 @@ static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
return (struct vcpu_pi *)vcpu;
}
-static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
return &vcpu_to_pi(vcpu)->pi_desc;
}
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index e579e8c3ad2e..8b1dccfe4885 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -16,6 +16,8 @@ struct vcpu_pi {
/* Until here common layout between vcpu_vmx and vcpu_tdx. */
};
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu);
+
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 478da0ebd225..0896e0e825ed 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -740,6 +740,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
tdx->prep_switch_state = TDX_PREP_SW_STATE_UNSAVED;
+ tdx->pi_desc.nv = POSTED_INTR_VECTOR;
+ __pi_set_sn(&tdx->pi_desc);
+
tdx->state = VCPU_TD_STATE_UNINITIALIZED;
return 0;
@@ -749,6 +752,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
+ vmx_vcpu_pi_load(vcpu, cpu);
if (vcpu->cpu == cpu)
return;
@@ -970,6 +974,9 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
trace_kvm_entry(vcpu, force_immediate_exit);
+ if (pi_test_on(&tdx->pi_desc))
+ apic->send_IPI_self(POSTED_INTR_VECTOR);
+
tdx_vcpu_enter_exit(vcpu);
if (tdx->host_debugctlmsr & ~TDX_DEBUGCTL_PRESERVED)
@@ -1672,6 +1679,16 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
}
+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector)
+{
+ struct kvm_vcpu *vcpu = apic->vcpu;
+ struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+ /* TDX supports only posted interrupt. No lapic emulation. */
+ __vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
+}
+
int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
{
struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -2598,8 +2615,11 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
if (ret)
return ret;
- tdx->state = VCPU_TD_STATE_INITIALIZED;
+ td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
+ td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
+ td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
+ tdx->state = VCPU_TD_STATE_INITIALIZED;
return 0;
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 176fd5da3a3c..75432e1c9f7f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6884,14 +6884,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
}
-void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
-
- pi_clear_on(&vmx->pi_desc);
- memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
-}
-
void vmx_do_interrupt_irqoff(unsigned long entry);
void vmx_do_nmi_irqoff(void);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index a3a5b25976f0..1ddb7600f162 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -46,7 +46,6 @@ int vmx_check_intercept(struct kvm_vcpu *vcpu,
bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
void vmx_migrate_timers(struct kvm_vcpu *vcpu);
void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
-void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu);
void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
void vmx_hwapic_isr_update(int max_isr);
int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
@@ -136,6 +135,9 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu);
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath);
+
+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
@@ -175,6 +177,9 @@ static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath) { return 0; }
+
+static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+ int trig_mode, int vector) {}
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
u64 *info2, u32 *intr_info, u32 *error_code) {}
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 06/16] KVM: x86: Assume timer IRQ was injected if APIC state is protected
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (4 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 05/16] KVM: TDX: Implement non-NMI interrupt injection Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 07/16] KVM: TDX: Wait lapic expire when timer IRQ was injected Binbin Wu
` (11 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Sean Christopherson <seanjc@google.com>
If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path. The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.
Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Renamed from "KVM: x86: Assume timer IRQ was injected if APIC state is proteced"
to "KVM: x86: Assume timer IRQ was injected if APIC state is protected", i.e.,
fix the typo 'proteced'.
---
arch/x86/kvm/lapic.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 684777c2f0a4..474e0a7c1069 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1789,8 +1789,17 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT);
+ u32 reg;
+ /*
+ * Assume a timer IRQ was "injected" if the APIC is protected. KVM's
+ * copy of the vIRR is bogus, it's the responsibility of the caller to
+ * precisely check whether or not a timer IRQ is pending.
+ */
+ if (apic->guest_apic_protected)
+ return true;
+
+ reg = kvm_lapic_get_reg(apic, APIC_LVTT);
if (kvm_apic_hw_enabled(apic)) {
int vec = reg & APIC_VECTOR_MASK;
void *bitmap = apic->regs + APIC_ISR;
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 07/16] KVM: TDX: Wait lapic expire when timer IRQ was injected
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (5 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 06/16] KVM: x86: Assume timer IRQ was injected if APIC state is protected Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 08/16] KVM: TDX: Implement methods to inject NMI Binbin Wu
` (10 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector
for LVTT is set in PIR before TD entry.
KVM always assumes a timer IRQ was injected if APIC state is protected.
For TDX guest, APIC state is protected and KVM injects timer IRQ via posted
interrupt. To avoid unnecessary wait calls, only call kvm_wait_lapic_expire()
when a timer IRQ was injected, i.e., POSTED_INTR_ON is set and the vector for
LVTT is set in PIR.
Add a helper to test PIR.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Split out from patch "KVM: TDX: Implement interrupt injection". (Chao)
- Check PIR against LVTT vector.
---
arch/x86/include/asm/posted_intr.h | 5 +++++
arch/x86/kvm/vmx/tdx.c | 7 ++++++-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index de788b400fba..bb107ebbe713 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -81,6 +81,11 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
}
+static inline bool pi_test_pir(int vector, struct pi_desc *pi_desc)
+{
+ return test_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
/* Non-atomic helpers */
static inline void __pi_set_sn(struct pi_desc *pi_desc)
{
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0896e0e825ed..dcbe25695d85 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -974,9 +974,14 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
trace_kvm_entry(vcpu, force_immediate_exit);
- if (pi_test_on(&tdx->pi_desc))
+ if (pi_test_on(&tdx->pi_desc)) {
apic->send_IPI_self(POSTED_INTR_VECTOR);
+ if(pi_test_pir(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTT) &
+ APIC_VECTOR_MASK, &tdx->pi_desc))
+ kvm_wait_lapic_expire(vcpu);
+ }
+
tdx_vcpu_enter_exit(vcpu);
if (tdx->host_debugctlmsr & ~TDX_DEBUGCTL_PRESERVED)
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 08/16] KVM: TDX: Implement methods to inject NMI
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (6 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 07/16] KVM: TDX: Wait lapic expire when timer IRQ was injected Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 09/16] KVM: TDX: Complete interrupts after TD exit Binbin Wu
` (9 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make
the NMI pending in the TDX module. If there is a further pending NMI in KVM,
collapse it to the one pending in the TDX module.
VMM can request the TDX module to inject a NMI into a TDX vCPU by setting
the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER to
run the vCPU and the TDX module will attempt to inject the NMI as soon as
possible.
KVM has the following 3 cases to inject two NMIs when handling simultaneous
NMIs and they need to be injected in a back-to-back way. Otherwise, OS
kernel may fire a warning about the unknown NMI [1]:
K1. One NMI is being handled in the guest and one NMI pending in KVM.
KVM requests NMI window exit to inject the pending NMI.
K2. Two NMIs are pending in KVM.
KVM injects the first NMI and requests NMI window exit to inject the
second NMI.
K3. A previous NMI needs to be rejected and one NMI pending in KVM.
KVM first requests force immediate exit followed by a VM entry to complete
the NMI rejection. Then, during the force immediate exit, KVM requests
NMI window exit to inject the pending NMI.
For TDX, PEND_NMI TDVPS field is a 1-bit filed, i.e. KVM can only pend one
NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know
the NMI blocking states of TDX vCPU, KVM has to assume NMI is always unmasked
and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it means the
previous NMI needs to be re-injected.
Based on KVM's NMI handling flow, there are following 6 cases:
In NMI handler TDX module KVM
T1. No PEND_NMI=0 1 pending NMI
T2. No PEND_NMI=0 2 pending NMIs
T3. No PEND_NMI=1 1 pending NMI
T4. Yes PEND_NMI=0 1 pending NMI
T5. Yes PEND_NMI=0 2 pending NMIs
T6. Yes PEND_NMI=1 1 pending NMI
K1 is mapped to T4.
K2 is mapped to T2 or T5.
K3 is mapped to T3 or T6.
Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and
T6 can happen.
When handling pending NMI in KVM for TDX guest, what KVM can do is to add a
pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by
this way. However, TDX doesn't allow KVM to request NMI window exit directly,
if PEND_NMI is already set and there is still pending NMI in KVM, the only way
KVM could try is to request a force immediate exit. But for case T5 and T6,
force immediate exit will result in infinite loop because force immediate exit
makes it no progress in the NMI handler, so that the pending NMI in the TDX
module can never be injected.
Considering on X86 bare metal, multiple NMIs could collapse into one NMI,
e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI
sources in the NMI handler to avoid missing handling of some NMI events.
Based on that, for the above 3 cases (K1-K3), only case K1 must inject the
second NMI because the guest NMI handler may have already polled some of the
NMI sources, which could include the source of the pending NMI, the pending
NMI must be injected to avoid the lost of NMI. For case K2 and K3, the guest
OS will poll all NMI sources (including the sources caused by the second NMI
and further NMI collapsed) when the delivery of the first NMI, KVM doesn't
have the necessity to inject the second NMI.
To handle the NMI injection properly for TDX, there are two options:
- Option 1: Modify the KVM's NMI handling common code, to collapse the second
pending NMI for K2 and K3.
- Option 2: Do it in TDX specific way. When the previous NMI is still pending
in the TDX module, i.e. it has not been delivered to TDX guest yet, collapse
the pending NMI in KVM into the previous one.
This patch goes with option 2 because it is simple and doesn't impact other
VM types. Option 1 may need more discussions.
This is the first need to access vCPU scope metadata in the "management"
class. Make needed accessors available.
[1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Singed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX interrupts breakout:
- Collapse the pending NMI in KVM if there is already one pending in the
TDX module.
---
arch/x86/kvm/vmx/main.c | 61 ++++++++++++++++++++++++++++++++++----
arch/x86/kvm/vmx/tdx.c | 16 ++++++++++
arch/x86/kvm/vmx/tdx.h | 5 ++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
4 files changed, 79 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d59a6a783ead..4b42d14cc62e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -240,6 +240,57 @@ static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
vmx_flush_tlb_guest(vcpu);
}
+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_inject_nmi(vcpu);
+ return;
+ }
+
+ vmx_inject_nmi(vcpu);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ /*
+ * The TDX module manages NMI windows and NMI reinjection, and hides NMI
+ * blocking, all KVM can do is throw an NMI over the wall.
+ */
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+ /*
+ * KVM can't get NMI blocking status for TDX guest, assume NMIs are
+ * always unmasked.
+ */
+ if (is_td_vcpu(vcpu))
+ return false;
+
+ return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+ /* Refer to the comments in tdx_inject_nmi(). */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_enable_nmi_window(vcpu);
+}
+
static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
int pgd_level)
{
@@ -411,14 +462,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.get_interrupt_shadow = vt_get_interrupt_shadow,
.patch_hypercall = vmx_patch_hypercall,
.inject_irq = vt_inject_irq,
- .inject_nmi = vmx_inject_nmi,
+ .inject_nmi = vt_inject_nmi,
.inject_exception = vmx_inject_exception,
.cancel_injection = vt_cancel_injection,
.interrupt_allowed = vt_interrupt_allowed,
- .nmi_allowed = vmx_nmi_allowed,
- .get_nmi_mask = vmx_get_nmi_mask,
- .set_nmi_mask = vmx_set_nmi_mask,
- .enable_nmi_window = vmx_enable_nmi_window,
+ .nmi_allowed = vt_nmi_allowed,
+ .get_nmi_mask = vt_get_nmi_mask,
+ .set_nmi_mask = vt_set_nmi_mask,
+ .enable_nmi_window = vt_enable_nmi_window,
.enable_irq_window = vt_enable_irq_window,
.update_cr8_intercept = vmx_update_cr8_intercept,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dcbe25695d85..65fe7ba8a6c6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1007,6 +1007,22 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
return tdx_exit_handlers_fastpath(vcpu);
}
+void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.nmi_injections;
+ td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+ /*
+ * TDX doesn't support KVM to request NMI window exit. If there is
+ * still a pending vNMI, KVM is not able to inject it along with the
+ * one pending in TDX module in a back-to-back way. Since the previous
+ * vNMI is still pending in TDX module, i.e. it has not been delivered
+ * to TDX guest yet, it's OK to collapse the pending vNMI into the
+ * previous one. The guest is expected to handle all the NMI sources
+ * when handling the first vNMI.
+ */
+ vcpu->arch.nmi_pending = 0;
+}
+
static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
{
vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index d6a0fc20ecaa..b553dd9b0b06 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -151,6 +151,8 @@ static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
"Invalid TD VMCS access for 16-bit field");
}
+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass) \
static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx, \
u32 field) \
@@ -200,6 +202,9 @@ static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx, \
TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
#else
static inline void tdx_bringup(void) {}
static inline void tdx_cleanup(void) {}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 1ddb7600f162..1d2bf6972ea9 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -138,6 +138,7 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu,
void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector);
+void tdx_inject_nmi(struct kvm_vcpu *vcpu);
void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
@@ -180,6 +181,7 @@ static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
int trig_mode, int vector) {}
+static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
u64 *info2, u32 *intr_info, u32 *error_code) {}
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 09/16] KVM: TDX: Complete interrupts after TD exit
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (7 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 08/16] KVM: TDX: Implement methods to inject NMI Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 10/16] KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM Binbin Wu
` (8 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Complete NMI injection by updating the status of NMI injection for TDX.
Because TDX virtualizes vAPIC, and non-NMI interrupts are delivered
using posted-interrupt mechanism, KVM only needs to care about NMI
injection.
For VMX, KVM injects an NMI by setting VM_ENTRY_INTR_INFO_FIELD via
vector-event injection mechanism. For TDX, KVM needs to request TDX
module to inject an NMI into a guest TD vCPU when the vCPU is not
active by setting PEND_NMI field within the TDX vCPU scope metadata
(Trust Domain Virtual Processor State (TDVPS)). TDX module will attempt
to inject an NMI as soon as possible on TD entry. KVM can read PEND_NMI
to get the status of NMI injection. A value of 0 indicates the NMI has
been injected into the guest TD vCPU.
Update KVM's NMI status on TD exit by checking whether a requested NMI
has been injected into the TD. Reading the PEND_NMI field via SEAMCALL
is expensive so only perform the check if an NMI was requested to inject.
If the read back value is 0, the NMI has been injected, update the NMI
status. If the read back value is 1, no action needed since the PEND_NMI
is still set.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX interrupts breakout:
- Shortlog "tdexit" -> "TD exit" (Reinette)
- Update change log as following suggested by Reinette with a little
supplement.
https://lore.kernel.org/lkml/fe9cec78-36ee-4a20-81df-ec837a45f69f@linux.intel.com/
- Fix comment, "nmi" -> "NMI" and add a missing period. (Reinette)
- Add a comment to explain why no need to request KVM_REQ_EVENT.
v19:
- move tdvps_management_check() to this patch
- typo: complete -> Complete in short log
---
arch/x86/kvm/vmx/tdx.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 65fe7ba8a6c6..b0f525069ebd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -853,6 +853,21 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu)
return 1;
}
+static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
+{
+ /* Avoid costly SEAMCALL if no NMI was injected. */
+ if (vcpu->arch.nmi_injected) {
+ /*
+ * No need to request KVM_REQ_EVENT because PEND_NMI is still
+ * set if NMI re-injection needed. No other event types need
+ * to be handled because TDX doesn't support injection of
+ * exception, SMI or interrupt (via event injection).
+ */
+ vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
+ TD_VCPU_PEND_NMI);
+ }
+}
+
struct tdx_uret_msr {
u32 msr;
unsigned int slot;
@@ -1004,6 +1019,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
if (unlikely(tdx_has_exit_reason(vcpu) && tdexit_exit_reason(vcpu).failed_vmentry))
return EXIT_FASTPATH_NONE;
+ tdx_complete_interrupts(vcpu);
+
return tdx_exit_handlers_fastpath(vcpu);
}
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 10/16] KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (8 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 09/16] KVM: TDX: Complete interrupts after TD exit Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 11/16] KVM: TDX: Always block INIT/SIPI Binbin Wu
` (7 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return
-ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD.
TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs. Because guest state (vCPU state, memory
state) is protected, it must go through the TDX module APIs to change
guest state. However, the TDX module doesn't provide a way for VMM to
inject SMI into guest TD or a way for VMM to switch guest vCPU mode into
SMM.
MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be
returned when SMI is requested.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Renamed from "KVM: TDX: Silently discard SMI request" to
"KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM".
- Change the changelog.
- Handle SMI request as !CONFIG_KVM_SMM for TD, and remove the
unnecessary comment. (Sean)
- Bug the VM if SMI OPs are called for a TD and remove related
tdx_* functions, but still keep the vt_* wrappers. (Sean, Paolo)
- Use kvm_x86_call()
---
arch/x86/kvm/smm.h | 3 +++
arch/x86/kvm/vmx/main.c | 43 +++++++++++++++++++++++++++++++++++++----
2 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..551703fbe200 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -142,6 +142,9 @@ union kvm_smram {
static inline int kvm_inject_smi(struct kvm_vcpu *vcpu)
{
+ if (!kvm_x86_call(has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE))
+ return -ENOTTY;
+
kvm_make_request(KVM_REQ_SMI, vcpu);
return 0;
}
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4b42d14cc62e..8ec96646faec 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -176,6 +176,41 @@ static int vt_handle_exit(struct kvm_vcpu *vcpu,
return vmx_handle_exit(vcpu, fastpath);
}
+#ifdef CONFIG_KVM_SMM
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return false;
+
+ return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_enter_smm(vcpu, smram);
+}
+
+static int vt_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+ if(KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return 0;
+
+ return vmx_leave_smm(vcpu, smram);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+ return;
+
+ /* RSM will cause a vmexit anyway. */
+ vmx_enable_smi_window(vcpu);
+}
+#endif
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -523,10 +558,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.setup_mce = vmx_setup_mce,
#ifdef CONFIG_KVM_SMM
- .smi_allowed = vmx_smi_allowed,
- .enter_smm = vmx_enter_smm,
- .leave_smm = vmx_leave_smm,
- .enable_smi_window = vmx_enable_smi_window,
+ .smi_allowed = vt_smi_allowed,
+ .enter_smm = vt_enter_smm,
+ .leave_smm = vt_leave_smm,
+ .enable_smi_window = vt_enable_smi_window,
#endif
.check_emulate_instruction = vmx_check_emulate_instruction,
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (9 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 10/16] KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2025-01-08 7:21 ` Xiaoyao Li
2025-01-09 2:51 ` Huang, Kai
2024-12-09 1:07 ` [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest Binbin Wu
` (6 subsequent siblings)
17 siblings, 2 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Always block INIT and SIPI events for the TDX guest because the TDX module
doesn't provide API for VMM to inject INIT IPI or SIPI.
TDX defines its own vCPU creation and initialization sequence including
multiple seamcalls. Also, it's only allowed during TD build time.
Given that TDX guest is para-virtualized to boot BSP/APs, normally there
shouldn't be any INIT/SIPI event for TDX guest. If any, three options to
handle them:
1. Always block INIT/SIPI request.
2. (Silently) ignore INIT/SIPI request during delivery.
3. Return error to guest TDs somehow.
Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
need to add new interface or helper function for INIT/SIPI delivery.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Renamed from "KVM: TDX: Silently ignore INIT/SIPI" to
"KVM: TDX: Always block INIT/SIPI".
- Remove KVM_BUG_ON() in tdx_vcpu_reset(). (Rick)
- Drop tdx_vcpu_reset() and move the comment to vt_vcpu_reset().
- Remove unnecessary interface and helpers to delivery INIT/SIPI
because INIT/SIPI events are always blocked for TDX. (Binbin)
- Update changelog.
---
arch/x86/kvm/lapic.c | 2 +-
arch/x86/kvm/vmx/main.c | 19 ++++++++++++++++++-
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 474e0a7c1069..f93c382344ee 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
kvm_vcpu_reset(vcpu, true);
- if (kvm_vcpu_is_bsp(apic->vcpu))
+ if (kvm_vcpu_is_bsp(vcpu))
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
else
vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8ec96646faec..7f933f821188 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -115,6 +115,11 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
+ /*
+ * TDX has its own sequence to do init during TD build time (by
+ * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
+ * runtime.
+ */
if (is_td_vcpu(vcpu))
return;
@@ -211,6 +216,18 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
}
#endif
+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+ /*
+ * INIT and SIPI are always blocked for TDX, i.e., INIT handling and
+ * the OP vcpu_deliver_sipi_vector() won't be called.
+ */
+ if (is_td_vcpu(vcpu))
+ return true;
+
+ return vmx_apic_init_signal_blocked(vcpu);
+}
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -565,7 +582,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
#endif
.check_emulate_instruction = vmx_check_emulate_instruction,
- .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+ .apic_init_signal_blocked = vt_apic_init_signal_blocked,
.migrate_timers = vmx_migrate_timers,
.msr_filter_changed = vmx_msr_filter_changed,
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (10 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 11/16] KVM: TDX: Always block INIT/SIPI Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2025-01-03 21:59 ` Vishal Annapurve
2025-01-13 2:03 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation Binbin Wu
` (5 subsequent siblings)
17 siblings, 2 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
from host VMM.
Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
tdx_td_vcpu_init(). (Rick)
- Change APICV -> APICv in changelog for consistency.
- Split the changelog to 2 paragraphs.
---
arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
arch/x86/kvm/vmx/main.c | 3 ++-
arch/x86/kvm/vmx/tdx.c | 3 +++
3 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 32c7d58a5d68..df535f08e004 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
*/
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
+ /*********************************************************/
+ /* INHIBITs that are relevant only to the Intel's APICv. */
+ /*********************************************************/
+
+ /*
+ * APICv is disabled because TDX doesn't support it.
+ */
+ APICV_INHIBIT_REASON_TDX,
+
NR_APICV_INHIBIT_REASONS,
};
@@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
__APICV_INHIBIT_REASON(IRQWIN), \
__APICV_INHIBIT_REASON(PIT_REINJ), \
__APICV_INHIBIT_REASON(SEV), \
- __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
+ __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \
+ __APICV_INHIBIT_REASON(TDX)
struct kvm_arch {
unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7f933f821188..13a0ab0a520c 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
+ BIT(APICV_INHIBIT_REASON_TDX))
struct kvm_x86_ops vt_x86_ops __initdata = {
.name = KBUILD_MODNAME,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b0f525069ebd..b51d2416acfb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
goto teardown;
}
+ kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
+
return 0;
/*
@@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
return -EIO;
}
+ vcpu->arch.apic->apicv_active = false;
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
return 0;
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (11 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2025-01-03 22:04 ` Vishal Annapurve
2025-01-22 11:34 ` Paolo Bonzini
2024-12-09 1:07 ` [PATCH 14/16] KVM: VMX: Move NMI/exception handler to common helper Binbin Wu
` (4 subsequent siblings)
17 siblings, 2 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
TDX protects TDX guest APIC state from VMM. Implement access methods of
TDX guest vAPIC state to ignore them or return zero.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Removed WARN_ON_ONCE() in tdx_set_virtual_apic_mode(). (Rick)
- Open code tdx_set_virtual_apic_mode(). (Binbin)
---
arch/x86/kvm/vmx/main.c | 51 +++++++++++++++++++++++++++++++++++++----
1 file changed, 46 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 13a0ab0a520c..6dcc9ebf6d6e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -228,6 +228,15 @@ static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
return vmx_apic_init_signal_blocked(vcpu);
}
+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+ /* Only x2APIC mode is supported for TD. */
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_set_virtual_apic_mode(vcpu);
+}
+
static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
{
struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -236,6 +245,22 @@ static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
memset(pi->pir, 0, sizeof(pi->pir));
}
+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(int max_isr)
+{
+ if (is_td_vcpu(kvm_get_running_vcpu()))
+ return;
+
+ return vmx_hwapic_isr_update(max_isr);
+}
+
static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
if (is_td_vcpu(vcpu))
@@ -414,6 +439,22 @@ static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
}
+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu))
+ return;
+
+ vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+ if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ return;
+
+ vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
{
if (!is_td(kvm))
@@ -527,14 +568,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.update_cr8_intercept = vmx_update_cr8_intercept,
.x2apic_icr_is_split = false,
- .set_virtual_apic_mode = vmx_set_virtual_apic_mode,
- .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
- .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+ .set_virtual_apic_mode = vt_set_virtual_apic_mode,
+ .set_apic_access_page_addr = vt_set_apic_access_page_addr,
+ .refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.apicv_pre_state_restore = vt_apicv_pre_state_restore,
.required_apicv_inhibits = VMX_REQUIRED_APICV_INHIBITS,
- .hwapic_irr_update = vmx_hwapic_irr_update,
- .hwapic_isr_update = vmx_hwapic_isr_update,
+ .hwapic_irr_update = vt_hwapic_irr_update,
+ .hwapic_isr_update = vt_hwapic_isr_update,
.sync_pir_to_irr = vt_sync_pir_to_irr,
.deliver_interrupt = vt_deliver_interrupt,
.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 14/16] KVM: VMX: Move NMI/exception handler to common helper
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (12 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 15/16] KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT Binbin Wu
` (3 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Sean Christopherson <sean.j.christopherson@intel.com>
TDX handles NMI/exception exit mostly the same as VMX case. The
difference is how to retrieve exit qualification. To share the code with
TDX, move NMI/exception to a common header, common.h.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
---
TDX interrupts breakout:
- Update change log with suggestions from (Binbin)
- Move the NMI handling code to common header and add a helper
__vmx_handle_nmi() for it. (Binbin)
---
arch/x86/kvm/vmx/common.h | 72 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 77 +++++----------------------------------
2 files changed, 82 insertions(+), 67 deletions(-)
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index a46f15ddeda1..809ced4c6cd8 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,8 +4,70 @@
#include <linux/kvm_host.h>
+#include <asm/traps.h>
+#include <asm/fred.h>
+
#include "posted_intr.h"
#include "mmu.h"
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_irqoff(unsigned long entry);
+void vmx_do_nmi_irqoff(void);
+
+static inline void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
+{
+ /*
+ * Save xfd_err to guest_fpu before interrupt is enabled, so the
+ * MSR value is not clobbered by the host activity before the guest
+ * has chance to consume it.
+ *
+ * Do not blindly read xfd_err here, since this exception might
+ * be caused by L1 interception on a platform which doesn't
+ * support xfd at all.
+ *
+ * Do it conditionally upon guest_fpu::xfd. xfd_err matters
+ * only when xfd contains a non-zero value.
+ *
+ * Queuing exception is done in vmx_handle_exit. See comment there.
+ */
+ if (vcpu->arch.guest_fpu.fpstate->xfd)
+ rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+}
+
+static inline void vmx_handle_exception_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ /* if exit due to PF check for async PF */
+ if (is_page_fault(intr_info))
+ vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+ /* if exit due to NM, handle before interrupts are enabled */
+ else if (is_nm_fault(intr_info))
+ vmx_handle_nm_fault_irqoff(vcpu);
+ /* Handle machine checks before interrupts are enabled */
+ else if (is_machine_check(intr_info))
+ kvm_machine_check();
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+ u32 intr_info)
+{
+ unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+
+ if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+ "unexpected VM-Exit interrupt info: 0x%x", intr_info))
+ return;
+
+ kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
+ else
+ vmx_do_interrupt_irqoff(gate_offset((gate_desc *)vmx_host_idt_base + vector));
+ kvm_after_interrupt(vcpu);
+
+ vcpu->arch.at_instruction_boundary = true;
+}
static inline bool vt_is_tdx_private_gpa(struct kvm *kvm, gpa_t gpa)
{
@@ -111,4 +173,14 @@ static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu,
kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
}
+static __always_inline void __vmx_handle_nmi(struct kvm_vcpu *vcpu)
+{
+ kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
+ if (cpu_feature_enabled(X86_FEATURE_FRED))
+ fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
+ else
+ vmx_do_nmi_irqoff();
+ kvm_after_interrupt(vcpu);
+}
+
#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 75432e1c9f7f..d2f926d85c3e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -527,7 +527,7 @@ static const struct kvm_vmx_segment_field {
};
-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;
#if IS_ENABLED(CONFIG_HYPERV)
static bool __read_mostly enlightened_vmcs = true;
@@ -4290,7 +4290,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS); /* 22.2.4 */
vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8); /* 22.2.4 */
- vmcs_writel(HOST_IDTR_BASE, host_idt_base); /* 22.2.4 */
+ vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base); /* 22.2.4 */
vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */
@@ -5160,7 +5160,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
intr_info = vmx_get_intr_info(vcpu);
/*
- * Machine checks are handled by handle_exception_irqoff(), or by
+ * Machine checks are handled by vmx_handle_exception_irqoff(), or by
* vmx_vcpu_run() if a #MC occurs on VM-Entry. NMIs are handled by
* vmx_vcpu_enter_exit().
*/
@@ -5168,7 +5168,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
return 1;
/*
- * Queue the exception here instead of in handle_nm_fault_irqoff().
+ * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
* This ensures the nested_vmx check is not skipped so vmexit can
* be reflected to L1 (when it intercepts #NM) before reaching this
* point.
@@ -6887,58 +6887,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
void vmx_do_interrupt_irqoff(unsigned long entry);
void vmx_do_nmi_irqoff(void);
-static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
-{
- /*
- * Save xfd_err to guest_fpu before interrupt is enabled, so the
- * MSR value is not clobbered by the host activity before the guest
- * has chance to consume it.
- *
- * Do not blindly read xfd_err here, since this exception might
- * be caused by L1 interception on a platform which doesn't
- * support xfd at all.
- *
- * Do it conditionally upon guest_fpu::xfd. xfd_err matters
- * only when xfd contains a non-zero value.
- *
- * Queuing exception is done in vmx_handle_exit. See comment there.
- */
- if (vcpu->arch.guest_fpu.fpstate->xfd)
- rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
-}
-
-static void handle_exception_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
- /* if exit due to PF check for async PF */
- if (is_page_fault(intr_info))
- vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
- /* if exit due to NM, handle before interrupts are enabled */
- else if (is_nm_fault(intr_info))
- handle_nm_fault_irqoff(vcpu);
- /* Handle machine checks before interrupts are enabled */
- else if (is_machine_check(intr_info))
- kvm_machine_check();
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
- u32 intr_info)
-{
- unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-
- if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
- "unexpected VM-Exit interrupt info: 0x%x", intr_info))
- return;
-
- kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
- if (cpu_feature_enabled(X86_FEATURE_FRED))
- fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
- else
- vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
- kvm_after_interrupt(vcpu);
-
- vcpu->arch.at_instruction_boundary = true;
-}
-
void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6947,9 +6895,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
return;
if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
- handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ vmx_get_intr_info(vcpu));
else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
- handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
+ vmx_handle_exception_irqoff(vcpu, vmx_get_intr_info(vcpu));
}
/*
@@ -7238,14 +7187,8 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
if ((u16)vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI &&
- is_nmi(vmx_get_intr_info(vcpu))) {
- kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
- if (cpu_feature_enabled(X86_FEATURE_FRED))
- fred_entry_from_kvm(EVENT_TYPE_NMI, NMI_VECTOR);
- else
- vmx_do_nmi_irqoff();
- kvm_after_interrupt(vcpu);
- }
+ is_nmi(vmx_get_intr_info(vcpu)))
+ __vmx_handle_nmi(vcpu);
out:
guest_state_exit_irqoff();
@@ -8309,7 +8252,7 @@ __init int vmx_hardware_setup(void)
int r;
store_idt(&dt);
- host_idt_base = dt.address;
+ vmx_host_idt_base = dt.address;
vmx_setup_user_return_msrs();
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 15/16] KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (13 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 14/16] KVM: VMX: Move NMI/exception handler to common helper Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 16/16] KVM: TDX: Handle EXIT_REASON_OTHER_SMI Binbin Wu
` (2 subsequent siblings)
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX.
NMI Handling: Based on current TDX module, NMI is unblocked after
SEAMRET from SEAM root mode to VMX root mode due to NMI VM-EXit, which
could lead to NMI handling order issue. Put the NMI VM-Exit handling
in noinstr section, it can't completely prevent the order issue from
happening, but it minimizes the window.
Interrupt and Exception Handling: Similar to the VMX case, external
interrupts and exceptions (machine check is the only exception type
KVM handles for TDX guests) are handled in the .handle_exit_irqoff()
callback.
For other exceptions, because TDX guest state is protected, exceptions in
TDX guests can't be intercepted. TDX VMM isn't supposed to handle these
exceptions. If unexpected exception occurs, exit to userspace with
KVM_EXIT_EXCEPTION.
For external interrupt, increase the statistics, same as the VMX case.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX interrupts breakout:
- Renamed from "KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT"
to "KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT".
- Update changelog.
- Rename tdx_handle_exception() to tdx_handle_exception_nmi() to reflect
that NMI is also checked. (Binbin)
- Add comments in tdx_handle_exception_nmi() about why NMI and machine
checks are ignored. (Chao)
- Exit to userspace with KVM_EXIT_EXCEPTION when unexpected exception
occurs instead of returning -EFAULT. (Chao, Isaku)
- Switch to vp_enter_ret.
- Move the handling of NMI, exception and external interrupt from
"KVM: TDX: Add a place holder to handle TDX VM exit" to this patch.
- Use helper __vmx_handle_nmi() to handle NMI, which including the
support for FRED.
---
arch/x86/kvm/vmx/main.c | 12 +++++++++-
arch/x86/kvm/vmx/tdx.c | 46 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/x86_ops.h | 2 ++
3 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6dcc9ebf6d6e..305425b19cb5 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -181,6 +181,16 @@ static int vt_handle_exit(struct kvm_vcpu *vcpu,
return vmx_handle_exit(vcpu, fastpath);
}
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ if (is_td_vcpu(vcpu)) {
+ tdx_handle_exit_irqoff(vcpu);
+ return;
+ }
+
+ vmx_handle_exit_irqoff(vcpu);
+}
+
#ifdef CONFIG_KVM_SMM
static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
{
@@ -599,7 +609,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.load_mmu_pgd = vt_load_mmu_pgd,
.check_intercept = vmx_check_intercept,
- .handle_exit_irqoff = vmx_handle_exit_irqoff,
+ .handle_exit_irqoff = vt_handle_exit_irqoff,
.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b51d2416acfb..fb7f825ed1ed 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -961,6 +961,10 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu)
REG(rsi, RSI);
#undef REG
+ if (tdx_check_exit_reason(vcpu, EXIT_REASON_EXCEPTION_NMI) &&
+ is_nmi(tdexit_intr_info(vcpu)))
+ __vmx_handle_nmi(vcpu);
+
guest_state_exit_irqoff();
}
@@ -1040,6 +1044,44 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
vcpu->arch.nmi_pending = 0;
}
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+ if (tdx_check_exit_reason(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT))
+ vmx_handle_external_interrupt_irqoff(vcpu,
+ tdexit_intr_info(vcpu));
+ else if (tdx_check_exit_reason(vcpu, EXIT_REASON_EXCEPTION_NMI))
+ vmx_handle_exception_irqoff(vcpu, tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_exception_nmi(struct kvm_vcpu *vcpu)
+{
+ u32 intr_info = tdexit_intr_info(vcpu);
+
+ /*
+ * Machine checks are handled by vmx_handle_exception_irqoff(), or by
+ * tdx_handle_exit() with TDX_NON_RECOVERABLE set if a #MC occurs on
+ * VM-Entry. NMIs are handled by tdx_vcpu_enter_exit().
+ */
+ if (is_nmi(intr_info) || is_machine_check(intr_info))
+ return 1;
+
+ kvm_pr_unimpl("unexpected exception 0x%x(exit_reason 0x%llx qual 0x%lx)\n",
+ intr_info,
+ to_tdx(vcpu)->vp_enter_ret, tdexit_exit_qual(vcpu));
+
+ vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
+ vcpu->run->ex.exception = intr_info & INTR_INFO_VECTOR_MASK;
+ vcpu->run->ex.error_code = 0;
+
+ return 0;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+ ++vcpu->stat.irq_exits;
+ return 1;
+}
+
static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
{
vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -1765,6 +1807,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
exit_reason = tdexit_exit_reason(vcpu);
switch (exit_reason.basic) {
+ case EXIT_REASON_EXCEPTION_NMI:
+ return tdx_handle_exception_nmi(vcpu);
+ case EXIT_REASON_EXTERNAL_INTERRUPT:
+ return tdx_handle_external_interrupt(vcpu);
case EXIT_REASON_TDCALL:
return handle_tdvmcall(vcpu);
default:
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 1d2bf6972ea9..28dbef77700c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -133,6 +133,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath);
@@ -176,6 +177,7 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediat
static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
+static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath) { return 0; }
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* [PATCH 16/16] KVM: TDX: Handle EXIT_REASON_OTHER_SMI
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (14 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 15/16] KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT Binbin Wu
@ 2024-12-09 1:07 ` Binbin Wu
2024-12-10 18:24 ` [PATCH 00/16] KVM: TDX: TDX interrupts Paolo Bonzini
2025-01-06 10:51 ` Xiaoyao Li
17 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2024-12-09 1:07 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel, binbin.wu
From: Isaku Yamahata <isaku.yamahata@intel.com>
Handle VM exit caused by "other SMI" for TDX, by returning back to
userspace for Machine Check System Management Interrupt (MSMI) case or
ignoring it and resume vCPU for non-MSMI case.
For VMX, SMM transition can happen in both VMX non-root mode and VMX
root mode. Unlike VMX, in SEAM root mode (TDX module), all interrupts
are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI
causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM,
SMI is delivered and handled by kernel handler right away.
An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI
because I/O instructions inside TDX guest trigger #VE and TDX guest needs to
use TDVMCALL to request VMM to do I/O emulation.
For "other SMI", there are two cases:
- MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in
TDX guest will be delivered as an MSMI. It causes an EXIT_REASON_OTHER_SMI
VM exit with MSMI (bit 0) set in the exit qualification. On VM exit, TDX
module checks whether the "other SMI" is caused by an MSMI or not. If so,
TDX module marks TD as fatal, preventing further TD entries, and then
completes the TD exit flow to KVM with the TDH.VP.ENTER outputs indicating
TDX_NON_RECOVERABLE_TD. After TD exit, the MSMI is delivered and eventually
handled by the kernel machine check handler (7911f145de5f x86/mce: Implement
recovery for errors in TDX/SEAM non-root mode), i.e., the memory page is
marked as poisoned and it won't be freed to the free list when the TDX
guest is terminated. Since the TDX guest is dead, follow other
non-recoverable cases, exit to userspace.
- For non-MSMI case, KVM doesn't need to do anything, just continue TDX vCPU
execution.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
TDX interrupts breakout:
- Squashed "KVM: TDX: Handle EXIT_REASON_OTHER_SMI" and
"KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI". (Chao)
- Rewrite the changelog.
- Remove the explicit call of kvm_machine_check() because the MSMI can
be handled by host #MC handler.
- Update comments according to the code change.
---
arch/x86/include/uapi/asm/vmx.h | 1 +
arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 6a9f268a2d2c..f0f4a4cf84a7 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -34,6 +34,7 @@
#define EXIT_REASON_TRIPLE_FAULT 2
#define EXIT_REASON_INIT_SIGNAL 3
#define EXIT_REASON_SIPI_SIGNAL 4
+#define EXIT_REASON_OTHER_SMI 6
#define EXIT_REASON_INTERRUPT_WINDOW 7
#define EXIT_REASON_NMI_WINDOW 8
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index fb7f825ed1ed..3cf8a4e1fc29 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1813,6 +1813,27 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
return tdx_handle_external_interrupt(vcpu);
case EXIT_REASON_TDCALL:
return handle_tdvmcall(vcpu);
+ case EXIT_REASON_OTHER_SMI:
+ /*
+ * Unlike VMX, SMI in SEAM non-root mode (i.e. when
+ * TD guest vCPU is running) will cause VM exit to TDX module,
+ * then SEAMRET to KVM. Once it exits to KVM, SMI is delivered
+ * and handled by kernel handler right away.
+ *
+ * The Other SMI exit can also be caused by the SEAM non-root
+ * machine check delivered via Machine Check System Management
+ * Interrupt (MSMI), but it has already been handled by the
+ * kernel machine check handler, i.e., the memory page has been
+ * marked as poisoned and it won't be freed to the free list
+ * when the TDX guest is terminated (the TDX module marks the
+ * guest as dead and prevent it from further running when
+ * machine check happens in SEAM non-root).
+ *
+ * - A MSMI will not reach here, it's handled as non_recoverable
+ * case above.
+ * - If it's not an MSMI, no need to do anything here.
+ */
+ return 1;
default:
break;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 00/16] KVM: TDX: TDX interrupts
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (15 preceding siblings ...)
2024-12-09 1:07 ` [PATCH 16/16] KVM: TDX: Handle EXIT_REASON_OTHER_SMI Binbin Wu
@ 2024-12-10 18:24 ` Paolo Bonzini
2025-01-06 10:51 ` Xiaoyao Li
17 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2024-12-10 18:24 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
Applied to kvm-coco-queue, thanks.
Paolo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2024-12-09 1:07 ` [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest Binbin Wu
@ 2025-01-03 21:59 ` Vishal Annapurve
2025-01-06 1:46 ` Binbin Wu
2025-01-13 2:03 ` Binbin Wu
1 sibling, 1 reply; 56+ messages in thread
From: Vishal Annapurve @ 2025-01-03 21:59 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> from host VMM.
>
> Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> TDX interrupts breakout:
> - Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
> tdx_td_vcpu_init(). (Rick)
> - Change APICV -> APICv in changelog for consistency.
> - Split the changelog to 2 paragraphs.
> ---
> arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
> arch/x86/kvm/vmx/main.c | 3 ++-
> arch/x86/kvm/vmx/tdx.c | 3 +++
> 3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 32c7d58a5d68..df535f08e004 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
> */
> APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
>
> + /*********************************************************/
> + /* INHIBITs that are relevant only to the Intel's APICv. */
> + /*********************************************************/
> +
> + /*
> + * APICv is disabled because TDX doesn't support it.
> + */
> + APICV_INHIBIT_REASON_TDX,
> +
> NR_APICV_INHIBIT_REASONS,
> };
>
> @@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
> __APICV_INHIBIT_REASON(IRQWIN), \
> __APICV_INHIBIT_REASON(PIT_REINJ), \
> __APICV_INHIBIT_REASON(SEV), \
> - __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
> + __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \
> + __APICV_INHIBIT_REASON(TDX)
>
> struct kvm_arch {
> unsigned long n_used_mmu_pages;
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 7f933f821188..13a0ab0a520c 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
> BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
> BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
> - BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
> + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
> + BIT(APICV_INHIBIT_REASON_TDX))
>
> struct kvm_x86_ops vt_x86_ops __initdata = {
> .name = KBUILD_MODNAME,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b0f525069ebd..b51d2416acfb 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> goto teardown;
> }
>
> + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
> +
> return 0;
>
> /*
> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> return -EIO;
> }
>
> + vcpu->arch.apic->apicv_active = false;
With this setting, apic_timer_expired[1] will always cause timer
interrupts to be pending without injecting them right away. Injecting
it after VM exit [2] could cause unbounded delays to timer interrupt
injection.
[1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/lapic.c?h=kvm-coco-queue#n1922
[2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>
> return 0;
> --
> 2.46.0
>
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation
2024-12-09 1:07 ` [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation Binbin Wu
@ 2025-01-03 22:04 ` Vishal Annapurve
2025-01-06 2:18 ` Binbin Wu
2025-01-22 11:34 ` Paolo Bonzini
1 sibling, 1 reply; 56+ messages in thread
From: Vishal Annapurve @ 2025-01-03 22:04 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Sun, Dec 8, 2024 at 5:12 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> ...
> +}
> +
> static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
> {
> struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
> @@ -236,6 +245,22 @@ static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
> memset(pi->pir, 0, sizeof(pi->pir));
Should this be a nop for TDX VMs? pre_state_restore could cause
pending PIRs to get cleared as KVM doesn't have ability to sync them
to vIRR in absence of access to the VAPIC page.
> }
>
> +static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> +{
> + if (is_td_vcpu(vcpu))
> + return;
> +
> + return vmx_hwapic_irr_update(vcpu, max_irr);
> +}
> +
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-03 21:59 ` Vishal Annapurve
@ 2025-01-06 1:46 ` Binbin Wu
2025-01-06 22:49 ` Vishal Annapurve
0 siblings, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-06 1:46 UTC (permalink / raw)
To: Vishal Annapurve
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel, Binbin Wu
On 1/4/2025 5:59 AM, Vishal Annapurve wrote:
> On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
>> from host VMM.
>>
>> Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
>> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> ---
>> TDX interrupts breakout:
>> - Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
>> tdx_td_vcpu_init(). (Rick)
>> - Change APICV -> APICv in changelog for consistency.
>> - Split the changelog to 2 paragraphs.
>> ---
>> arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
>> arch/x86/kvm/vmx/main.c | 3 ++-
>> arch/x86/kvm/vmx/tdx.c | 3 +++
>> 3 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 32c7d58a5d68..df535f08e004 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
>> */
>> APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
>>
>> + /*********************************************************/
>> + /* INHIBITs that are relevant only to the Intel's APICv. */
>> + /*********************************************************/
>> +
>> + /*
>> + * APICv is disabled because TDX doesn't support it.
>> + */
>> + APICV_INHIBIT_REASON_TDX,
>> +
>> NR_APICV_INHIBIT_REASONS,
>> };
>>
>> @@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
>> __APICV_INHIBIT_REASON(IRQWIN), \
>> __APICV_INHIBIT_REASON(PIT_REINJ), \
>> __APICV_INHIBIT_REASON(SEV), \
>> - __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
>> + __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \
>> + __APICV_INHIBIT_REASON(TDX)
>>
>> struct kvm_arch {
>> unsigned long n_used_mmu_pages;
>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>> index 7f933f821188..13a0ab0a520c 100644
>> --- a/arch/x86/kvm/vmx/main.c
>> +++ b/arch/x86/kvm/vmx/main.c
>> @@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
>> BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
>> BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
>> BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
>> - BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
>> + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
>> + BIT(APICV_INHIBIT_REASON_TDX))
>>
>> struct kvm_x86_ops vt_x86_ops __initdata = {
>> .name = KBUILD_MODNAME,
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index b0f525069ebd..b51d2416acfb 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
>> goto teardown;
>> }
>>
>> + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
>> +
>> return 0;
>>
>> /*
>> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
>> return -EIO;
>> }
>>
>> + vcpu->arch.apic->apicv_active = false;
> With this setting, apic_timer_expired[1] will always cause timer
> interrupts to be pending without injecting them right away. Injecting
> it after VM exit [2] could cause unbounded delays to timer interrupt
> injection.
When apic->apicv_active is false, it will fallback to increasing the
apic->lapic_timer.pending and request KVM_REQ_UNBLOCK.
If apic_timer_expired() is called from timer function, the target vCPU
will be kicked out immediately.
So there is no unbounded delay to timer interrupt injection.
In a previous PUCK session, Sean suggested apic->apicv_active should be
set to true to align with the hardware setting because TDX module always
enables apicv for TDX guests.
Will send a write up about changing apicv to active later.
>
> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/lapic.c?h=kvm-coco-queue#n1922
> [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
>
>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>
>> return 0;
>> --
>> 2.46.0
>>
>>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation
2025-01-03 22:04 ` Vishal Annapurve
@ 2025-01-06 2:18 ` Binbin Wu
0 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2025-01-06 2:18 UTC (permalink / raw)
To: Vishal Annapurve
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On 1/4/2025 6:04 AM, Vishal Annapurve wrote:
> On Sun, Dec 8, 2024 at 5:12 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>> ...
>> +}
>> +
>> static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
>> {
>> struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
>> @@ -236,6 +245,22 @@ static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
>> memset(pi->pir, 0, sizeof(pi->pir));
> Should this be a nop for TDX VMs? pre_state_restore could cause
> pending PIRs to get cleared as KVM doesn't have ability to sync them
> to vIRR in absence of access to the VAPIC page.
This callback is called by kvm_lapic_reset() and kvm_apic_set_state().
If it is call by kvm_lapic_reset(), it should be cleared.
If it is called by kvm_apic_set_state() when userspace want to setup the
lapic. It will be needed when live migration is enabled for TDX.
For VMX VM, the PIR is synced to vIRR and then the state will be sent to
destination VM.
For TDX guest, I am not sure the final solution to sync PIR from source to
destination TDX guest. I guess TDX module probably will do the job. Will
ask Intel guys to check what is the solution.
For this base series, TDX live migration is not supported yet, it is OK
to reset PIR for now.
>
>> }
>>
>> +static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
>> +{
>> + if (is_td_vcpu(vcpu))
>> + return;
>> +
>> + return vmx_hwapic_irr_update(vcpu, max_irr);
>> +}
>> +
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 00/16] KVM: TDX: TDX interrupts
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
` (16 preceding siblings ...)
2024-12-10 18:24 ` [PATCH 00/16] KVM: TDX: TDX interrupts Paolo Bonzini
@ 2025-01-06 10:51 ` Xiaoyao Li
2025-01-06 20:08 ` Sean Christopherson
17 siblings, 1 reply; 56+ messages in thread
From: Xiaoyao Li @ 2025-01-06 10:51 UTC (permalink / raw)
To: Binbin Wu, pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel
On 12/9/2024 9:07 AM, Binbin Wu wrote:
> Hi,
>
> This patch series introduces the support of interrupt handling for TDX
> guests, including virtual interrupt injection and VM-Exits caused by
> vectored events.
>
(I'm not sure if it is the correct place to raise the discussion on
KVM_SET_LAPIC and KVM_SET_LAPIC for TDX. But it seems the most related
series)
Should KVM reject KVM_GET_LAPIC and KVM_SET_LAPIC for TDX?
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 00/16] KVM: TDX: TDX interrupts
2025-01-06 10:51 ` Xiaoyao Li
@ 2025-01-06 20:08 ` Sean Christopherson
2025-01-09 2:44 ` Binbin Wu
0 siblings, 1 reply; 56+ messages in thread
From: Sean Christopherson @ 2025-01-06 20:08 UTC (permalink / raw)
To: Xiaoyao Li
Cc: Binbin Wu, pbonzini, kvm, rick.p.edgecombe, kai.huang,
adrian.hunter, reinette.chatre, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Mon, Jan 06, 2025, Xiaoyao Li wrote:
> On 12/9/2024 9:07 AM, Binbin Wu wrote:
> > Hi,
> >
> > This patch series introduces the support of interrupt handling for TDX
> > guests, including virtual interrupt injection and VM-Exits caused by
> > vectored events.
> >
>
> (I'm not sure if it is the correct place to raise the discussion on
> KVM_SET_LAPIC and KVM_SET_LAPIC for TDX. But it seems the most related
> series)
>
> Should KVM reject KVM_GET_LAPIC and KVM_SET_LAPIC for TDX?
Yes, IIRC that was what Paolo suggested in one of the many PUCK calls. Until
KVM supports intra-host migration for TDX guests, getting and setting APIC state
is nonsensical.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-06 1:46 ` Binbin Wu
@ 2025-01-06 22:49 ` Vishal Annapurve
2025-01-06 23:40 ` Sean Christopherson
0 siblings, 1 reply; 56+ messages in thread
From: Vishal Annapurve @ 2025-01-06 22:49 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Sun, Jan 5, 2025 at 5:46 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
>
>
>
> On 1/4/2025 5:59 AM, Vishal Annapurve wrote:
> > On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> >> From: Isaku Yamahata <isaku.yamahata@intel.com>
> >>
> >> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> >> from host VMM.
> >>
> >> Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
> >> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
> >>
> >> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> >> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> >> ---
> >> TDX interrupts breakout:
> >> - Removed WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm)) in
> >> tdx_td_vcpu_init(). (Rick)
> >> - Change APICV -> APICv in changelog for consistency.
> >> - Split the changelog to 2 paragraphs.
> >> ---
> >> arch/x86/include/asm/kvm_host.h | 12 +++++++++++-
> >> arch/x86/kvm/vmx/main.c | 3 ++-
> >> arch/x86/kvm/vmx/tdx.c | 3 +++
> >> 3 files changed, 16 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >> index 32c7d58a5d68..df535f08e004 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -1281,6 +1281,15 @@ enum kvm_apicv_inhibit {
> >> */
> >> APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
> >>
> >> + /*********************************************************/
> >> + /* INHIBITs that are relevant only to the Intel's APICv. */
> >> + /*********************************************************/
> >> +
> >> + /*
> >> + * APICv is disabled because TDX doesn't support it.
> >> + */
> >> + APICV_INHIBIT_REASON_TDX,
> >> +
> >> NR_APICV_INHIBIT_REASONS,
> >> };
> >>
> >> @@ -1299,7 +1308,8 @@ enum kvm_apicv_inhibit {
> >> __APICV_INHIBIT_REASON(IRQWIN), \
> >> __APICV_INHIBIT_REASON(PIT_REINJ), \
> >> __APICV_INHIBIT_REASON(SEV), \
> >> - __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
> >> + __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \
> >> + __APICV_INHIBIT_REASON(TDX)
> >>
> >> struct kvm_arch {
> >> unsigned long n_used_mmu_pages;
> >> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> >> index 7f933f821188..13a0ab0a520c 100644
> >> --- a/arch/x86/kvm/vmx/main.c
> >> +++ b/arch/x86/kvm/vmx/main.c
> >> @@ -445,7 +445,8 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
> >> BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
> >> BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
> >> BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
> >> - BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
> >> + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
> >> + BIT(APICV_INHIBIT_REASON_TDX))
> >>
> >> struct kvm_x86_ops vt_x86_ops __initdata = {
> >> .name = KBUILD_MODNAME,
> >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> index b0f525069ebd..b51d2416acfb 100644
> >> --- a/arch/x86/kvm/vmx/tdx.c
> >> +++ b/arch/x86/kvm/vmx/tdx.c
> >> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> >> goto teardown;
> >> }
> >>
> >> + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
> >> +
> >> return 0;
> >>
> >> /*
> >> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> >> return -EIO;
> >> }
> >>
> >> + vcpu->arch.apic->apicv_active = false;
> > With this setting, apic_timer_expired[1] will always cause timer
> > interrupts to be pending without injecting them right away. Injecting
> > it after VM exit [2] could cause unbounded delays to timer interrupt
> > injection.
>
> When apic->apicv_active is false, it will fallback to increasing the
> apic->lapic_timer.pending and request KVM_REQ_UNBLOCK.
> If apic_timer_expired() is called from timer function, the target vCPU
> will be kicked out immediately.
> So there is no unbounded delay to timer interrupt injection.
Ack. Though, wouldn't it be faster to just post timer interrupts right
away without causing vcpu exit?
Another scenario I was thinking of was hrtimer expiry during vcpu exit
being handled in KVM/userspace, which will cause timer interrupt
injection after the next exit [1] delaying timer interrupt to guest.
This scenario is not specific to TDX VMs though.
[1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
>
> In a previous PUCK session, Sean suggested apic->apicv_active should be
> set to true to align with the hardware setting because TDX module always
> enables apicv for TDX guests.
> Will send a write up about changing apicv to active later.
>
> >
> > [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/lapic.c?h=kvm-coco-queue#n1922
> > [2] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
> >
> >> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> >>
> >> return 0;
> >> --
> >> 2.46.0
> >>
> >>
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-06 22:49 ` Vishal Annapurve
@ 2025-01-06 23:40 ` Sean Christopherson
2025-01-07 3:24 ` Chao Gao
0 siblings, 1 reply; 56+ messages in thread
From: Sean Christopherson @ 2025-01-06 23:40 UTC (permalink / raw)
To: Vishal Annapurve
Cc: Binbin Wu, pbonzini, kvm, rick.p.edgecombe, kai.huang,
adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel
On Mon, Jan 06, 2025, Vishal Annapurve wrote:
> On Sun, Jan 5, 2025 at 5:46 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> > On 1/4/2025 5:59 AM, Vishal Annapurve wrote:
> > > On Sun, Dec 8, 2024 at 5:11 PM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> > >> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > >> index b0f525069ebd..b51d2416acfb 100644
> > >> --- a/arch/x86/kvm/vmx/tdx.c
> > >> +++ b/arch/x86/kvm/vmx/tdx.c
> > >> @@ -2143,6 +2143,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
> > >> goto teardown;
> > >> }
> > >>
> > >> + kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
> > >> +
> > >> return 0;
> > >>
> > >> /*
> > >> @@ -2528,6 +2530,7 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > >> return -EIO;
> > >> }
> > >>
> > >> + vcpu->arch.apic->apicv_active = false;
> > > With this setting, apic_timer_expired[1] will always cause timer
> > > interrupts to be pending without injecting them right away. Injecting
> > > it after VM exit [2] could cause unbounded delays to timer interrupt
> > > injection.
> >
> > When apic->apicv_active is false, it will fallback to increasing the
> > apic->lapic_timer.pending and request KVM_REQ_UNBLOCK.
> > If apic_timer_expired() is called from timer function, the target vCPU
> > will be kicked out immediately.
> > So there is no unbounded delay to timer interrupt injection.
>
> Ack. Though, wouldn't it be faster to just post timer interrupts right
> away without causing vcpu exit?
Yes, but if and only if hrtimers are offloaded to dedicated "housekeeping" CPUs.
If the hrtimer is running on the same pCPU that the vCPU is running on, then the
expiration of the underlying hardware timer (in practice, the "real" APIC timer)
will generate a host IRQ and thus a VM-Exit. I.e. the vCPU will already be kicked
into the host, and the virtual timer IRQ will be delivered prior to re-entering
the guest.
Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
being disabled to detect that it's worth trying to post a timer interrupt, but off
the top of my head I'm pretty sure that's unnecessary and pointless. The
"vcpu->mode == IN_GUEST_MODE" is super cheap, and I can't think of any harm in
posting the time interrupt if the target vCPU happens to be in guest mode even
if the host wasn't configured just so.
> Another scenario I was thinking of was hrtimer expiry during vcpu exit
> being handled in KVM/userspace, which will cause timer interrupt
> injection after the next exit [1] delaying timer interrupt to guest.
No, the IRQ won't be delayed. Expiration from the hrtimer callback will set
KVM_REQ_UNBLOCK, which will prevent actually entering the guest (see the call
to kvm_request_pending() in kvm_vcpu_exit_request()).
> This scenario is not specific to TDX VMs though.
>
> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-06 23:40 ` Sean Christopherson
@ 2025-01-07 3:24 ` Chao Gao
2025-01-07 8:09 ` Binbin Wu
0 siblings, 1 reply; 56+ messages in thread
From: Chao Gao @ 2025-01-07 3:24 UTC (permalink / raw)
To: Sean Christopherson
Cc: Vishal Annapurve, Binbin Wu, pbonzini, kvm, rick.p.edgecombe,
kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
tony.lindgren, isaku.yamahata, yan.y.zhao, linux-kernel
>Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
>being disabled to detect that it's worth trying to post a timer interrupt, but off
>the top of my head I'm pretty sure that's unnecessary and pointless. The
Commit 1714a4eb6fb0 gives an explanation:
if only some guests isolated and others not, they would not
have any benefit from posted timer interrupts, and at the same time lose
VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
returns true and therefore forces kvm_can_use_hv_timer() to false.
>"vcpu->mode == IN_GUEST_MODE" is super cheap, and I can't think of any harm in
>posting the time interrupt if the target vCPU happens to be in guest mode even
>if the host wasn't configured just so.
>
>> Another scenario I was thinking of was hrtimer expiry during vcpu exit
>> being handled in KVM/userspace, which will cause timer interrupt
>> injection after the next exit [1] delaying timer interrupt to guest.
>
>No, the IRQ won't be delayed. Expiration from the hrtimer callback will set
>KVM_REQ_UNBLOCK, which will prevent actually entering the guest (see the call
>to kvm_request_pending() in kvm_vcpu_exit_request()).
>
>> This scenario is not specific to TDX VMs though.
>>
>> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-07 3:24 ` Chao Gao
@ 2025-01-07 8:09 ` Binbin Wu
2025-01-07 21:15 ` Sean Christopherson
0 siblings, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-07 8:09 UTC (permalink / raw)
To: Chao Gao, Sean Christopherson
Cc: Vishal Annapurve, pbonzini, kvm, rick.p.edgecombe, kai.huang,
adrian.hunter, reinette.chatre, xiaoyao.li, tony.lindgren,
isaku.yamahata, yan.y.zhao, linux-kernel
On 1/7/2025 11:24 AM, Chao Gao wrote:
>> Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
>> being disabled to detect that it's worth trying to post a timer interrupt, but off
>> the top of my head I'm pretty sure that's unnecessary and pointless. The
> Commit 1714a4eb6fb0 gives an explanation:
>
> if only some guests isolated and others not, they would not
> have any benefit from posted timer interrupts, and at the same time lose
> VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
> returns true and therefore forces kvm_can_use_hv_timer() to false.
Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and
hlt_in_guest for non TDX guest. The code doesn't work for TDX guests.
- Whether mwait in guest is allowed for TDX depends on the cpuid
configuration in TD_PARAMS, not the capability of physical cpu checked by
kvm_can_mwait_in_guest().
- hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false
for TDX guests.
For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest()
or add special handling (i.e., skip checking mwait/hlt in guest) because
VMX preemption timer can't be used anyway, in order to allow housekeeping
CPU to post timer interrupt for TDX vCPUs when nohz_full is configured
after changing APICv active to true.
>
>> "vcpu->mode == IN_GUEST_MODE" is super cheap, and I can't think of any harm in
>> posting the time interrupt if the target vCPU happens to be in guest mode even
>> if the host wasn't configured just so.
>>
>>> Another scenario I was thinking of was hrtimer expiry during vcpu exit
>>> being handled in KVM/userspace, which will cause timer interrupt
>>> injection after the next exit [1] delaying timer interrupt to guest.
>> No, the IRQ won't be delayed. Expiration from the hrtimer callback will set
>> KVM_REQ_UNBLOCK, which will prevent actually entering the guest (see the call
>> to kvm_request_pending() in kvm_vcpu_exit_request()).
>>
>>> This scenario is not specific to TDX VMs though.
>>>
>>> [1] https://git.kernel.org/pub/scm/virt/kvm/kvm.git/tree/arch/x86/kvm/x86.c?h=kvm-coco-queue#n11263
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-07 8:09 ` Binbin Wu
@ 2025-01-07 21:15 ` Sean Christopherson
0 siblings, 0 replies; 56+ messages in thread
From: Sean Christopherson @ 2025-01-07 21:15 UTC (permalink / raw)
To: Binbin Wu
Cc: Chao Gao, Vishal Annapurve, pbonzini, kvm, rick.p.edgecombe,
kai.huang, adrian.hunter, reinette.chatre, xiaoyao.li,
tony.lindgren, isaku.yamahata, yan.y.zhao, linux-kernel
On Tue, Jan 07, 2025, Binbin Wu wrote:
> On 1/7/2025 11:24 AM, Chao Gao wrote:
> > > Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT interception
> > > being disabled to detect that it's worth trying to post a timer interrupt, but off
> > > the top of my head I'm pretty sure that's unnecessary and pointless. The
> > Commit 1714a4eb6fb0 gives an explanation:
> >
> > if only some guests isolated and others not, they would not
> > have any benefit from posted timer interrupts, and at the same time lose
> > VMX preemption timer fast paths because kvm_can_post_timer_interrupt()
> > returns true and therefore forces kvm_can_use_hv_timer() to false.
But that's only relevant for setup. Upon expiration, if the target vCPU is in
guest mode and APICv is active, then from_timer_fn must be true, which in turn
means that hrtimers are in use.
> Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and
> hlt_in_guest for non TDX guest. The code doesn't work for TDX guests.
> - Whether mwait in guest is allowed for TDX depends on the cpuid
> configuration in TD_PARAMS, not the capability of physical cpu checked by
> kvm_can_mwait_in_guest().
> - hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false
> for TDX guests.
>
> For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest()
> or add special handling (i.e., skip checking mwait/hlt in guest) because
> VMX preemption timer can't be used anyway, in order to allow housekeeping
> CPU to post timer interrupt for TDX vCPUs when nohz_full is configured
> after changing APICv active to true.
Yes, but I don't think we need any TDX-specific logic beyond noting that the
VMX preemption can't be used. As above, consulting kvm_can_post_timer_interrupt()
in the expiration path is silly.
And after staring at all of this for a few hours, I think we should ditch the
entire lapic_timer.pending approach, because at this point it's doing more harm
than good. Tracking pending timer IRQs was primarily done to support delivery
of *all* timer interrupts when multiple interrupts were coalesced in the host,
e.g. because a vCPU was scheduled out. But that got removed by commit
f1ed0450a5fa ("KVM: x86: Remove support for reporting coalesced APIC IRQs"),
partly because posted interrupts threw a wrench in things, but also because
delivering multiple periodic interrupts in quick succession is problematic in
its own right.
With the interrupt coalescing logic gone, I can't think of any reason KVM needs
to kick the vCPU out to the main vcpu_run() loop just to deliver the interrupt,
i.e. pending interrupts before delivering them is unnecessary. E.g. conditioning
direct deliverly on apicv_active in the !from_timer_fn case makes no sense, because
KVM is going to manually move the interrupt from the PIR to the IRR anyways.
I also don't like that the behavior for the posting path is subtly different than
!APICv. E.g. if an hrtimer fires while KVM is handling a write to TMICT, KVM will
deliver the interrupt if configured to post timer, but not if APICv is disabled,
because the latter will increment "pending", and "pending" will be cleared before
handling the new TMICT. Ditto for switch APIC timer modes.
Unfortunately, there is a lot to disentangle before KVM can directly deliver all
APIC timer interupts, as KVM has built up a fair bit of ugliness on top of "pending".
E.g. when switching back to the VMX preemption timer (after a blocking vCPU is
scheduled back in), KVM leaves the hrtimer active. start_hv_timer() accounts for
that by checking lapic_timer.pending, but leaving the hrtimer running in this case
is absurd and unnecessarily relies on "pending" being incremented.
if (kvm_x86_call(set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
return false;
ktimer->hv_timer_in_use = true;
hrtimer_cancel(&ktimer->timer);
/*
* To simplify handling the periodic timer, leave the hv timer running
* even if the deadline timer has expired, i.e. rely on the resulting
* VM-Exit to recompute the periodic timer's target expiration.
*/
if (!apic_lvtt_period(apic)) {
/*
* Cancel the hv timer if the sw timer fired while the hv timer
* was being programmed, or if the hv timer itself expired.
*/
if (atomic_read(&ktimer->pending)) {
cancel_hv_timer(apic);
} else if (expired) {
apic_timer_expired(apic, false);
cancel_hv_timer(apic);
}
}
I'm also not convinced that letting the hrtimer run on a different CPU in the
"normal" case is a good idea. KVM explicitly migrates the timer, *except* for
the posting case, and so not pinning the hrtimer is likely counter-productive,
not to mention confusing.
void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
{
struct hrtimer *timer;
if (!lapic_in_kernel(vcpu) ||
kvm_can_post_timer_interrupt(vcpu))
return;
timer = &vcpu->arch.apic->lapic_timer.timer;
if (hrtimer_cancel(timer))
hrtimer_start_expires(timer, HRTIMER_MODE_ABS_HARD);
}
And if the hrtimer is pinned to the pCPU, then in practice it should be impossible
for KVM to post a timer IRQ into a vCPU that's in guest mode.
Seapking of which, posting a timer IRQ into a vCPU that's in guest mode is a bit
dodgy. I _think_ it's safe, because all of the flows that can be coincident are
actually mutually exclusive. E.g. apic_timer_expired()'s call to
__kvm_wait_lapic_expire() can't collide with the call from the entry path, as the
entry path's invocation happens with IRQs disabled.
But all of this makes me wary, so I'd much prefer to clean up, harden, and document
the existing code before we try to optimize timer IRQ delivery for more cases.
That's especially true for TDX as we're already adding significant compexlity
and multiple novel paths for TDX; we don't need more at this time :-)
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2024-12-09 1:07 ` [PATCH 11/16] KVM: TDX: Always block INIT/SIPI Binbin Wu
@ 2025-01-08 7:21 ` Xiaoyao Li
2025-01-08 7:53 ` Binbin Wu
2025-01-09 2:51 ` Huang, Kai
1 sibling, 1 reply; 56+ messages in thread
From: Xiaoyao Li @ 2025-01-08 7:21 UTC (permalink / raw)
To: Binbin Wu, pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao, linux-kernel
On 12/9/2024 9:07 AM, Binbin Wu wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Always block INIT and SIPI events for the TDX guest because the TDX module
> doesn't provide API for VMM to inject INIT IPI or SIPI.
>
> TDX defines its own vCPU creation and initialization sequence including
> multiple seamcalls. Also, it's only allowed during TD build time.
>
> Given that TDX guest is para-virtualized to boot BSP/APs, normally there
> shouldn't be any INIT/SIPI event for TDX guest. If any, three options to
> handle them:
> 1. Always block INIT/SIPI request.
> 2. (Silently) ignore INIT/SIPI request during delivery.
> 3. Return error to guest TDs somehow.
>
> Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
> INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
> need to add new interface or helper function for INIT/SIPI delivery.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> TDX interrupts breakout:
> - Renamed from "KVM: TDX: Silently ignore INIT/SIPI" to
> "KVM: TDX: Always block INIT/SIPI".
> - Remove KVM_BUG_ON() in tdx_vcpu_reset(). (Rick)
> - Drop tdx_vcpu_reset() and move the comment to vt_vcpu_reset().
> - Remove unnecessary interface and helpers to delivery INIT/SIPI
> because INIT/SIPI events are always blocked for TDX. (Binbin)
> - Update changelog.
> ---
> arch/x86/kvm/lapic.c | 2 +-
> arch/x86/kvm/vmx/main.c | 19 ++++++++++++++++++-
> 2 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 474e0a7c1069..f93c382344ee 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>
> if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
> kvm_vcpu_reset(vcpu, true);
> - if (kvm_vcpu_is_bsp(apic->vcpu))
> + if (kvm_vcpu_is_bsp(vcpu))
> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> else
> vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 8ec96646faec..7f933f821188 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -115,6 +115,11 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
>
> static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> + /*
> + * TDX has its own sequence to do init during TD build time (by
> + * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
> + * runtime.
> + */
The first half is confusing. It seems to mix up init(ialization) with
INIT event.
And this callback is about *reset*, which can be due to INIT event or
not. That's why it has a second parameter of init_event. The comment
needs to clarify why reset is not needed for both cases.
I think we can just say TDX doesn't support vcpu reset no matter due to
INIT event or not.
> if (is_td_vcpu(vcpu))
> return;
>
> @@ -211,6 +216,18 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
> }
> #endif
>
> +static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> + /*
> + * INIT and SIPI are always blocked for TDX, i.e., INIT handling and
> + * the OP vcpu_deliver_sipi_vector() won't be called.
> + */
> + if (is_td_vcpu(vcpu))
> + return true;
> +
> + return vmx_apic_init_signal_blocked(vcpu);
> +}
> +
> static void vt_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
> {
> struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
> @@ -565,7 +582,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> #endif
>
> .check_emulate_instruction = vmx_check_emulate_instruction,
> - .apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> + .apic_init_signal_blocked = vt_apic_init_signal_blocked,
> .migrate_timers = vmx_migrate_timers,
>
> .msr_filter_changed = vmx_msr_filter_changed,
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-08 7:21 ` Xiaoyao Li
@ 2025-01-08 7:53 ` Binbin Wu
2025-01-08 14:40 ` Sean Christopherson
0 siblings, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-08 7:53 UTC (permalink / raw)
To: Xiaoyao Li
Cc: pbonzini, seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, tony.lindgren, isaku.yamahata, yan.y.zhao,
chao.gao, linux-kernel
On 1/8/2025 3:21 PM, Xiaoyao Li wrote:
> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Always block INIT and SIPI events for the TDX guest because the TDX module
>> doesn't provide API for VMM to inject INIT IPI or SIPI.
>>
>> TDX defines its own vCPU creation and initialization sequence including
>> multiple seamcalls. Also, it's only allowed during TD build time.
>>
>> Given that TDX guest is para-virtualized to boot BSP/APs, normally there
>> shouldn't be any INIT/SIPI event for TDX guest. If any, three options to
>> handle them:
>> 1. Always block INIT/SIPI request.
>> 2. (Silently) ignore INIT/SIPI request during delivery.
>> 3. Return error to guest TDs somehow.
>>
>> Choose option 1 for simplicity. Since INIT and SIPI are always blocked,
>> INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no
>> need to add new interface or helper function for INIT/SIPI delivery.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> ---
>> TDX interrupts breakout:
>> - Renamed from "KVM: TDX: Silently ignore INIT/SIPI" to
>> "KVM: TDX: Always block INIT/SIPI".
>> - Remove KVM_BUG_ON() in tdx_vcpu_reset(). (Rick)
>> - Drop tdx_vcpu_reset() and move the comment to vt_vcpu_reset().
>> - Remove unnecessary interface and helpers to delivery INIT/SIPI
>> because INIT/SIPI events are always blocked for TDX. (Binbin)
>> - Update changelog.
>> ---
>> arch/x86/kvm/lapic.c | 2 +-
>> arch/x86/kvm/vmx/main.c | 19 ++++++++++++++++++-
>> 2 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 474e0a7c1069..f93c382344ee 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>> if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
>> kvm_vcpu_reset(vcpu, true);
>> - if (kvm_vcpu_is_bsp(apic->vcpu))
>> + if (kvm_vcpu_is_bsp(vcpu))
>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> else
>> vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>> index 8ec96646faec..7f933f821188 100644
>> --- a/arch/x86/kvm/vmx/main.c
>> +++ b/arch/x86/kvm/vmx/main.c
>> @@ -115,6 +115,11 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
>> static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>> {
>> + /*
>> + * TDX has its own sequence to do init during TD build time (by
>> + * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
>> + * runtime.
>> + */
>
> The first half is confusing. It seems to mix up init(ialization) with INIT event.
>
> And this callback is about *reset*, which can be due to INIT event or not. That's why it has a second parameter of init_event. The comment needs to clarify why reset is not needed for both cases.
>
> I think we can just say TDX doesn't support vcpu reset no matter due to INIT event or not.
>
>
Will update it.
Thanks!
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-08 7:53 ` Binbin Wu
@ 2025-01-08 14:40 ` Sean Christopherson
2025-01-09 2:09 ` Xiaoyao Li
2025-01-09 2:26 ` Binbin Wu
0 siblings, 2 replies; 56+ messages in thread
From: Sean Christopherson @ 2025-01-08 14:40 UTC (permalink / raw)
To: Binbin Wu
Cc: Xiaoyao Li, pbonzini, kvm, rick.p.edgecombe, kai.huang,
adrian.hunter, reinette.chatre, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Wed, Jan 08, 2025, Binbin Wu wrote:
> On 1/8/2025 3:21 PM, Xiaoyao Li wrote:
> > On 12/9/2024 9:07 AM, Binbin Wu wrote:
...
> > > ---
> > > arch/x86/kvm/lapic.c | 2 +-
> > > arch/x86/kvm/vmx/main.c | 19 ++++++++++++++++++-
> > > 2 files changed, 19 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > index 474e0a7c1069..f93c382344ee 100644
> > > --- a/arch/x86/kvm/lapic.c
> > > +++ b/arch/x86/kvm/lapic.c
> > > @@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
> > > if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
> > > kvm_vcpu_reset(vcpu, true);
> > > - if (kvm_vcpu_is_bsp(apic->vcpu))
> > > + if (kvm_vcpu_is_bsp(vcpu))
> > > vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > > else
> > > vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > > index 8ec96646faec..7f933f821188 100644
> > > --- a/arch/x86/kvm/vmx/main.c
> > > +++ b/arch/x86/kvm/vmx/main.c
> > > @@ -115,6 +115,11 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
> > > static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > {
> > > + /*
> > > + * TDX has its own sequence to do init during TD build time (by
> > > + * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
> > > + * runtime.
> > > + */
> >
> > The first half is confusing. It seems to mix up init(ialization) with INIT
> > event.
> >
> > And this callback is about *reset*, which can be due to INIT event or not.
> > That's why it has a second parameter of init_event. The comment needs to
> > clarify why reset is not needed for both cases.
> >
> > I think we can just say TDX doesn't support vcpu reset no matter due to
> > INIT event or not.
That's not entirely accurate either though. TDX does support KVM's version of
RESET, because KVM's RESET is "power-on", i.e. vCPU creation. Emulation of
runtime RESET is userspace's responsibility.
The real reason why KVM doesn't do anything during KVM's RESET is that what
little setup KVM does/can do needs to be defered until after guest CPUID is
configured.
KVM should also WARN if a TDX vCPU gets INIT, no?
Side topic, the comment about x2APIC in tdx_vcpu_init() is too specific, e.g.
calling out that x2APIC support is enumerated in CPUID.0x1.ECX isn't necessary,
and stating that userspace must use KVM_SET_CPUID2 is flat out wrong. Very
technically, KVM_SET_CPUID is also a valid option, it's just not used in practice
because it doesn't support setting non-zero indices (but in theory it could be
used to enable x2APIC).
E.g. something like this?
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d2e78e6675b9..e36fba94fa14 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -115,13 +115,10 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
{
- /*
- * TDX has its own sequence to do init during TD build time (by
- * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
- * runtime.
- */
- if (is_td_vcpu(vcpu))
+ if (is_td_vcpu(vcpu)) {
+ tdx_vcpu_reset(vcpu, init_event);
return;
+ }
vmx_vcpu_reset(vcpu, init_event);
}
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9e490fccf073..a587f59167a7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2806,9 +2806,8 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
return -EINVAL;
/*
- * As TDX requires X2APIC, set local apic mode to X2APIC. User space
- * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
- * KVM_SET_CPUID2. Otherwise kvm_apic_set_base() will fail.
+ * TDX requires x2APIC, userspace is responsible for configuring guest
+ * CPUID accordingly.
*/
apic_base = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
(kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0);
@@ -2827,6 +2826,19 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
return 0;
}
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+ /*
+ * Yell on INIT, as TDX doesn't support INIT, i.e. KVM should drop all
+ * INIT events.
+ *
+ * Defer initializing vCPU for RESET state until KVM_TDX_INIT_VCPU, as
+ * userspace needs to define the vCPU model before KVM can initialize
+ * vCPU state, e.g. to enable x2APIC.
+ */
+ WARN_ON_ONCE(init_event);
+}
+
struct tdx_gmem_post_populate_arg {
struct kvm_vcpu *vcpu;
__u32 flags;
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-08 14:40 ` Sean Christopherson
@ 2025-01-09 2:09 ` Xiaoyao Li
2025-01-09 2:26 ` Binbin Wu
1 sibling, 0 replies; 56+ messages in thread
From: Xiaoyao Li @ 2025-01-09 2:09 UTC (permalink / raw)
To: Sean Christopherson, Binbin Wu
Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, tony.lindgren, isaku.yamahata, yan.y.zhao,
chao.gao, linux-kernel
On 1/8/2025 10:40 PM, Sean Christopherson wrote:
> On Wed, Jan 08, 2025, Binbin Wu wrote:
>> On 1/8/2025 3:21 PM, Xiaoyao Li wrote:
>>> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>
> ...
>
>>>> ---
>>>> arch/x86/kvm/lapic.c | 2 +-
>>>> arch/x86/kvm/vmx/main.c | 19 ++++++++++++++++++-
>>>> 2 files changed, 19 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>> index 474e0a7c1069..f93c382344ee 100644
>>>> --- a/arch/x86/kvm/lapic.c
>>>> +++ b/arch/x86/kvm/lapic.c
>>>> @@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>>>> if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
>>>> kvm_vcpu_reset(vcpu, true);
>>>> - if (kvm_vcpu_is_bsp(apic->vcpu))
>>>> + if (kvm_vcpu_is_bsp(vcpu))
>>>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>>> else
>>>> vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>>>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>>>> index 8ec96646faec..7f933f821188 100644
>>>> --- a/arch/x86/kvm/vmx/main.c
>>>> +++ b/arch/x86/kvm/vmx/main.c
>>>> @@ -115,6 +115,11 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
>>>> static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>>> {
>>>> + /*
>>>> + * TDX has its own sequence to do init during TD build time (by
>>>> + * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
>>>> + * runtime.
>>>> + */
>>>
>>> The first half is confusing. It seems to mix up init(ialization) with INIT
>>> event.
>>>
>>> And this callback is about *reset*, which can be due to INIT event or not.
>>> That's why it has a second parameter of init_event. The comment needs to
>>> clarify why reset is not needed for both cases.
>>>
>>> I think we can just say TDX doesn't support vcpu reset no matter due to
>>> INIT event or not.
>
> That's not entirely accurate either though. TDX does support KVM's version of
> RESET, because KVM's RESET is "power-on", i.e. vCPU creation. Emulation of
> runtime RESET is userspace's responsibility.
>
> The real reason why KVM doesn't do anything during KVM's RESET is that what
> little setup KVM does/can do needs to be defered until after guest CPUID is
> configured.
It's much clear now.
> KVM should also WARN if a TDX vCPU gets INIT, no?
Agreed.
> Side topic, the comment about x2APIC in tdx_vcpu_init() is too specific, e.g.
> calling out that x2APIC support is enumerated in CPUID.0x1.ECX isn't necessary,
> and stating that userspace must use KVM_SET_CPUID2 is flat out wrong. Very
> technically, KVM_SET_CPUID is also a valid option, it's just not used in practice
> because it doesn't support setting non-zero indices (but in theory it could be
> used to enable x2APIC).
>
> E.g. something like this?
LGTM.
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d2e78e6675b9..e36fba94fa14 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -115,13 +115,10 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
>
> static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> - /*
> - * TDX has its own sequence to do init during TD build time (by
> - * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
> - * runtime.
> - */
> - if (is_td_vcpu(vcpu))
> + if (is_td_vcpu(vcpu)) {
> + tdx_vcpu_reset(vcpu, init_event);
> return;
> + }
>
> vmx_vcpu_reset(vcpu, init_event);
> }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9e490fccf073..a587f59167a7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2806,9 +2806,8 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> return -EINVAL;
>
> /*
> - * As TDX requires X2APIC, set local apic mode to X2APIC. User space
> - * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
> - * KVM_SET_CPUID2. Otherwise kvm_apic_set_base() will fail.
> + * TDX requires x2APIC, userspace is responsible for configuring guest
> + * CPUID accordingly.
> */
> apic_base = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
> (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0);
> @@ -2827,6 +2826,19 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> return 0;
> }
>
> +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +{
> + /*
> + * Yell on INIT, as TDX doesn't support INIT, i.e. KVM should drop all
> + * INIT events.
> + *
> + * Defer initializing vCPU for RESET state until KVM_TDX_INIT_VCPU, as
> + * userspace needs to define the vCPU model before KVM can initialize
> + * vCPU state, e.g. to enable x2APIC.
> + */
> + WARN_ON_ONCE(init_event);
> +}
> +
> struct tdx_gmem_post_populate_arg {
> struct kvm_vcpu *vcpu;
> __u32 flags;
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-08 14:40 ` Sean Christopherson
2025-01-09 2:09 ` Xiaoyao Li
@ 2025-01-09 2:26 ` Binbin Wu
2025-01-09 2:46 ` Huang, Kai
1 sibling, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-09 2:26 UTC (permalink / raw)
To: Sean Christopherson
Cc: Xiaoyao Li, pbonzini, kvm, rick.p.edgecombe, kai.huang,
adrian.hunter, reinette.chatre, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On 1/8/2025 10:40 PM, Sean Christopherson wrote:
> On Wed, Jan 08, 2025, Binbin Wu wrote:
>> On 1/8/2025 3:21 PM, Xiaoyao Li wrote:
>>> On 12/9/2024 9:07 AM, Binbin Wu wrote:
> ...
>
>>>> ---
>>>> arch/x86/kvm/lapic.c | 2 +-
>>>> arch/x86/kvm/vmx/main.c | 19 ++++++++++++++++++-
>>>> 2 files changed, 19 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>>>> index 474e0a7c1069..f93c382344ee 100644
>>>> --- a/arch/x86/kvm/lapic.c
>>>> +++ b/arch/x86/kvm/lapic.c
>>>> @@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>>>> if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
>>>> kvm_vcpu_reset(vcpu, true);
>>>> - if (kvm_vcpu_is_bsp(apic->vcpu))
>>>> + if (kvm_vcpu_is_bsp(vcpu))
>>>> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>>> else
>>>> vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>>>> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
>>>> index 8ec96646faec..7f933f821188 100644
>>>> --- a/arch/x86/kvm/vmx/main.c
>>>> +++ b/arch/x86/kvm/vmx/main.c
>>>> @@ -115,6 +115,11 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
>>>> static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>>>> {
>>>> + /*
>>>> + * TDX has its own sequence to do init during TD build time (by
>>>> + * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
>>>> + * runtime.
>>>> + */
>>> The first half is confusing. It seems to mix up init(ialization) with INIT
>>> event.
>>>
>>> And this callback is about *reset*, which can be due to INIT event or not.
>>> That's why it has a second parameter of init_event. The comment needs to
>>> clarify why reset is not needed for both cases.
>>>
>>> I think we can just say TDX doesn't support vcpu reset no matter due to
>>> INIT event or not.
> That's not entirely accurate either though. TDX does support KVM's version of
> RESET, because KVM's RESET is "power-on", i.e. vCPU creation. Emulation of
> runtime RESET is userspace's responsibility.
>
> The real reason why KVM doesn't do anything during KVM's RESET is that what
> little setup KVM does/can do needs to be defered until after guest CPUID is
> configured.
>
> KVM should also WARN if a TDX vCPU gets INIT, no?
There was a KVM_BUG_ON() if a TDX vCPU gets INIT in v19, and later it was
removed during the cleanup about removing WARN_ON_ONCE() and KVM_BUG_ON().
Since INIT/SIPI are always blocked for TDX guests, a delivery of INIT
event is a KVM bug and a WARN_ON_ONCE() is appropriate for this case.
>
> Side topic, the comment about x2APIC in tdx_vcpu_init() is too specific, e.g.
> calling out that x2APIC support is enumerated in CPUID.0x1.ECX isn't necessary,
> and stating that userspace must use KVM_SET_CPUID2 is flat out wrong. Very
> technically, KVM_SET_CPUID is also a valid option, it's just not used in practice
> because it doesn't support setting non-zero indices (but in theory it could be
> used to enable x2APIC).
Indeed, it is too specific.
>
> E.g. something like this?
LGTM.
Thanks for the suggestion!
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d2e78e6675b9..e36fba94fa14 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -115,13 +115,10 @@ static void vt_vcpu_free(struct kvm_vcpu *vcpu)
>
> static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
> - /*
> - * TDX has its own sequence to do init during TD build time (by
> - * KVM_TDX_INIT_VCPU) and it doesn't support INIT event during TD
> - * runtime.
> - */
> - if (is_td_vcpu(vcpu))
> + if (is_td_vcpu(vcpu)) {
> + tdx_vcpu_reset(vcpu, init_event);
> return;
> + }
>
> vmx_vcpu_reset(vcpu, init_event);
> }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9e490fccf073..a587f59167a7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2806,9 +2806,8 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> return -EINVAL;
>
> /*
> - * As TDX requires X2APIC, set local apic mode to X2APIC. User space
> - * VMM, e.g. qemu, is required to set CPUID[0x1].ecx.X2APIC=1 by
> - * KVM_SET_CPUID2. Otherwise kvm_apic_set_base() will fail.
> + * TDX requires x2APIC, userspace is responsible for configuring guest
> + * CPUID accordingly.
> */
> apic_base = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC |
> (kvm_vcpu_is_reset_bsp(vcpu) ? MSR_IA32_APICBASE_BSP : 0);
> @@ -2827,6 +2826,19 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> return 0;
> }
>
> +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +{
> + /*
> + * Yell on INIT, as TDX doesn't support INIT, i.e. KVM should drop all
> + * INIT events.
> + *
> + * Defer initializing vCPU for RESET state until KVM_TDX_INIT_VCPU, as
> + * userspace needs to define the vCPU model before KVM can initialize
> + * vCPU state, e.g. to enable x2APIC.
> + */
> + WARN_ON_ONCE(init_event);
> +}
> +
> struct tdx_gmem_post_populate_arg {
> struct kvm_vcpu *vcpu;
> __u32 flags;
>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 00/16] KVM: TDX: TDX interrupts
2025-01-06 20:08 ` Sean Christopherson
@ 2025-01-09 2:44 ` Binbin Wu
0 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2025-01-09 2:44 UTC (permalink / raw)
To: Sean Christopherson, Xiaoyao Li
Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, tony.lindgren, isaku.yamahata, yan.y.zhao,
chao.gao, linux-kernel
On 1/7/2025 4:08 AM, Sean Christopherson wrote:
> On Mon, Jan 06, 2025, Xiaoyao Li wrote:
>> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>>> Hi,
>>>
>>> This patch series introduces the support of interrupt handling for TDX
>>> guests, including virtual interrupt injection and VM-Exits caused by
>>> vectored events.
>>>
>> (I'm not sure if it is the correct place to raise the discussion on
>> KVM_SET_LAPIC and KVM_SET_LAPIC for TDX. But it seems the most related
>> series)
>>
>> Should KVM reject KVM_GET_LAPIC and KVM_SET_LAPIC for TDX?
> Yes, IIRC that was what Paolo suggested in one of the many PUCK calls. Until
> KVM supports intra-host migration for TDX guests, getting and setting APIC state
> is nonsensical.
>
By rejecting KVM_GET_LAPIC/KVM_SET_LAPIC for TDX guests (i.e.,
guest_apic_protected), I think it should return an error code instead of
returning 0.
Then it requires modifications in QEMU TDX support code to avoid requesting
KVM_GET_LAPIC/KVM_SET_LAPIC.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-09 2:26 ` Binbin Wu
@ 2025-01-09 2:46 ` Huang, Kai
2025-01-09 3:20 ` Binbin Wu
0 siblings, 1 reply; 56+ messages in thread
From: Huang, Kai @ 2025-01-09 2:46 UTC (permalink / raw)
To: seanjc@google.com, binbin.wu@linux.intel.com
Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Chatre, Reinette,
Hunter, Adrian, tony.lindgren@linux.intel.com,
kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y
On Thu, 2025-01-09 at 10:26 +0800, Binbin Wu wrote:
> > > > I think we can just say TDX doesn't support vcpu reset no matter due to
> > > > INIT event or not.
> > That's not entirely accurate either though. TDX does support KVM's version of
> > RESET, because KVM's RESET is "power-on", i.e. vCPU creation. Emulation of
> > runtime RESET is userspace's responsibility.
> >
> > The real reason why KVM doesn't do anything during KVM's RESET is that what
> > little setup KVM does/can do needs to be defered until after guest CPUID is
> > configured.
> >
> > KVM should also WARN if a TDX vCPU gets INIT, no?
>
> There was a KVM_BUG_ON() if a TDX vCPU gets INIT in v19, and later it was
> removed during the cleanup about removing WARN_ON_ONCE() and KVM_BUG_ON().
>
> Since INIT/SIPI are always blocked for TDX guests, a delivery of INIT
> event is a KVM bug and a WARN_ON_ONCE() is appropriate for this case.
Can TDX guest issue INIT via IPI? Perhaps KVM_BUG_ON() is safer?
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2024-12-09 1:07 ` [PATCH 11/16] KVM: TDX: Always block INIT/SIPI Binbin Wu
2025-01-08 7:21 ` Xiaoyao Li
@ 2025-01-09 2:51 ` Huang, Kai
1 sibling, 0 replies; 56+ messages in thread
From: Huang, Kai @ 2025-01-09 2:51 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
binbin.wu@linux.intel.com
Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Chatre, Reinette,
Zhao, Yan Y, Hunter, Adrian, tony.lindgren@linux.intel.com,
Yamahata, Isaku, linux-kernel@vger.kernel.org
On Mon, 2024-12-09 at 09:07 +0800, Binbin Wu wrote:
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 474e0a7c1069..f93c382344ee 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -3365,7 +3365,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
>
> if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
> kvm_vcpu_reset(vcpu, true);
> - if (kvm_vcpu_is_bsp(apic->vcpu))
> + if (kvm_vcpu_is_bsp(vcpu))
> vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> else
> vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
This part seems not related. If it needs to be done, should it be done in a
separate patch?
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-09 2:46 ` Huang, Kai
@ 2025-01-09 3:20 ` Binbin Wu
2025-01-09 4:01 ` Huang, Kai
0 siblings, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-09 3:20 UTC (permalink / raw)
To: Huang, Kai, seanjc@google.com
Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Chatre, Reinette,
Hunter, Adrian, tony.lindgren@linux.intel.com,
kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y
On 1/9/2025 10:46 AM, Huang, Kai wrote:
> On Thu, 2025-01-09 at 10:26 +0800, Binbin Wu wrote:
>>>>> I think we can just say TDX doesn't support vcpu reset no matter due to
>>>>> INIT event or not.
>>> That's not entirely accurate either though. TDX does support KVM's version of
>>> RESET, because KVM's RESET is "power-on", i.e. vCPU creation. Emulation of
>>> runtime RESET is userspace's responsibility.
>>>
>>> The real reason why KVM doesn't do anything during KVM's RESET is that what
>>> little setup KVM does/can do needs to be defered until after guest CPUID is
>>> configured.
>>>
>>> KVM should also WARN if a TDX vCPU gets INIT, no?
>> There was a KVM_BUG_ON() if a TDX vCPU gets INIT in v19, and later it was
>> removed during the cleanup about removing WARN_ON_ONCE() and KVM_BUG_ON().
>>
>> Since INIT/SIPI are always blocked for TDX guests, a delivery of INIT
>> event is a KVM bug and a WARN_ON_ONCE() is appropriate for this case.
> Can TDX guest issue INIT via IPI? Perhaps KVM_BUG_ON() is safer?
TDX guests are not expected to issue INIT, but it could in theory.
It seems no serous impact if guest does it, not sure it needs to kill the
VM or not.
Also, in this patch, for TDX kvm_apic_init_sipi_allowed() is always
returning false, so vt_vcpu_reset() will not be called with init=true.
Adding a WARN_ON_ONCE() is the guard for the KVM's logic itself,
not the guard for guest behavior.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 11/16] KVM: TDX: Always block INIT/SIPI
2025-01-09 3:20 ` Binbin Wu
@ 2025-01-09 4:01 ` Huang, Kai
0 siblings, 0 replies; 56+ messages in thread
From: Huang, Kai @ 2025-01-09 4:01 UTC (permalink / raw)
To: seanjc@google.com, binbin.wu@linux.intel.com
Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Chatre, Reinette,
Hunter, Adrian, tony.lindgren@linux.intel.com,
kvm@vger.kernel.org, pbonzini@redhat.com, Yamahata, Isaku,
linux-kernel@vger.kernel.org, Zhao, Yan Y
On Thu, 2025-01-09 at 11:20 +0800, Binbin Wu wrote:
>
>
> On 1/9/2025 10:46 AM, Huang, Kai wrote:
> > On Thu, 2025-01-09 at 10:26 +0800, Binbin Wu wrote:
> > > > > > I think we can just say TDX doesn't support vcpu reset no matter due to
> > > > > > INIT event or not.
> > > > That's not entirely accurate either though. TDX does support KVM's version of
> > > > RESET, because KVM's RESET is "power-on", i.e. vCPU creation. Emulation of
> > > > runtime RESET is userspace's responsibility.
> > > >
> > > > The real reason why KVM doesn't do anything during KVM's RESET is that what
> > > > little setup KVM does/can do needs to be defered until after guest CPUID is
> > > > configured.
> > > >
> > > > KVM should also WARN if a TDX vCPU gets INIT, no?
> > > There was a KVM_BUG_ON() if a TDX vCPU gets INIT in v19, and later it was
> > > removed during the cleanup about removing WARN_ON_ONCE() and KVM_BUG_ON().
> > >
> > > Since INIT/SIPI are always blocked for TDX guests, a delivery of INIT
> > > event is a KVM bug and a WARN_ON_ONCE() is appropriate for this case.
> > Can TDX guest issue INIT via IPI? Perhaps KVM_BUG_ON() is safer?
> TDX guests are not expected to issue INIT, but it could in theory.
> It seems no serous impact if guest does it, not sure it needs to kill the
> VM or not.
>
> Also, in this patch, for TDX kvm_apic_init_sipi_allowed() is always
> returning false, so vt_vcpu_reset() will not be called with init=true.
> Adding a WARN_ON_ONCE() is the guard for the KVM's logic itself,
> not the guard for guest behavior.
>
Yeah agreed. I replied too early before looking at the patch.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC
2024-12-09 1:07 ` [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC Binbin Wu
@ 2025-01-09 15:38 ` Nikolay Borisov
2025-01-10 5:36 ` Binbin Wu
0 siblings, 1 reply; 56+ messages in thread
From: Nikolay Borisov @ 2025-01-09 15:38 UTC (permalink / raw)
To: Binbin Wu, pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel
On 9.12.24 г. 3:07 ч., Binbin Wu wrote:
> From: Sean Christopherson <seanjc@google.com>
>
> Add flag and hook to KVM's local APIC management to support determining
> whether or not a TDX guest as a pending IRQ. For TDX vCPUs, the virtual
> APIC page is owned by the TDX module and cannot be accessed by KVM. As a
> result, registers that are virtualized by the CPU, e.g. PPR, cannot be
> read or written by KVM. To deliver interrupts for TDX guests, KVM must
> send an IRQ to the CPU on the posted interrupt notification vector. And
> to determine if TDX vCPU has a pending interrupt, KVM must check if there
> is an outstanding notification.
>
> Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
> protected to short-circuit the various other flows that try to pull an
> IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
> pending, KVM can't do anything based on _which_ IRQ is pending.
>
> Intentionally omit sanity checks from other flows, e.g. PPR update, so as
> not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
> and userspace will never reach those flows for TDX guests, but reaching
> them is not fatal if something does go awry.
>
> Note, this doesn't handle interrupts that have been delivered to the vCPU
> but not yet recognized by the core, i.e. interrupts that are sitting in
> vmcs.GUEST_INTR_STATUS. Querying that state requires a SEAMCALL and will
> be supported in a future patch.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> TDX interrupts breakout:
> - Dropped vt_protected_apic_has_interrupt() with KVM_BUG_ON(), wire in
> tdx_protected_apic_has_interrupt() directly. (Rick)
> - Add {} on else in vt_hardware_setup()
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/irq.c | 3 +++
> arch/x86/kvm/lapic.c | 3 +++
> arch/x86/kvm/lapic.h | 2 ++
> arch/x86/kvm/vmx/main.c | 3 +++
> arch/x86/kvm/vmx/tdx.c | 6 ++++++
> arch/x86/kvm/vmx/x86_ops.h | 2 ++
> 8 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index ec1b1b39c6b3..d5faaaee6ac0 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -114,6 +114,7 @@ KVM_X86_OP_OPTIONAL(pi_start_assignment)
> KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
> KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
> KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
> +KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
> KVM_X86_OP_OPTIONAL(set_hv_timer)
> KVM_X86_OP_OPTIONAL(cancel_hv_timer)
> KVM_X86_OP(setup_mce)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 37dc7edef1ca..32c7d58a5d68 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1811,6 +1811,7 @@ struct kvm_x86_ops {
> void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
> void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
> bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
> + bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);
>
> int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> bool *expired);
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 63f66c51975a..f0644d0bbe11 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -100,6 +100,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
> if (kvm_cpu_has_extint(v))
> return 1;
>
> + if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
> + return static_call(kvm_x86_protected_apic_has_interrupt)(v);
> +
> return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
> }
> EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 65412640cfc7..684777c2f0a4 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2920,6 +2920,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
> if (!kvm_apic_present(vcpu))
> return -1;
>
> + if (apic->guest_apic_protected)
> + return -1;
> +
> __apic_update_ppr(apic, &ppr);
> return apic_has_interrupt_for_ppr(apic, ppr);
> }
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 1b8ef9856422..82355faf8c0d 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -65,6 +65,8 @@ struct kvm_lapic {
> bool sw_enabled;
> bool irr_pending;
> bool lvt0_in_nmi_mode;
> + /* Select registers in the vAPIC cannot be read/written. */
> + bool guest_apic_protected;
Can't this member be eliminated and instead is_td_vcpu() used as it
stands currently that member is simply a proxy value for "is this a tdx
vcpu"?
<snip>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC
2025-01-09 15:38 ` Nikolay Borisov
@ 2025-01-10 5:36 ` Binbin Wu
0 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2025-01-10 5:36 UTC (permalink / raw)
To: Nikolay Borisov, pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel
On 1/9/2025 11:38 PM, Nikolay Borisov wrote:
>
>
> On 9.12.24 г. 3:07 ч., Binbin Wu wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> Add flag and hook to KVM's local APIC management to support determining
>> whether or not a TDX guest as a pending IRQ. For TDX vCPUs, the virtual
>> APIC page is owned by the TDX module and cannot be accessed by KVM. As a
>> result, registers that are virtualized by the CPU, e.g. PPR, cannot be
>> read or written by KVM. To deliver interrupts for TDX guests, KVM must
>> send an IRQ to the CPU on the posted interrupt notification vector. And
>> to determine if TDX vCPU has a pending interrupt, KVM must check if there
>> is an outstanding notification.
>>
>> Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
>> protected to short-circuit the various other flows that try to pull an
>> IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
>> pending, KVM can't do anything based on _which_ IRQ is pending.
>>
>> Intentionally omit sanity checks from other flows, e.g. PPR update, so as
>> not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM
>> and userspace will never reach those flows for TDX guests, but reaching
>> them is not fatal if something does go awry.
>>
>> Note, this doesn't handle interrupts that have been delivered to the vCPU
>> but not yet recognized by the core, i.e. interrupts that are sitting in
>> vmcs.GUEST_INTR_STATUS. Querying that state requires a SEAMCALL and will
>> be supported in a future patch.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
>> ---
>> TDX interrupts breakout:
>> - Dropped vt_protected_apic_has_interrupt() with KVM_BUG_ON(), wire in
>> tdx_protected_apic_has_interrupt() directly. (Rick)
>> - Add {} on else in vt_hardware_setup()
>> ---
>> arch/x86/include/asm/kvm-x86-ops.h | 1 +
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/kvm/irq.c | 3 +++
>> arch/x86/kvm/lapic.c | 3 +++
>> arch/x86/kvm/lapic.h | 2 ++
>> arch/x86/kvm/vmx/main.c | 3 +++
>> arch/x86/kvm/vmx/tdx.c | 6 ++++++
>> arch/x86/kvm/vmx/x86_ops.h | 2 ++
>> 8 files changed, 21 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
>> index ec1b1b39c6b3..d5faaaee6ac0 100644
>> --- a/arch/x86/include/asm/kvm-x86-ops.h
>> +++ b/arch/x86/include/asm/kvm-x86-ops.h
>> @@ -114,6 +114,7 @@ KVM_X86_OP_OPTIONAL(pi_start_assignment)
>> KVM_X86_OP_OPTIONAL(apicv_pre_state_restore)
>> KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
>> KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
>> +KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
>> KVM_X86_OP_OPTIONAL(set_hv_timer)
>> KVM_X86_OP_OPTIONAL(cancel_hv_timer)
>> KVM_X86_OP(setup_mce)
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 37dc7edef1ca..32c7d58a5d68 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1811,6 +1811,7 @@ struct kvm_x86_ops {
>> void (*apicv_pre_state_restore)(struct kvm_vcpu *vcpu);
>> void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
>> bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
>> + bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);
>> int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
>> bool *expired);
>> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
>> index 63f66c51975a..f0644d0bbe11 100644
>> --- a/arch/x86/kvm/irq.c
>> +++ b/arch/x86/kvm/irq.c
>> @@ -100,6 +100,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
>> if (kvm_cpu_has_extint(v))
>> return 1;
>> + if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
>> + return static_call(kvm_x86_protected_apic_has_interrupt)(v);
>> +
>> return kvm_apic_has_interrupt(v) != -1; /* LAPIC */
>> }
>> EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 65412640cfc7..684777c2f0a4 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -2920,6 +2920,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
>> if (!kvm_apic_present(vcpu))
>> return -1;
>> + if (apic->guest_apic_protected)
>> + return -1;
>> +
>> __apic_update_ppr(apic, &ppr);
>> return apic_has_interrupt_for_ppr(apic, ppr);
>> }
>> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>> index 1b8ef9856422..82355faf8c0d 100644
>> --- a/arch/x86/kvm/lapic.h
>> +++ b/arch/x86/kvm/lapic.h
>> @@ -65,6 +65,8 @@ struct kvm_lapic {
>> bool sw_enabled;
>> bool irr_pending;
>> bool lvt0_in_nmi_mode;
>> + /* Select registers in the vAPIC cannot be read/written. */
>> + bool guest_apic_protected;
>
> Can't this member be eliminated and instead is_td_vcpu() used as it stands currently that member is simply a proxy value for "is this a tdx vcpu"?
By using this member, the code in the common path can be more generic,
instead of using is_td_vcpu(). I.e, in the future, if other VM types has
the same characteristic, no need to modify the common code.
>
> <snip>
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2024-12-09 1:07 ` [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest Binbin Wu
2025-01-03 21:59 ` Vishal Annapurve
@ 2025-01-13 2:03 ` Binbin Wu
2025-01-13 2:09 ` Binbin Wu
1 sibling, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-13 2:03 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel
On 12/9/2024 9:07 AM, Binbin Wu wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> from host VMM.
>
> Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
For TDX guests, APICv is always enabled by TDX module. But in current TDX basic support patch series, TDX code inhibits APICv for TDX guests from the view of KVM. Synced with Isaku, the reason was to prevent the APICv active state from toggling during runtime. Sean raised the concern in a PUCK session that it is not concept right to "lie" to KVM that APICv is disabled while it is actually enabled. Instead, it's better to make APICv enabled and prevent it from being disabled from the view of KVM. Following is the analysis about the APICv active state for TDX to kick off further discussions. APICv active state ================== From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. APIC hw enabled --------------- After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}. - KVM_SET_LAPIC Currently, KVM allows userspace to
request KVM_SET_LAPIC to set the state of LAPIC for TDX guests. There are two options: - Force x2APIC mode and default base address when userspace request KVM_SET_LAPIC. - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected is true), since migration is not supported yet. Choose option 2 for simplicity for now. - KVM_SET_{SREGS, SREGS2} KVM rejects userspace to set APIC base when vcpu->kvm->arch.has_protected_state and vcpu->arch.guest_state_protected are both set. Currently for TDX, kvm->arch.has_protected_state is not set, so userspace is allowed to modify APIC base. There are three options: - Reject KVM_SET_{SREGS, SREGS2} when either vcpu->arch.guest_state_protected or vcpu->kvm->arch.has_protected_state is set. - Check vcpu->arch.guest_state_protected before kvm_apic_set_base() in __set_sregs_common(). - Set has_protected_state for TDX guests. Choose option 3, i.e. to set has_protected_state for TDX guests, aligning with SEV/SNP. APICv inhibit reasons
--------------------- APICv could be disabled due to a few inhibit reasons. - APICV_INHIBIT_REASON_DISABLED For TDX, this could be triggered when the module parameter enable_apicv is set to false. enable_apicv could be checked in tdx_bringup(). Disable TDX support if !enable_apicv. So that APICV_INHIBIT_REASON_DISABLED will not be set during runtime and apic->apicv_active is initialized to true. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED KVM will reject userspace to modify APIC base, i.e., APIC mode will always be x2APIC mode, the only reason this could be set is it fails to allocate memory for KVM apic map. - APICV_INHIBIT_REASON_PIT_REINJ Based on current code, this is relevant only to AMD's AVIC, so this reason will not be set for TDX guests. However, KVM is also not be able to intercept EOI for TDX guests. For TDX, if in-kernel PIT is enabled and in re-inject mode, the use of PIT in guest may have problem. Fortunately, modern OSes don't use PIT. Options: - Enforce irqchip
split for TDX guests, i.e. in-kernel PIT is not supported. - Leave it as it is and expect PIT will not be used. - Reasons will not be set for TDX - APICV_INHIBIT_REASON_HYPERV TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT In-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED KVM will reject userspace to modify APIC base, i.e., APIC mode will always be x2APIC mode. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED KVM will reject userspace to set APIC base. - Reasons relevant only to AMD's AVIC - APICV_INHIBIT_REASON_NESTED, - APICV_INHIBIT_REASON_IRQWIN, - APICV_INHIBIT_REASON_SEV, - APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. Summary about APICv inhibit reasons: APICv could still be disabled runtime in some corner case, e.g, APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure. After checking enable_apicv in tdx_bringup(),
apic->apicv_active is initialized as true in kvm_create_lapic(). If APICv is inhibited due to any reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check if APICv is disabled for TDX, if APICv is disabled, bug the VM. Changes of APICv active from false to true ========================================== Lazy check for pending APIC EOI when In-kernel IOAPIC ----------------------------------------------------- In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor accelerates write to APIC EOI register and does not trap if the interrupt is edge-triggered. So there is a workaround by lazy check for pending APIC EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no pending APIC EOI. KVM is also not be able to intercept EOI for TDX guests. - When APICv is enabled The code of lazy check for pending APIC EOI doesn't work for TDX because KVM can't get the status of real IRR and ISR, and the values are 0s in vIRR and vISR
in apic->regs[], kvm_apic_pending_eoi() will always return false. So the RTC pending EOI will always be cleared when ioapic_set_irq() is called for RTC. Then userspace may miss the coalesced RTC interrupts. - When When APICv is disabled ioapic_lazy_update_eoi() will not be called,then pending EOI status for RTC will not be cleared after setting and this will mislead userspace to see coalesced RTC interrupts. Options: - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC. - Leave it as it is, but the use of RTC may not be accurate. kvm_can_post_timer_interrupt() ------------------------------ Whether housekeeping CPU can deliver timer interrupt to target vCPU via posted interrupt when nohz_full option set. - When APICv active is false, it always return false. - When APICv active is true, it also depends on whether mwait or hlt in guest is set. For TDX guests, hlt will trigger #VE unconditionally and TDX guests request HLT via TDVMCALL. Whether mwait is
allowed depends on the cpuid configuration in TD_PARAMS. So current implementation of kvm_mwait_in_guest() and kvm_hlt_in_guest() doesn't reflect the real status for TDX guests. However, Sean mentioned "consulting kvm_can_post_timer_interrupt() in the expiration path is silly". There could be cleanups for this part. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/ So, don't do any TDX-specific logic for it. apic_timer_expired() -------------------- About kvm_can_post_timer_interrupt() in the expiration path, see the description above. For the rest part, when the function is not called from timer function - If apicv_active, the timer interrupt will be injected via kvm_apic_inject_pending_timer_irqs(). - If !apicv_active, the timer interrupt will be handled via lapic_timer.pending approach, and finally, the timer interrupt is also be injected via kvm_apic_inject_pending_timer_irqs(). Basically, they are functionally equivalent with subtle differences. E.g., if an
hrtimer fires while KVM is handling a write to TMICT, KVM will deliver the interrupt if configured to post timer, but not if APICv is disabled, because the latter will increment "pending", and "pending" will be cleared before handling the new TMICT. Ditto for switch APIC timer modes. Sean mentioned the entire lapic_timer.pending approach may need to be ditched, and the timer interrupt could be directly delivered no matter apicv is active or not. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/ This is not TDX specific, leave it for now. Options: - Fix kvm_mwait_in_guest()/kvm_hlt_in_guest() for TDX guests. - VMX preemption timer can't be used by TDX guests anyway, leave kvm_mwait_in_guest()/kvm_hlt_in_guest() as them are, posted timer interrupt could be used when userspace requested to disable exit for mwait/hlt. - VMX preemption timer can't be used by TDX guests anyway, skip checking kvm_mwait_in_guest()/kvm_hlt_in_guest(). kvm_arch_dy_has_pending_interrupt()
----------------------------------- Before enabling off-TD debug, there is no functional change because there is no PAUSE Exit for TDX guests. After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true to get the pending interrupt from PID. Set APICv to active for TDX is the right thing to do. update_cr8_intercept() ---------------------- Functionally unchanged because the callback update_cr8_intercept() for TDX is ignored. Set APICv to active for TDX can return earlier to skip unnecessary code. kvm_lapic_reset() kvm_apic_set_state() -------------------- The callbacks apicv_post_state_restore(), hwapic_irr_update(), and hwapic_isr_update() will be called for TDX guests when apicv is active, these callbacks have been ignored by TDX code already, no functional changes. Issues ====== PIC interrupts -------------- KVM inject PIC interrupt via event injection path. Currently, TDX code doesn't handle this, thus PIC interrupts will be lost. Fortunately, modern OSes
don't use PIC. We could use posted-interrupt in to deliver PIC interrupt if needed. Or can we assume PIC will not be used by TDX guests? In-kernel PIT in re-inject mode ------------------------------- See the description for "APICV_INHIBIT_REASON_PIT_REINJ" above. Lazy check for pending APIC EOI of In-kernel IOAPIC --------------------------------------------------- See the description for the same item in "Changes of APICv active from false to true". Open: For the issues related to in-kernel PIT and in-kernel IOAPIC, should KVM force irqchip split for TDX guests to eliminate the use of in-kernel PIT and in-kernel IOAPIC? Proposed code change ==================== Below is the proposed code change to change APICv active from false to true for TDX guests. Force irqchip split for TEX guests is not included. Note, by rejecting KVM_GET_LAPIC/KVM_SET_LAPIC for TDX guests (i.e., when guest_apic_protected), it returns an error code instead of returning 0. It requires modifications in
QEMU TDX support code to avoid requesting KVM_GET_LAPIC/KVM_SET_LAPIC. 8<---------------------------------------------------------------------------- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 0787855ab006..97025a240d54 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1289,15 +1289,6 @@ enum kvm_apicv_inhibit { */ APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED, - /*********************************************************/ - /* INHIBITs that are relevant only to the Intel's APICv. */ - /*********************************************************/ - - /* - * APICv is disabled because TDX doesn't support it. - */ - APICV_INHIBIT_REASON_TDX, - NR_APICV_INHIBIT_REASONS, }; @@ -1316,8 +1307,7 @@ enum kvm_apicv_inhibit { __APICV_INHIBIT_REASON(IRQWIN), \ __APICV_INHIBIT_REASON(PIT_REINJ), \ __APICV_INHIBIT_REASON(SEV), \ - __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \ - __APICV_INHIBIT_REASON(TDX) +
__APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED) struct kvm_arch { unsigned long n_used_mmu_pages; diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c index 9b79b4bb063f..df9cc4a7f2d8 100644 --- a/arch/x86/kvm/vmx/main.c +++ b/arch/x86/kvm/vmx/main.c @@ -782,8 +782,10 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu) static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { - if (WARN_ON_ONCE(is_td_vcpu(vcpu))) + if (is_td_vcpu(vcpu)) { + KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm); return; + } vmx_refresh_apicv_exec_ctrl(vcpu); } @@ -908,8 +910,7 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \ BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \ BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \ - BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \ - BIT(APICV_INHIBIT_REASON_TDX)) + BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED)) struct kvm_x86_ops vt_x86_ops __initdata = { .name =
KBUILD_MODNAME, diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c index 67fc391fe798..cc516ab2d990 100644 --- a/arch/x86/kvm/vmx/tdx.c +++ b/arch/x86/kvm/vmx/tdx.c @@ -614,6 +614,7 @@ int tdx_vm_init(struct kvm *kvm) struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm); kvm->arch.has_private_mem = true; + kvm->arch.has_protected_state = true; /* * Because guest TD is protected, VMM can't parse the instruction in TD. @@ -2354,8 +2355,6 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params, goto teardown; } - kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX); - return 0; /* @@ -2741,7 +2740,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx) return -EIO; } - vcpu->arch.apic->apicv_active = false; vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; return 0; @@ -3273,6 +3271,11 @@ int __init tdx_bringup(void) goto success_disable_tdx; } + if (!enable_apicv) { + pr_err("APICv is required for TDX\n"); + goto success_disable_tdx; + } + if
(!tdp_mmu_enabled || !enable_mmio_caching) { pr_err("TDP MMU and MMIO caching is required for TDX\n"); goto success_disable_tdx; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e433c8ee63a5..837a287d8c47 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5108,6 +5108,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) { + if (vcpu->arch.apic->guest_apic_protected) + return -EINVAL; + kvm_x86_call(sync_pir_to_irr)(vcpu); return kvm_apic_get_state(vcpu, s); @@ -5118,6 +5121,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu, { int r; + if (vcpu->arch.apic->guest_apic_protected) + return -EINVAL; + r = kvm_apic_set_state(vcpu, s); if (r) return r;
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-13 2:03 ` Binbin Wu
@ 2025-01-13 2:09 ` Binbin Wu
2025-01-13 17:16 ` Sean Christopherson
2025-01-16 11:55 ` Huang, Kai
0 siblings, 2 replies; 56+ messages in thread
From: Binbin Wu @ 2025-01-13 2:09 UTC (permalink / raw)
To: pbonzini, seanjc, kvm
Cc: rick.p.edgecombe, kai.huang, adrian.hunter, reinette.chatre,
xiaoyao.li, tony.lindgren, isaku.yamahata, yan.y.zhao, chao.gao,
linux-kernel
On 1/13/2025 10:03 AM, Binbin Wu wrote:
>
> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
>> from host VMM.
>>
>> Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
>> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>
Resend due to the format mess.
For TDX guests, APICv is always enabled by TDX module. But in current TDX
basic support patch series, TDX code inhibits APICv for TDX guests from
the view of KVM. Synced with Isaku, the reason was to prevent the APICv
active state from toggling during runtime.
Sean raised the concern in a PUCK session that it is not concept right to
"lie" to KVM that APICv is disabled while it is actually enabled. Instead,
it's better to make APICv enabled and prevent it from being disabled from
the view of KVM.
Following is the analysis about the APICv active state for TDX to kick off
further discussions.
APICv active state
==================
From the view of KVM, whether APICv state is active or not is decided by:
1. APIC is hw enabled
2. VM and vCPU have no inhibit reasons set.
APIC hw enabled
---------------
After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could
disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}.
- KVM_SET_LAPIC
Currently, KVM allows userspace to request KVM_SET_LAPIC to set the state
of LAPIC for TDX guests.
There are two options:
- Force x2APIC mode and default base address when userspace request
KVM_SET_LAPIC.
- Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected
is true), since migration is not supported yet.
Choose option 2 for simplicity for now.
- KVM_SET_{SREGS, SREGS2}
KVM rejects userspace to set APIC base when
vcpu->kvm->arch.has_protected_state and vcpu->arch.guest_state_protected
are both set.
Currently for TDX, kvm->arch.has_protected_state is not set, so userspace
is allowed to modify APIC base.
There are three options:
- Reject KVM_SET_{SREGS, SREGS2} when either vcpu->arch.guest_state_protected
or vcpu->kvm->arch.has_protected_state is set.
- Check vcpu->arch.guest_state_protected before kvm_apic_set_base() in
__set_sregs_common().
- Set has_protected_state for TDX guests.
Choose option 3, i.e. to set has_protected_state for TDX guests, aligning
with SEV/SNP.
APICv inhibit reasons
---------------------
APICv could be disabled due to a few inhibit reasons.
- APICV_INHIBIT_REASON_DISABLED
For TDX, this could be triggered when the module parameter enable_apicv is
set to false.
enable_apicv could be checked in tdx_bringup(). Disable TDX support if
!enable_apicv. So that APICV_INHIBIT_REASON_DISABLED will not be set
during runtime and apic->apicv_active is initialized to true.
- APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED
KVM will reject userspace to modify APIC base, i.e., APIC mode will always
be x2APIC mode, the only reason this could be set is it fails to allocate
memory for KVM apic map.
- APICV_INHIBIT_REASON_PIT_REINJ
Based on current code, this is relevant only to AMD's AVIC, so this reason
will not be set for TDX guests. However, KVM is also not be able to
intercept EOI for TDX guests. For TDX, if in-kernel PIT is enabled and in
re-inject mode, the use of PIT in guest may have problem. Fortunately,
modern OSes don't use PIT.
Options:
- Enforce irqchip split for TDX guests, i.e. in-kernel PIT is not supported.
- Leave it as it is and expect PIT will not be used.
- Reasons will not be set for TDX
- APICV_INHIBIT_REASON_HYPERV
TDX doesn't support HyperV guest yet.
- APICV_INHIBIT_REASON_ABSENT
In-kernel LAPIC is checked in tdx_vcpu_create().
- APICV_INHIBIT_REASON_BLOCKIRQ
TDX doesn't support KVM_SET_GUEST_DEBUG.
- APICV_INHIBIT_REASON_APIC_ID_MODIFIED
KVM will reject userspace to modify APIC base, i.e., APIC mode will always
be x2APIC mode.
- APICV_INHIBIT_REASON_APIC_BASE_MODIFIED
KVM will reject userspace to set APIC base.
- Reasons relevant only to AMD's AVIC
- APICV_INHIBIT_REASON_NESTED,
- APICV_INHIBIT_REASON_IRQWIN,
- APICV_INHIBIT_REASON_SEV,
- APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED.
Summary about APICv inhibit reasons:
APICv could still be disabled runtime in some corner case, e.g,
APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
After checking enable_apicv in tdx_bringup(), apic->apicv_active is
initialized as true in kvm_create_lapic(). If APICv is inhibited due to any
reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
if APICv is disabled for TDX, if APICv is disabled, bug the VM.
Changes of APICv active from false to true
==========================================
Lazy check for pending APIC EOI when In-kernel IOAPIC
-----------------------------------------------------
In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
accelerates write to APIC EOI register and does not trap if the interrupt
is edge-triggered. So there is a workaround by lazy check for pending APIC
EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
pending APIC EOI.
KVM is also not be able to intercept EOI for TDX guests.
- When APICv is enabled
The code of lazy check for pending APIC EOI doesn't work for TDX because
KVM can't get the status of real IRR and ISR, and the values are 0s in
vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
is called for RTC. Then userspace may miss the coalesced RTC interrupts.
- When When APICv is disabled
ioapic_lazy_update_eoi() will not be called,then pending EOI status for
RTC will not be cleared after setting and this will mislead userspace to
see coalesced RTC interrupts.
Options:
- Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
- Leave it as it is, but the use of RTC may not be accurate.
kvm_can_post_timer_interrupt()
------------------------------
Whether housekeeping CPU can deliver timer interrupt to target vCPU via
posted interrupt when nohz_full option set.
- When APICv active is false, it always return false.
- When APICv active is true, it also depends on whether mwait or hlt in guest
is set.
For TDX guests, hlt will trigger #VE unconditionally and TDX guests request
HLT via TDVMCALL. Whether mwait is allowed depends on the cpuid configuration
in TD_PARAMS.
So current implementation of kvm_mwait_in_guest() and kvm_hlt_in_guest()
doesn't reflect the real status for TDX guests.
However, Sean mentioned "consulting kvm_can_post_timer_interrupt() in the
expiration path is silly". There could be cleanups for this part.
https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/
So, don't do any TDX-specific logic for it.
apic_timer_expired()
--------------------
About kvm_can_post_timer_interrupt() in the expiration path, see the
description above.
For the rest part, when the function is not called from timer function
- If apicv_active, the timer interrupt will be injected via
kvm_apic_inject_pending_timer_irqs().
- If !apicv_active, the timer interrupt will be handled via
lapic_timer.pending approach, and finally, the timer interrupt is also be
injected via kvm_apic_inject_pending_timer_irqs().
Basically, they are functionally equivalent with subtle differences. E.g.,
if an hrtimer fires while KVM is handling a write to TMICT, KVM will deliver
the interrupt if configured to post timer, but not if APICv is disabled,
because the latter will increment "pending", and "pending" will be cleared
before handling the new TMICT. Ditto for switch APIC timer modes.
Sean mentioned the entire lapic_timer.pending approach may need to be ditched,
and the timer interrupt could be directly delivered no matter apicv is active
or not. https://lore.kernel.org/kvm/Z32ZjGH72WPKBMam@google.com/
This is not TDX specific, leave it for now.
Options:
- Fix kvm_mwait_in_guest()/kvm_hlt_in_guest() for TDX guests.
- VMX preemption timer can't be used by TDX guests anyway, leave
kvm_mwait_in_guest()/kvm_hlt_in_guest() as them are, posted timer interrupt
could be used when userspace requested to disable exit for mwait/hlt.
- VMX preemption timer can't be used by TDX guests anyway, skip checking
kvm_mwait_in_guest()/kvm_hlt_in_guest().
kvm_arch_dy_has_pending_interrupt()
-----------------------------------
Before enabling off-TD debug, there is no functional change because there
is no PAUSE Exit for TDX guests.
After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true
to get the pending interrupt from PID. Set APICv to active for TDX is the
right thing to do.
update_cr8_intercept()
----------------------
Functionally unchanged because the callback update_cr8_intercept() for TDX
is ignored. Set APICv to active for TDX can return earlier to skip unnecessary
code.
kvm_lapic_reset()
kvm_apic_set_state()
--------------------
The callbacks apicv_post_state_restore(), hwapic_irr_update(), and
hwapic_isr_update() will be called for TDX guests when apicv is active,
these callbacks have been ignored by TDX code already, no functional changes.
Issues
======
PIC interrupts
--------------
KVM inject PIC interrupt via event injection path.
Currently, TDX code doesn't handle this, thus PIC interrupts will be lost.
Fortunately, modern OSes don't use PIC.
We could use posted-interrupt in to deliver PIC interrupt if needed.
Or can we assume PIC will not be used by TDX guests?
In-kernel PIT in re-inject mode
-------------------------------
See the description for "APICV_INHIBIT_REASON_PIT_REINJ" above.
Lazy check for pending APIC EOI of In-kernel IOAPIC
---------------------------------------------------
See the description for the same item in "Changes of APICv active from false
to true".
Open:
For the issues related to in-kernel PIT and in-kernel IOAPIC, should KVM
force irqchip split for TDX guests to eliminate the use of in-kernel PIT
and in-kernel IOAPIC?
Proposed code change
====================
Below is the proposed code change to change APICv active from false to true
for TDX guests.
Force irqchip split for TEX guests is not included.
Note, by rejecting KVM_GET_LAPIC/KVM_SET_LAPIC for TDX guests (i.e., when
guest_apic_protected), it returns an error code instead of returning 0.
It requires modifications in QEMU TDX support code to avoid requesting
KVM_GET_LAPIC/KVM_SET_LAPIC.
8<----------------------------------------------------------------------------
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0787855ab006..97025a240d54 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1289,15 +1289,6 @@ enum kvm_apicv_inhibit {
*/
APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED,
- /*********************************************************/
- /* INHIBITs that are relevant only to the Intel's APICv. */
- /*********************************************************/
-
- /*
- * APICv is disabled because TDX doesn't support it.
- */
- APICV_INHIBIT_REASON_TDX,
-
NR_APICV_INHIBIT_REASONS,
};
@@ -1316,8 +1307,7 @@ enum kvm_apicv_inhibit {
__APICV_INHIBIT_REASON(IRQWIN), \
__APICV_INHIBIT_REASON(PIT_REINJ), \
__APICV_INHIBIT_REASON(SEV), \
- __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED), \
- __APICV_INHIBIT_REASON(TDX)
+ __APICV_INHIBIT_REASON(LOGICAL_ID_ALIASED)
struct kvm_arch {
unsigned long n_used_mmu_pages;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9b79b4bb063f..df9cc4a7f2d8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -782,8 +782,10 @@ static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
{
- if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+ if (is_td_vcpu(vcpu)) {
+ KVM_BUG_ON(!kvm_vcpu_apicv_active(vcpu), vcpu->kvm);
return;
+ }
vmx_refresh_apicv_exec_ctrl(vcpu);
}
@@ -908,8 +910,7 @@ static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
BIT(APICV_INHIBIT_REASON_BLOCKIRQ) | \
BIT(APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED) | \
BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED) | \
- BIT(APICV_INHIBIT_REASON_TDX))
+ BIT(APICV_INHIBIT_REASON_APIC_BASE_MODIFIED))
struct kvm_x86_ops vt_x86_ops __initdata = {
.name = KBUILD_MODNAME,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 67fc391fe798..cc516ab2d990 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -614,6 +614,7 @@ int tdx_vm_init(struct kvm *kvm)
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
kvm->arch.has_private_mem = true;
+ kvm->arch.has_protected_state = true;
/*
* Because guest TD is protected, VMM can't parse the instruction in TD.
@@ -2354,8 +2355,6 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params,
goto teardown;
}
- kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_TDX);
-
return 0;
/*
@@ -2741,7 +2740,6 @@ static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
return -EIO;
}
- vcpu->arch.apic->apicv_active = false;
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
return 0;
@@ -3273,6 +3271,11 @@ int __init tdx_bringup(void)
goto success_disable_tdx;
}
+ if (!enable_apicv) {
+ pr_err("APICv is required for TDX\n");
+ goto success_disable_tdx;
+ }
+
if (!tdp_mmu_enabled || !enable_mmio_caching) {
pr_err("TDP MMU and MMIO caching is required for TDX\n");
goto success_disable_tdx;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e433c8ee63a5..837a287d8c47 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5108,6 +5108,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s)
{
+ if (vcpu->arch.apic->guest_apic_protected)
+ return -EINVAL;
+
kvm_x86_call(sync_pir_to_irr)(vcpu);
return kvm_apic_get_state(vcpu, s);
@@ -5118,6 +5121,9 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
{
int r;
+ if (vcpu->arch.apic->guest_apic_protected)
+ return -EINVAL;
+
r = kvm_apic_set_state(vcpu, s);
if (r)
return r;
^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-13 2:09 ` Binbin Wu
@ 2025-01-13 17:16 ` Sean Christopherson
2025-01-14 8:20 ` Binbin Wu
2025-01-16 11:55 ` Huang, Kai
1 sibling, 1 reply; 56+ messages in thread
From: Sean Christopherson @ 2025-01-13 17:16 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Mon, Jan 13, 2025, Binbin Wu wrote:
> On 1/13/2025 10:03 AM, Binbin Wu wrote:
> >
> > On 12/9/2024 9:07 AM, Binbin Wu wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > >
> > > Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
> > > from host VMM.
> > >
> > > Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
> > > it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
> >
> Resend due to the format mess.
That was a very impressive mess :-)
> After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could
> disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}.
>
> - KVM_SET_LAPIC
> Currently, KVM allows userspace to request KVM_SET_LAPIC to set the state
> of LAPIC for TDX guests.
> There are two options:
> - Force x2APIC mode and default base address when userspace request
> KVM_SET_LAPIC.
> - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected
> is true), since migration is not supported yet.
> Choose option 2 for simplicity for now.
Yeah. We'll likely need to support KVM_SET_LAPIC at some point, e.g. to support
PID.PIR save/restore, but that's definitely a future problem.
> Summary about APICv inhibit reasons:
> APICv could still be disabled runtime in some corner case, e.g,
> APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
> After checking enable_apicv in tdx_bringup(), apic->apicv_active is
> initialized as true in kvm_create_lapic(). If APICv is inhibited due to any
> reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
> if APICv is disabled for TDX, if APICv is disabled, bug the VM.
I _think_ this is a non-issue, and that KVM could do KVM_BUG_ON() if APICv is
inihibited by kvm_recalculate_apic_map() for a TDX VM. x2APIC is mandatory
(KVM_APIC_MODE_MAP_DISABLED and "APIC_ID modified" impossible), KVM emulates
APIC_ID as read-only for x2APIC mode (physical aliasing impossible), and LDR is
read-only for x2APIC (logical aliasing impossible).
To ensure no physical aliasing, KVM would need to require KVM_CAP_X2APIC_API be
enabled, but that should probably be required for TDX no matter what.
> kvm_arch_dy_has_pending_interrupt()
> -----------------------------------
> Before enabling off-TD debug, there is no functional change because there
> is no PAUSE Exit for TDX guests.
> After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true
> to get the pending interrupt from PID. Set APICv to active for TDX is the
> right thing to do.
And as alluded to above, for save/restore, e.g. intrahost migration.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-13 17:16 ` Sean Christopherson
@ 2025-01-14 8:20 ` Binbin Wu
2025-01-14 16:59 ` Sean Christopherson
0 siblings, 1 reply; 56+ messages in thread
From: Binbin Wu @ 2025-01-14 8:20 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On 1/14/2025 1:16 AM, Sean Christopherson wrote:
> On Mon, Jan 13, 2025, Binbin Wu wrote:
>> On 1/13/2025 10:03 AM, Binbin Wu wrote:
>>> On 12/9/2024 9:07 AM, Binbin Wu wrote:
>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>
>>>> Inhibit APICv for TDX guest in KVM since TDX doesn't support APICv accesses
>>>> from host VMM.
>>>>
>>>> Follow how SEV inhibits APICv. I.e, define a new inhibit reason for TDX, set
>>>> it on TD initialization, and add the flag to kvm_x86_ops.required_apicv_inhibits.
>> Resend due to the format mess.
> That was a very impressive mess :-)
>
>> After TDX vCPU init, APIC is set to x2APIC mode. However, userspace could
>> disable APIC via KVM_SET_LAPIC or KVM_SET_{SREGS, SREGS2}.
>>
>> - KVM_SET_LAPIC
>> Currently, KVM allows userspace to request KVM_SET_LAPIC to set the state
>> of LAPIC for TDX guests.
>> There are two options:
>> - Force x2APIC mode and default base address when userspace request
>> KVM_SET_LAPIC.
>> - Simply reject KVM_SET_LAPIC for TDX guest (apic->guest_apic_protected
>> is true), since migration is not supported yet.
>> Choose option 2 for simplicity for now.
> Yeah. We'll likely need to support KVM_SET_LAPIC at some point, e.g. to support
> PID.PIR save/restore, but that's definitely a future problem.
>
>> Summary about APICv inhibit reasons:
>> APICv could still be disabled runtime in some corner case, e.g,
>> APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
>> After checking enable_apicv in tdx_bringup(), apic->apicv_active is
>> initialized as true in kvm_create_lapic(). If APICv is inhibited due to any
>> reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
>> if APICv is disabled for TDX, if APICv is disabled, bug the VM.
> I _think_ this is a non-issue, and that KVM could do KVM_BUG_ON() if APICv is
> inihibited by kvm_recalculate_apic_map() for a TDX VM. x2APIC is mandatory
> (KVM_APIC_MODE_MAP_DISABLED and "APIC_ID modified" impossible), KVM emulates
> APIC_ID as read-only for x2APIC mode (physical aliasing impossible), and LDR is
> read-only for x2APIC (logical aliasing impossible).
For logical aliasing, according to the KVM code, it's only relevant to
AMD's AVIC. It's not set in VMX_REQUIRED_APICV_INHIBITS.
Is the reason AVIC using logical-id-addressing while APICv using
physical-id-addressing for IPI virtualization?
>
> To ensure no physical aliasing, KVM would need to require KVM_CAP_X2APIC_API be
> enabled, but that should probably be required for TDX no matter what.
There is no physical aliasing when APIC is in x2apic mode, vcpu_id is used
anyway. Also, KVM is going to reject KVM_SET_LAPIC/KVM_GET_LAPIC from
userspace for TDX guests, functionally, it doesn't matter whether
KVM_CAP_X2APIC_API is enabled or not.
But for future proof, we could enforce KVM_CAP_X2APIC_API being enabled.
>
>> kvm_arch_dy_has_pending_interrupt()
>> -----------------------------------
>> Before enabling off-TD debug, there is no functional change because there
>> is no PAUSE Exit for TDX guests.
>> After enabling off-TD debug, the kvm_vcpu_apicv_active(vcpu) should be true
>> to get the pending interrupt from PID. Set APICv to active for TDX is the
>> right thing to do.
> And as alluded to above, for save/restore, e.g. intrahost migration.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-14 8:20 ` Binbin Wu
@ 2025-01-14 16:59 ` Sean Christopherson
0 siblings, 0 replies; 56+ messages in thread
From: Sean Christopherson @ 2025-01-14 16:59 UTC (permalink / raw)
To: Binbin Wu
Cc: pbonzini, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Tue, Jan 14, 2025, Binbin Wu wrote:
> On 1/14/2025 1:16 AM, Sean Christopherson wrote:
> > On Mon, Jan 13, 2025, Binbin Wu wrote:
> > > Summary about APICv inhibit reasons:
> > > APICv could still be disabled runtime in some corner case, e.g,
> > > APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED due to memory allocation failure.
> > > After checking enable_apicv in tdx_bringup(), apic->apicv_active is
> > > initialized as true in kvm_create_lapic(). If APICv is inhibited due to any
> > > reason runtime, the refresh_apicv_exec_ctrl() callback could be used to check
> > > if APICv is disabled for TDX, if APICv is disabled, bug the VM.
> > I _think_ this is a non-issue, and that KVM could do KVM_BUG_ON() if APICv is
> > inihibited by kvm_recalculate_apic_map() for a TDX VM. x2APIC is mandatory
> > (KVM_APIC_MODE_MAP_DISABLED and "APIC_ID modified" impossible), KVM emulates
> > APIC_ID as read-only for x2APIC mode (physical aliasing impossible), and LDR is
> > read-only for x2APIC (logical aliasing impossible).
>
> For logical aliasing, according to the KVM code, it's only relevant to
> AMD's AVIC. It's not set in VMX_REQUIRED_APICV_INHIBITS.
Ah, right.
> Is the reason AVIC using logical-id-addressing while APICv using
> physical-id-addressing for IPI virtualization?
Ya, more or less. AVIC supports virtualizing both physical and logical IPIs,
APICv only supports physical.
> > To ensure no physical aliasing, KVM would need to require KVM_CAP_X2APIC_API be
> > enabled, but that should probably be required for TDX no matter what.
> There is no physical aliasing when APIC is in x2apic mode, vcpu_id is used
> anyway.
Yeah, ignore this, I misremembered the effects of KVM_CAP_X2APIC_API.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-13 2:09 ` Binbin Wu
2025-01-13 17:16 ` Sean Christopherson
@ 2025-01-16 11:55 ` Huang, Kai
2025-01-16 14:50 ` Sean Christopherson
1 sibling, 1 reply; 56+ messages in thread
From: Huang, Kai @ 2025-01-16 11:55 UTC (permalink / raw)
To: kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com,
binbin.wu@linux.intel.com
Cc: Gao, Chao, Edgecombe, Rick P, Li, Xiaoyao, Chatre, Reinette,
Zhao, Yan Y, Hunter, Adrian, tony.lindgren@linux.intel.com,
Yamahata, Isaku, linux-kernel@vger.kernel.org
On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
> Lazy check for pending APIC EOI when In-kernel IOAPIC
> -----------------------------------------------------
> In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
> accelerates write to APIC EOI register and does not trap if the interrupt
> is edge-triggered. So there is a workaround by lazy check for pending APIC
> EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
> pending APIC EOI.
> KVM is also not be able to intercept EOI for TDX guests.
> - When APICv is enabled
> The code of lazy check for pending APIC EOI doesn't work for TDX because
> KVM can't get the status of real IRR and ISR, and the values are 0s in
> vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
> false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
> is called for RTC. Then userspace may miss the coalesced RTC interrupts.
> - When When APICv is disabled
> ioapic_lazy_update_eoi() will not be called,then pending EOI status for
> RTC will not be cleared after setting and this will mislead userspace to
> see coalesced RTC interrupts.
> Options:
> - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
> - Leave it as it is, but the use of RTC may not be accurate.
Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
upon EOI for level-triggered interrupt. I don't know how does KVM works with
userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
irqchip split for TDX guest" seems not right.
I think the problem is level-triggered interrupt, so I think another option is
to reject level-triggered interrupt for TDX guest.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-16 11:55 ` Huang, Kai
@ 2025-01-16 14:50 ` Sean Christopherson
2025-01-16 20:16 ` Huang, Kai
2025-01-17 0:49 ` Binbin Wu
0 siblings, 2 replies; 56+ messages in thread
From: Sean Christopherson @ 2025-01-16 14:50 UTC (permalink / raw)
To: Kai Huang
Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
binbin.wu@linux.intel.com, Chao Gao, Rick P Edgecombe, Xiaoyao Li,
Reinette Chatre, Yan Y Zhao, Adrian Hunter,
tony.lindgren@linux.intel.com, Isaku Yamahata,
linux-kernel@vger.kernel.org
On Thu, Jan 16, 2025, Kai Huang wrote:
> On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
> > Lazy check for pending APIC EOI when In-kernel IOAPIC
> > -----------------------------------------------------
> > In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
> > accelerates write to APIC EOI register and does not trap if the interrupt
> > is edge-triggered. So there is a workaround by lazy check for pending APIC
> > EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
> > pending APIC EOI.
> > KVM is also not be able to intercept EOI for TDX guests.
> > - When APICv is enabled
> > The code of lazy check for pending APIC EOI doesn't work for TDX because
> > KVM can't get the status of real IRR and ISR, and the values are 0s in
> > vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
> > false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
> > is called for RTC. Then userspace may miss the coalesced RTC interrupts.
> > - When When APICv is disabled
> > ioapic_lazy_update_eoi() will not be called,then pending EOI status for
> > RTC will not be cleared after setting and this will mislead userspace to
> > see coalesced RTC interrupts.
> > Options:
> > - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
> > - Leave it as it is, but the use of RTC may not be accurate.
>
> Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> upon EOI for level-triggered interrupt. I don't know how does KVM works with
> userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> irqchip split for TDX guest" seems not right.
Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
I/O APIC and the "split" model is the way to concoct such a setup. With a "full"
IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
emulated (see below).
> I think the problem is level-triggered interrupt,
Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
i.e. all EOIs are accelerated and never trigger exits.
> so I think another option is to reject level-triggered interrupt for TDX guest.
This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness
of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC,
the level-ness is determined by the corresponding Redirection Table entry. For
"GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
dictated by the MSI itself, which again is guest controlled.
If the guest induces generation of a level-triggered interrupt, the VMM is left
with the choice of dropping the interrupt, sending it as-is, or converting it to
an edge-triggered interrupt. Ditto for KVM. All of those options will make the
guest unhappy.
So while it _might_ make debugging broken guests either, I don't think it's worth
the complexity to try and prevent the VMM/guest from sending level-triggered
GSI-routed interrupts. It'd be a bit of a whack-a-mole and there's no architectural
behavior KVM can provide that's better than sending the interrupt and hoping for
the best.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-16 14:50 ` Sean Christopherson
@ 2025-01-16 20:16 ` Huang, Kai
2025-01-16 22:37 ` Sean Christopherson
2025-01-17 0:49 ` Binbin Wu
1 sibling, 1 reply; 56+ messages in thread
From: Huang, Kai @ 2025-01-16 20:16 UTC (permalink / raw)
To: seanjc@google.com
Cc: Gao, Chao, Edgecombe, Rick P, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, Li, Xiaoyao, Chatre, Reinette,
Zhao, Yan Y, Hunter, Adrian, kvm@vger.kernel.org,
pbonzini@redhat.com, tony.lindgren@linux.intel.com,
Yamahata, Isaku
On Thu, 2025-01-16 at 06:50 -0800, Sean Christopherson wrote:
> On Thu, Jan 16, 2025, Kai Huang wrote:
> > On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
> > > Lazy check for pending APIC EOI when In-kernel IOAPIC
> > > -----------------------------------------------------
> > > In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
> > > accelerates write to APIC EOI register and does not trap if the interrupt
> > > is edge-triggered. So there is a workaround by lazy check for pending APIC
> > > EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
> > > pending APIC EOI.
> > > KVM is also not be able to intercept EOI for TDX guests.
> > > - When APICv is enabled
> > > The code of lazy check for pending APIC EOI doesn't work for TDX because
> > > KVM can't get the status of real IRR and ISR, and the values are 0s in
> > > vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
> > > false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
> > > is called for RTC. Then userspace may miss the coalesced RTC interrupts.
> > > - When When APICv is disabled
> > > ioapic_lazy_update_eoi() will not be called,then pending EOI status for
> > > RTC will not be cleared after setting and this will mislead userspace to
> > > see coalesced RTC interrupts.
> > > Options:
> > > - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
> > > - Leave it as it is, but the use of RTC may not be accurate.
> >
> > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> > upon EOI for level-triggered interrupt. I don't know how does KVM works with
> > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> > irqchip split for TDX guest" seems not right.
>
> Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> I/O APIC and the "split" model is the way to concoct such a setup. With a "full"
> IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> emulated (see below).
Disabling in-kernel IOAPIC/PIC for TDX guests is fine to me, but I think that,
"conceptually", having IOAPIC/PIC in userspace doesn't mean disabling IOAPIC,
because theoretically usrespace IOAPIC still needs to be told about the EOI for
emulation. I just haven't figured out how does userpsace IOAPIC work with KVM
in case of "split IRQCHIP" w/o trapping EOI for level-triggered interrupt. :-)
If the point is to disable in-kernel IOAPIC/PIC for TDX guests, then I think
both KVM_IRQCHIP_NONE and KVM_IRQCHIP_SPLIT should be allowed for TDX, but not
just KVM_IRQCHIP_SPLIT?
>
> > I think the problem is level-triggered interrupt,
>
> Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> i.e. all EOIs are accelerated and never trigger exits.
>
> > so I think another option is to reject level-triggered interrupt for TDX guest.
>
> This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness
> of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC,
> the level-ness is determined by the corresponding Redirection Table entry. For
> "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> dictated by the MSI itself, which again is guest controlled.
>
> If the guest induces generation of a level-triggered interrupt, the VMM is left
> with the choice of dropping the interrupt, sending it as-is, or converting it to
> an edge-triggered interrupt. Ditto for KVM. All of those options will make the
> guest unhappy.
>
> So while it _might_ make debugging broken guests either, I don't think it's worth
> the complexity to try and prevent the VMM/guest from sending level-triggered
> GSI-routed interrupts.
>
KVM can at least have some chance to print some error message?
> It'd be a bit of a whack-a-mole and there's no architectural
> behavior KVM can provide that's better than sending the interrupt and hoping for
> the best.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-16 20:16 ` Huang, Kai
@ 2025-01-16 22:37 ` Sean Christopherson
2025-01-17 9:53 ` Huang, Kai
0 siblings, 1 reply; 56+ messages in thread
From: Sean Christopherson @ 2025-01-16 22:37 UTC (permalink / raw)
To: Kai Huang
Cc: Chao Gao, Rick P Edgecombe, linux-kernel@vger.kernel.org,
binbin.wu@linux.intel.com, Xiaoyao Li, Reinette Chatre,
Yan Y Zhao, Adrian Hunter, kvm@vger.kernel.org,
pbonzini@redhat.com, tony.lindgren@linux.intel.com,
Isaku Yamahata
On Thu, Jan 16, 2025, Kai Huang wrote:
> On Thu, 2025-01-16 at 06:50 -0800, Sean Christopherson wrote:
> > On Thu, Jan 16, 2025, Kai Huang wrote:
...
> > > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> > > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> > > upon EOI for level-triggered interrupt. I don't know how does KVM works with
> > > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> > > irqchip split for TDX guest" seems not right.
> >
> > Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> > I/O APIC and the "split" model is the way to concoct such a setup. With a "full"
> > IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> > nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> > emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> > Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> > level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> > emulated (see below).
>
> Disabling in-kernel IOAPIC/PIC for TDX guests is fine to me, but I think that,
> "conceptually", having IOAPIC/PIC in userspace doesn't mean disabling IOAPIC,
> because theoretically usrespace IOAPIC still needs to be told about the EOI for
> emulation. I just haven't figured out how does userpsace IOAPIC work with KVM
> in case of "split IRQCHIP" w/o trapping EOI for level-triggered interrupt. :-)
Userspace I/O APIC _does_ intercept EOI. KVM scans the GSI routes provided by
userspace and intercepts those that are configured to be delivered as level-
triggered interrupts. Whereas with an in-kernel I/O APIC, KVM scans the GSI
routes *and* the I/O APIC Redirection Table (for interrupts that are routed
through the I/O APIC).
> If the point is to disable in-kernel IOAPIC/PIC for TDX guests, then I think
> both KVM_IRQCHIP_NONE and KVM_IRQCHIP_SPLIT should be allowed for TDX, but not
> just KVM_IRQCHIP_SPLIT?
No, because APICv is mandatory for TDX, which rules out KVM_IRQCHIP_NONE.
>
> >
> > > I think the problem is level-triggered interrupt,
> >
> > Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> > i.e. all EOIs are accelerated and never trigger exits.
> >
> > > so I think another option is to reject level-triggered interrupt for TDX guest.
> >
> > This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness
> > of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC,
> > the level-ness is determined by the corresponding Redirection Table entry. For
> > "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> > wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> > dictated by the MSI itself, which again is guest controlled.
> >
> > If the guest induces generation of a level-triggered interrupt, the VMM is left
> > with the choice of dropping the interrupt, sending it as-is, or converting it to
> > an edge-triggered interrupt. Ditto for KVM. All of those options will make the
> > guest unhappy.
> >
> > So while it _might_ make debugging broken guests either, I don't think it's worth
> > the complexity to try and prevent the VMM/guest from sending level-triggered
> > GSI-routed interrupts.
> >
>
> KVM can at least have some chance to print some error message?
No. A guest can shoot itself any number of ways, and userspace has every
opportunity to log weirdness in this case.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-16 14:50 ` Sean Christopherson
2025-01-16 20:16 ` Huang, Kai
@ 2025-01-17 0:49 ` Binbin Wu
1 sibling, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2025-01-17 0:49 UTC (permalink / raw)
To: Sean Christopherson, Kai Huang
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, Chao Gao,
Rick P Edgecombe, Xiaoyao Li, Reinette Chatre, Yan Y Zhao,
Adrian Hunter, tony.lindgren@linux.intel.com, Isaku Yamahata,
linux-kernel@vger.kernel.org
On 1/16/2025 10:50 PM, Sean Christopherson wrote:
> On Thu, Jan 16, 2025, Kai Huang wrote:
>> On Mon, 2025-01-13 at 10:09 +0800, Binbin Wu wrote:
>>> Lazy check for pending APIC EOI when In-kernel IOAPIC
>>> -----------------------------------------------------
>>> In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor
>>> accelerates write to APIC EOI register and does not trap if the interrupt
>>> is edge-triggered. So there is a workaround by lazy check for pending APIC
>>> EOI at the time when setting new IOAPIC irq, and update IOAPIC EOI if no
>>> pending APIC EOI.
>>> KVM is also not be able to intercept EOI for TDX guests.
>>> - When APICv is enabled
>>> The code of lazy check for pending APIC EOI doesn't work for TDX because
>>> KVM can't get the status of real IRR and ISR, and the values are 0s in
>>> vIRR and vISR in apic->regs[], kvm_apic_pending_eoi() will always return
>>> false. So the RTC pending EOI will always be cleared when ioapic_set_irq()
>>> is called for RTC. Then userspace may miss the coalesced RTC interrupts.
>>> - When When APICv is disabled
>>> ioapic_lazy_update_eoi() will not be called,then pending EOI status for
>>> RTC will not be cleared after setting and this will mislead userspace to
>>> see coalesced RTC interrupts.
>>> Options:
>>> - Force irqchip split for TDX guests to eliminate the use of in-kernel IOAPIC.
>>> - Leave it as it is, but the use of RTC may not be accurate.
>> Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
>> for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
>> upon EOI for level-triggered interrupt. I don't know how does KVM works with
>> userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
>> irqchip split for TDX guest" seems not right.
> Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> I/O APIC and the "split" model is the way to concoct such a setup. With a "full"
> IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> emulated (see below).
>
>> I think the problem is level-triggered interrupt,
> Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> i.e. all EOIs are accelerated and never trigger exits.
Yes, and I think it needs to add the description about it and the
level-trigger interrupt in the commit message of some patch.
>
>> so I think another option is to reject level-triggered interrupt for TDX guest.
> This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness
> of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC,
> the level-ness is determined by the corresponding Redirection Table entry. For
> "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> dictated by the MSI itself, which again is guest controlled.
>
> If the guest induces generation of a level-triggered interrupt, the VMM is left
> with the choice of dropping the interrupt, sending it as-is, or converting it to
> an edge-triggered interrupt. Ditto for KVM. All of those options will make the
> guest unhappy.
>
> So while it _might_ make debugging broken guests either, I don't think it's worth
> the complexity to try and prevent the VMM/guest from sending level-triggered
> GSI-routed interrupts. It'd be a bit of a whack-a-mole and there's no architectural
> behavior KVM can provide that's better than sending the interrupt and hoping for
> the best.
Currently, KVM doesn't do anything special if the guest send level-triggered
interrupts for TDX guests.
QEMU has a patch to set the eoi_intercept_unsupported to true for tdx guests.
https://lore.kernel.org/kvm/20241105062408.3533704-41-xiaoyao.li@intel.com/
And it seems that the level_trigger_unsupported info will be passed to guest
via ACPI table. I didn't dig deep into it, I suppose with the information,
guests will not send level-triggered GSI interrupts?
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-16 22:37 ` Sean Christopherson
@ 2025-01-17 9:53 ` Huang, Kai
2025-01-17 10:46 ` Huang, Kai
0 siblings, 1 reply; 56+ messages in thread
From: Huang, Kai @ 2025-01-17 9:53 UTC (permalink / raw)
To: seanjc@google.com
Cc: Gao, Chao, Edgecombe, Rick P, binbin.wu@linux.intel.com,
Li, Xiaoyao, linux-kernel@vger.kernel.org, Hunter, Adrian,
Chatre, Reinette, kvm@vger.kernel.org, Zhao, Yan Y,
tony.lindgren@linux.intel.com, pbonzini@redhat.com,
Yamahata, Isaku
On Thu, 2025-01-16 at 14:37 -0800, Sean Christopherson wrote:
> On Thu, Jan 16, 2025, Kai Huang wrote:
> > On Thu, 2025-01-16 at 06:50 -0800, Sean Christopherson wrote:
> > > On Thu, Jan 16, 2025, Kai Huang wrote:
>
> ...
>
> > > > Looking at the code, it seems KVM only traps EOI for level-triggered interrupt
> > > > for in-kernel IOAPIC chip, but IIUC IOAPIC in userspace also needs to be told
> > > > upon EOI for level-triggered interrupt. I don't know how does KVM works with
> > > > userspace IOAPIC w/o trapping EOI for level-triggered interrupt, but "force
> > > > irqchip split for TDX guest" seems not right.
> > >
> > > Forcing a "split" IRQ chip is correct, in the sense that TDX doesn't support an
> > > I/O APIC and the "split" model is the way to concoct such a setup. With a "full"
> > > IRQ chip, KVM is responsible for emulating the I/O APIC, which is more or less
> > > nonsensical on TDX because it's fully virtual world, i.e. there's no reason to
> > > emulate legacy devices that only know how to talk to the I/O APIC (or PIC, etc.).
> > > Disallowing an in-kernel I/O APIC is ideal from KVM's perspective, because
> > > level-triggered interrupts and thus the I/O APIC as a whole can't be faithfully
> > > emulated (see below).
> >
> > Disabling in-kernel IOAPIC/PIC for TDX guests is fine to me, but I think that,
> > "conceptually", having IOAPIC/PIC in userspace doesn't mean disabling IOAPIC,
> > because theoretically usrespace IOAPIC still needs to be told about the EOI for
> > emulation. I just haven't figured out how does userpsace IOAPIC work with KVM
> > in case of "split IRQCHIP" w/o trapping EOI for level-triggered interrupt. :-)
>
> Userspace I/O APIC _does_ intercept EOI. KVM scans the GSI routes provided by
> userspace and intercepts those that are configured to be delivered as level-
> triggered interrupts.
>
Yeah see it now (I believe you mean kvm_scan_ioapic_routes()). Thanks!
> Whereas with an in-kernel I/O APIC, KVM scans the GSI
> routes *and* the I/O APIC Redirection Table (for interrupts that are routed
> through the I/O APIC).
Right.
But neither of them work with TDX because TDX doesn't support EOI exit.
So from the sense that we don't want KVM to support in-kernel IOAPIC for TDX, I
agree we can force to use IRQCHIP split. But my point is this doesn't seem to
be able to resolve the problem. :-)
Btw, IIUC, in case of IRQCHIP split, KVM uses KVM_IRQ_ROUTING_MSI for routes of
GSIs. But it seems KVM only allows level-triggered MSI to be signaled (which is
a surprising):
int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level, bool line_status)
{
struct kvm_lapic_irq irq;
if (kvm_msi_route_invalid(kvm, e))
return -EINVAL;
if (!level)
return -1;
kvm_set_msi_irq(kvm, e, &irq);
return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
}
>
> > If the point is to disable in-kernel IOAPIC/PIC for TDX guests, then I think
> > both KVM_IRQCHIP_NONE and KVM_IRQCHIP_SPLIT should be allowed for TDX, but not
> > just KVM_IRQCHIP_SPLIT?
>
> No, because APICv is mandatory for TDX, which rules out KVM_IRQCHIP_NONE.
Yeah I missed this obvious thing.
>
> >
> > >
> > > > I think the problem is level-triggered interrupt,
> > >
> > > Yes, because the TDX Module doesn't allow the hypervisor to modify the EOI-bitmap,
> > > i.e. all EOIs are accelerated and never trigger exits.
> > >
> > > > so I think another option is to reject level-triggered interrupt for TDX guest.
> > >
> > > This is a "don't do that, it will hurt" situation. With a sane VMM, the level-ness
> > > of GSIs is controlled by the guest. For GSIs that are routed through the I/O APIC,
> > > the level-ness is determined by the corresponding Redirection Table entry. For
> > > "GSIs" that are actually MSIs (KVM piggybacks legacy GSI routing to let userspace
> > > wire up MSIs), and for direct MSIs injection (KVM_SIGNAL_MSI), the level-ness is
> > > dictated by the MSI itself, which again is guest controlled.
> > >
> > > If the guest induces generation of a level-triggered interrupt, the VMM is left
> > > with the choice of dropping the interrupt, sending it as-is, or converting it to
> > > an edge-triggered interrupt. Ditto for KVM. All of those options will make the
> > > guest unhappy.
> > >
> > > So while it _might_ make debugging broken guests either, I don't think it's worth
> > > the complexity to try and prevent the VMM/guest from sending level-triggered
> > > GSI-routed interrupts.
> > >
> >
> > KVM can at least have some chance to print some error message?
>
> No. A guest can shoot itself any number of ways, and userspace has every
> opportunity to log weirdness in this case.
Agreed.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-17 9:53 ` Huang, Kai
@ 2025-01-17 10:46 ` Huang, Kai
2025-01-17 15:08 ` Sean Christopherson
0 siblings, 1 reply; 56+ messages in thread
From: Huang, Kai @ 2025-01-17 10:46 UTC (permalink / raw)
To: seanjc@google.com
Cc: Gao, Chao, Edgecombe, Rick P, binbin.wu@linux.intel.com,
Li, Xiaoyao, linux-kernel@vger.kernel.org, Hunter, Adrian,
Chatre, Reinette, kvm@vger.kernel.org, Zhao, Yan Y,
tony.lindgren@linux.intel.com, pbonzini@redhat.com,
Yamahata, Isaku
On Fri, 2025-01-17 at 09:53 +0000, Huang, Kai wrote:
> Btw, IIUC, in case of IRQCHIP split, KVM uses KVM_IRQ_ROUTING_MSI for routes of
> GSIs. But it seems KVM only allows level-triggered MSI to be signaled (which is
> a surprising):
>
> int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
> struct kvm *kvm, int irq_source_id, int level, bool line_status)
> {
> struct kvm_lapic_irq irq;
>
> if (kvm_msi_route_invalid(kvm, e))
> return -EINVAL;
>
> if (!level)
> return -1;
>
> kvm_set_msi_irq(kvm, e, &irq);
>
> return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
> }
Ah sorry this 'level' is not trig_mode. Please ignore :-)
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest
2025-01-17 10:46 ` Huang, Kai
@ 2025-01-17 15:08 ` Sean Christopherson
0 siblings, 0 replies; 56+ messages in thread
From: Sean Christopherson @ 2025-01-17 15:08 UTC (permalink / raw)
To: Kai Huang
Cc: Chao Gao, Rick P Edgecombe, binbin.wu@linux.intel.com, Xiaoyao Li,
linux-kernel@vger.kernel.org, Adrian Hunter, Reinette Chatre,
kvm@vger.kernel.org, Yan Y Zhao, tony.lindgren@linux.intel.com,
pbonzini@redhat.com, Isaku Yamahata
On Fri, Jan 17, 2025, Kai Huang wrote:
> On Fri, 2025-01-17 at 09:53 +0000, Huang, Kai wrote:
> > Btw, IIUC, in case of IRQCHIP split, KVM uses KVM_IRQ_ROUTING_MSI for routes of
> > GSIs. But it seems KVM only allows level-triggered MSI to be signaled (which is
> > a surprising):
> >
> > int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
> > struct kvm *kvm, int irq_source_id, int level, bool line_status)
> > {
> > struct kvm_lapic_irq irq;
> >
> > if (kvm_msi_route_invalid(kvm, e))
> > return -EINVAL;
> >
> > if (!level)
> > return -1;
> >
> > kvm_set_msi_irq(kvm, e, &irq);
> >
> > return kvm_irq_delivery_to_apic(kvm, NULL, &irq, NULL);
> > }
>
> Ah sorry this 'level' is not trig_mode. Please ignore :-)
Yeah :-( I have misread the use of "level" so, so many times in KVM's IRQ code.
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation
2024-12-09 1:07 ` [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation Binbin Wu
2025-01-03 22:04 ` Vishal Annapurve
@ 2025-01-22 11:34 ` Paolo Bonzini
2025-01-22 13:59 ` Binbin Wu
1 sibling, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2025-01-22 11:34 UTC (permalink / raw)
To: Binbin Wu
Cc: seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On Mon, Dec 9, 2024 at 2:06 AM Binbin Wu <binbin.wu@linux.intel.com> wrote:
> - .hwapic_irr_update = vmx_hwapic_irr_update,
> - .hwapic_isr_update = vmx_hwapic_isr_update,
> + .hwapic_irr_update = vt_hwapic_irr_update,
> + .hwapic_isr_update = vt_hwapic_isr_update,
Just a note, hwapic_irr_update is gone in 6.14 and thus in kvm-coco-queue.
Paolo
^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation
2025-01-22 11:34 ` Paolo Bonzini
@ 2025-01-22 13:59 ` Binbin Wu
0 siblings, 0 replies; 56+ messages in thread
From: Binbin Wu @ 2025-01-22 13:59 UTC (permalink / raw)
To: Paolo Bonzini
Cc: seanjc, kvm, rick.p.edgecombe, kai.huang, adrian.hunter,
reinette.chatre, xiaoyao.li, tony.lindgren, isaku.yamahata,
yan.y.zhao, chao.gao, linux-kernel
On 1/22/2025 7:34 PM, Paolo Bonzini wrote:
> On Mon, Dec 9, 2024 at 2:06 AM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>> - .hwapic_irr_update = vmx_hwapic_irr_update,
>> - .hwapic_isr_update = vmx_hwapic_isr_update,
>> + .hwapic_irr_update = vt_hwapic_irr_update,
>> + .hwapic_isr_update = vt_hwapic_isr_update,
> Just a note, hwapic_irr_update is gone in 6.14 and thus in kvm-coco-queue.
>
> Paolo
>
Thanks for the info.
^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2025-01-22 13:59 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 1:07 [PATCH 00/16] KVM: TDX: TDX interrupts Binbin Wu
2024-12-09 1:07 ` [PATCH 01/16] KVM: TDX: Add support for find pending IRQ in a protected local APIC Binbin Wu
2025-01-09 15:38 ` Nikolay Borisov
2025-01-10 5:36 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 02/16] KVM: VMX: Remove use of struct vcpu_vmx from posted_intr.c Binbin Wu
2024-12-09 1:07 ` [PATCH 03/16] KVM: TDX: Disable PI wakeup for IPIv Binbin Wu
2024-12-09 1:07 ` [PATCH 04/16] KVM: VMX: Move posted interrupt delivery code to common header Binbin Wu
2024-12-09 1:07 ` [PATCH 05/16] KVM: TDX: Implement non-NMI interrupt injection Binbin Wu
2024-12-09 1:07 ` [PATCH 06/16] KVM: x86: Assume timer IRQ was injected if APIC state is protected Binbin Wu
2024-12-09 1:07 ` [PATCH 07/16] KVM: TDX: Wait lapic expire when timer IRQ was injected Binbin Wu
2024-12-09 1:07 ` [PATCH 08/16] KVM: TDX: Implement methods to inject NMI Binbin Wu
2024-12-09 1:07 ` [PATCH 09/16] KVM: TDX: Complete interrupts after TD exit Binbin Wu
2024-12-09 1:07 ` [PATCH 10/16] KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM Binbin Wu
2024-12-09 1:07 ` [PATCH 11/16] KVM: TDX: Always block INIT/SIPI Binbin Wu
2025-01-08 7:21 ` Xiaoyao Li
2025-01-08 7:53 ` Binbin Wu
2025-01-08 14:40 ` Sean Christopherson
2025-01-09 2:09 ` Xiaoyao Li
2025-01-09 2:26 ` Binbin Wu
2025-01-09 2:46 ` Huang, Kai
2025-01-09 3:20 ` Binbin Wu
2025-01-09 4:01 ` Huang, Kai
2025-01-09 2:51 ` Huang, Kai
2024-12-09 1:07 ` [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest Binbin Wu
2025-01-03 21:59 ` Vishal Annapurve
2025-01-06 1:46 ` Binbin Wu
2025-01-06 22:49 ` Vishal Annapurve
2025-01-06 23:40 ` Sean Christopherson
2025-01-07 3:24 ` Chao Gao
2025-01-07 8:09 ` Binbin Wu
2025-01-07 21:15 ` Sean Christopherson
2025-01-13 2:03 ` Binbin Wu
2025-01-13 2:09 ` Binbin Wu
2025-01-13 17:16 ` Sean Christopherson
2025-01-14 8:20 ` Binbin Wu
2025-01-14 16:59 ` Sean Christopherson
2025-01-16 11:55 ` Huang, Kai
2025-01-16 14:50 ` Sean Christopherson
2025-01-16 20:16 ` Huang, Kai
2025-01-16 22:37 ` Sean Christopherson
2025-01-17 9:53 ` Huang, Kai
2025-01-17 10:46 ` Huang, Kai
2025-01-17 15:08 ` Sean Christopherson
2025-01-17 0:49 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 13/16] KVM: TDX: Add methods to ignore virtual apic related operation Binbin Wu
2025-01-03 22:04 ` Vishal Annapurve
2025-01-06 2:18 ` Binbin Wu
2025-01-22 11:34 ` Paolo Bonzini
2025-01-22 13:59 ` Binbin Wu
2024-12-09 1:07 ` [PATCH 14/16] KVM: VMX: Move NMI/exception handler to common helper Binbin Wu
2024-12-09 1:07 ` [PATCH 15/16] KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT Binbin Wu
2024-12-09 1:07 ` [PATCH 16/16] KVM: TDX: Handle EXIT_REASON_OTHER_SMI Binbin Wu
2024-12-10 18:24 ` [PATCH 00/16] KVM: TDX: TDX interrupts Paolo Bonzini
2025-01-06 10:51 ` Xiaoyao Li
2025-01-06 20:08 ` Sean Christopherson
2025-01-09 2:44 ` Binbin Wu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).