[PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion
@ 2025-01-11  1:24 Sean Christopherson
  2025-01-11  1:24 ` [PATCH 1/5] KVM: x86: Document that KVM_EXIT_HYPERCALL requires completion Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:24 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman,
	Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

Add a flag to kvm_run, KVM_RUN_NEEDS_COMPLETION, that is set by KVM to
inform userspace that KVM_RUN needs to be re-invoked prior to state
save/restore.  The current approach of relying on KVM developers to
update documentation, and on VMM developers to read said documentation,
is brittle and error prone.

Note, this series is *very* lightly tested (borderline RFC).

Sean Christopherson (5):
  KVM: x86: Document that KVM_EXIT_HYPERCALL requires completion
  KVM: Clear vcpu->run->flags at start of KVM_RUN for all architectures
  KVM: Add a common kvm_run flag to communicate an exit needs completion
  KVM: selftests: Provide separate helper for KVM_RUN with
    immediate_exit
  KVM: selftests: Rely on KVM_RUN_NEEDS_COMPLETION to complete userspace
    exits

 Documentation/virt/kvm/api.rst                | 77 ++++++++++++-------
 arch/arm64/kvm/arm.c                          |  1 -
 arch/arm64/kvm/handle_exit.c                  |  2 +-
 arch/powerpc/kvm/book3s_emulate.c             |  1 +
 arch/powerpc/kvm/book3s_hv.c                  |  5 +-
 arch/powerpc/kvm/book3s_pr.c                  |  2 +
 arch/powerpc/kvm/booke.c                      |  1 +
 arch/x86/include/uapi/asm/kvm.h               |  7 +-
 arch/x86/kvm/x86.c                            |  3 +-
 include/uapi/linux/kvm.h                      |  3 +
 .../testing/selftests/kvm/include/kvm_util.h  | 13 +++-
 tools/testing/selftests/kvm/lib/kvm_util.c    |  6 +-
 .../testing/selftests/kvm/lib/ucall_common.c  |  2 +-
 .../testing/selftests/kvm/lib/x86/processor.c |  8 +-
 .../kvm/x86/nested_exceptions_test.c          |  3 +-
 .../kvm/x86/triple_fault_event_test.c         |  3 +-
 virt/kvm/kvm_main.c                           |  4 +
 17 files changed, 90 insertions(+), 51 deletions(-)


base-commit: 10b2c8a67c4b8ec15f9d07d177f63b563418e948
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] KVM: x86: Document that KVM_EXIT_HYPERCALL requires completion
  2025-01-11  1:24 [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion Sean Christopherson
@ 2025-01-11  1:24 ` Sean Christopherson
  2025-01-11  1:24 ` [PATCH 2/5] KVM: Clear vcpu->run->flags at start of KVM_RUN for all architectures Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:24 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman,
	Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

Update KVM's documentation to call out that KVM_EXIT_HYPERCALL requires
userspace to do KVM_RUN before state save/restore, so that KVM can skip
the hypercall instruction, otherwise resuming the vCPU after restore will
restart the instruction and potentially lead to a spurious MAP_GPA_RANGE.

Fixes: 0dbb11230437 ("KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst | 39 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 454c2aaa155e..c92c8d4e8779 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6615,13 +6615,29 @@ The 'data' member contains, in its first 'len' bytes, the value as it would
 appear if the VCPU performed a load or store of the appropriate width directly
 to the byte array.
 
+It is strongly recommended that userspace use ``KVM_EXIT_IO`` (x86) or
+``KVM_EXIT_MMIO`` (all except s390) to implement functionality that
+requires a guest to interact with host userspace.
+
+.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
+
+		/* KVM_EXIT_HYPERCALL */
+		struct {
+			__u64 nr;
+			__u64 args[6];
+			__u64 ret;
+			__u64 flags;
+		} hypercall;
+
+
 .. note::
 
       For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
-      KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
-      operations are complete (and guest state is consistent) only after userspace
-      has re-entered the kernel with KVM_RUN.  The kernel side will first finish
-      incomplete operations and then check for pending signals.
+      KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR, KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL
+      the corresponding operations are complete (and guest state is consistent)
+      only after userspace has re-entered the kernel with KVM_RUN.  The kernel
+      side will first finish incomplete operations and then check for pending
+      signals.
 
       The pending state of the operation is not preserved in state which is
       visible to userspace, thus userspace should ensure that the operation is
@@ -6632,21 +6648,6 @@ to the byte array.
 
 ::
 
-		/* KVM_EXIT_HYPERCALL */
-		struct {
-			__u64 nr;
-			__u64 args[6];
-			__u64 ret;
-			__u64 flags;
-		} hypercall;
-
-
-It is strongly recommended that userspace use ``KVM_EXIT_IO`` (x86) or
-``KVM_EXIT_MMIO`` (all except s390) to implement functionality that
-requires a guest to interact with host userspace.
-
-.. note:: KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
-
 For arm64:
 ----------
 
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] KVM: Clear vcpu->run->flags at start of KVM_RUN for all architectures
  2025-01-11  1:24 [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion Sean Christopherson
  2025-01-11  1:24 ` [PATCH 1/5] KVM: x86: Document that KVM_EXIT_HYPERCALL requires completion Sean Christopherson
@ 2025-01-11  1:24 ` Sean Christopherson
  2025-01-11  1:24 ` [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:24 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman,
	Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

Clear kvm_run.flags at the start of KVM_RUN for all architectures to
minimize the probability of leaving a stale flag set.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/arm64/kvm/arm.c         | 1 -
 arch/arm64/kvm/handle_exit.c | 2 +-
 arch/powerpc/kvm/book3s_hv.c | 4 +---
 arch/x86/kvm/x86.c           | 1 -
 virt/kvm/kvm_main.c          | 3 +++
 5 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a102c3aebdbc..925fa010bb7b 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1128,7 +1128,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
 	ret = 1;
 	run->exit_reason = KVM_EXIT_UNKNOWN;
-	run->flags = 0;
 	while (ret > 0) {
 		/*
 		 * Check conditions before entering the guest
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index d7c2990e7c9e..63692c254a07 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -186,7 +186,7 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu)
 	run->exit_reason = KVM_EXIT_DEBUG;
 	run->debug.arch.hsr = lower_32_bits(esr);
 	run->debug.arch.hsr_high = upper_32_bits(esr);
-	run->flags = KVM_DEBUG_ARCH_HSR_HIGH_VALID;
+	run->flags |= KVM_DEBUG_ARCH_HSR_HIGH_VALID;
 
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_WATCHPT_LOW:
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 25429905ae90..b253f7372774 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1704,9 +1704,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 		/* Exit to guest with KVM_EXIT_NMI as exit reason */
 		run->exit_reason = KVM_EXIT_NMI;
 		run->hw.hardware_exit_reason = vcpu->arch.trap;
-		/* Clear out the old NMI status from run->flags */
-		run->flags &= ~KVM_RUN_PPC_NMI_DISP_MASK;
-		/* Now set the NMI status */
+		/* Note, run->flags is cleared at the start of KVM_RUN. */
 		if (vcpu->arch.mce_evt.disposition == MCE_DISPOSITION_RECOVERED)
 			run->flags |= KVM_RUN_PPC_NMI_DISP_FULLY_RECOV;
 		else
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b04092ec76a..a8aa12e0911d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11465,7 +11465,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
-	kvm_run->flags = 0;
 	kvm_load_guest_fpu(vcpu);
 
 	kvm_vcpu_srcu_read_lock(vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index de2c11dae231..7d2076439081 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4336,6 +4336,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
 
 			put_pid(oldpid);
 		}
+
+		vcpu->run->flags = 0;
+
 		vcpu->wants_to_run = !READ_ONCE(vcpu->run->immediate_exit__unsafe);
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		vcpu->wants_to_run = false;
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-11  1:24 [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion Sean Christopherson
  2025-01-11  1:24 ` [PATCH 1/5] KVM: x86: Document that KVM_EXIT_HYPERCALL requires completion Sean Christopherson
  2025-01-11  1:24 ` [PATCH 2/5] KVM: Clear vcpu->run->flags at start of KVM_RUN for all architectures Sean Christopherson
@ 2025-01-11  1:24 ` Sean Christopherson
  2025-01-11 11:01   ` Marc Zyngier
  2025-01-13  2:09   ` Chao Gao
  2025-01-11  1:24 ` [PATCH 4/5] KVM: selftests: Provide separate helper for KVM_RUN with immediate_exit Sean Christopherson
  2025-01-11  1:24 ` [PATCH 5/5] KVM: selftests: Rely on KVM_RUN_NEEDS_COMPLETION to complete userspace exits Sean Christopherson
  4 siblings, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:24 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman,
	Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
that KVM_RUN needs to be re-executed prior to save/restore in order to
complete the instruction/operation that triggered the userspace exit.

KVM's current approach of adding notes in the Documentation is beyond
brittle, e.g. there is at least one known case where a KVM developer added
a new userspace exit type, and then that same developer forgot to handle
completion when adding userspace support.

On x86, there are multiple exits that need completion, but they are all
conveniently funneled through a single callback, i.e. in theory, this is a
one-time thing for KVM x86 (and other architectures could follow suit with
additional refactoring).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst    | 48 ++++++++++++++++++++++---------
 arch/powerpc/kvm/book3s_emulate.c |  1 +
 arch/powerpc/kvm/book3s_hv.c      |  1 +
 arch/powerpc/kvm/book3s_pr.c      |  2 ++
 arch/powerpc/kvm/booke.c          |  1 +
 arch/x86/include/uapi/asm/kvm.h   |  7 +++--
 arch/x86/kvm/x86.c                |  2 ++
 include/uapi/linux/kvm.h          |  3 ++
 virt/kvm/kvm_main.c               |  1 +
 9 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c92c8d4e8779..8e172675d8d6 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6505,7 +6505,7 @@ local APIC is not used.
 
 	__u16 flags;
 
-More architecture-specific flags detailing state of the VCPU that may
+Common and architecture-specific flags detailing state of the VCPU that may
 affect the device's behavior. Current defined flags::
 
   /* x86, set if the VCPU is in system management mode */
@@ -6518,6 +6518,8 @@ affect the device's behavior. Current defined flags::
   /* arm64, set for KVM_EXIT_DEBUG */
   #define KVM_DEBUG_ARCH_HSR_HIGH_VALID  (1 << 0)
 
+  /* all architectures, set when the exit needs completion (via KVM_RUN) */
+  #define KVM_RUN_NEEDS_COMPLETION  (1 << 15)
 ::
 
 	/* in (pre_kvm_run), out (post_kvm_run) */
@@ -6632,19 +6634,10 @@ requires a guest to interact with host userspace.
 
 .. note::
 
-      For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
-      KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR, KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL
-      the corresponding operations are complete (and guest state is consistent)
-      only after userspace has re-entered the kernel with KVM_RUN.  The kernel
-      side will first finish incomplete operations and then check for pending
-      signals.
+      For some exits, userspace must re-enter the kernel with KVM_RUN to
+      complete the exit and ensure guest state is consistent.
 
-      The pending state of the operation is not preserved in state which is
-      visible to userspace, thus userspace should ensure that the operation is
-      completed before performing a live migration.  Userspace can re-enter the
-      guest with an unmasked signal pending or with the immediate_exit field set
-      to complete pending operations without allowing any further instructions
-      to be executed.
+      See KVM_CAP_NEEDS_COMPLETION for details.
 
 ::
 
@@ -8239,7 +8232,7 @@ Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the
 core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.
 
 7.36 KVM_CAP_X86_GUEST_MODE
-------------------------------
+---------------------------
 
 :Architectures: x86
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
@@ -8252,6 +8245,33 @@ KVM exits with the register state of either the L1 or L2 guest
 depending on which executed at the time of an exit. Userspace must
 take care to differentiate between these cases.
 
+7.37 KVM_CAP_NEEDS_COMPLETION
+-----------------------------
+
+:Architectures: all
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will set
+KVM_RUN_NEEDS_COMPLETION in kvm_run.flags if KVM requires userspace to re-enter
+the kernel KVM_RUN to complete the exit.
+
+For select exits, userspace must re-enter the kernel with KVM_RUN to complete
+the corresponding operation, only after which is guest state guaranteed to be
+consistent.  On such a KVM_RUN, the kernel side will first finish incomplete
+operations and then check for pending signals.
+
+The pending state of the operation for such exits is not preserved in state
+which is visible to userspace, thus userspace should ensure that the operation
+is completed before performing state save/restore, e.g. for live migration.
+Userspace can re-enter the guest with an unmasked signal pending or with the
+immediate_exit field set to complete pending operations without allowing any
+further instructions to be executed.
+
+Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
+and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
+KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
+KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/powerpc/kvm/book3s_emulate.c b/arch/powerpc/kvm/book3s_emulate.c
index de126d153328..15014a66c167 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -374,6 +374,7 @@ int kvmppc_core_emulate_op_pr(struct kvm_vcpu *vcpu,
 			}
 
 			vcpu->run->exit_reason = KVM_EXIT_PAPR_HCALL;
+			vcpu->run->flags |= KVM_RUN_NEEDS_COMPLETION;
 			vcpu->arch.hcall_needed = 1;
 			emulated = EMULATE_EXIT_USER;
 			break;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b253f7372774..05ad0c3494f1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1767,6 +1767,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 		for (i = 0; i < 9; ++i)
 			run->papr_hcall.args[i] = kvmppc_get_gpr(vcpu, 4 + i);
 		run->exit_reason = KVM_EXIT_PAPR_HCALL;
+		run->flags |= KVM_RUN_NEEDS_COMPLETION;
 		vcpu->arch.hcall_needed = 1;
 		r = RESUME_HOST;
 		break;
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 83bcdc80ce51..c45beb64905a 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1310,6 +1310,7 @@ int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr)
 				run->papr_hcall.args[i] = gpr;
 			}
 			run->exit_reason = KVM_EXIT_PAPR_HCALL;
+			run->flags |= KVM_RUN_NEEDS_COMPLETION;
 			vcpu->arch.hcall_needed = 1;
 			r = RESUME_HOST;
 		} else if (vcpu->arch.osi_enabled &&
@@ -1320,6 +1321,7 @@ int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr)
 			int i;
 
 			run->exit_reason = KVM_EXIT_OSI;
+			run->flags |= KVM_RUN_NEEDS_COMPLETION;
 			for (i = 0; i < 32; i++)
 				gprs[i] = kvmppc_get_gpr(vcpu, i);
 			vcpu->arch.osi_needed = 1;
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 6a5be025a8af..3a0e00178fa5 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -751,6 +751,7 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
 		vcpu->run->epr.epr = 0;
 		vcpu->arch.epr_needed = true;
 		vcpu->run->exit_reason = KVM_EXIT_EPR;
+		vcpu->run->flags |= KVM_RUN_NEEDS_COMPLETION;
 		r = 0;
 	}
 
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 88585c1de416..e2ec32a8970c 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -104,9 +104,10 @@ struct kvm_ioapic_state {
 #define KVM_IRQCHIP_IOAPIC       2
 #define KVM_NR_IRQCHIPS          3
 
-#define KVM_RUN_X86_SMM		 (1 << 0)
-#define KVM_RUN_X86_BUS_LOCK     (1 << 1)
-#define KVM_RUN_X86_GUEST_MODE   (1 << 2)
+#define KVM_RUN_X86_SMM			(1 << 0)
+#define KVM_RUN_X86_BUS_LOCK		(1 << 1)
+#define KVM_RUN_X86_GUEST_MODE		(1 << 2)
+#define KVM_RUN_X86_NEEDS_COMPLETION	(1 << 2)
 
 /* for KVM_GET_REGS and KVM_SET_REGS */
 struct kvm_regs {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8aa12e0911d..67034b089371 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10154,6 +10154,8 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu)
 		kvm_run->flags |= KVM_RUN_X86_SMM;
 	if (is_guest_mode(vcpu))
 		kvm_run->flags |= KVM_RUN_X86_GUEST_MODE;
+	if (vcpu->arch.complete_userspace_io)
+		kvm_run->flags |= KVM_RUN_NEEDS_COMPLETION;
 }
 
 static void update_cr8_intercept(struct kvm_vcpu *vcpu)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 343de0a51797..91dbee3ae0bc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -192,6 +192,8 @@ struct kvm_xen_exit {
 /* Flags that describe what fields in emulation_failure hold valid data. */
 #define KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES (1ULL << 0)
 
+#define KVM_RUN_NEEDS_COMPLETION	(1 << 15)
+
 /*
  * struct kvm_run can be modified by userspace at any time, so KVM must be
  * careful to avoid TOCTOU bugs. In order to protect KVM, HINT_UNSAFE_IN_KVM()
@@ -933,6 +935,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_PRE_FAULT_MEMORY 236
 #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
 #define KVM_CAP_X86_GUEST_MODE 238
+#define KVM_CAP_NEEDS_COMPLETION 239
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7d2076439081..28aa89e5ba85 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4746,6 +4746,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_CHECK_EXTENSION_VM:
 	case KVM_CAP_ENABLE_CAP_VM:
 	case KVM_CAP_HALT_POLL:
+	case KVM_CAP_NEEDS_COMPLETION:
 		return 1;
 #ifdef CONFIG_KVM_MMIO
 	case KVM_CAP_COALESCED_MMIO:
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] KVM: selftests: Provide separate helper for KVM_RUN with immediate_exit
  2025-01-11  1:24 [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-01-11  1:24 ` [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion Sean Christopherson
@ 2025-01-11  1:24 ` Sean Christopherson
  2025-01-11  1:24 ` [PATCH 5/5] KVM: selftests: Rely on KVM_RUN_NEEDS_COMPLETION to complete userspace exits Sean Christopherson
  4 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:24 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman,
	Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

Turn vcpu_run_complete_io() into a wrapper for a dedicated helper for
doing KVM_RUN with immediate_exit = true, so that a future patch can do
userspace exit completion if and only if it's actually necessary,
whereas x86's nested exceptions test wants to unconditionally do KVM_RUN
with an immediate exit.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h           | 9 ++++++++-
 tools/testing/selftests/kvm/lib/kvm_util.c               | 2 +-
 tools/testing/selftests/kvm/x86/nested_exceptions_test.c | 3 +--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 4c4e5a847f67..78fd597c1b60 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -631,7 +631,14 @@ static inline int __vcpu_run(struct kvm_vcpu *vcpu)
 	return __vcpu_ioctl(vcpu, KVM_RUN, NULL);
 }
 
-void vcpu_run_complete_io(struct kvm_vcpu *vcpu);
+
+void vcpu_run_immediate_exit(struct kvm_vcpu *vcpu);
+
+static inline void vcpu_run_complete_io(struct kvm_vcpu *vcpu)
+{
+	vcpu_run_immediate_exit(vcpu);
+}
+
 struct kvm_reg_list *vcpu_get_reg_list(struct kvm_vcpu *vcpu);
 
 static inline void vcpu_enable_cap(struct kvm_vcpu *vcpu, uint32_t cap,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 33fefeb3ca44..c9a33766f673 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1665,7 +1665,7 @@ void vcpu_run(struct kvm_vcpu *vcpu)
 	TEST_ASSERT(!ret, KVM_IOCTL_ERROR(KVM_RUN, ret));
 }
 
-void vcpu_run_complete_io(struct kvm_vcpu *vcpu)
+void vcpu_run_immediate_exit(struct kvm_vcpu *vcpu)
 {
 	int ret;
 
diff --git a/tools/testing/selftests/kvm/x86/nested_exceptions_test.c b/tools/testing/selftests/kvm/x86/nested_exceptions_test.c
index 3eb0313ffa39..4f144576a6da 100644
--- a/tools/testing/selftests/kvm/x86/nested_exceptions_test.c
+++ b/tools/testing/selftests/kvm/x86/nested_exceptions_test.c
@@ -238,8 +238,7 @@ int main(int argc, char *argv[])
 
 	/* Pend #SS and request immediate exit.  #SS should still be pending. */
 	queue_ss_exception(vcpu, false);
-	vcpu->run->immediate_exit = true;
-	vcpu_run_complete_io(vcpu);
+	vcpu_run_immediate_exit(vcpu);
 
 	/* Verify the pending events comes back out the same as it went in. */
 	vcpu_events_get(vcpu, &events);
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] KVM: selftests: Rely on KVM_RUN_NEEDS_COMPLETION to complete userspace exits
  2025-01-11  1:24 [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-01-11  1:24 ` [PATCH 4/5] KVM: selftests: Provide separate helper for KVM_RUN with immediate_exit Sean Christopherson
@ 2025-01-11  1:24 ` Sean Christopherson
  4 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-11  1:24 UTC (permalink / raw)
  To: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman,
	Sean Christopherson
  Cc: kvm, linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

Add selftests coverage for KVM_RUN_NEEDS_COMPLETION by redoing KVM_RUN if
and only if KVM states that completion is required.

Opportunistically rename the helper to replace "io" with "exit", as exits
that require completion are no longer limited to I/O.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h            | 8 ++++++--
 tools/testing/selftests/kvm/lib/kvm_util.c                | 4 ++++
 tools/testing/selftests/kvm/lib/ucall_common.c            | 2 +-
 tools/testing/selftests/kvm/lib/x86/processor.c           | 8 +-------
 tools/testing/selftests/kvm/x86/triple_fault_event_test.c | 3 +--
 5 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 78fd597c1b60..86e1850e4e49 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -29,6 +29,8 @@
 
 #define NSEC_PER_SEC 1000000000L
 
+extern bool kvm_has_needs_completion;
+
 struct userspace_mem_region {
 	struct kvm_userspace_memory_region2 region;
 	struct sparsebit *unused_phy_pages;
@@ -634,9 +636,11 @@ static inline int __vcpu_run(struct kvm_vcpu *vcpu)
 
 void vcpu_run_immediate_exit(struct kvm_vcpu *vcpu);
 
-static inline void vcpu_run_complete_io(struct kvm_vcpu *vcpu)
+static inline void vcpu_run_complete_exit(struct kvm_vcpu *vcpu)
 {
-	vcpu_run_immediate_exit(vcpu);
+	if (!kvm_has_needs_completion ||
+	    (vcpu->run->flags & KVM_RUN_NEEDS_COMPLETION))
+		vcpu_run_immediate_exit(vcpu);
 }
 
 struct kvm_reg_list *vcpu_get_reg_list(struct kvm_vcpu *vcpu);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index c9a33766f673..95ac9b981049 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -19,6 +19,8 @@
 
 #define KVM_UTIL_MIN_PFN	2
 
+bool kvm_has_needs_completion;
+
 uint32_t guest_random_seed;
 struct guest_random_state guest_rng;
 static uint32_t last_guest_seed;
@@ -2253,6 +2255,8 @@ void __attribute((constructor)) kvm_selftest_init(void)
 	/* Tell stdout not to buffer its content. */
 	setbuf(stdout, NULL);
 
+	kvm_has_needs_completion = kvm_check_cap(KVM_CAP_NEEDS_COMPLETION);
+
 	guest_random_seed = last_guest_seed = random();
 	pr_info("Random seed: 0x%x\n", guest_random_seed);
 
diff --git a/tools/testing/selftests/kvm/lib/ucall_common.c b/tools/testing/selftests/kvm/lib/ucall_common.c
index 42151e571953..125584a94ba8 100644
--- a/tools/testing/selftests/kvm/lib/ucall_common.c
+++ b/tools/testing/selftests/kvm/lib/ucall_common.c
@@ -154,7 +154,7 @@ uint64_t get_ucall(struct kvm_vcpu *vcpu, struct ucall *uc)
 			    "Guest failed to allocate ucall struct");
 
 		memcpy(uc, addr, sizeof(*uc));
-		vcpu_run_complete_io(vcpu);
+		vcpu_run_complete_exit(vcpu);
 	} else {
 		memset(uc, 0, sizeof(*uc));
 	}
diff --git a/tools/testing/selftests/kvm/lib/x86/processor.c b/tools/testing/selftests/kvm/lib/x86/processor.c
index bd5a802fa7a5..1db4764e413b 100644
--- a/tools/testing/selftests/kvm/lib/x86/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86/processor.c
@@ -1077,13 +1077,7 @@ struct kvm_x86_state *vcpu_save_state(struct kvm_vcpu *vcpu)
 			    nested_size, sizeof(state->nested_));
 	}
 
-	/*
-	 * When KVM exits to userspace with KVM_EXIT_IO, KVM guarantees
-	 * guest state is consistent only after userspace re-enters the
-	 * kernel with KVM_RUN.  Complete IO prior to migrating state
-	 * to a new VM.
-	 */
-	vcpu_run_complete_io(vcpu);
+	vcpu_run_complete_exit(vcpu);
 
 	state = malloc(sizeof(*state) + msr_list->nmsrs * sizeof(state->msrs.entries[0]));
 	TEST_ASSERT(state, "-ENOMEM when allocating kvm state");
diff --git a/tools/testing/selftests/kvm/x86/triple_fault_event_test.c b/tools/testing/selftests/kvm/x86/triple_fault_event_test.c
index 56306a19144a..82d732788bc1 100644
--- a/tools/testing/selftests/kvm/x86/triple_fault_event_test.c
+++ b/tools/testing/selftests/kvm/x86/triple_fault_event_test.c
@@ -97,8 +97,7 @@ int main(void)
 	events.flags |= KVM_VCPUEVENT_VALID_TRIPLE_FAULT;
 	events.triple_fault.pending = true;
 	vcpu_events_set(vcpu, &events);
-	run->immediate_exit = true;
-	vcpu_run_complete_io(vcpu);
+	vcpu_run_complete_exit(vcpu);
 
 	vcpu_events_get(vcpu, &events);
 	TEST_ASSERT(events.flags & KVM_VCPUEVENT_VALID_TRIPLE_FAULT,
-- 
2.47.1.613.gc27f4b7a9f-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-11  1:24 ` [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion Sean Christopherson
@ 2025-01-11 11:01   ` Marc Zyngier
  2025-01-13 15:44     ` Sean Christopherson
  2025-01-13  2:09   ` Chao Gao
  1 sibling, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2025-01-11 11:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Sat, 11 Jan 2025 01:24:48 +0000,
Sean Christopherson <seanjc@google.com> wrote:
> 
> Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
> that KVM_RUN needs to be re-executed prior to save/restore in order to
> complete the instruction/operation that triggered the userspace exit.
> 
> KVM's current approach of adding notes in the Documentation is beyond
> brittle, e.g. there is at least one known case where a KVM developer added
> a new userspace exit type, and then that same developer forgot to handle
> completion when adding userspace support.

Is this going to fix anything? If they couldn't be bothered to read
the documentation, let alone update it, how is that going to be
improved by extra rules and regulations?

I don't see how someone ignoring the documented behaviour of a given
exit reason is, all of a sudden, have an epiphany and take a *new*
flag into account.

> 
> On x86, there are multiple exits that need completion, but they are all
> conveniently funneled through a single callback, i.e. in theory, this is a
> one-time thing for KVM x86 (and other architectures could follow suit with
> additional refactoring).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  Documentation/virt/kvm/api.rst    | 48 ++++++++++++++++++++++---------
>  arch/powerpc/kvm/book3s_emulate.c |  1 +
>  arch/powerpc/kvm/book3s_hv.c      |  1 +
>  arch/powerpc/kvm/book3s_pr.c      |  2 ++
>  arch/powerpc/kvm/booke.c          |  1 +
>  arch/x86/include/uapi/asm/kvm.h   |  7 +++--
>  arch/x86/kvm/x86.c                |  2 ++
>  include/uapi/linux/kvm.h          |  3 ++
>  virt/kvm/kvm_main.c               |  1 +
>  9 files changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index c92c8d4e8779..8e172675d8d6 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6505,7 +6505,7 @@ local APIC is not used.
>  
>  	__u16 flags;
>  
> -More architecture-specific flags detailing state of the VCPU that may
> +Common and architecture-specific flags detailing state of the VCPU that may
>  affect the device's behavior. Current defined flags::
>  
>    /* x86, set if the VCPU is in system management mode */
> @@ -6518,6 +6518,8 @@ affect the device's behavior. Current defined flags::
>    /* arm64, set for KVM_EXIT_DEBUG */
>    #define KVM_DEBUG_ARCH_HSR_HIGH_VALID  (1 << 0)
>  
> +  /* all architectures, set when the exit needs completion (via KVM_RUN) */
> +  #define KVM_RUN_NEEDS_COMPLETION  (1 << 15)
>  ::
>  
>  	/* in (pre_kvm_run), out (post_kvm_run) */
> @@ -6632,19 +6634,10 @@ requires a guest to interact with host userspace.
>  
>  .. note::
>  
> -      For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
> -      KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR, KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL
> -      the corresponding operations are complete (and guest state is consistent)
> -      only after userspace has re-entered the kernel with KVM_RUN.  The kernel
> -      side will first finish incomplete operations and then check for pending
> -      signals.
> +      For some exits, userspace must re-enter the kernel with KVM_RUN to
> +      complete the exit and ensure guest state is consistent.
>  
> -      The pending state of the operation is not preserved in state which is
> -      visible to userspace, thus userspace should ensure that the operation is
> -      completed before performing a live migration.  Userspace can re-enter the
> -      guest with an unmasked signal pending or with the immediate_exit field set
> -      to complete pending operations without allowing any further instructions
> -      to be executed.
> +      See KVM_CAP_NEEDS_COMPLETION for details.
>  
>  ::
>  
> @@ -8239,7 +8232,7 @@ Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the
>  core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest.
>  
>  7.36 KVM_CAP_X86_GUEST_MODE
> -------------------------------
> +---------------------------
>  
>  :Architectures: x86
>  :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> @@ -8252,6 +8245,33 @@ KVM exits with the register state of either the L1 or L2 guest
>  depending on which executed at the time of an exit. Userspace must
>  take care to differentiate between these cases.
>  
> +7.37 KVM_CAP_NEEDS_COMPLETION
> +-----------------------------
> +
> +:Architectures: all
> +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> +
> +The presence of this capability indicates that KVM_RUN will set
> +KVM_RUN_NEEDS_COMPLETION in kvm_run.flags if KVM requires userspace to re-enter
> +the kernel KVM_RUN to complete the exit.
> +
> +For select exits, userspace must re-enter the kernel with KVM_RUN to complete
> +the corresponding operation, only after which is guest state guaranteed to be
> +consistent.  On such a KVM_RUN, the kernel side will first finish incomplete
> +operations and then check for pending signals.
> +
> +The pending state of the operation for such exits is not preserved in state
> +which is visible to userspace, thus userspace should ensure that the operation
> +is completed before performing state save/restore, e.g. for live migration.
> +Userspace can re-enter the guest with an unmasked signal pending or with the
> +immediate_exit field set to complete pending operations without allowing any
> +further instructions to be executed.
> +
> +Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
> +and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
> +KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
> +KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.

So once you advertise KVM_CAP_NEEDS_COMPLETION, the completion flag
must be present for all of these exits, right? And from what I can
tell, this capability is unconditionally advertised.

Yet, you don't amend arm64 to publish that flag. Not that I think this
causes any issue (even if you save the state at that point without
reentering the guest, it will be still be consistent), but that
directly contradicts the documentation (isn't that ironic? ;-).

Or is your intent to *relax* the requirements on arm64 (and anything
else but x86 and POWER)?

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-11  1:24 ` [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion Sean Christopherson
  2025-01-11 11:01   ` Marc Zyngier
@ 2025-01-13  2:09   ` Chao Gao
  2025-01-13  9:01     ` Binbin Wu
  2025-01-13 16:59     ` Sean Christopherson
  1 sibling, 2 replies; 15+ messages in thread
From: Chao Gao @ 2025-01-13  2:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Fri, Jan 10, 2025 at 05:24:48PM -0800, Sean Christopherson wrote:
>Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
>that KVM_RUN needs to be re-executed prior to save/restore in order to
>complete the instruction/operation that triggered the userspace exit.
>
>KVM's current approach of adding notes in the Documentation is beyond
>brittle, e.g. there is at least one known case where a KVM developer added
>a new userspace exit type, and then that same developer forgot to handle
>completion when adding userspace support.

This answers one question I had:
https://lore.kernel.org/kvm/Z1bmUCEdoZ87wIMn@intel.com/

So, it is the VMM's (i.e., QEMU's) responsibility to re-execute KVM_RUN in this
case.

Btw, can this flag be used to address the issue [*] with steal time accounting?
We can set the new flag for each vCPU in the PM notifier and we need to change
the re-execution to handle steal time accounting (not just IO completion).

[*]: https://lore.kernel.org/kvm/Z36XJl1OAahVkxhl@google.com/

one nit below,

>--- a/arch/x86/include/uapi/asm/kvm.h
>+++ b/arch/x86/include/uapi/asm/kvm.h
>@@ -104,9 +104,10 @@ struct kvm_ioapic_state {
> #define KVM_IRQCHIP_IOAPIC       2
> #define KVM_NR_IRQCHIPS          3
> 
>-#define KVM_RUN_X86_SMM		 (1 << 0)
>-#define KVM_RUN_X86_BUS_LOCK     (1 << 1)
>-#define KVM_RUN_X86_GUEST_MODE   (1 << 2)
>+#define KVM_RUN_X86_SMM			(1 << 0)
>+#define KVM_RUN_X86_BUS_LOCK		(1 << 1)
>+#define KVM_RUN_X86_GUEST_MODE		(1 << 2)
>+#define KVM_RUN_X86_NEEDS_COMPLETION	(1 << 2)

This X86_NEEDS_COMPLETION should be dropped. It is never used.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-13  2:09   ` Chao Gao
@ 2025-01-13  9:01     ` Binbin Wu
  2025-01-13 16:59     ` Sean Christopherson
  1 sibling, 0 replies; 15+ messages in thread
From: Binbin Wu @ 2025-01-13  9:01 UTC (permalink / raw)
  To: Chao Gao, Sean Christopherson
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel




On 1/13/2025 10:09 AM, Chao Gao wrote:
> On Fri, Jan 10, 2025 at 05:24:48PM -0800, Sean Christopherson wrote:
>> Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
>> that KVM_RUN needs to be re-executed prior to save/restore in order to
>> complete the instruction/operation that triggered the userspace exit.
>>
>> KVM's current approach of adding notes in the Documentation is beyond
>> brittle, e.g. there is at least one known case where a KVM developer added
>> a new userspace exit type, and then that same developer forgot to handle
>> completion when adding userspace support.
> This answers one question I had:
> https://lore.kernel.org/kvm/Z1bmUCEdoZ87wIMn@intel.com/
In current QEMU code, it always returns back to KVM via KVM_RUN after it
successfully handled a KVM exit reason, no matter what the exit reason is.
The complete_userspace_io() callback will be called if it has been setup.
So if a new kvm exit reason is added in QEMU, it seems QEMU doesn't need
special handing to make the complete_userspace_io() callback be called.

However, QEMU is not the only userspace VMM that supports KVM, it makes
sense to make the solution generic and clear for different userspace VMMs.

Regarding the support of MapGPA for TDX when live migration is considered,
since a big range will be split into 2MB chunks, in order the status is
right after TD live migration, it needs to set the return code to retry
with the next_gpa in the complete_userspace_io() callback if vcpu->wants_to_run
is false or vcpu->run->immediate_exit__unsafe is set, otherwise, TDX guest
will see return code as successful and think the whole range has been converted
successfully.

@@ -1093,7 +1093,8 @@ static int tdx_complete_vmcall_map_gpa(struct kvm_vcpu *vcpu)
          * immediately after STI or MOV/POP SS.
          */
         if (pi_has_pending_interrupt(vcpu) ||
-           kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending) {
+           kvm_test_request(KVM_REQ_NMI, vcpu) || vcpu->arch.nmi_pending ||
+           !vcpu->wants_to_run) {
                 tdvmcall_set_return_code(vcpu, TDVMCALL_STATUS_RETRY);
                 tdx->vp_enter_args.r11 = tdx->map_gpa_next;
                 return 1;

Of course, it can be addressed later when TD live migration is supported.


>
> So, it is the VMM's (i.e., QEMU's) responsibility to re-execute KVM_RUN in this
> case.
>
> Btw, can this flag be used to address the issue [*] with steal time accounting?
> We can set the new flag for each vCPU in the PM notifier and we need to change
> the re-execution to handle steal time accounting (not just IO completion).
>
> [*]: https://lore.kernel.org/kvm/Z36XJl1OAahVkxhl@google.com/
>
> one nit below,
>
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -104,9 +104,10 @@ struct kvm_ioapic_state {
>> #define KVM_IRQCHIP_IOAPIC       2
>> #define KVM_NR_IRQCHIPS          3
>>
>> -#define KVM_RUN_X86_SMM		 (1 << 0)
>> -#define KVM_RUN_X86_BUS_LOCK     (1 << 1)
>> -#define KVM_RUN_X86_GUEST_MODE   (1 << 2)
>> +#define KVM_RUN_X86_SMM			(1 << 0)
>> +#define KVM_RUN_X86_BUS_LOCK		(1 << 1)
>> +#define KVM_RUN_X86_GUEST_MODE		(1 << 2)
>> +#define KVM_RUN_X86_NEEDS_COMPLETION	(1 << 2)
> This X86_NEEDS_COMPLETION should be dropped. It is never used.
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-11 11:01   ` Marc Zyngier
@ 2025-01-13 15:44     ` Sean Christopherson
  2025-01-13 17:58       ` Marc Zyngier
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-01-13 15:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Sat, Jan 11, 2025, Marc Zyngier wrote:
> On Sat, 11 Jan 2025 01:24:48 +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
> > that KVM_RUN needs to be re-executed prior to save/restore in order to
> > complete the instruction/operation that triggered the userspace exit.
> > 
> > KVM's current approach of adding notes in the Documentation is beyond
> > brittle, e.g. there is at least one known case where a KVM developer added
> > a new userspace exit type, and then that same developer forgot to handle
> > completion when adding userspace support.
> 
> Is this going to fix anything? If they couldn't be bothered to read
> the documentation, let alone update it, how is that going to be
> improved by extra rules and regulations?
> 
> I don't see how someone ignoring the documented behaviour of a given
> exit reason is, all of a sudden, have an epiphany and take a *new*
> flag into account.

The idea is to reduce the probability of introducing bugs, in KVM or userspace,
every time KVM attaches a completion callback.  Yes, userspace would need to be
updated to handle KVM_RUN_NEEDS_COMPLETION, but once that flag is merged, neither
KVM's documentation nor userspace would never need to be updated again.  And if
all architectures took an approach of handling completion via function callback,
I'm pretty sure we'd never need to manually update KVM itself either.

> > +7.37 KVM_CAP_NEEDS_COMPLETION
> > +-----------------------------
> > +
> > +:Architectures: all
> > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > +
> > +The presence of this capability indicates that KVM_RUN will set
> > +KVM_RUN_NEEDS_COMPLETION in kvm_run.flags if KVM requires userspace to re-enter
> > +the kernel KVM_RUN to complete the exit.
> > +
> > +For select exits, userspace must re-enter the kernel with KVM_RUN to complete
> > +the corresponding operation, only after which is guest state guaranteed to be
> > +consistent.  On such a KVM_RUN, the kernel side will first finish incomplete
> > +operations and then check for pending signals.
> > +
> > +The pending state of the operation for such exits is not preserved in state
> > +which is visible to userspace, thus userspace should ensure that the operation
> > +is completed before performing state save/restore, e.g. for live migration.
> > +Userspace can re-enter the guest with an unmasked signal pending or with the
> > +immediate_exit field set to complete pending operations without allowing any
> > +further instructions to be executed.
> > +
> > +Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
> > +and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
> > +KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
> > +KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.
> 
> So once you advertise KVM_CAP_NEEDS_COMPLETION, the completion flag
> must be present for all of these exits, right? And from what I can
> tell, this capability is unconditionally advertised.
> 
> Yet, you don't amend arm64 to publish that flag. Not that I think this
> causes any issue (even if you save the state at that point without
> reentering the guest, it will be still be consistent), but that
> directly contradicts the documentation (isn't that ironic? ;-).

It does cause issues, I missed this code in kvm_arch_vcpu_ioctl_run():

	if (run->exit_reason == KVM_EXIT_MMIO) {
		ret = kvm_handle_mmio_return(vcpu);
		if (ret <= 0)
			return ret;
	}

> Or is your intent to *relax* the requirements on arm64 (and anything
> else but x86 and POWER)?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-13  2:09   ` Chao Gao
  2025-01-13  9:01     ` Binbin Wu
@ 2025-01-13 16:59     ` Sean Christopherson
  1 sibling, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-13 16:59 UTC (permalink / raw)
  To: Chao Gao
  Cc: Paolo Bonzini, Marc Zyngier, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Mon, Jan 13, 2025, Chao Gao wrote:
> On Fri, Jan 10, 2025 at 05:24:48PM -0800, Sean Christopherson wrote:
> >Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
> >that KVM_RUN needs to be re-executed prior to save/restore in order to
> >complete the instruction/operation that triggered the userspace exit.
> >
> >KVM's current approach of adding notes in the Documentation is beyond
> >brittle, e.g. there is at least one known case where a KVM developer added
> >a new userspace exit type, and then that same developer forgot to handle
> >completion when adding userspace support.
> 
> This answers one question I had:
> https://lore.kernel.org/kvm/Z1bmUCEdoZ87wIMn@intel.com/
> 
> So, it is the VMM's (i.e., QEMU's) responsibility to re-execute KVM_RUN in this
> case.

Yep.

> Btw, can this flag be used to address the issue [*] with steal time accounting?
> We can set the new flag for each vCPU in the PM notifier and we need to change
> the re-execution to handle steal time accounting (not just IO completion).
> 
> [*]: https://lore.kernel.org/kvm/Z36XJl1OAahVkxhl@google.com/

Uh, hmm.  Partially?  And not without creating new, potentially worse problems.

I like the idea, but (a) there's no guarantee a vCPU would be "in" KVM_RUN at
the time of suspend, and (b) KVM would need to take vcpu->mutex in the PM notifier
in order to avoid clobbering the current completion callback, which is definitely
a net negative (hello, deadlocks).

E.g. if a vCPU task is in userspace processing emulated MMIO at the time of
suspend+resume, KVM's completion callback will be non-zero and must be preserved.
And if a vCPU task is in userspace processing an exit that _doesn't_ require
completion, setting KVM_RUN_NEEDS_COMPLETION would likely be missed by userspace,
e.g. if userspace checks the flag only after regaining control from KVM_RUN.

In general, I think setting KVM_RUN_NEEDS_COMPLETION outside of KVM_RUN would add
too much complexity.

> one nit below,
> 
> >--- a/arch/x86/include/uapi/asm/kvm.h
> >+++ b/arch/x86/include/uapi/asm/kvm.h
> >@@ -104,9 +104,10 @@ struct kvm_ioapic_state {
> > #define KVM_IRQCHIP_IOAPIC       2
> > #define KVM_NR_IRQCHIPS          3
> > 
> >-#define KVM_RUN_X86_SMM		 (1 << 0)
> >-#define KVM_RUN_X86_BUS_LOCK     (1 << 1)
> >-#define KVM_RUN_X86_GUEST_MODE   (1 << 2)
> >+#define KVM_RUN_X86_SMM			(1 << 0)
> >+#define KVM_RUN_X86_BUS_LOCK		(1 << 1)
> >+#define KVM_RUN_X86_GUEST_MODE		(1 << 2)
> >+#define KVM_RUN_X86_NEEDS_COMPLETION	(1 << 2)
> 
> This X86_NEEDS_COMPLETION should be dropped. It is never used.

Gah, thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-13 15:44     ` Sean Christopherson
@ 2025-01-13 17:58       ` Marc Zyngier
  2025-01-13 18:58         ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2025-01-13 17:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Mon, 13 Jan 2025 15:44:28 +0000,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Sat, Jan 11, 2025, Marc Zyngier wrote:
> > On Sat, 11 Jan 2025 01:24:48 +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
> > > that KVM_RUN needs to be re-executed prior to save/restore in order to
> > > complete the instruction/operation that triggered the userspace exit.
> > > 
> > > KVM's current approach of adding notes in the Documentation is beyond
> > > brittle, e.g. there is at least one known case where a KVM developer added
> > > a new userspace exit type, and then that same developer forgot to handle
> > > completion when adding userspace support.
> > 
> > Is this going to fix anything? If they couldn't be bothered to read
> > the documentation, let alone update it, how is that going to be
> > improved by extra rules and regulations?
> > 
> > I don't see how someone ignoring the documented behaviour of a given
> > exit reason is, all of a sudden, have an epiphany and take a *new*
> > flag into account.
> 
> The idea is to reduce the probability of introducing bugs, in KVM or userspace,
> every time KVM attaches a completion callback.  Yes, userspace would need to be
> updated to handle KVM_RUN_NEEDS_COMPLETION, but once that flag is merged, neither
> KVM's documentation nor userspace would never need to be updated again.  And if
> all architectures took an approach of handling completion via function callback,
> I'm pretty sure we'd never need to manually update KVM itself either.

You are assuming that we need this completion, and I dispute this
assertion.

>
> > > +7.37 KVM_CAP_NEEDS_COMPLETION
> > > +-----------------------------
> > > +
> > > +:Architectures: all
> > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > +
> > > +The presence of this capability indicates that KVM_RUN will set
> > > +KVM_RUN_NEEDS_COMPLETION in kvm_run.flags if KVM requires userspace to re-enter
> > > +the kernel KVM_RUN to complete the exit.
> > > +
> > > +For select exits, userspace must re-enter the kernel with KVM_RUN to complete
> > > +the corresponding operation, only after which is guest state guaranteed to be
> > > +consistent.  On such a KVM_RUN, the kernel side will first finish incomplete
> > > +operations and then check for pending signals.
> > > +
> > > +The pending state of the operation for such exits is not preserved in state
> > > +which is visible to userspace, thus userspace should ensure that the operation
> > > +is completed before performing state save/restore, e.g. for live migration.
> > > +Userspace can re-enter the guest with an unmasked signal pending or with the
> > > +immediate_exit field set to complete pending operations without allowing any
> > > +further instructions to be executed.
> > > +
> > > +Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
> > > +and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
> > > +KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
> > > +KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.
> > 
> > So once you advertise KVM_CAP_NEEDS_COMPLETION, the completion flag
> > must be present for all of these exits, right? And from what I can
> > tell, this capability is unconditionally advertised.
> > 
> > Yet, you don't amend arm64 to publish that flag. Not that I think this
> > causes any issue (even if you save the state at that point without
> > reentering the guest, it will be still be consistent), but that
> > directly contradicts the documentation (isn't that ironic? ;-).
> 
> It does cause issues, I missed this code in kvm_arch_vcpu_ioctl_run():
> 
> 	if (run->exit_reason == KVM_EXIT_MMIO) {
> 		ret = kvm_handle_mmio_return(vcpu);
> 		if (ret <= 0)
> 			return ret;
> 	}

That's satisfying a load from the guest forwarded to userspace. If the
VMM did a save of the guest at this stage, restored and resumed it,
*nothing* bad would happen, as PC still points to the instruction that
got forwarded. You'll see the same load again.

As for all arm64 synchronous exceptions, they are idempotent, and can
be repeated as often as you want without side effects.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-13 17:58       ` Marc Zyngier
@ 2025-01-13 18:58         ` Sean Christopherson
  2025-01-13 19:38           ` Marc Zyngier
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-01-13 18:58 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Mon, Jan 13, 2025, Marc Zyngier wrote:
> On Mon, 13 Jan 2025 15:44:28 +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Sat, Jan 11, 2025, Marc Zyngier wrote:
> > > On Sat, 11 Jan 2025 01:24:48 +0000,
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
> > > > that KVM_RUN needs to be re-executed prior to save/restore in order to
> > > > complete the instruction/operation that triggered the userspace exit.
> > > > 
> > > > KVM's current approach of adding notes in the Documentation is beyond
> > > > brittle, e.g. there is at least one known case where a KVM developer added
> > > > a new userspace exit type, and then that same developer forgot to handle
> > > > completion when adding userspace support.
> > > 
> > > Is this going to fix anything? If they couldn't be bothered to read
> > > the documentation, let alone update it, how is that going to be
> > > improved by extra rules and regulations?
> > > 
> > > I don't see how someone ignoring the documented behaviour of a given
> > > exit reason is, all of a sudden, have an epiphany and take a *new*
> > > flag into account.
> > 
> > The idea is to reduce the probability of introducing bugs, in KVM or userspace,
> > every time KVM attaches a completion callback.  Yes, userspace would need to be
> > updated to handle KVM_RUN_NEEDS_COMPLETION, but once that flag is merged, neither
> > KVM's documentation nor userspace would never need to be updated again.  And if
> > all architectures took an approach of handling completion via function callback,
> > I'm pretty sure we'd never need to manually update KVM itself either.
> 
> You are assuming that we need this completion, and I dispute this
> assertion.

Ah, gotcha.

> > > > +The pending state of the operation for such exits is not preserved in state
> > > > +which is visible to userspace, thus userspace should ensure that the operation
> > > > +is completed before performing state save/restore, e.g. for live migration.
> > > > +Userspace can re-enter the guest with an unmasked signal pending or with the
> > > > +immediate_exit field set to complete pending operations without allowing any
> > > > +further instructions to be executed.
> > > > +
> > > > +Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
> > > > +and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
> > > > +KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
> > > > +KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.
> > > 
> > > So once you advertise KVM_CAP_NEEDS_COMPLETION, the completion flag
> > > must be present for all of these exits, right? And from what I can
> > > tell, this capability is unconditionally advertised.
> > > 
> > > Yet, you don't amend arm64 to publish that flag. Not that I think this
> > > causes any issue (even if you save the state at that point without
> > > reentering the guest, it will be still be consistent), but that
> > > directly contradicts the documentation (isn't that ironic? ;-).
> > 
> > It does cause issues, I missed this code in kvm_arch_vcpu_ioctl_run():
> > 
> > 	if (run->exit_reason == KVM_EXIT_MMIO) {
> > 		ret = kvm_handle_mmio_return(vcpu);
> > 		if (ret <= 0)
> > 			return ret;
> > 	}
> 
> That's satisfying a load from the guest forwarded to userspace.

And MMIO stores, no?  I.e. PC needs to be incremented on stores as well.

> If the VMM did a save of the guest at this stage, restored and resumed it,
> *nothing* bad would happen, as PC still points to the instruction that got
> forwarded. You'll see the same load again.

But replaying an MMIO store could cause all kinds of problems, and even MMIO
loads could theoretically be problematic, e.g. if there are side effects in the
device that trigger on access to a device register.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-13 18:58         ` Sean Christopherson
@ 2025-01-13 19:38           ` Marc Zyngier
  2025-01-13 22:04             ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Marc Zyngier @ 2025-01-13 19:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Mon, 13 Jan 2025 18:58:45 +0000,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Mon, Jan 13, 2025, Marc Zyngier wrote:
> > On Mon, 13 Jan 2025 15:44:28 +0000,
> > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Sat, Jan 11, 2025, Marc Zyngier wrote:
> > > > On Sat, 11 Jan 2025 01:24:48 +0000,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > > 
> > > > > Add a kvm_run flag, KVM_RUN_NEEDS_COMPLETION, to communicate to userspace
> > > > > that KVM_RUN needs to be re-executed prior to save/restore in order to
> > > > > complete the instruction/operation that triggered the userspace exit.
> > > > > 
> > > > > KVM's current approach of adding notes in the Documentation is beyond
> > > > > brittle, e.g. there is at least one known case where a KVM developer added
> > > > > a new userspace exit type, and then that same developer forgot to handle
> > > > > completion when adding userspace support.
> > > > 
> > > > Is this going to fix anything? If they couldn't be bothered to read
> > > > the documentation, let alone update it, how is that going to be
> > > > improved by extra rules and regulations?
> > > > 
> > > > I don't see how someone ignoring the documented behaviour of a given
> > > > exit reason is, all of a sudden, have an epiphany and take a *new*
> > > > flag into account.
> > > 
> > > The idea is to reduce the probability of introducing bugs, in KVM or userspace,
> > > every time KVM attaches a completion callback.  Yes, userspace would need to be
> > > updated to handle KVM_RUN_NEEDS_COMPLETION, but once that flag is merged, neither
> > > KVM's documentation nor userspace would never need to be updated again.  And if
> > > all architectures took an approach of handling completion via function callback,
> > > I'm pretty sure we'd never need to manually update KVM itself either.
> > 
> > You are assuming that we need this completion, and I dispute this
> > assertion.
> 
> Ah, gotcha.
> 
> > > > > +The pending state of the operation for such exits is not preserved in state
> > > > > +which is visible to userspace, thus userspace should ensure that the operation
> > > > > +is completed before performing state save/restore, e.g. for live migration.
> > > > > +Userspace can re-enter the guest with an unmasked signal pending or with the
> > > > > +immediate_exit field set to complete pending operations without allowing any
> > > > > +further instructions to be executed.
> > > > > +
> > > > > +Without KVM_CAP_NEEDS_COMPLETION, KVM_RUN_NEEDS_COMPLETION will never be set
> > > > > +and userspace must assume that exits of type KVM_EXIT_IO, KVM_EXIT_MMIO,
> > > > > +KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR,
> > > > > +KVM_EXIT_X86_WRMSR, and KVM_EXIT_HYPERCALL require completion.
> > > > 
> > > > So once you advertise KVM_CAP_NEEDS_COMPLETION, the completion flag
> > > > must be present for all of these exits, right? And from what I can
> > > > tell, this capability is unconditionally advertised.
> > > > 
> > > > Yet, you don't amend arm64 to publish that flag. Not that I think this
> > > > causes any issue (even if you save the state at that point without
> > > > reentering the guest, it will be still be consistent), but that
> > > > directly contradicts the documentation (isn't that ironic? ;-).
> > > 
> > > It does cause issues, I missed this code in kvm_arch_vcpu_ioctl_run():
> > > 
> > > 	if (run->exit_reason == KVM_EXIT_MMIO) {
> > > 		ret = kvm_handle_mmio_return(vcpu);
> > > 		if (ret <= 0)
> > > 			return ret;
> > > 	}
> > 
> > That's satisfying a load from the guest forwarded to userspace.
> 
> And MMIO stores, no?  I.e. PC needs to be incremented on stores as well.

Yes, *after* the store as completed. If you replay the instruction,
the same store comes out.

>
> > If the VMM did a save of the guest at this stage, restored and resumed it,
> > *nothing* bad would happen, as PC still points to the instruction that got
> > forwarded. You'll see the same load again.
> 
> But replaying an MMIO store could cause all kinds of problems, and even MMIO
> loads could theoretically be problematic, e.g. if there are side effects in the
> device that trigger on access to a device register.

But that's the VMM's problem. If it has modified its own state and
doesn't return to the guest to complete the instruction, that's just
as bad as a load, which *do* have side effects as well.

Overall, the guest state exposed by KVM is always correct, and
replaying the instruction is not going to change that. It is if the
VMM is broken that things turn ugly *for the VMM itself*, and I claim
that no amount of flag being added is going to help that.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion
  2025-01-13 19:38           ` Marc Zyngier
@ 2025-01-13 22:04             ` Sean Christopherson
  0 siblings, 0 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-01-13 22:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Paolo Bonzini, Oliver Upton, Michael Ellerman, kvm,
	linux-arm-kernel, kvmarm, linuxppc-dev, linux-kernel

On Mon, Jan 13, 2025, Marc Zyngier wrote:
> On Mon, 13 Jan 2025 18:58:45 +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Mon, Jan 13, 2025, Marc Zyngier wrote:
> > > On Mon, 13 Jan 2025 15:44:28 +0000,
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > On Sat, Jan 11, 2025, Marc Zyngier wrote:
> > > > > Yet, you don't amend arm64 to publish that flag. Not that I think this
> > > > > causes any issue (even if you save the state at that point without
> > > > > reentering the guest, it will be still be consistent), but that
> > > > > directly contradicts the documentation (isn't that ironic? ;-).
> > > > 
> > > > It does cause issues, I missed this code in kvm_arch_vcpu_ioctl_run():
> > > > 
> > > > 	if (run->exit_reason == KVM_EXIT_MMIO) {
> > > > 		ret = kvm_handle_mmio_return(vcpu);
> > > > 		if (ret <= 0)
> > > > 			return ret;
> > > > 	}
> > > 
> > > That's satisfying a load from the guest forwarded to userspace.
> > 
> > And MMIO stores, no?  I.e. PC needs to be incremented on stores as well.
> 
> Yes, *after* the store as completed. If you replay the instruction,
> the same store comes out.
> 
> > > If the VMM did a save of the guest at this stage, restored and resumed it,
> > > *nothing* bad would happen, as PC still points to the instruction that got
> > > forwarded. You'll see the same load again.
> > 
> > But replaying an MMIO store could cause all kinds of problems, and even MMIO
> > loads could theoretically be problematic, e.g. if there are side effects in the
> > device that trigger on access to a device register.
> 
> But that's the VMM's problem. If it has modified its own state and
> doesn't return to the guest to complete the instruction, that's just
> as bad as a load, which *do* have side effects as well.

Agreed, just wanted to make sure I wasn't completely misunderstanding something
about arm64.

> Overall, the guest state exposed by KVM is always correct, and
> replaying the instruction is not going to change that. It is if the
> VMM is broken that things turn ugly *for the VMM itself*, 
> and I claim that no amount of flag being added is going to help that.

On x86 at least, adding KVM_RUN_NEEDS_COMPLETION reduces the chances for human
error.  x86 has had bugs in both KVM (patch 1) and userspace (Google's VMM when
handling MSR exits) that would have been avoided if KVM_RUN_NEEDS_COMPLETION existed.
Unless the VMM is doing something decidely odd, userspace needs to write code once
(maybe even just once for all architectures).  For KVM, the flag is set based on
whether or not the vCPU has a valid completion callback, i.e. will be correct so
long as the underlying KVM code is correct.

Contrast that with the current approach, where the KVM developer needs to get
the KVM code correct and remember to update KVM's documentation.  Documentation
is especially problematic, because in practice it can't be tested, i.e. is much
more likely to be missed by the developer and the maintainer.  The VMM either
needs to blindly redo KVM_RUN (as selftests do, and apparently as QEMU does), or
the developer adding VMM support needs to be diligent in reading KVM's documentation.
And like KVM documentation, testing that the VMM is implemented to KVM's "spec"
is effectively impossible in practice, because 99.9999% of the time userspace
exits and save/restore will work just fine.

I do agree that the VMM is likely going to run into problems sooner or later if
the developers/maintainers don't fundamentally understand the need to redo KVM_RUN,
but I also think there's significant value in reducing the chances for simple
human error to result in broken VMs.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-01-13 22:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-11  1:24 [PATCH 0/5] KVM: Add a kvm_run flag to signal need for completion Sean Christopherson
2025-01-11  1:24 ` [PATCH 1/5] KVM: x86: Document that KVM_EXIT_HYPERCALL requires completion Sean Christopherson
2025-01-11  1:24 ` [PATCH 2/5] KVM: Clear vcpu->run->flags at start of KVM_RUN for all architectures Sean Christopherson
2025-01-11  1:24 ` [PATCH 3/5] KVM: Add a common kvm_run flag to communicate an exit needs completion Sean Christopherson
2025-01-11 11:01   ` Marc Zyngier
2025-01-13 15:44     ` Sean Christopherson
2025-01-13 17:58       ` Marc Zyngier
2025-01-13 18:58         ` Sean Christopherson
2025-01-13 19:38           ` Marc Zyngier
2025-01-13 22:04             ` Sean Christopherson
2025-01-13  2:09   ` Chao Gao
2025-01-13  9:01     ` Binbin Wu
2025-01-13 16:59     ` Sean Christopherson
2025-01-11  1:24 ` [PATCH 4/5] KVM: selftests: Provide separate helper for KVM_RUN with immediate_exit Sean Christopherson
2025-01-11  1:24 ` [PATCH 5/5] KVM: selftests: Rely on KVM_RUN_NEEDS_COMPLETION to complete userspace exits Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).