kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/3] KVM: x86: Include host suspended time in steal time
@ 2025-07-22  5:50 Suleiman Souhlal
  2025-07-22  5:50 ` [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend Suleiman Souhlal
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Suleiman Souhlal @ 2025-07-22  5:50 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, Sergey Senozhatsky,
	Konrad Rzeszutek Wilk, Tzung-Bi Shih, John Stultz, kvm,
	linux-kernel, ssouhlal, Suleiman Souhlal

This series makes it so that the time that the host is suspended is
included in guests' steal time.

When the host resumes from a suspend, the guest thinks any task
that was running during the suspend ran for a long time, even though
the effective run time was much shorter, which can end up having
negative effects with scheduling.

To mitigate this issue, include the time that the host was
suspended in steal time, if the guest requests it, which lets the
guest subtract the duration from the tasks' runtime. Add new ABI 
to make this behavior opt-in per-guest.

In addition, make the guest TSC behavior consistent whether the
host TSC went backwards or not.

v8:
- Avoid #ifdef in function body.

v7: https://lore.kernel.org/lkml/20250714033649.4024311-1-suleiman@google.com/
- Fix build.
- Make advancing TSC dependent on X86_64.

v6: https://lore.kernel.org/kvm/20250709070450.473297-1-suleiman@google.com/
- Use true/false for bools.
- Indentation.
- Remove superfluous flag. 
- Use atomic operations for accumulating suspend duration.
- Reuse generic vcpu block/kick infrastructure instead of rolling our own.
- Add ABI to make the behavior opt-in per-guest.
- Add command line parameter to make guest use this.
- Reword commit messages in imperative mood.

v5: https://lore.kernel.org/kvm/20250325041350.1728373-1-suleiman@google.com/
- Fix grammar mistakes in commit message.

v4: https://lore.kernel.org/kvm/20250221053927.486476-1-suleiman@google.com/
- Advance guest TSC on suspends where host TSC goes backwards.
- Block vCPUs from running until resume notifier.
- Move suspend duration accounting out of machine-independent kvm to
  x86.
- Merge code and documentation patches.
- Reworded documentation.

v3: https://lore.kernel.org/kvm/20250107042202.2554063-1-suleiman@google.com/
- Use PM notifier instead of syscore ops (kvm_suspend()/kvm_resume()),
  because the latter doesn't get called on shallow suspend.
- Don't call function under UACCESS.
- Whitespace.

v2: https://lore.kernel.org/kvm/20240820043543.837914-1-suleiman@google.com/
- Accumulate suspend time at machine-independent kvm layer and track per-VCPU
  instead of per-VM.
- Document changes.

v1: https://lore.kernel.org/kvm/20240710074410.770409-1-suleiman@google.com/

Suleiman Souhlal (3):
  KVM: x86: Advance guest TSC after deep suspend.
  KVM: x86: Include host suspended duration in steal time
  KVM: x86: Add "suspendsteal" cmdline to request host to add suspend
    duration in steal time

 .../admin-guide/kernel-parameters.txt         |   5 +
 Documentation/virt/kvm/x86/cpuid.rst          |   4 +
 Documentation/virt/kvm/x86/msr.rst            |  14 ++
 arch/x86/include/asm/kvm_host.h               |   6 +
 arch/x86/include/uapi/asm/kvm_para.h          |   2 +
 arch/x86/kernel/kvm.c                         |  15 ++
 arch/x86/kvm/cpuid.c                          |   4 +-
 arch/x86/kvm/x86.c                            | 129 +++++++++++++++++-
 8 files changed, 172 insertions(+), 7 deletions(-)

-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend.
  2025-07-22  5:50 [PATCH v8 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
@ 2025-07-22  5:50 ` Suleiman Souhlal
  2025-08-21  0:24   ` Sean Christopherson
  2025-07-22  5:50 ` [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time Suleiman Souhlal
  2025-07-22  5:50 ` [PATCH v8 3/3] KVM: x86: Add "suspendsteal" cmdline to request host to add suspend " Suleiman Souhlal
  2 siblings, 1 reply; 7+ messages in thread
From: Suleiman Souhlal @ 2025-07-22  5:50 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, Sergey Senozhatsky,
	Konrad Rzeszutek Wilk, Tzung-Bi Shih, John Stultz, kvm,
	linux-kernel, ssouhlal, Suleiman Souhlal

Try to advance guest TSC to current time after suspend when the host
TSCs went backwards.

This makes the behavior consistent between suspends where host TSC
resets and suspends where it doesn't, such as suspend-to-idle, where
in the former case if the host TSC resets, the guests' would
previously be "frozen" due to KVM's backwards TSC prevention, while
in the latter case they would advance.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/x86.c              | 49 ++++++++++++++++++++++++++++++++-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fb01e456b624..e57d51e9f2be 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1415,6 +1415,9 @@ struct kvm_arch {
 	u64 cur_tsc_offset;
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
+#ifdef CONFIG_X86_64
+	bool host_was_suspended;
+#endif
 
 	u32 default_tsc_khz;
 	bool user_set_tsc;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d992d5652f..422c7fcc5d83 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2779,7 +2779,7 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
 	kvm_vcpu_write_tsc_offset(vcpu, tsc_offset + adjustment);
 }
 
-static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
+static inline void __adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
 {
 	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio)
 		WARN_ON(adjustment < 0);
@@ -4995,6 +4995,52 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
 
 static DEFINE_PER_CPU(struct kvm_vcpu *, last_vcpu);
 
+#ifdef CONFIG_X86_64
+static void kvm_set_host_was_suspended(struct kvm *kvm)
+{
+	kvm->arch.host_was_suspended = true;
+}
+
+static void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, u64 adj)
+{
+	unsigned long flags;
+	struct kvm *kvm;
+	bool advance;
+	u64 kernel_ns, l1_tsc, offset, tsc_now;
+
+	kvm = vcpu->kvm;
+	advance = kvm_get_time_and_clockread(&kernel_ns, &tsc_now);
+	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+	/*
+	 * Advance the guest's TSC to current time instead of only preventing
+	 * it from going backwards, while making sure all the vCPUs use the
+	 * same offset.
+	 */
+	if (kvm->arch.host_was_suspended && advance) {
+		l1_tsc = nsec_to_cycles(vcpu,
+					kvm->arch.kvmclock_offset + kernel_ns);
+		offset = kvm_compute_l1_tsc_offset(vcpu, l1_tsc);
+		kvm->arch.cur_tsc_offset = offset;
+		kvm_vcpu_write_tsc_offset(vcpu, offset);
+	} else if (advance) {
+		kvm_vcpu_write_tsc_offset(vcpu, kvm->arch.cur_tsc_offset);
+	} else {
+		__adjust_tsc_offset_host(vcpu, adj);
+	}
+	kvm->arch.host_was_suspended = false;
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+}
+#else
+static void kvm_set_host_was_suspended(struct kvm *kvm)
+{
+}
+
+static void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, u64 adj)
+{
+	__adjust_tsc_offset_host(vcpu, adj);
+}
+#endif /* CONFIG_X86_64 */
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -12729,6 +12775,7 @@ int kvm_arch_enable_virtualization_cpu(void)
 				kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 			}
 
+			kvm_set_host_was_suspended(kvm);
 			/*
 			 * We have to disable TSC offset matching.. if you were
 			 * booting a VM while issuing an S4 host suspend....
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time
  2025-07-22  5:50 [PATCH v8 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
  2025-07-22  5:50 ` [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend Suleiman Souhlal
@ 2025-07-22  5:50 ` Suleiman Souhlal
  2025-08-21  0:43   ` Sean Christopherson
  2025-07-22  5:50 ` [PATCH v8 3/3] KVM: x86: Add "suspendsteal" cmdline to request host to add suspend " Suleiman Souhlal
  2 siblings, 1 reply; 7+ messages in thread
From: Suleiman Souhlal @ 2025-07-22  5:50 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, Sergey Senozhatsky,
	Konrad Rzeszutek Wilk, Tzung-Bi Shih, John Stultz, kvm,
	linux-kernel, ssouhlal, Suleiman Souhlal

Introduce MSR_KVM_SUSPEND_STEAL which controls whether or not a guest
wants the duration of host suspend to be included in steal time.

This lets guests subtract the duration during which the host was
suspended from the runtime of tasks that were running over the suspend,
in order to prevent cases where host suspend causes long runtimes in
guest tasks, even though their effective runtime was much shorter.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 Documentation/virt/kvm/x86/cpuid.rst |  4 ++
 Documentation/virt/kvm/x86/msr.rst   | 14 +++++
 arch/x86/include/asm/kvm_host.h      |  3 ++
 arch/x86/include/uapi/asm/kvm_para.h |  2 +
 arch/x86/kvm/cpuid.c                 |  4 +-
 arch/x86/kvm/x86.c                   | 80 ++++++++++++++++++++++++++--
 6 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
index bda3e3e737d7..71b42b649973 100644
--- a/Documentation/virt/kvm/x86/cpuid.rst
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -103,6 +103,10 @@ KVM_FEATURE_HC_MAP_GPA_RANGE       16          guest checks this feature bit bef
 KVM_FEATURE_MIGRATION_CONTROL      17          guest checks this feature bit before
                                                using MSR_KVM_MIGRATION_CONTROL
 
+KVM_FEATURE_SUSPEND_STEAL          18          guest checks this feature bit
+                                               before using
+                                               MSR_KVM_SUSPEND_STEAL.
+
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                                per-cpu warps are expected in
                                                kvmclock
diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 3aecf2a70e7b..7c33f9ee11f5 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -296,6 +296,12 @@ data:
 		the amount of time in which this vCPU did not run, in
 		nanoseconds. Time during which the vcpu is idle, will not be
 		reported as steal time.
+		If the guest set the enable bit in MSR_KVM_SUSPEND_STEAL,
+		steal time includes the duration during which the host is
+		suspended. The case where the host suspends during a VM
+		migration might not be accounted if VCPUs aren't entered
+		post-resume. A workaround would be for the VMM to ensure that
+		the guest is entered with KVM_RUN after resuming from suspend.
 
 	preempted:
 		indicate the vCPU who owns this struct is running or
@@ -388,3 +394,11 @@ data:
         guest is communicating page encryption status to the host using the
         ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
         allow live migration of the guest.
+
+MSR_KVM_SUSPEND_STEAL:
+	0x4b564d09
+
+data:
+	This MSR is available if KVM_FEATURE_SUSPEND_STEAL is present in
+	CPUID. Bit 0 controls whether the host should include the duration it
+	has been suspended in steal time (1), or not (0).
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e57d51e9f2be..eeb63852f062 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -933,6 +933,8 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
+		u64 suspend_ts;
+		atomic64_t suspend_ns;
 		struct gfn_to_hva_cache cache;
 	} st;
 
@@ -1029,6 +1031,7 @@ struct kvm_vcpu_arch {
 	} pv_eoi;
 
 	u64 msr_kvm_poll_control;
+	u64 msr_kvm_suspend_steal;
 
 	/* pv related host specific info */
 	struct {
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..678ebc3d7eeb 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -36,6 +36,7 @@
 #define KVM_FEATURE_MSI_EXT_DEST_ID	15
 #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
 #define KVM_FEATURE_MIGRATION_CONTROL	17
+#define KVM_FEATURE_SUSPEND_STEAL	18
 
 #define KVM_HINTS_REALTIME      0
 
@@ -58,6 +59,7 @@
 #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
 #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
 #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
+#define MSR_KVM_SUSPEND_STEAL	0x4b564d09
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f84bc0569c9c..2ba85f208f87 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1621,8 +1621,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
 			     (1 << KVM_FEATURE_ASYNC_PF_INT);
 
-		if (sched_info_on())
+		if (sched_info_on()) {
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
+			entry->eax |= (1 << KVM_FEATURE_SUSPEND_STEAL);
+		}
 
 		entry->ebx = 0;
 		entry->ecx = 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 422c7fcc5d83..655fc85ab942 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3753,6 +3753,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	if (unlikely(atomic64_read(&vcpu->arch.st.suspend_ns)))
+		steal += atomic64_xchg(&vcpu->arch.st.suspend_ns, 0);
 	unsafe_put_user(steal, &st->steal, out);
 
 	version += 1;
@@ -4058,6 +4060,17 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vcpu->arch.msr_kvm_poll_control = data;
 		break;
 
+	case MSR_KVM_SUSPEND_STEAL:
+		if (!guest_pv_has(vcpu, KVM_FEATURE_SUSPEND_STEAL) ||
+		    !guest_pv_has(vcpu, KVM_FEATURE_STEAL_TIME))
+			return 1;
+
+		if (!(data & KVM_MSR_ENABLED))
+			return 1;
+
+		vcpu->arch.msr_kvm_suspend_steal = data;
+		break;
+
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -4404,6 +4417,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		msr_info->data = vcpu->arch.msr_kvm_poll_control;
 		break;
+	case MSR_KVM_SUSPEND_STEAL:
+		if (!guest_pv_has(vcpu, KVM_FEATURE_SUSPEND_STEAL))
+			return 1;
+		msr_info->data = vcpu->arch.msr_kvm_suspend_steal;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -7027,13 +7045,52 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
 {
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
+	bool kick_vcpus = false;
 
-	/*
-	 * Ignore the return, marking the guest paused only "fails" if the vCPU
-	 * isn't using kvmclock; continuing on is correct and desirable.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (vcpu->arch.msr_kvm_suspend_steal & KVM_MSR_ENABLED) {
+			kick_vcpus = true;
+			WRITE_ONCE(vcpu->arch.st.suspend_ts,
+				   ktime_get_boottime_ns());
+		}
+		/*
+		 * Ignore the return, marking the guest paused only "fails" if
+		 * the vCPU isn't using kvmclock; continuing on is correct and
+		 * desirable.
+		 */
 		(void)kvm_set_guest_paused(vcpu);
+	}
+
+	if (kick_vcpus)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+
+	return NOTIFY_DONE;
+}
+
+static int
+kvm_arch_resume_notifier(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		u64 suspend_ns = ktime_get_boottime_ns() -
+				 vcpu->arch.st.suspend_ts;
+
+		WRITE_ONCE(vcpu->arch.st.suspend_ts, 0);
+
+		/*
+		 * Only accumulate the suspend time if suspend steal-time is
+		 * enabled, but always clear suspend_ts and kick the vCPU as
+		 * the vCPU could have disabled suspend steal-time after the
+		 * suspend notifier grabbed suspend_ts.
+		 */
+		if (vcpu->arch.msr_kvm_suspend_steal & KVM_MSR_ENABLED)
+			atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
+
+		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
 
 	return NOTIFY_DONE;
 }
@@ -7044,6 +7101,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 	case PM_HIBERNATION_PREPARE:
 	case PM_SUSPEND_PREPARE:
 		return kvm_arch_suspend_notifier(kvm);
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		return kvm_arch_resume_notifier(kvm);
 	}
 
 	return NOTIFY_DONE;
@@ -11233,6 +11293,16 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 static bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * During host SUSPEND/RESUME tasks get frozen after SUSPEND notifiers
+	 * run, and thawed before RESUME notifiers, i.e. vCPUs can be actively
+	 * running when KVM sees the system as suspended.  Block the vCPU if
+	 * KVM sees the vCPU as suspended to ensure the suspend steal time is
+	 * accounted before the guest can run, and to the correct guest task.
+	 */
+	if (READ_ONCE(vcpu->arch.st.suspend_ts))
+		return false;
+
 	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
 		!vcpu->arch.apf.halted);
 }
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v8 3/3] KVM: x86: Add "suspendsteal" cmdline to request host to add suspend duration in steal time
  2025-07-22  5:50 [PATCH v8 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
  2025-07-22  5:50 ` [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend Suleiman Souhlal
  2025-07-22  5:50 ` [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time Suleiman Souhlal
@ 2025-07-22  5:50 ` Suleiman Souhlal
  2025-08-21  0:44   ` Sean Christopherson
  2 siblings, 1 reply; 7+ messages in thread
From: Suleiman Souhlal @ 2025-07-22  5:50 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Chao Gao, David Woodhouse, Sergey Senozhatsky,
	Konrad Rzeszutek Wilk, Tzung-Bi Shih, John Stultz, kvm,
	linux-kernel, ssouhlal, Suleiman Souhlal

Introduce a new command line parameter, "suspendsteal", enabling the
guest to use MSR_KVM_SUSPEND_STEAL, which tells the host that it would
like host suspend duration to be included in steal time.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 arch/x86/kernel/kvm.c                           | 15 +++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 07e22ba5bfe3..9a5490539bb2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7074,6 +7074,11 @@
 			improve throughput, but will also increase the
 			amount of memory reserved for use by the client.
 
+	suspendsteal
+			[X86,PV_OPS]
+			Enable requesting the host to include the duration the
+			host was suspended in steal time. Disabled by default.
+
 	suspend.pm_test_delay=
 			[SUSPEND]
 			Sets the number of seconds to remain in a suspend test
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 921c1c783bc1..35d1bb2283c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -320,6 +320,18 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static bool suspend_steal;
+
+static int __init suspendsteal_setup(char *s)
+{
+	if (kvm_para_has_feature(KVM_FEATURE_SUSPEND_STEAL))
+		suspend_steal = true;
+
+	return 0;
+}
+
+early_param("suspendsteal", suspendsteal_setup);
+
 static void kvm_register_steal_time(void)
 {
 	int cpu = smp_processor_id();
@@ -331,6 +343,9 @@ static void kvm_register_steal_time(void)
 	wrmsrq(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
 	pr_debug("stealtime: cpu %d, msr %llx\n", cpu,
 		(unsigned long long) slow_virt_to_phys(st));
+
+	if (suspend_steal)
+		wrmsrl(MSR_KVM_SUSPEND_STEAL, KVM_MSR_ENABLED);
 }
 
 static DEFINE_PER_CPU_DECRYPTED(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend.
  2025-07-22  5:50 ` [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend Suleiman Souhlal
@ 2025-08-21  0:24   ` Sean Christopherson
  0 siblings, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2025-08-21  0:24 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse,
	Sergey Senozhatsky, Konrad Rzeszutek Wilk, Tzung-Bi Shih,
	John Stultz, kvm, linux-kernel, ssouhlal

On Tue, Jul 22, 2025, Suleiman Souhlal wrote:
> Try to advance guest TSC to current time after suspend when the host
> TSCs went backwards.
> 
> This makes the behavior consistent between suspends where host TSC
> resets and suspends where it doesn't, such as suspend-to-idle, where
> in the former case if the host TSC resets, the guests' would
> previously be "frozen" due to KVM's backwards TSC prevention, while
> in the latter case they would advance.

Before you waste too much time reading through the various pieces of feedback...

I'm leaning towards scrapping this patch.  I still like the idea, but making it
do the right thing given all the edge cases and caveats with TSC management in
KVM seems practically impossible.  E.g. as you called out in an earlier version,
fast-forwarding to "now" is probably undesirable if a vCPU's TSC was completel
disassociated from "now" and/or arch.kvmclock_offset.

We could probably figure out ways to cobble together a solution that works for
most situations, but given that no one is clamoring for KVM to operate this way,
I doubt it'd be worth the effort and complexity.  And it certainly isn't worth
taking on the risk of breaking an existing setup/user.

For the suspend steal time angle, a PV feature bit means both the host and the
guest need to opt-in.  A host opt-in means we can punt to documentation, e.g.
we can document that there are caveats with running VMs across deep suspend, and
the user should consider whether or not they care before enable suspend steal time.

And then if someone really wants KVM to fast-forward time (or comes up with a
simple solution), they can have honor of figuring out how to do so correctly for
all of the crazy TSC flows.

> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 ++
>  arch/x86/kvm/x86.c              | 49 ++++++++++++++++++++++++++++++++-
>  2 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fb01e456b624..e57d51e9f2be 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1415,6 +1415,9 @@ struct kvm_arch {
>  	u64 cur_tsc_offset;
>  	u64 cur_tsc_generation;
>  	int nr_vcpus_matched_tsc;
> +#ifdef CONFIG_X86_64
> +	bool host_was_suspended;

Adding an #idfef to save a single bool isn't worth it, especially since it
necessitates a wrapper (or more #ifdefs).  For something like this, I'd just
set it unconditionally, and then esentially ignore it for 32-bit, along with a
comment explaining why we can't do anything useful for 32-bit.

> +#endif
>  
>  	u32 default_tsc_khz;
>  	bool user_set_tsc;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a9d992d5652f..422c7fcc5d83 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2779,7 +2779,7 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
>  	kvm_vcpu_write_tsc_offset(vcpu, tsc_offset + adjustment);
>  }
>  
> -static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
> +static inline void __adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
>  {
>  	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio)
>  		WARN_ON(adjustment < 0);
> @@ -4995,6 +4995,52 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
>  
>  static DEFINE_PER_CPU(struct kvm_vcpu *, last_vcpu);
>  
> +#ifdef CONFIG_X86_64
> +static void kvm_set_host_was_suspended(struct kvm *kvm)
> +{
> +	kvm->arch.host_was_suspended = true;
> +}
> +
> +static void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, u64 adj)
> +{
> +	unsigned long flags;
> +	struct kvm *kvm;
> +	bool advance;
> +	u64 kernel_ns, l1_tsc, offset, tsc_now;
> +
> +	kvm = vcpu->kvm;
> +	advance = kvm_get_time_and_clockread(&kernel_ns, &tsc_now);
> +	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> +	/*
> +	 * Advance the guest's TSC to current time instead of only preventing
> +	 * it from going backwards, while making sure all the vCPUs use the
> +	 * same offset.
> +	 */
> +	if (kvm->arch.host_was_suspended && advance) {
> +		l1_tsc = nsec_to_cycles(vcpu,
> +					kvm->arch.kvmclock_offset + kernel_ns);
> +		offset = kvm_compute_l1_tsc_offset(vcpu, l1_tsc);

This is where my idea really falls apart.  For this to be correct, KVM would need
to know the relationship between a vCPU's TSC offset and kvmclock_offset.  The
simplest thing would likely be to do something like force an update if and only
if TSCs are matched across all vCPUs, but trying to reason about how that would
work when vCPUs account for the suspend time at different and arbitrary times
makes my head hurt.

One idea I might fiddle with is to snapshot get_kvmclock_base_ns() on suspend
and then grab it again on resume, and use _that_ delta for tsc_offset_adjustment.
But that's something that can be pursued separately (if at all; I don't see any
reason to gate suspend steal time on it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time
  2025-07-22  5:50 ` [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time Suleiman Souhlal
@ 2025-08-21  0:43   ` Sean Christopherson
  0 siblings, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2025-08-21  0:43 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse,
	Sergey Senozhatsky, Konrad Rzeszutek Wilk, Tzung-Bi Shih,
	John Stultz, kvm, linux-kernel, ssouhlal

On Tue, Jul 22, 2025, Suleiman Souhlal wrote:
> Introduce MSR_KVM_SUSPEND_STEAL which controls whether or not a guest
> wants the duration of host suspend to be included in steal time.
>
> This lets guests subtract the duration during which the host was
> suspended from the runtime of tasks that were running over the suspend,
> in order to prevent cases where host suspend causes long runtimes in
> guest tasks, even though their effective runtime was much shorter.

This is far too terse.  There are a _lot_ of subtleties and wrinkles in this
code need to be captured.  A persistent reader can find them in the lore links,
but I want to cover the main points in the changelog.

I actually don't mind putting that together, it's a good way to remind myself of
what this is so doing, so feel free to ignore this and I can fixup when applying.

> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index a1efa7907a0b..678ebc3d7eeb 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -36,6 +36,7 @@
>  #define KVM_FEATURE_MSI_EXT_DEST_ID	15
>  #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
>  #define KVM_FEATURE_MIGRATION_CONTROL	17
> +#define KVM_FEATURE_SUSPEND_STEAL	18
>  
>  #define KVM_HINTS_REALTIME      0
>  
> @@ -58,6 +59,7 @@
>  #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
>  #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
>  #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
> +#define MSR_KVM_SUSPEND_STEAL	0x4b564d09

I'm pretty sure we don't need a new MSR.  There are 4 bits (I think; KVM's
#defines are horrific) in MSR_KVM_STEAL_TIME that are currently reserved, we
can simply grab one of those.

Using MSR_KVM_STEAL_TIME also makes it more obvious that SUSPEND_STEAL has a
hard dependency on STEAL_TIME being enabled.

> @@ -7027,13 +7045,52 @@ static int kvm_arch_suspend_notifier(struct kvm *kvm)
>  {
>  	struct kvm_vcpu *vcpu;
>  	unsigned long i;
> +	bool kick_vcpus = false;
>  
> -	/*
> -	 * Ignore the return, marking the guest paused only "fails" if the vCPU
> -	 * isn't using kvmclock; continuing on is correct and desirable.
> -	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (vcpu->arch.msr_kvm_suspend_steal & KVM_MSR_ENABLED) {
> +			kick_vcpus = true;
> +			WRITE_ONCE(vcpu->arch.st.suspend_ts,
> +				   ktime_get_boottime_ns());
> +		}
> +		/*
> +		 * Ignore the return, marking the guest paused only "fails" if
> +		 * the vCPU isn't using kvmclock; continuing on is correct and
> +		 * desirable.
> +		 */
>  		(void)kvm_set_guest_paused(vcpu);
> +	}
> +

This should have a comment.  I'm pretty sure I suggested this idea, and it still
took me a good 5 minutes to figure out what this code was doing.

	/*
	 * Force vCPUs to exit from KVM's inner run loop to ensure they see
	 * suspend_ts and block until KVM's resume notifier runs.
	 */


> +	if (kick_vcpus)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
> +
> +	return NOTIFY_DONE;
> +}
> +
> +static int

Do not wrap before the function name.  Linus has a nice explanation/rant on this:

https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=GhjSY6nFmuB_wA8fYd5iLBNXjo9Bw@mail.gmail.com

> +kvm_arch_resume_notifier(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	unsigned long i;
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		u64 suspend_ns = ktime_get_boottime_ns() -
> +				 vcpu->arch.st.suspend_ts;
> +
> +		WRITE_ONCE(vcpu->arch.st.suspend_ts, 0);
> +
> +		/*
> +		 * Only accumulate the suspend time if suspend steal-time is
> +		 * enabled, but always clear suspend_ts and kick the vCPU as
> +		 * the vCPU could have disabled suspend steal-time after the
> +		 * suspend notifier grabbed suspend_ts.
> +		 */
> +		if (vcpu->arch.msr_kvm_suspend_steal & KVM_MSR_ENABLED)

This should arguably also check STEAL_TIME is enabled.  And regardless of which
MSR is used, this needs to be READ_ONCE() since the vCPU could modify the value
while KVM is running this code.

The other option would be to only check the MSR from within KVM_RUN, but getting
that right would probably be even harder, and in practice a false negative/positive
is a non-issue.

Related to toggling the MSR, arguably arch.st.suspend_ns should be zeroed if
SUSPEND_STEAL is disabled, but I don't expect a guest to ever disable SUSPEND_STEAL
and also re-enabled it at a later time, _and_ AFAICT STEAL_TIME doesn't work that
way either, i.e. it keeps accumulating stolen time and throws it all at the guest
if/when STEAL_TIME is enabled.

> +			atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
> +
> +		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
> +		kvm_vcpu_kick(vcpu);
> +	}

Compile tested only, but this?

---
 Documentation/virt/kvm/x86/cpuid.rst |  3 +
 Documentation/virt/kvm/x86/msr.rst   |  6 ++
 arch/x86/include/asm/kvm_host.h      |  2 +
 arch/x86/include/uapi/asm/kvm_para.h |  2 +
 arch/x86/kvm/cpuid.c                 |  4 +-
 arch/x86/kvm/x86.c                   | 83 +++++++++++++++++++++++++---
 6 files changed, 92 insertions(+), 8 deletions(-)

diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst
index bda3e3e737d7..0fa96a134718 100644
--- a/Documentation/virt/kvm/x86/cpuid.rst
+++ b/Documentation/virt/kvm/x86/cpuid.rst
@@ -103,6 +103,9 @@ KVM_FEATURE_HC_MAP_GPA_RANGE       16          guest checks this feature bit bef
 KVM_FEATURE_MIGRATION_CONTROL      17          guest checks this feature bit before
                                                using MSR_KVM_MIGRATION_CONTROL
 
+KVM_FEATURE_SUSPEND_STEAL          18          guest checks this feature bit
+                                               before enabling KVM_STEAL_SUSPEND_TIME
+
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                                per-cpu warps are expected in
                                                kvmclock
diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst
index 3aecf2a70e7b..367842311ed3 100644
--- a/Documentation/virt/kvm/x86/msr.rst
+++ b/Documentation/virt/kvm/x86/msr.rst
@@ -296,6 +296,12 @@ data:
 		the amount of time in which this vCPU did not run, in
 		nanoseconds. Time during which the vcpu is idle, will not be
 		reported as steal time.
+		If the guest sets KVM_STEAL_SUSPEND_TIME, steal time includes
+		the duration during which the host is suspended.  The case
+		where the host suspends during a VM migration might not be
+		accounted if VCPUs don't enter the guest post-resume.  A
+		workaround would be for the VMM to ensure that the guest is
+		entered with KVM_RUN after resuming from suspend.
 
 	preempted:
 		indicate the vCPU who owns this struct is running or
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0d3cc0fc27af..8e11d08dc701 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -944,6 +944,8 @@ struct kvm_vcpu_arch {
 		u8 preempted;
 		u64 msr_val;
 		u64 last_steal;
+		u64 suspend_ts;
+		atomic64_t suspend_ns;
 		struct gfn_to_hva_cache cache;
 	} st;
 
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..1bbe2db93890 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -36,6 +36,7 @@
 #define KVM_FEATURE_MSI_EXT_DEST_ID	15
 #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
 #define KVM_FEATURE_MIGRATION_CONTROL	17
+#define KVM_FEATURE_SUSPEND_STEAL	18
 
 #define KVM_HINTS_REALTIME      0
 
@@ -80,6 +81,7 @@ struct kvm_clock_pairing {
 	__u32 pad[9];
 };
 
+#define KVM_STEAL_SUSPEND_TIME			(1 << 1)
 #define KVM_STEAL_ALIGNMENT_BITS 5
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ad6cadf09930..9dbb3899bee7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1626,8 +1626,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
 			     (1 << KVM_FEATURE_ASYNC_PF_INT);
 
-		if (sched_info_on())
+		if (sched_info_on()) {
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
+			entry->eax |= (1 << KVM_FEATURE_SUSPEND_STEAL);
+		}
 
 		entry->ebx = 0;
 		entry->ecx = 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7ba2cdfdac44..cae1c5a8eb99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3784,6 +3784,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	steal += current->sched_info.run_delay -
 		vcpu->arch.st.last_steal;
 	vcpu->arch.st.last_steal = current->sched_info.run_delay;
+	if (unlikely(atomic64_read(&vcpu->arch.st.suspend_ns)))
+		steal += atomic64_xchg(&vcpu->arch.st.suspend_ns, 0);
 	unsafe_put_user(steal, &st->steal, out);
 
 	version += 1;
@@ -4052,14 +4054,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			kvm_check_async_pf_completion(vcpu);
 		}
 		break;
-	case MSR_KVM_STEAL_TIME:
+	case MSR_KVM_STEAL_TIME: {
+		u64 reserved = KVM_STEAL_RESERVED_MASK;
+
 		if (!guest_pv_has(vcpu, KVM_FEATURE_STEAL_TIME))
 			return 1;
 
 		if (unlikely(!sched_info_on()))
 			return 1;
 
-		if (data & KVM_STEAL_RESERVED_MASK)
+		if (guest_pv_has(vcpu, KVM_FEATURE_SUSPEND_STEAL))
+			reserved &= ~KVM_STEAL_SUSPEND_TIME;
+
+		if (data & reserved)
 			return 1;
 
 		vcpu->arch.st.msr_val = data;
@@ -4070,6 +4077,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 
 		break;
+	}
 	case MSR_KVM_PV_EOI_EN:
 		if (!guest_pv_has(vcpu, KVM_FEATURE_PV_EOI))
 			return 1;
@@ -6858,18 +6866,66 @@ long kvm_arch_vm_compat_ioctl(struct file *filp, unsigned int ioctl,
 }
 #endif
 
+static bool kvm_is_suspend_steal_enabled(struct kvm_vcpu *vcpu)
+{
+	u64 val = READ_ONCE(vcpu->arch.st.msr_val);
+
+	return (val & KVM_MSR_ENABLED) && (val & KVM_STEAL_SUSPEND_TIME);
+}
+
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 static int kvm_arch_suspend_notifier(struct kvm *kvm)
 {
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
+	bool kick_vcpus = false;
 
-	/*
-	 * Ignore the return, marking the guest paused only "fails" if the vCPU
-	 * isn't using kvmclock; continuing on is correct and desirable.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (kvm_is_suspend_steal_enabled(vcpu)) {
+			kick_vcpus = true;
+			WRITE_ONCE(vcpu->arch.st.suspend_ts,
+				   ktime_get_boottime_ns());
+		}
+		/*
+		 * Ignore the return, marking the guest paused only "fails" if
+		 * the vCPU isn't using kvmclock; continuing on is correct and
+		 * desirable.
+		 */
 		(void)kvm_set_guest_paused(vcpu);
+	}
+
+	/*
+	 * Force vCPUs to exit from KVM's inner run loop to ensure they see
+	 * suspend_ts and block until KVM's resume notifier runs.
+	 */
+	if (kick_vcpus)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
+
+	return NOTIFY_DONE;
+}
+
+static int kvm_arch_resume_notifier(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		u64 suspend_ns = ktime_get_boottime_ns() -
+				 vcpu->arch.st.suspend_ts;
+		WRITE_ONCE(vcpu->arch.st.suspend_ts, 0);
+
+		/*
+		 * Only accumulate the suspend time if suspend steal-time is
+		 * enabled, but always clear suspend_ts and kick the vCPU as
+		 * the vCPU could have disabled suspend steal-time after the
+		 * suspend notifier grabbed suspend_ts.
+		 */
+		if (kvm_is_suspend_steal_enabled(vcpu))
+			atomic64_add(suspend_ns, &vcpu->arch.st.suspend_ns);
+
+		kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
 
 	return NOTIFY_DONE;
 }
@@ -6880,6 +6936,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 	case PM_HIBERNATION_PREPARE:
 	case PM_SUSPEND_PREPARE:
 		return kvm_arch_suspend_notifier(kvm);
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		return kvm_arch_resume_notifier(kvm);
 	}
 
 	return NOTIFY_DONE;
@@ -11104,6 +11163,16 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 static bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * During host SUSPEND/RESUME tasks get frozen after SUSPEND notifiers
+	 * run, and thawed before RESUME notifiers, i.e. vCPUs can be actively
+	 * running when KVM sees the system as suspended.  Block the vCPU if
+	 * KVM sees the vCPU as suspended to ensure the suspend steal time is
+	 * accounted before the guest can run, and to the correct guest task.
+	 */
+	if (READ_ONCE(vcpu->arch.st.suspend_ts))
+		return false;
+
 	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
 		!vcpu->arch.apf.halted);
 }

base-commit: 91b392ada892a2e8b1c621b9493c50f6fb49880f
--

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 3/3] KVM: x86: Add "suspendsteal" cmdline to request host to add suspend duration in steal time
  2025-07-22  5:50 ` [PATCH v8 3/3] KVM: x86: Add "suspendsteal" cmdline to request host to add suspend " Suleiman Souhlal
@ 2025-08-21  0:44   ` Sean Christopherson
  0 siblings, 0 replies; 7+ messages in thread
From: Sean Christopherson @ 2025-08-21  0:44 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Chao Gao, David Woodhouse,
	Sergey Senozhatsky, Konrad Rzeszutek Wilk, Tzung-Bi Shih,
	John Stultz, kvm, linux-kernel, ssouhlal

On Tue, Jul 22, 2025, Suleiman Souhlal wrote:
> Introduce a new command line parameter, "suspendsteal", enabling the
> guest to use MSR_KVM_SUSPEND_STEAL, which tells the host that it would
> like host suspend duration to be included in steal time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---

And then if we reuse MSR_KVM_STEAL_TIME:

---
 Documentation/admin-guide/kernel-parameters.txt |  5 +++++
 arch/x86/kernel/kvm.c                           | 13 ++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 747a55abf494..8e80094317c3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7178,6 +7178,11 @@
 			improve throughput, but will also increase the
 			amount of memory reserved for use by the client.
 
+	suspendsteal
+			[X86,PV_OPS]
+			Enable requesting the host to include the duration the
+			host was suspended in steal time. Disabled by default.
+
 	suspend.pm_test_delay=
 			[SUSPEND]
 			Sets the number of seconds to remain in a suspend test
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8ae750cde0c6..1eea3e82c85b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -71,6 +71,7 @@ static DEFINE_PER_CPU_READ_MOSTLY(bool, async_pf_enabled);
 static DEFINE_PER_CPU_DECRYPTED(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 DEFINE_PER_CPU_DECRYPTED(struct kvm_steal_time, steal_time) __aligned(64) __visible;
 static int has_steal_clock = 0;
+static bool suspend_steal;
 
 static int has_guest_poll = 0;
 /*
@@ -320,6 +321,15 @@ static void __init paravirt_ops_setup(void)
 #endif
 }
 
+static int __init suspendsteal_setup(char *s)
+{
+	if (kvm_para_has_feature(KVM_FEATURE_SUSPEND_STEAL))
+		suspend_steal = true;
+
+	return 0;
+}
+early_param("suspendsteal", suspendsteal_setup);
+
 static void kvm_register_steal_time(void)
 {
 	int cpu = smp_processor_id();
@@ -328,7 +338,8 @@ static void kvm_register_steal_time(void)
 	if (!has_steal_clock)
 		return;
 
-	wrmsrq(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
+	wrmsrq(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED) |
+				   (suspend_steal ? KVM_STEAL_SUSPEND_TIME : 0));
 	pr_debug("stealtime: cpu %d, msr %llx\n", cpu,
 		(unsigned long long) slow_virt_to_phys(st));
 }
--

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-08-21  0:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22  5:50 [PATCH v8 0/3] KVM: x86: Include host suspended time in steal time Suleiman Souhlal
2025-07-22  5:50 ` [PATCH v8 1/3] KVM: x86: Advance guest TSC after deep suspend Suleiman Souhlal
2025-08-21  0:24   ` Sean Christopherson
2025-07-22  5:50 ` [PATCH v8 2/3] KVM: x86: Include host suspended duration in steal time Suleiman Souhlal
2025-08-21  0:43   ` Sean Christopherson
2025-07-22  5:50 ` [PATCH v8 3/3] KVM: x86: Add "suspendsteal" cmdline to request host to add suspend " Suleiman Souhlal
2025-08-21  0:44   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).